floridsleeves's comments

floridsleeves · on Aug 27, 2023

The misuse of APIs in the generated code could lead to severe problem, such as resource leaks, program crashes, etc. Existing code evaluation benchmark and datasets focus on crafting small tasks such as programming questions in coding interviews, which however deviates from the problem that developers would ask LLM for real-world coding help. To fill the missing piece, researchers from UCSD propose a dataset RobustAPI for evaluating the reliability and robustness of code generated by LLMs. By collecting 1208 coding questions from StackOverflow on 24 representative Java APIs, they evaluate them on popular LLMs including GPT-3.5, GPT-4, Llama2, Vicuna. The evaluation results show that even for GPT-4, 62% of the generated code contains API misuses, which would cause severe consequences if the code is introduced into real-world software.

floridsleeves · on March 24, 2020

Great tool! Will try it out