Self-calling code benchmarks help you decide which LLMs you should use for your programming tasks

Subscribe to our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI reporting. Learn more

As the coding of large language models (LLMs) continues to improve, the benchmarks used to evaluate their performance become less and less useful.

Although many LLMs achieve similarly high scores on these benchmarks, it can be difficult to understand which of them should be used for specific software development projects and companies.

A new paper from Yale University and Tsinghua University presents a novel method to test the ability of models to “self-calling code generation“Problems that require reasoning, code generation, and reuse of existing code to solve problems.

Self-invoking code generation is much more similar to realistic programming scenarios than benchmark testing and provides a better understanding of the ability of current LLMs to solve real-world programming problems.

Self-invoking code generation

Two popular benchmarks for assessing coding skills of LLMs are: HumanEval And MBPP (Mostly basic Python problems). These are datasets of handcrafted problems that require the model to write code for simple tasks.

However, these benchmarks only cover a portion of the challenges that software developers face in the real world. In practical scenarios, software developers not only write new code, but also need to understand and reuse existing code and create reusable components to solve complex problems.

“The ability to understand and subsequently use one’s own generated code, (in other words) self-invoking code generation, plays an important role for LLMs to leverage their code generation reasoning skills that current benchmarks do not capture,” the researchers write.

To test the ability of LLMs for self-invoking code generation, researchers created two new benchmarks: HumanEval Pro and MBPP Prowhich expand the existing data sets. Each problem in HumanEval Pro and MBPP Pro builds on an existing example in the original dataset and introduces additional elements that require the model to solve the base problem and invoke that solution to solve a more complex problem.

For example, the original problem might be something simple, such as writing a function that replaces all occurrences of a specific character in a string with a new character.

The extended problem would be to write a function that changes the occurrence of multiple characters in a string with the specified replacements. To do this, the model would have to write a new function that calls the previous function it generated in the simple problem.

“This assessment of self-invoking code generation provides deeper insights into the programming skills of LLMs beyond the scope of single-problem code generation,” the researchers write.

LLMs perform poorly in self-invoking code generation

Researchers tested HumanEval Pro and MBPP Pro on more than 20 open and private models, including GPT-4oOpenAI o1-mini And Claude 3.5 sonnetas well as Qwen, DeepSeek and Codestral Series.

Their results show a significant discrepancy between traditional coding benchmarks and self-invoking code generation tasks. “While frontier LLMs excel at generating custom code snippets, they often struggle to effectively use their self-generated code to solve more complex problems,” the researchers write.

For example, with a single generation (pass@1), o1-mini achieves 96.2% on HumanEval, but only 76.2% on HumanEval Pro.

Another interesting finding is that while instruction fine-tuning provides significant improvements in simple coding tasks, it shows diminishing returns in generating self-invoking code. The researchers note that “current instruction-based fine-tuning approaches are not effective enough for more complex self-invoking code generation tasks,” suggesting that we need to rethink how we train baseline models for coding and reasoning tasks.

To advance research on self-invoking code generation, the researchers propose a technique to automatically repurpose existing coding benchmarks for self-invoking code generation. The approach uses Frontier LLMs to generate self-invoking problems based on the original problems. They then generate candidate solutions and verify their correctness by running the code and running test cases on it. The pipeline minimizes the need for manual code review, helping to generate more examples with less effort.

Problems with automatic generation of self-calling code (Source: arXiv)

A complex landscape

This new benchmark family comes at a time when old coding benchmarks are quickly being conquered by frontier models. Current frontier models such as GPT-4o, o1 and Claude 3.5 Sonnet already achieve very high results on HumanEval and MBPP, as well as their more advanced versions HumanEval+ and MBPP+.

At the same time, there are more complex benchmarks such as: SWE Bankthat evaluate the capabilities of models in end-to-end software development tasks that require a wide range of skills, such as: B. using external libraries and files and managing DevOps tools. SWE-Bench is a very difficult benchmark and even the most advanced models show only modest performance. For example, OpenAI o1 is inconsistent on SWE-Bench Verified.

Surprising finding: OpenAI’s O1 – Reasoning-High only achieved 30% in SWE-Bench Verified – far below its claim of 48.9%. Even more interesting: Claude achieved 53% in the same context. There’s something wrong with O1’s “expanded thinking”… ?1/8 pic.twitter.com/ADLXNuKpPP
— Alejandro Cuadron (@Alex_Cuadron) January 5, 2025

The self-calling code generation lies somewhere between the simple benchmarks and SWE-Bench. It helps assess a very specific type of reasoning ability: the use of existing code within a module to address complex problems. Self-calling code benchmarks can prove to be a very practical indicator of the usefulness of LLMs in real-world environments where human programmers are in control and AI co-pilots help them tackle specific coding tasks in the software development process.

“HumanEval Pro and MBPP Pro are positioned to serve as valuable benchmarks for code-related evaluations and inspire future LLM development by illuminating current model deficiencies and encouraging innovation in training methods,” the researchers write.

Daily insights into business use cases with VB Daily

If you want to impress your boss, VB Daily is for you. We give you the inside scoop on what companies are doing with generative AI, from regulatory changes to practical deployments, so you can share insights for maximum ROI.

Read ours Privacy Policy

Thank you for your subscription. Check out more VB newsletter here.

An error has occurred.