Subscribe to our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI reporting. Learn more
In a new case study, Hugging Face researchers have shown how it works small language models (SLMs) can be configured to outperform much larger models. Their results show that a Llama-3 model with 3B parameters can outperform the 70B version of the model on complex mathematical problems.
Hugging Face has fully documented the entire process and provides a roadmap for companies looking to create their own tailored argumentation models.
Scaling of the test time calculation
The work is inspired by OpenAI o1that uses additional “thinking” to solve complex math, coding, and reasoning problems.
The key idea behind models like o1 is to scale the “test time calculation,” which effectively means using more computational cycles during inference to test and verify different answers and reasoning paths before producing the final answer. Scaling the test time calculation is particularly useful when there is not enough memory to run a large model.
Since o1 is a private model and OpenAI has remained quiet about its internal operations, researchers are speculating about how it works and trying to reverse engineer the process. There are already several open alternatives to o1.
The face hugging work is based on a DeepMind study published in Augustwhich examines the trade-offs between inference time and pre-training computation. The study provides comprehensive guidelines for balancing training and inference computation to achieve the best results within a set budget.
In addition to the use of additional inference time calculations, the success of the technique depends on two key components: a reward model that evaluates the SLM’s answers and a search algorithm that optimizes the path to refining its answers.
Different argumentation algorithms
The easiest way to use test time scaling is with “majority voting”, where the same prompt is sent to the model multiple times and the one with the highest vote is chosen. For simple problems, majority voting can prove useful, but for complex reasoning problems or tasks where errors are consistent across generations, success quickly stagnates.
A more advanced reasoning method is “Best-of-N”. In this technique, the SLM generates multiple answers, but instead of a majority vote, a reward model is used to evaluate the answers and select the best one. Weighted Best-of-N, a more nuanced version of this method, takes consistency into account to select answers that are both safe and more common than others.
The researchers used a “process reward model” (PRM) that evaluates the SLM’s response based not only on the final answer, but also on the multiple stages it goes through to achieve it. Their experiments showed that weighted best-of-N and PRMs did the trick Flame-3.2 1B close to the level of Llama-3.2 8B on the difficult MATH-500 benchmark.
Add search
To further improve the model’s performance, the researchers added search algorithms to the model’s reasoning process. Instead of generating the answer in a single pass, they used “beam search,” an algorithm that controls the model’s answering process step by step.
At each step, the SLM generates several partial answers. The search algorithm uses the reward model to evaluate the answers and selects a subset that is worth further investigation. The process repeats until the model exhausts its inference budget or reaches the correct answer. In this way, the inference budget can be narrowed to focus on the most promising answers.
The researchers found that while beam search improves the model’s performance on complex problems, it tends to perform worse than other techniques on simple problems. To address this challenge, they added two additional elements to their inference strategy.
First came Diverse Verifier Tree Search (DVTS), a variant of beam search that ensures that the SLM does not get stuck in incorrect reasoning paths and diversifies its answer branches. Second, they developed a “computationally optimal scaling strategy,” as proposed in the DeepMind paper, that dynamically selects the best test-time scaling strategy based on the difficulty of the input problem.
The combination of these techniques allowed Llama-3.2 1B to punch above its weight and significantly outperform the 8B model. They also found that the strategy was scalable and when applied to Llama-3.2 3B could outperform the much larger 70B model.
Not a perfect solution yet
Scaling the test time calculation changes the dynamics of the model costs. Enterprises now have the ability to choose where they want to allocate their computing resources. For example, if your memory is tight or you can tolerate slower response times, you can use a small model and spend more inference time cycles to generate more accurate answers.
However, test time scaling also has its limits. For example, in the experiments conducted by Hugging Face, the researchers used a specially trained Llama 3.1-8B model as a PRM, which requires the parallel operation of two models (although it is much more resource efficient than the 70B model). The researchers acknowledge that the holy grail of test time scaling is “self-verification,” where the original model checks its own answer rather than relying on an external verifier. This is an open research area.
The test time scaling technique presented in this study is also limited to problems where the answer can be clearly evaluated, such as coding and mathematics. Creating reward models and verifiers for subjective tasks such as creative writing and product design requires further research.
What is clear, however, is that test time scaling resulted in generation lots of interest and activity and we can expect more tools and techniques to come to market in the coming months. Companies should keep an eye on the development of the landscape.
Source link