Beyond Benchmarks: How Deepseek-R1 and O1 appear in real tasks

Take part in our daily and weekly newsletters to get the latest updates and exclusive content for reporting on industry -leading AI. Learn more

Deepseek-R1 Certainly created a lot of excitement and concern, especially for the rival of Openai O1. So we tested them in a minor comparison with some simple data analyzes and market research tasks.

In order to place the models on the same foundations, we used confusion -Pro search that now supports both O1 and R1. Our goal was to look beyond the benchmarks and to check whether the models can actually carry out ad hoc tasks in which information must be collected from the web, select the correct data and perform simple tasks that require considerable manual efforts would.

Both models are impressive, but make mistakes if the input requests have no specificity. O1 is a little better in argue tasks, but the transparency of R1 gives it an advantage in cases (and there will be some) where there are mistakes.

Here you will find a breakdown of some of our experiments and the links to the confusion pages on which you can check the results yourself.

Calculation of returns for investments from the web

Our first test measured whether models could calculate rendits on investment (ROI). We looked at a scenario in which the user on the first day of every month from January to December from January to December from January to December $ 140 in the great seven (Alphabet, Amazon, Apple, Meta, Microsoft, Nvidia, Tesla ) invested. We asked the model to calculate the value of the portfolio for the current date.

In order to perform this task, the model would have to deduct 7 price information for the first day of each month, evenly divide the monthly investment via the shares (20 USD per share), summarize them and the portfolio value according to the value of the shares At the current date.

Both models failed in this task. O1 has returned a list of stock prices for January 2024 and January 2025 together with a formula for calculating the portfolio value. However, it could not calculate the right values and basically said that there would be no ROI. On the other hand, R1 made the mistake of only investing in January 2024 and calculating the returns for January 2025.

*The argument of O1 does not provide enough information*

However, the models’ argumentation process was interesting. O1 did not provide many details about how its results had achieved while it had achieved its results, but The argument of R1 pursued showed that there was no information because the call engine of the confusion had not received the monthly data for share prices (many applications of the interior generations do not fail due to the model due to the model, but due to poor access). This turned out to be an important feedback that led us to the next experiment.

*The R1 argument shows that there is no information to be missing*

Argument about file contents

We decided to carry out the same experiment as before, but instead of demanding the model to call up the information from the web, we decided to provide it in a text file. To do this, we copy the monthly stock data for each share from Yahoo! Finance in a text file and gave it to the model. The file contained the name of each share plus the HTML table, which included the price for the first day of each month from January to December 2024 and the last recorded price. The data was not cleaned to reduce manual effort and to test whether the model could select the correct parts from the data.

Here, too, both models provided the correct answer. O1 seemed to have extracted the data From the file, but suggested, the calculation can be made manually in a tool like Excel. The lane of argument was very vague and contained no useful information to fix the model. R1 also failed And there was no answer, but the argumentation control contained a lot of useful information.

For example, it was clear that the model had analyzed the HTML data correctly for each share and could extract the correct information. It was also able to make the monthly calculation of investments, add it and to calculate the final value in the table according to the last share price. However, this end value remained in his chain of argument and did not make it into the final answer. The model had also been confused by a series in the Nvidia diagram, which had marked the 10: 1 share part of the company on June 10, 2024 and incorrectly reported the final value of the portfolio.

*R1 hid the results in his argument and information about where to go wrong*

Here, too, the true distinction feature was not the result itself, but the ability to investigate how the model came to its reaction. In this case, R1 made us a better experience and enables us to understand the model’s restrictions and how we can formulate our command prompt and format our data in order to achieve better results in the future.

Comparison of data about the web

Another experiment that we carried out required the model to compare and determine the statistics of four leading NBA centers, which the best improvement in the field target percentage (FG%) from 2022/2023 to seasons 2023/2024 showed. This task demanded that the model carry out multi -stage argumentation via various data points. The catch in the request was that Victor Wembanyama, who was just entering the league as a rookie in 2023.

The call for this prompt was much easier because the player statistics are widespread on the Internet and are usually included in their Wikipedia and NBA profiles. Both models answered correctly (it is Giannis if they were curious), although depending on the sources they used were a little different. However, they did not know that Wemby did not qualify for comparison and collected other statistics from his time in the European league.

In his answer R1 offered a better collapse The results with a comparison table together with links to the sources you used for the answer. The added context enabled us to correct the prompt. After we have changed the command prompt that we searched for FG% from the NBA seasons, the Wemby model was correct from the results.

Adding a simple word to the prompt made the difference in the result. This is something that a person would implicitly know. Be as specific as possible in your entry prompt and try to include information that a person would implicitly accept.

Conclusion conclusion

Models of argument are powerful tools, but still have a way for you to trust tasks, especially if other components of LLM applications (Langwarly Model) develop. Both O1 and R1 can still make basic mistakes from our experiments. Although you show impressive results, you still need a little handwood to achieve precise results.

Ideally, an argumentation model should be able to explain to the user if there is no information for the task. Alternatively, the argumentation of the model should be able to support users in order to better understand errors and correct their requests in order to increase the accuracy and stability of the answers of the model. In this regard, R1 had the upper hand. Hopefully future argumentation models, including OpenAis upcoming O3 seriesoffers users more visibility and control.

Daily insights into the economic use cases with VB daily

If you want to impress your boss, VB Daily covered her. We give you the Inside scoop of what companies do with generative AI, from regulatory shifts to practical deprivation, so that they can share knowledge for a maximum ROI.

Read our Data protection guideline

Thanks for subscribing. Check out more VB newsletter here.

An error occurred.