OpenAI confirms the new frontier models o3 and o3-mini

Subscribe to our daily and weekly newsletters to receive the latest updates and exclusive content on industry-leading AI reporting. Learn more

OpenAI is gradually inviting selected users to test a whole range of new reasoning models called o3 and o3 mini, successors to the just released o1 and o1-mini models was released in full earlier this month.

OpenAI o3 was named so to avoid copyright issues with phone company O2 and because CEO Sam Altman says the company “has a history of being really bad with names,” today on the final day of the “12 Days of OpenAI” livestreams announced.

Altman said the two new models will initially be made available to selected third-party researchers Security checkwith o3-mini expected at the end of January 2025 and o3 “shortly thereafter”.

“We see this as the beginning of the next phase of AI, where you can use these models to do increasingly complex tasks that require a lot of thought,” Altman said. “On the last day of this event, we thought it would be fun to move from one frontier model to the next frontier model.”

The announcement comes just a day after Google unveiled it and made it available to the public to use new Gemini 2.0 Flash Thinking modelanother competing “reasoning” model that, unlike the OpenAI o1 series, allows users to see the steps in its “reasoning” process documented in text bullet points.

The release of Gemini 2.0 Flash Thinking and now the announcement from o3 show that the competition between OpenAI and Google, as well as the wider field of AI model providers, is entering a new and intense phase as they not only offer LLMs or multimodal models, but also advanced also argumentation models. These may be more applicable to more difficult problems in science, math, engineering, physics, and more.

Best performance yet on third-party benchmarks

Altman also said that the o3 model is “incredibly good at coding,” and the benchmarks shared by OpenAI back this up, showing that the model even outperforms o1 on programming tasks.

• Exceptional coding performance: o3 outperforms o1 on SWE-Bench Verified by 22.8 percentage points, achieving a Codeforces score of 2727, surpassing the OpenAI Chief Scientist score of 2665.

• Mastery of mathematics and natural sciences: o3 achieves a score of 96.7% on the AIME 2024 exam, missing only one question, and achieves 87.7% on the GPQA Diamond exam, far exceeding the performance of human experts.

• Border benchmarks: The model sets new records on demanding tests such as EpochAI’s Frontier Math, solving 25.2% of problems where no other model exceeds 2%. In the ARC AGI test, o3 triples the score of o1 and surpasses 85% (as confirmed live by the ARC Prize team), representing a milestone in conceptual thinking.

Conscious alignment

Alongside these advancements, OpenAI increased its commitment to security and alignment.

The company introduced New research on deliberative orientationa technique that has been instrumental in making o1 its most robust and focused model to date.

This technique embeds human-written security specifications into the models, allowing them to explicitly think about these policies before generating answers.

The strategy aims to solve common security challenges in LLMs, such as vulnerability to jailbreak attacks and excessive rejection of harmless prompts, by equipping the models with Chain-of-Thinking (CoT) reasoning. This process allows models to dynamically retrieve and apply security specifications during inference.

Deliberative Alignment improves on previous methods such as Reinforcement Learning from Human Feedback (RLHF) and Constitutional AI, which rely only on security specifications for label generation rather than embedding the policies directly into the models.

By fine-tuning LLMs to security-related prompts and associated specifications, this approach creates models that enable policy-driven thinking without relying heavily on human-labeled data.

Results shared by OpenAI researchers in a New, non-peer-reviewed paper note that this method improves performance on security metrics, reduces malicious output, and ensures better compliance with content and style guidelines.

The key results highlight the o1 model’s advances over predecessors such as GPT-4o and other state-of-the-art models. Through deliberate targeting, the o1 series is able to provide excellent jailbreak resistance and secure completions while minimizing excessive rejections on harmless prompts. Furthermore, the method facilitates generalization outside of distribution and shows robustness in multilingual and coded jailbreak scenarios. These improvements are consistent with OpenAI’s goal of making AI systems safer and more interpretable as their capabilities increase.

This research will also play a key role in tuning o3 and o3-mini, ensuring their capabilities are both powerful and responsible.

How to request access to the o3 and o3-mini test

Applications for early access are now possible OpenAI website and ends on January 10, 2025.

Applicants must Fill out an online form form They are asked for a variety of information, including research focus, previous experience, and links to previous published articles and their code repositories on Github. You also have to choose which of the models – o3 or o3-mini – you want to test as well as what you want to use them for.

Selected researchers will be granted access to o3 and o3-mini to explore their capabilities and contribute to security assessments. However, OpenAI’s form indicates that o3 will not be available for several weeks.

Researchers are encouraged to develop robust evaluations, create controlled demonstrations of high-risk features, and test models against scenarios not possible with widely used tools.

This initiative builds on the company’s established practices, including rigorous internal safety testing, collaborations with organizations such as the US and UK AI Safety Institutes, and its Preparedness Framework.

OpenAI will review applications on a rolling basis, with selection beginning immediately.

A new leap forward?

The launch of o3 and o3-mini signals an advance in AI performance, particularly in areas that require advanced thinking and problem-solving skills.

With their exceptional results in coding, mathematics, and conceptual benchmarks, these models demonstrate the rapid advances in AI research.

By inviting the broader research community to collaborate on security testing, OpenAI aims to ensure that these features are used responsibly.

Watch the stream below:

Daily insights into business use cases with VB Daily

If you want to impress your boss, VB Daily is for you. We give you the inside scoop on what companies are doing with generative AI, from regulatory changes to practical deployments, so you can share insights for maximum ROI.

Read ours Privacy Policy

Thank you for your subscription. Check out more VB newsletter here.

An error has occurred.