"We view this as sort of the beginning of the next phase of AI where you can use these models to do increasingly complex tasks that require a lot of reasoning," stated a representative during the announcement. While many expected the successor to O1 to be named O2, the company playfully cited their "grand tradition of OpenAI being really truly bad at names" and chose the moniker O3, alongside a smaller, more cost-effective version named O3 Mini.
Described as "very very smart," O3 promises to push the boundaries of AI capabilities, while O3 Mini is touted as "incredibly smart" with a focus on delivering "really good performance and cost."
However, the public will have to wait to get their hands on these powerful new tools. OpenAI announced that while they will not be publicly launching the models today, they are taking a novel approach to safety testing. Starting immediately, researchers can apply for access to both O3 and O3 Mini for public safety testing.
"We've taken safety testing seriously as our models get more and more capable," the announcement emphasized. "At this new level of capability, we want to try adding a new part of our safety testing procedure which is to allow public access for researchers that want to help us test." Interested researchers can find the application form on the OpenAI website, with applications closing on January 10th.
The company did offer a tantalizing glimpse into the impressive capabilities of O3. Mark, Head of Research at OpenAI, showcased the model's prowess in technical benchmarks. In coding, O3 achieved a remarkable 71.7% accuracy on the Sweet Bench Verified benchmark, surpassing O1 by over 20%. Further highlighting its coding abilities, O3 reached an impressive ELO rating of 2727 on the competitive coding platform Codeforces, significantly exceeding O1's score and even outperforming OpenAI's own chief scientist.
O3's mathematical abilities are equally striking. It achieved a near-perfect 96.7% accuracy on the notoriously challenging AMC math competition, compared to O1's 83.3%. On the GPQA Diamond benchmark, which tests performance on PhD-level science questions, O3 scored 87.7%, a 10% improvement over O1, and exceeding the typical score of an expert PhD in their field.
Recognizing the limitations of current benchmarks, OpenAI highlighted O3's performance on the newly emerged Epic AI Frontier math benchmark, considered the toughest mathematical challenge available. While existing models struggle with under 2% accuracy, O3 achieved over 25% in aggressive testing.
Adding to the excitement, Greg Kamradt, President of the ARC Prize Foundation, joined the announcement to reveal O3's groundbreaking performance on the ARC-AGI benchmark, a long-standing challenge in the AI world. After five years without a single system surpassing a 5% score, O3 achieved a state-of-the-art 75.7% on the semi-private holdout set under low compute conditions, making it the new number one on the public leaderboard. Remarkably, under high compute settings, O3 scored an astounding 87.5% on the same set, exceeding human performance levels on this benchmark.
Turning the spotlight to O3 Mini, Hongu, the model's trainer, emphasized its cost-efficiency and flexibility. With the recently introduced adaptive thinking time feature in the API, O3 Mini will offer low, medium, and high reasoning effort options, allowing users to tailor its performance to specific use cases. Live demonstrations showcased O3 Mini's ability to generate and execute code, even evaluating its own performance on the challenging GPQA dataset with impressive speed and accuracy. Benchmark results further highlighted O3 Mini's coding and math proficiency, often exceeding the performance of the original O1 while offering significantly reduced latency. OpenAI also confirmed that O3 Mini will support popular API features like function calling and structured outputs.
In addition to the new models, OpenAI announced a novel safety technique called "deliberative alignment." This method leverages the reasoning capabilities of the models themselves to better understand safety specifications and identify potentially harmful prompts, resulting in improved rejection accuracy and reduced over-refusals compared to previous models.
Looking ahead, OpenAI anticipates launching O3 Mini around the end of January, with the full O3 model becoming generally available shortly after. The company emphasized that the timeline is contingent on the successful completion of the expanded safety testing.
The announcement of O3 and O3 Mini marks a significant step forward in the field of reasoning AI, promising a new era of complex task execution and highlighting OpenAI's commitment to both pushing technological boundaries and prioritizing responsible AI development through rigorous safety measures. The AI community eagerly awaits the opportunity to test and explore the full potential of these groundbreaking new models.