OpenAI announced the release of its new series of AI models—OpenAI o1, with significantly advanced reasoning capabilities. According to OpenAI, what sets the o1 apart from the GPT-4o family is that they’re designed to spend more time thinking before they respond. One of the caveats with older and current OpenAI models (e.g. GPT-4o and 4o-mini) is their limited reasoning and contextual awareness capabilities—which lag behind advanced models like Anthropic’s Claude 3.5 Sonnet. OpenAI o1 is designed to help users complete complex tasks and solve harder problems than previous models in science, coding, and math.
This blog explores Open o1’s features, test results, pricing, and comparisons with existing benchmarks, GPT-4o and Claude 3.5 Sonnet (you can compare currently leading models here)
1. Overview of OpenAI o1
OpenAI o1 is a model family designed specifically for advanced reasoning and problem-solving. According to Open AI, the models can perform similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology. The test results don’t suggest otherwise. Key highlights of the OpenAI o1 models include.
1.1 Performance Metrics
OpenAI o1 ranks in the 89th percentile on competitive programming questions and has shown remarkable results in standardized tests, outperforming human PhD-level accuracy in physics, biology, and chemistry benchmarks. Besides, the model has a 128K context and an October 2023 knowledge cutoff.
1.2 o1 Model Family
The series includes the o1 preview model with a broader world knowledge and reasoning, and a smaller variant, o1-mini, which is faster and more cost-effective, especially for coding tasks. The o1-mini is approximately 80% cheaper than the o1-preview while maintaining competitive performance in coding evaluations.
1.3 Availability of o1 models
The o1 preview models are currently available in ChatGPT Plus (including access for Team, and Enterprise users), as well as via API for developers on tier 5 of API usage. In ChatGPT, it has a strict message limit of only 30 messages per week for the o1 preview and 50 messages per week for the o1 mini, after which you are required to switch to GPT-4o models.
1.4 Pricing for o1 models compared with GPT-4o
OpenAI has structured its pricing to cater to different user needs, with o1-mini being the most economical option. Here’s a breakdown of the pricing for the OpenAI o1 models:
Model | Input Tokens | Output Tokens |
OpenAI o1 | $15.00 / 1M | $60.00 / 1M |
OpenAI o1-mini | $3.00 / 1M | $12.00 / 1M |
GPT-4o (08-06) | $2.5 / 1M | $10.00 / 1M |
GPT-4o mini | $0.150 / 1M | $0.600 / 1M |
Claude 3.5 Sonnet | $3.00 / 1M | $15 / 1M |
2. Comparison of OpenAI o1 vs GPT 4o
In rigorous testing, OpenAI o1 has demonstrated superior reasoning skills compared to its predecessors. For example, in a qualifying exam for the International Mathematics Olympiad, the o1 model scored 83%, while GPT-4o only managed 13%. Additionally, the o1 model scored significantly higher on jailbreaking tests, indicating a stronger adherence to safety protocols.
Performance Comparison
The below charts (courtesy: OpenAI) provide some interesting details about OpenAI o1’s technical performance across different metrics:
3. Comparison of OpenAI o1 vs Claude 3.5 Sonnet
Here are some quick points highlighting the differences comparing OpenAI o1 with GPT-40 and Claude 3.5 Sonnet:
- Reasoning Ability: OpenAI o1 outperforms GPT-4o in complex reasoning tasks, as evidenced by its superior scores in competitive programming and math challenges. But the context is still lower than Claude’s most premium plan aka Claude for Enterprise, which has a 500K context window.
- Safety and Compliance: OpenAI o1 has shown improved performance in safety tests, indicating better adherence to safety protocols compared to GPT-4o and Claude 3.5 Sonnet.
Claude AI also launched its own github integration to ground the responses in your personal data, which is especially helpful for code generation use cases.
4. Conclusion
The introduction of OpenAI o1 marks a significant milestone in AI development, particularly in enhancing reasoning capabilities for complex problem-solving. OpenAI mentioned that they expect to add browsing, file, and image uploading, and other features to make them more useful to everyone. It’ll be interesting to follow along with these developments. At the same time, it is important to compare models and pick the one which works best for your use case, and the most expensive model isn’t always the best. The best models currently are GPT-4o, Claude 3.5 Sonnet, Llama 3.1 and you can test multiple models and make a decision that works for you.