Hello, my name is Kakeya, the CEO of Scuti.
We specialize in offshore development and lab-based development with a focus on generative AI, as well as offering consulting services related to generative AI. Recently, we have been receiving many requests for system development integrated with generative AI.
On September 12, 2024, OpenAI announced the “OpenAI o1” series of AI models, which are equipped with advanced reasoning capabilities.
This AI model tackles complex problems using human-like thought processes, generating more refined and high-precision outputs. The first in the series, “o1-preview,” was released as an early access version, alongside a lightweight version called “o1-mini,” which has drawn significant attention from researchers and developers worldwide.
In this article, we will explain the technical details of OpenAI o1-preview/mini, compare it with previous models, discuss benchmark results, use cases, and safety considerations.
OpenAI o1-preview / mini: An AI Model that Dramatically Enhances Reasoning Abilities
What makes OpenAI o1 so remarkable?
Chain of Thought (CoT) Reasoning Like Humans: OpenAI o1 mimics the human process of solving complex problems using “Chain of Thought,” allowing it to analyze problems step by step and derive solutions.
Expert-Level Capabilities: OpenAI o1 demonstrates expert-level abilities in highly specialized fields such as mathematics, coding, and science.
Consideration for Safety and Ethics: OpenAI o1 is designed to comply with safety regulations and avoid generating harmful content. It also incorporates technologies to promote ethical behavior and eliminate bias.
OpenAI o1-preview: Solving Complex Problems with Reasoning Abilities that Surpass GPT-4o
OpenAI o1-preview uses a technique called “Chain of Thought” to process complex reasoning tasks in a multi-stage manner like a human, enabling advanced problem-solving capabilities.
o1-preview overcomes the challenges of complex reasoning that GPT-4o faced by employing human-like thinking processes, allowing it to tackle more sophisticated problems. It excels particularly in tasks requiring logical reasoning, strategic planning, and problem-solving.
o1-preview is not the next version of GPT-4o, but a new language model.
At present, o1-preview does not have some features, such as web search or file upload, like ChatGPT. Therefore, in general cases, GPT-4o might still be superior. However, in complex reasoning tasks, o1-preview elevates the potential of AI to a new level and is expected to be a key milestone in future AI development.
OpenAI o1-mini: Specializing in STEM Reasoning with a Focus on Speed and Cost Efficiency
OpenAI o1-mini is a lightweight version of o1-preview that maintains its reasoning capabilities while dramatically improving processing speed and cost efficiency. Compared to o1-preview, o1-mini operates 3 to 5 times faster, and its usage cost is 80% cheaper.
o1-mini is specifically trained in STEM fields (Science, Technology, Engineering, and Mathematics), particularly excelling in reasoning tasks related to mathematics and coding. Like o1-preview, o1-mini also uses “Chain of Thought” reasoning to solve complex problems step by step, similar to human processes.
o1-mini may not perform as well in tasks that require extensive general knowledge compared to o1-preview or GPT-4o. This is because o1-mini is specialized in STEM fields and has less exposure to general knowledge compared to o1-preview. However, for applications that require high-precision reasoning with limited resources, o1-mini is a powerful and attractive option.
Benchmark Results of OpenAI o1
OpenAI o1-preview/mini has outperformed previous AI models in various benchmarks, elevating AI reasoning capabilities to a new level.
The following graph, published by OpenAI, compares o1’s performance in mathematics, programming, and PhD-level science with GPT-4o, showing that o1’s scores are overwhelmingly superior to those of GPT-4o.
Source: https://openai.com/index/learning-to-reason-with-llms/
Mathematics: Achieved a Score at the Top 500 Level in the United States on AIME
In the American Invitational Mathematics Examination (AIME), which measures high school students’ mathematical abilities, o1 solved 74.4% of problems (11.1 out of 15) in a single sample, 83.3% (12.5 out of 15) in consensus across 64 samples, and 93% (13.9 out of 15) when re-ranking 1000 samples using a learned scoring function.
This score ranks at the top 500 level nationwide in the United States, high enough to qualify for the selection process for the International Mathematical Olympiad (IMO).
Source:https://openai.com/index/learning-to-reason-with-llms/ モデルの学習時間共にAIMEのスコアが伸びていることがわかります
Coding: Ranked in the Top 89% on Codeforces, Achieved High Precision on HumanEval
OpenAI developed a model based on OpenAI o1 that enhances programming capabilities, and it was pitted against humans under the same conditions in the International Olympiad in Informatics (IOI). The result was a score of 213, ranking in the top 49%. This score is about 60 points higher than a random submission strategy.
When the submission limit was relaxed, the model scored 362.14, surpassing the gold medal standard. Additionally, in Codeforces’ simulation evaluation, the o1-based model achieved an Elo rating of 1807, outperforming 93% of programmers.
Regarding o1-mini, it achieved an Elo rating of 1650 on Codeforces, comparable to o1 (1673 Elo) and surpassing o1-preview (1258 Elo). This score corresponds to the top 86% of programmers on Codeforces. Furthermore, o1-mini demonstrated excellent performance in coding benchmarks like HumanEval and high school-level cybersecurity competitions like CTF.
These results suggest that o1-preview/mini has advanced coding capabilities, reaching a level where they can compete with human programmers. By automating various coding tasks such as code generation, code review, and bug fixing, o1-preview/mini is expected to significantly contribute to the efficiency of software development.
Source:https://openai.com/index/learning-to-reason-with-llms/
Science: Achieved Accuracy Surpassing Human Experts in GPQA Diamond
In the scientific question-answering benchmark “GPQA Diamond,” o1 achieved accuracy surpassing human experts, shocking the world. This marks the first instance where an AI model has outperformed human experts in a scientific field requiring advanced expertise.
o1-preview also achieved a 73.3% accuracy rate in GPQA Diamond, while o1-mini scored 60.0%, both far exceeding GPT-4o’s 50.6%.
o1-preview/mini is expected to contribute significantly to the advancement of science and technology by assisting in tasks such as reading scientific papers, analyzing experimental data, and developing new drugs.
Source:https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/
o1-mini Falls Short of GPT-4o in MMLU, Which Requires Broad General Knowledge
In the Multiple-Choice Question set “MMLU,” which covers 57 different fields, o1 achieved an accuracy rate of 92.3%, and o1-preview scored 90.8%, both outperforming GPT-4o’s 88.7%. However, o1-mini scored 85.2%, which is lower than GPT-4o.
This is likely because o1-mini is specialized in STEM fields and does not perform as well as GPT-4o in tasks like MMLU, which require broad general knowledge.
Source:https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/
Human Evaluation: o1-preview/mini Superior in Reasoning-Focused Fields
OpenAI has also conducted human evaluation experiments. In these experiments, evaluators compared the answers of o1-preview/mini and GPT-4o to determine which provided better responses.
As a result, in reasoning-focused fields such as data analysis, coding, and mathematics, o1-preview/mini’s answers were rated as superior to those of GPT-4o.
However, in language-focused fields such as text generation and translation, GPT-4o’s answers were rated higher. This is likely because o1-preview/mini is specialized in STEM fields, and therefore does not perform as well as GPT-4o in language generation tasks.
The graph below shows the percentage of responses rated as “better than GPT-4o.” A score of 50% indicates that the evaluation found little difference between the two, while a score above 50% means o1 was rated better than GPT-4o.
The three graphs on the right (programming, data analysis, and calculations) show higher ratings for o1 compared to GPT-4o.
Source:https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/
Use Cases
OpenAI o1-preview/mini, with its advanced reasoning abilities, has the potential to assist in solving problems across various fields and extend human capabilities.
Programming: A Powerful Tool to Accelerate Software Development
o1-preview/mini is expected to significantly contribute to the efficiency of software development with its advanced coding abilities. By automating various coding tasks such as code generation, code review, and bug fixing, developers can focus on more creative work.
In this video, o1-preview is used to implement a snake game in HTML, JS, and CSS. Next, the user instructs it to add obstacles in the shape of the letters “AI” to make the game more challenging. o1-preview modifies the code as instructed and creates a snake game with “AI”-shaped obstacles on the screen.
In another video, a user, who lacked the skills to write code meeting complex requirements, explains how they used o1-preview to generate the necessary code textually for creating a tool to visually explain the Self-Attention mechanism in a Transformer class.
o1-preview/mini is also expected to be a useful learning support tool for beginner programmers. It not only explains how to write and debug code clearly but also provides an interactive learning environment for understanding fundamental programming concepts.
Scientific Research: AI Research Assistant to Accelerate the Advancement of Science and Technology
o1-preview/mini has the potential to accelerate the advancement of science and technology by assisting in various scientific research tasks such as reading scientific papers, analyzing experimental data, and developing new drugs.
For example, o1-preview/mini can automatically analyze vast amounts of scientific papers and extract key information. It can also analyze experimental data and construct statistical models to verify hypotheses. Additionally, o1-preview/mini can design potential drug compounds and predict their efficacy.
In the following video, geneticist Katherine Brownstein explains how o1-preview is useful in genetic research on rare diseases.
Previously, researchers had to manually examine each paper, but with o1-preview, they can quickly summarize the necessary information and easily obtain data on gene expression sites and functions.
Mathematics: Solving Complex Mathematical Problems and Supporting the Discovery of New Mathematical Theories
o1-preview/mini can design algorithms to solve complex mathematical problems, simplify and transform mathematical expressions, and model real-world phenomena mathematically.
In the following video, o1-preview is tasked with solving a complex riddle related to age.
The problem is as follows: “The age of the princess is the same as the age of the prince when he is twice the age of the princess. The age of the prince is twice what it was when the princess’s age was half the current total of their ages.” This is a problem that is difficult for even humans to understand and solve immediately.
o1-preview analyzed the problem using the Chain of Thought method, set variables, organized the conditions into equations, and finally arrived at the correct solution: “The princess’s age is 6k, and the prince’s age is 8k (where k is an arbitrary natural number).”
Other Applications: Education, Finance, Law, and More
In addition to the fields mentioned above, o1-preview/mini can be applied to a wide range of fields such as education, finance, and law, for complex tasks that require human thought processes.
- Education: o1-preview/mini can provide individually optimized learning materials and instruction tailored to each student’s learning progress and comprehension level.
- Finance: o1-preview/mini can analyze vast amounts of financial data, predict market trends, and develop investment strategies.
- Law: o1-preview/mini can assist with the interpretation of legal documents and case law research, contributing to the efficiency of legal professionals.
Development with a Focus on Safety and Ethics
OpenAI has emphasized safety and ethics in the development of o1-preview/mini. The model is designed to avoid generating harmful content, engaging in unethical behavior, and violating privacy.
- Specific Safety Measures: Refusal of harmful prompts, elimination of bias, and ethical behavior. o1-preview/mini learns reasoning methods within the context of safety regulations, enabling more effective application of these rules. For example, if a user provides a prompt that encourages illegal activity, o1-preview/mini recognizes within the Chain of Thought process that the prompt violates safety regulations and rejects it.
Additionally, o1-preview/mini adopts various bias-reduction techniques to eliminate biases present in the training data. Furthermore, the model is designed to act in accordance with ethical guidelines, ensuring it avoids engaging in unethical behaviors.
Rigorous Safety Evaluation: Jailbreak Test, Bias Detection Test, and Ethics Evaluation Test
OpenAI has conducted various safety tests to evaluate the safety of o1-preview/mini. These tests include the “Jailbreak Test” to check whether the model adheres to safety regulations, the “Bias Detection Test” to see if the model generates biased information, and the “Ethics Evaluation Test” to determine if the model engages in unethical behavior.
Comparing GPT-4o and o1
So far, we have discussed OpenAI o1-preview/mini at length, but it seems that GPT-4o is better for text generation, while o1-mini might be superior for program generation. Let’s compare their outputs using two themes.
Japanese Traditional Comedy “Oogiri”
To compare text generation abilities, the following prompt was input, and the outputs were compared:
GPT-4o output
o1-mini output
As with the previous competition between GPT-4o and Claude 3.5 Sonnet, it seems that GPT has no sense of humor. The output from o1-mini didn’t really resemble traditional Oogiri, so GPT-4o might still be the better option here…
However, o1-mini started the Chain of Thought process even with this kind of prompt, giving it a surreal but different type of humor.
Original Game Implementation
Next, let’s test o1-mini in its strong suit: programming. The following prompt was entered:
GPT-4o output
O1-mini output
This was an overwhelming victory for o1-mini!
First, the speed of the output was completely different. o1-mini felt about five times faster.
As for the quality, GPT-4o’s output wasn’t even functional as a game. On the other hand, although o1-mini’s game ended up being more like Tetris than Puyo Puyo, and there was a bug that prevented moving to the right at some point, the game was still operational and somewhat complete as a game.
It was disappointing that the “Fluffy feeling” wasn’t there at all, though.
In any case, I could feel that o1-mini’s programming capabilities were superior to GPT-4o!