OpenAI and Google's AI race: Is GPT-4 or Gemini 1.5 Pro ahead?
In benchmarks such as Massively Multitask Language Understanding, GPQA, HumanEVAL, and MATH, GPT-4 surpasses the May release of Gemini 1.5 Pro.
In recent weeks, we talked about OpenAI’s new model, GPT-4o, which can read people’s emotions from their facial expressions. The GPT-4 iteration, GPT-4o, was met with such enthusiasm by users that the mobile app revenue of ChatGPT increased after the launch of GPT-4o. The total revenue of the app rose to $900,000, nearly doubling its average of $491,000.
OpenAI moved up its spring update announcing GPT-4o to precede the Google I/O event, aiming to overshadow Google’s AI-focused announcements. At the event, where Google announced numerous AI innovations in various fields, it was revealed that the context window of Gemini 1.5 Pro had been expanded to 2 million tokens.
In previous evaluations, GPT-4 generally outperformed other models, while we shared that Gemini Ultra managed to surpass GPT-4 in many criteria. After the latest developments, let’s take a look at how GPT-4o and Gemini stand against each other.
GPT-4o, GPT-4 Turbo, and GPT-4 Comparison
GPT-4 Turbo features a context window of 128,000 tokens, which is four times larger than GPT-4’s context window. According to information shared by OpenAI, GPT-4o can respond to voice inputs in as little as 232 milliseconds. This metric is noteworthy because it approaches the average human response time in a conversation, which is around 320 milliseconds. GPT-4o matches GPT-4 Turbo’s performance in English and code text. OpenAI also notes significant improvements in non-English texts and states that this model is much faster and 50% cheaper when accessed via API.
Before GPT-4o, using Voice Mode with ChatGPT had average latency times of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4). GPT-3.5 or GPT-4 would process text inputs and outputs, and a third simple model would convert this text back into speech. With GPT-4o, all inputs and outputs, including text, image, and voice, are processed by the same neural network.
Comparison of Gemini Versions: Current Gemini 1.5 Pro and Gemini 1.5 Flash
A new report released by Google in May compares the current Gemini 1.5 Pro and Gemini 1.5 Flash models with other models in the Gemini series. The report presents the updated Gemini 1.5 Pro as an upgrade from the previous February release of Gemini 1.5 Pro. This model outperforms its predecessor in most capabilities and benchmarks.
The current Gemini 1.5 Pro surpasses Gemini 1.0 Pro and 1.0 Ultra in various comparisons. Additionally, according to the report, the updated Gemini 1.5 Pro requires significantly less processing to train. Compared to the previous version of Gemini 1.5 Pro, the May release stands out with its multimodal reasoning capabilities.
Similarly, the report indicates that Gemini 1.5 Flash performs better than Gemini 1.0 Pro. In several benchmarks, it even shows comparable performance to 1.0 Ultra.
The data shared in the report highlights the progress made by the May update of Gemini 1.5 Pro compared to the February release of Gemini 1.5 Pro. The model shows significant advancements, particularly in the V*Bench criteria, and also demonstrates notable improvements in its scores on EgoSchema, MATH, MathVista, and HumanEval evaluations.
Comparison of GPT-4 Turbo and Claude 2.1 in the MRCR Task
In the Multimodal Common Reference Resolution (MRCR) task, it is observed that Gemini models outperform GPT-4 Turbo and Claude 2.1, particularly in handling long contexts and content. This task involves presenting the model with a lengthy conversation between a user and the model. During the conversation, the user requests various types of writing, such as poems, riddles, and essays, and the model continues to respond accordingly. In each conversation, two user requests that contain different topics and writing styles from the rest of the conversation are randomly inserted into the context. According to the report, this task also measures the AI’s reasoning abilities in long contexts.
Gemini 1.5 Pro vs. GPT-4 Turbo
Gemini 1.5 Pro outperforms GPT-4 Turbo at around 8,000 tokens, while Gemini 1.5 Flash surpasses it at approximately 20,000 tokens. According to the report, both of Google’s models achieve an average score of around 75% at 1 million tokens. In contrast, GPT-4 Turbo’s performance steadily declines, reaching about 60% at the 128,000 token limit. Claude 2.1, which can handle up to 200,000 tokens, scores around 20% at 128,000 tokens.
Comparison of Fast Output Production: Claude, GPT, and Gemini Models
Another comparison presented in the report focuses on the speed of output production. This comparison measures how the models perform in producing outputs quickly in English, Japanese, Chinese, and French.
Fast Output Production Comparison
According to the shared information, Gemini 1.5 Flash produces the fastest output across all four evaluated languages. It outperforms Gemini 1.5 Pro, GPT-4 Turbo, Claude 3 Sonnet, and Claude 3 Opus. For English queries, Gemini 1.5 Flash produces over 650 characters per second, making it 30% faster than Claude 3 Haiku, the second fastest model among those evaluated.
Detailed Comparison and Recommendations
The report from Google contains many more comparisons. For a detailed review, it is recommended to read the full report.
Comparison of GPT-4o, Gemini Ultra 1.0, and Gemini Pro 1.5
In a blog post published by OpenAI, GPT-4o is compared in text evaluation against GPT-4 Turbo, GPT-4 (April 2023 release), Anthropic’s Claude 3 Opus model, Google’s Gemini Pro 1.5, Gemini Ultra 1.0 models, and Meta’s Llama 400b model.
Performance Comparison of GPT-4o, GPT-4 Turbo, and Other Models
In this comparison, GPT-4o outperforms GPT-4 Turbo in all evaluations except the Discrete Reasoning Over Paragraphs (DROP) assessment. In the DROP criteria, GPT-4o closely matches Claude 3 Opus and outperforms Gemini Pro 1.5, which has a performance score of 78.9%.
MMLU Criteria Performance
In the Massively Multitask Language Understanding (MMLU) criteria, GPT-4 scores 86.5%, while GPT-4o achieves 88.7%. Gemini 1.5 Pro scores 81.9% in this criterion, which appears to be based on the February version, as this is the value cited in Google’s documents for that release. However, the May version of Gemini Pro 1.5 scores 85.9%, still falling short of GPT-4o. Gemini Ultra, with a score of 83.7%, also trails behind GPT-4o.
Performance in Various Criteria
GPQA (Graduate-Level Google-Proof Question Answering)
– GPT-4o: 53%
– Gemini 1.5 Pro (May version): 46.2%
MATH
– GPT-4o: 76.6%
– Gemini 1.5 Pro (February version): 58.5%
– Gemini 1.5 Pro (May version): 67.7%
– Gemini Ultra 1.0: 53.2%
HumanEVAL (Evaluation of Code-Trained Large Language Models)
– GPT-4o: 90.2%
– Gemini 1.5 Pro (February version): 71.9%
– Gemini 1.5 Pro (May version): 84.1%
– Gemini Ultra: 53.2%
In this criterion, the latest version of Gemini 1.5 Pro falls behind both GPT-4o and GPT-4 Turbo, with Gemini Ultra performing even lower.
MGSM (Multilingual Grade School Math)
– GPT-4o: 90.5%
– GPT-4 Turbo: 88.5%
– Gemini 1.5 Pro (February version): 88.7%
– Gemini Ultra: 79%
Voice Translation Models
Google’s AudioPalm-2 and Gemini models outperform Meta’s XLS-R, SeamlessM4T-v2, and OpenAI’s Whisper-V3. However, GPT-4o appears to slightly edge out Google’s models.
In the criterion of automatic speech recognition, a comparison between GPT-4o and Whisper-v3 reveals that GPT-4o outperforms Whisper-v3 in different languages.
Comparison between ChatGPT and Gemini Models
Considering all evaluations, GPT-4o generally exhibits better performance compared to GPT-4. Additionally, Gemini 1.5 Pro with a 2 million token context window excels, particularly in handling long texts and content, outperforming models from the GPT series.
Gemini 1.5 Pro (2 million token context window, May version)
– Context Window: Gemini 1.5 Pro’s most notable feature is its extensive 2 million token context window. This allows the model to process a much wider range of text simultaneously.
– Natural Language Processing Abilities: With its broad context window, Gemini 1.5 Pro can better understand complex and lengthy texts while preserving contextual relationships. This is a significant advantage, especially for long documents or multi-part conversations.
– Use Cases: Thanks to its extensive context window, Gemini 1.5 Pro can perform exceptionally well in tasks such as large-scale document analysis, long-term conversations, and multi-part story writing.
General Comparison
– Context Management: Gemini 1.5 Pro can handle a much broader context compared to GPT-4 and GPT-4o, providing a significant advantage, especially in long texts and complex contexts.
– Performance and Use Cases: GPT-4 and GPT-4o are powerful models for general-purpose use and perform well in various tasks. However, due to context window limitations, they may risk forgetting some information or losing contextual relationships in very long texts.
– Customization and Optimization: While GPT-4o may offer performance optimizations, Gemini 1.5 Pro’s extensive context window provides more consistency and contextual accuracy in longer and more complex tasks.
Gemini Advanced made the following comparison:
GPT-4 and GPT-4o are versatile language models that excel in areas such as language proficiency, logic, reasoning, coding, and creativity. The multimodal capabilities of GPT-4o provide an advantage for applications that involve working with visual data.
Gemini 1.5 Pro stands out from its competitors in understanding and processing long and complex texts due to its massive context window. This feature is ideal for applications such as large-scale text analysis, document summarization, and tracking long conversation histories.
When deciding which model is more suitable for you, it’s important to evaluate based on the features you need and the application area.
Important Note: Gemini 1.5 Pro has not yet been widely released and is accessible only to specific developers and researchers by Google. Therefore, GPT-4 and GPT-4o are currently more commonly available options.
Page Contents
Toggle