Google markets Gemini 2.5/3.1 Pro as the world's best AI. Independent benchmarks tell a different story: there is no single "best" model. Gemini leads in reasoning and multimodal tasks. Claude dominates software engineering. GPT-5 leads in hard science and knowledge-work metrics. The gap between top models is as small as 1-2 points on most benchmarks. Google's "world's best" claim is marketing, not fact.
The Claim
In March 2025, Google DeepMind announced Gemini 2.5 Pro with this headline:
"Gemini 2.5: Our most intelligent AI model"
The blog post made several specific claims:
- "State-of-the-art on a wide range of benchmarks"
- "Debuts at #1 on LMArena by a significant margin"
- "Strong reasoning and code capabilities, leading on common coding, math and science benchmarks"
Google's technical paper (arXiv:2507.06261) reinforced this: "Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks."
These are strong claims. "State-of-the-art" doesn't mean "pretty good." It means the best in the world. Is that true?
We analyzed 18 benchmarks from the independently-run LM Council (May 2026), curated by AI Explained and run by Epoch AI, Scale AI, and other third parties. These are not self-reported scores — they are independently measured.
The Data: 18 Benchmarks, 3 Families, One Clear Picture
We tracked the top 3 AI model families — Gemini, GPT, and Claude — across every benchmark in the LM Council's May 2026 leaderboard. Here's what we found.
Overall Scorecard
| Benchmark | Category | Winner | Gemini Best | GPT Best | Claude Best |
|---|---|---|---|---|---|
| Humanity's Last Exam | General Reasoning | 🥇 Gemini | 44.7% (3.1 Pro) | 44.3% (GPT-5.5) | — |
| SimpleBench | Common Sense | 🥇 Gemini | 79.6% (3.1 Pro) | 76.9% (GPT-5.5 Pro) | — |
| GPQA Diamond | PhD Science | 🥇 GPT | 94.1% | 94.6% (GPT-5.4 Pro) | — |
| GDPval | Knowledge Work | 🥇 GPT | 53.5% (3 Pro) | 83.0% (GPT-5.4) | 59.6% (Opus 4.5) |
| SWE-bench Verified | Coding | 🥇 Claude | 80.6% (3.1 Pro) | 76.9% (GPT-5.4) | 83.5% (Opus 4.7) |
| SWE-bench Pro | Coding (Hard) | 🥇 GPT | 54.2% | 57.7% (GPT-5.4) | ~45% |
| Terminal-Bench 2.0 | Agentic | 🥇 Gemini/GPT | 78.4% (3.1 Pro) | 77.3% (5.3 Codex) | 69.9% (Opus 4.6) |
| METR Time Horizons | Long Tasks | 🥇 Claude | — | 352 min | 718 min (Opus 4.6) |
| FrontierMath | Research Math | 🥇 GPT | — | 50.0% (GPT-5.4) | 40.7% (Opus 4.6) |
| WeirdML v2 | ML Coding | 🥇 GPT | 72.1% (3.1 Pro) | 79.3% (5.3 Codex) | 65.9% (Opus 4.6) |
| WebDev Arena | Web Building | 🥇 Claude | — | 1480 (GPT-5.2) | 1512 (Opus 4.5) |
| Fiction.liveBench | Long Context | 🥇 o3/Grok | 90.6% (2.5 Pro) | 96.9% (GPT-5) | — |
| BALROG | Game Completion | 🥇 Gemini | 48.1% (3 Flash) | — | — |
| MATH Level 5 | Competition Math | 🥇 GPT | — | 98.1% (GPT-5) | — |
| OTIS Mock AIME | Advanced Math | 🥇 Claude | 95.6% (3.1 Pro) | 96.1% (GPT-5.2) | 97.8% (Opus 4.7) |
| GSO | Code Optimization | 🥇 GPT | 18.6% (3 Pro) | 27.4% (GPT-5.2) | 26.5% (Opus 4.5) |
| GeoBench | Visual GeoReasoning | 🥇 Gemini | 3893 (3 Pro) | 3789 (o3) | — |
| VPCT | Visual Physics | 🥇 Gemini | 91.0% (3 Pro) | 84.0% (GPT-5.2) | — |
The Score Tally
Gemini leads in: 6 of 18 benchmarks (Humanity's Last Exam, SimpleBench, Terminal-Bench 2.0, BALROG, GeoBench, VPCT)
GPT leads in: 8 of 18 benchmarks (GPQA Diamond, GDPval, SWE-bench Pro, FrontierMath, WeirdML v2, MATH Level 5, GSO, Fiction.liveBench)
Claude leads in: 4 of 18 benchmarks (SWE-bench Verified, METR Time Horizons, WebDev Arena, OTIS Mock AIME)
Why Google's "World's Best" Claim Falls Apart
1. Benchmark Selection Bias
Google chose benchmarks where Gemini excels and omitted where it doesn't. Their March 2025 announcement highlighted LMArena (human preference ranking) and GPQA / AIME — where Gemini 2.5 Pro was competitive at launch.
But even on GPQA Diamond, which Google touted, GPT-5.4 now leads at 94.6% vs Gemini 3.1 Pro's 94.1% — a statistically insignificant gap, but not exactly "state-of-the-art."
What Google didn't mention:
- On GDPval (knowledge work across 44 occupations), GPT-5.4 scores 83.0% vs Gemini 3 Pro's 53.5% — a 30-point gap.
- On FrontierMath (expert-level math), GPT-5.4 scores 50.0% — Gemini doesn't even make the top 5.
- On METR Time Horizons (long-duration tasks), Claude Opus 4.6 handles 718 minutes of continuous work — Gemini doesn't appear in the top 5.
2. "State-of-the-Art" Is Time-Bound
When Google released Gemini 2.5 Pro in March 2025, it genuinely led many benchmarks. But the AI landscape moves fast:
- GPT-5.4 (late 2025) reclaimed leadership on hard science and knowledge work.
- Claude Opus 4.6/4.7 (early 2026) dominates software engineering with SWE-bench Verified at 83.5%.
- Gemini 3.1 Pro (mid-2026) reclaimed the lead on reasoning and multimodal tasks.
Google's "state-of-the-art" claim was true — for about 2 months. In AI, that's a lifetime.
3. The "One Model to Rule Them All" Fallacy
The most honest assessment comes from Byteiota's March 2026 coding benchmark analysis:
"The AI coding tool wars are over, and nobody won. March 2026 benchmark results show Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro trading victories across different tasks, with top models landing within 1-2 points of each other on major benchmarks."
The question isn't "which model is best?" It's "which model is best for this job?"
| Task Type | Best Model | Why |
|---|---|---|
| Long-form codebase work | Claude Opus 4.6/4.7 | 1M context, leads SWE-bench |
| Terminal automation | GPT-5.3 / Gemini 3.1 Pro | Tied at ~77-78% |
| Hard science research | GPT-5.4 Pro | 94.6% GPQA, 50% FrontierMath |
| Abstract reasoning | Gemini 3.1 Pro | 79.6% SimpleBench |
| Visual/multimodal | Gemini 3.1 Pro | 91% VPCT, 3893 GeoBench |
| Knowledge work | GPT-5.4 | 83% GDPval |
| Cost-effective | Gemini 3.1 Pro | $2/$12 per M tokens |
| Complex math | Claude Opus 4.7 | 97.8% OTIS Mock AIME |
The Price Factor
| Model | Input (per M tokens) | Output (per M tokens) |
|---|---|---|
| Gemini 3.1 Pro | $2.00 | $12.00 |
| GPT-5.2 | $1.75 | $14.00 |
| Claude Opus 4.6 | Premium tier | |
| Grok 4.1 | $0.20 | $0.50 |
For budget-conscious teams, Gemini 3.1 Pro delivers the best price-to-performance ratio. But calling a model "world's best" when its main advantage is price is like calling a Honda Civic "the world's best car" because it has the best fuel economy.
What This Means for You
If you're a developer:
Don't pick one model. Run 2-3 in a routing setup. Let cheap models handle docs. Let premium models handle complex architecture. 37% of enterprises already use 5+ models in production (IDC 2026).
If you're a buyer:
Ignore "world's best" claims. They are always marketing. Match the model to your specific task. Need visual reasoning? Gemini. Need reliable code fixes? Claude. Need deep scientific analysis? GPT.
If you're building a product:
API diversity is risk management. Don't tie your product to one provider. Routing costs 60-85% less while maintaining or improving performance.
Our Verdict
Google's "world's best AI" claim is a marketing statement, not a factual one.
The data shows a three-way split:
- Gemini leads in reasoning, multimodality, and cost efficiency
- GPT leads in hard science, knowledge work, and math
- Claude leads in software engineering and long-duration tasks
Is Gemini the world's best AI? Only if you define "best" as "best at the things Google chose to measure." In independent benchmarks, the answer is clear: there is no single best model, and anyone who tells you otherwise is selling something.
Data sources: LM Council Benchmarks (May 2026), Google DeepMind Blog (March 2025), arXiv:2507.06261, Byteiota Coding Benchmarks (March 2026)