TL;DR
Anthropic’s Claude Opus 4.8 leads the hardest coding benchmark (SWE-Bench Pro at 69.2%, 10.6 points ahead of GPT-5.5) but trails OpenAI on terminal-based tasks (Terminal-Bench 2.1: 74.6% vs 78.2%). Google’s Gemini 3.1 Pro wins browser-based coding tasks (BrowseComp single-agent: 85.9%) while charging the lowest token prices ($1.25/$5 per million). Microsoft’s two new MAI models hit 51% and ~53% on SWE-Bench Pro but lack terminal benchmark data. The $9.3B market at 26% CAGR to $30B by 2031 has five serious contenders, each winning a different slice of the coding workload.
The Landscape
The AI code tools market hit $9.3 billion in 2026 and is projected to grow to ~$30 billion by 2031 — a 26% CAGR, per Mordor Intelligence cited in CNBC on June 1, 2026. Analyst Tomasz Tunguz of Theory Ventures told CNBC that “AI coding is the most attractive market for generative AI models,” and that “AI might eventually represent 30% to 60% of research and development spending.” Every major AI company is now building or buying its way into this space. The question: whose model actually performs best on real coding tasks?
The Contenders
Anthropic – Claude Code
Claude Opus 4.8, released May 28, 2026, is the current SWE-Bench Pro leader at 69.2% — 10.6 points ahead of the next best model. It also leads on SWE-Bench Verified (88.6%) and OSWorld-Verified (83.4%), and Cursor’s co-founder Michael Truell confirmed that Opus 4.8 “exceeds prior Opus on CursorBench across all effort levels.” Anthropic’s Dynamic Workflows allow parallel subagents, and the model has a default 1M token context. The company achieved a $965B valuation in its latest financing round and confidentially filed for IPO on June 1, 2026 — up from essentially zero coding revenue in 2024.
OpenAI – Codex / GPT-5.5
GPT-5.5, released April 23, 2026, posts 58.6% on SWE-Bench Pro but leads the Unix terminal benchmark (Terminal-Bench 2.1 at 78.2% — the highest of any model). It also leads on HLE with tools (52.2%) and GDPval-AA (1,769 Elo). OpenAI has pivoted from consumer to enterprise (CNBC, May 11, 2026), and its ~1M token context now emphasizes CLI/computer use capabilities. It is the strongest model for terminal-based development workflows.
Google – Gemini + Antigravity 2.0
Gemini 3.1 Pro, released February 19, 2026, scores 54.2% on SWE-Bench Pro. But Google’s BrowseComp single-agent score (85.9%) is the highest in the industry, and its GPQA Diamond score (~94.3%) leads general reasoning benchmarks. Antigravity 2.0, announced at Google I/O May 2026, is not a single model but a multi-agent orchestration layer that can “orchestrate multiple agents to execute tasks in parallel.” Gemini 3.5 Flash was also announced at I/O. Google resets Gemini rate limits after developers complained about quota exhaustion. Its pricing is the lowest in the market: $1.25/$5 per million tokens for Gemini 3.1 Pro.
Microsoft – MAI + Copilot
At Build 2026 on June 2, Microsoft announced seven MAI models built from scratch with zero distillation and clean data lineage. Key models: MAI-Thinking-1 (35B MoE, ~53% SWE-Bench Pro, 97% on AIME 2025) and MAI-Code-1-Flash (5B parameters, 51% SWE-Bench Pro). Both are text-only with 256K context. Microsoft is shifting Copilot’s default models to MAI and starting to charge for Copilot based on usage (CNBC, May 22, 2026). The models are available on OpenRouter, Fireworks, and Baseten. Terminal-Bench data is not yet available.
Cursor
Cursor is not a model vendor but an IDE that integrates the best models. It signed an agreement with SpaceX giving SpaceX the right to acquire Cursor for $60 billion (CNBC, May 2026). CursorBench, its own benchmark, shows Claude Opus 4.8 leading across all effort levels. Cursor’s tool-calling efficiency, fewer steps, and better follow-through make it a strong distribution channel for whichever model performs best.
Coding Benchmark Comparison Table
| Model | Vendor | Released | SWE-Bench Pro | Terminal-Bench 2.1 | Pricing (per M tokens) | Context Window |
|---|---|---|---|---|---|---|
| Claude Opus 4.8 | Anthropic | May 28, 2026 | 69.2% | 74.6% | $5/$25 | 1M |
| GPT-5.5 | OpenAI | Apr 23, 2026 | 58.6% | 78.2% | ~$15/$60 | ~1M |
| Gemini 3.1 Pro | Feb 19, 2026 | 54.2% | 70.3% | $1.25/$5 | 1M | |
| MAI-Thinking-1 | Microsoft | Jun 2, 2026 | ~53% | N/A (data insufficient) | TBD | 256K |
| MAI-Code-1-Flash | Microsoft | Jun 2, 2026 | 51% | N/A (data insufficient) | TBD | TBD |
Additional known scores: Claude Opus 4.8 SWE-Bench Verified 88.6%, OSWorld-Verified 83.4%; GPT-5.5 BrowseComp single-agent 84.4%; Gemini 3.1 Pro BrowseComp single-agent 85.9%.
Revenue & Market Signals
- Anthropic went from ~$0 coding revenue in 2024 to a $965B valuation. The confidential IPO filing on June 1 signals confidence that coding will be its primary revenue driver.
- OpenAI pivoted hard to enterprise after Claude Code “ate its lunch on consumer” (analyst Gil Luria). GPT-5.5’s terminal strength suggests a focus on DevOps and backend workflows.
- Cursor secured a $60B acquisition right from SpaceX, demonstrating that startups can win at the distribution layer even if they don’t train their own models.
- Microsoft built seven models from scratch to reduce dependence on OpenAI and Anthropic. Its Copilot monetization shift (usage-based pricing) mirrors the entire market trend.
The Late Entrants
Microsoft and Google entered the coding AI race later than Anthropic and OpenAI, but both are investing heavily.
Microsoft’s MAI models show that with only 5B parameters, MAI-Code-1-Flash achieves 51% on SWE-Bench Pro — close to larger models. MAI-Thinking-1 at 35B MoE matches Claude Opus 4.6-level performance (~53%). Terminal benchmarks are not yet released, making it impossible to compare against GPT-5.5 or Claude on CLI tasks.
Google’s Antigravity 2.0 is not a model but an agent orchestration system. Combined with Gemini 3.1 Pro’s native multimodal support and leading BrowseComp score, Google is pursuing the “multi-agent coding assistant” angle rather than raw model performance. The low pricing ($1.25/$5 per million tokens) is a deliberate volume play.
The Data Verdict
No single model wins across all benchmarks. The data shows:
- Claude Opus 4.8 leads on the two most comprehensive coding benchmarks (SWE-Bench Pro and SWE-Bench Verified). It is the best for complex software engineering tasks.
- GPT-5.5 dominates terminal and CLI coding (Terminal-Bench 2.1 lead of 3.6 points). It is the strongest for DevOps, command-line development, and computer-use workflows.
- Gemini 3.1 Pro leads browser-based coding (BrowseComp single-agent 85.9%) and offers the lowest token cost. Best for web-focused development and cost-sensitive deployments.
- Microsoft’s MAI models are competitive on SWE-Bench Pro despite small parameter counts, but terminal and web benchmarks are missing — making them incomplete compared to the others.
The market is not a monolith. Each vendor wins a distinct use case, and the “best” model depends entirely on the developer’s workflow. Anthropic’s lead on SWE-Bench Pro is the most publicized, but OpenAI’s terminal dominance and Google’s browsing edge are equally real.
Recommendation
- For complex software engineering teams: Claude Opus 4.8 — highest SWE-Bench Pro and SWE-Bench Verified scores.
- For terminal-heavy or DevOps tasks: GPT-5.5 — Terminal-Bench 2.1 leader and top computer use performance.
- For web-centric or cost-constrained projects: Gemini 3.1 Pro — best BrowseComp and lowest token pricing.
- For organizations wanting to avoid vendor lock-in: Evaluate MAI-Thinking-1 and MAI-Code-1-Flash once terminal benchmark data is released.