Claude Opus 4.8 Scores 61.4 on AAII — But Where Does It Still Lose?

TL;DR

Anthropic claims Claude Opus 4.8 is the first model to break 60 on the Artificial Analysis Intelligence Index (61.4), beating GPT-5.5 (60.2) by 1.2 points. The data confirms clear leads on SWE-Bench Pro (69.2% vs GPT-5.5’s 58.6%), HLE with tools (57.9% vs 52.2%), and GDPval-AA (1890 Elo vs 1769). Yet GPT-5.5 still dominates Terminal-Bench 2.1 (78.2% vs 74.6%), and BrowseComp single-agent is a dead heat (84.4% vs 84.3%). GPQA Diamond is a three-way tie at 93.6%. The AAII lead is real but narrower than it appears — and doesn’t extend to every critical benchmark.

The Claim

Anthropic, May 28, 2026: “Claude Opus 4.8, taking the #1 spot on the Artificial Analysis Intelligence Index with a score of 61.4, dethroning GPT-5.5 for the first time since OpenAI’s April launch.”
Source

The Data Sources

AAII Index: Artificial Analysis, May 28, 2026 (https://artificialanalysis.ai/articles/claude-opus-4-8-analysis-and-benchmarks)
SWE-Bench Pro (hard variant): Anthropic vs Artificial Analysis data, May 2026
Terminal-Bench 2.1: Anthropic vs Artificial Analysis data, May 2026
SWE-Bench Verified: Anthropic, May 2026
OSWorld-Verified: Anthropic, May 2026
BrowseComp (single-agent & multi-agent): Anthropic, May 2026
GPQA Diamond: Anthropic vs Artificial Analysis data, May 2026
HLE (with/without tools): Anthropic, May 2026
GDPval-AA (Elo): Artificial Analysis via Anthropic, May 2026
Code Safety & Alignment: Anthropic alignment team, May 2026
Box AI enterprise test: Box blog, May 2026 (https://blog.box.com/anthropics-opus-48-advances-enterprise-content-use-cases)
Harvey AI Legal Agent Benchmark: Harvey AI, May 2026
CursorBench: Michael Truell (Cursor co-founder), May 2026

Where They’re Right

The AAII crown is legitimate. Opus 4.8 scores 61.4 — the first model above 60 — with a +1.2 point gap over GPT-5.5 (60.2) and +4.1 over its predecessor Opus 4.7 (57.3). The index aggregates multiple capabilities, so this indicates broad progress.

Coding benchmarks show the largest leap. On SWE-Bench Pro (hard variant), Opus 4.8 reaches 69.2% — a 10.6 point lead over GPT-5.5 (58.6%) and 15 points over Gemini 3.1 Pro (54.2%). The gain over Opus 4.7 is a solid +4.9 points. On OSWorld-Verified (UI-driven VM tasks), it scores 83.4% vs GPT-5.5’s 78.7%, a 4.7 point lead.

Economically valuable work (GDPval-AA) shows a massive efficiency improvement. Opus 4.8 hits 1,890 Elo, outstripping GPT-5.5 by 121 points and Opus 4.7 by 137. Anthropic claims this is achieved with 15% fewer turns and 35% fewer output tokens than Opus 4.7, though it still uses ~30% more turns than GPT-5.5. The efficiency-curve trade-off matters for latency-sensitive deployments.

Code safety improvements are meaningful. Anthropic’s alignment team reports 4x fewer unflagged code flaws, a 10x reduction in overconfidence, and 0% on uncritically reporting flawed results — a first for Claude. These are internal metrics but align with the model’s observable behavior in enterprise tests (Box AI report drafting hit 87% vs 77% on Opus 4.7; Harvey AI’s highest Legal Agent Benchmark score).

Where It Gets Murky

Two key benchmarks still belong to GPT-5.5. Terminal-Bench 2.1 (multi-tool CLI workflows) is GPT-5.5’s territory at 78.2% versus Opus 4.8’s 74.6%. The gap narrowed from 12.1 points (Opus 4.7) to 3.6 points, but GPT-5.5 remains the leader here. BrowseComp single-agent is similarly close: GPT-5.5 (84.4%) and Gemini 3.1 Pro (85.9%) both beat Opus 4.8 (84.3%). The gap is within noise, but they don’t trail by much — they lead.

GPQA Diamond is a dead heat. Opus 4.8, Opus 4.7, and GPT-5.5 all sit at 93.6-94.2%. Opus 4.8 actually dips slightly from 4.7 (94.2% to 93.6%). This suggests the progress is not uniform across reasoning domains — or that GPQA Diamond is saturating. Three models within 0.6 points is not a leadership story.

Pricing unchanged, but fast mode cheaper — with caveats. Opus 4.8 costs the same $5/$25 per million tokens as Opus 4.7. Fast mode is 3x cheaper, but standard mode pricing didn’t drop. Meanwhile GPT-5.5’s pricing is undisclosed, making TCO comparisons impossible. The fast mode discount applies only to the cheaper tier; standard output tokens remain expensive.

Enterprise data is sparse and vendor-aligned. Box’s tests are a single partner’s case studies, not independent audits. Harvey’s “highest ever” score lacks absolute numbers. Cursor’s claim of “exceeds prior Opus on CursorBench across all effort levels” is not accompanied by comparative data against GPT-5.5 or Gemini. The strongest numbers come from Anthropic’s own lab.

The efficiency trade-off is real. While Opus 4.8 uses fewer turns and tokens than Opus 4.7 on GDPval-AA, it still consumes ~30% more turns than GPT-5.5. For real-time interactive coding, this could offset the benchmark advantage. Anthropic didn’t disclose latency or cost-per-task comparisons.

Comparison Table

Benchmark	Opus 4.8	Opus 4.7	GPT-5.5	Gemini 3.1 Pro
AAII Index	61.4	57.3	60.2	—
SWE-Bench Pro (hard)	69.2%	64.3%	58.6%	54.2%
Terminal-Bench 2.1	74.6%	66.1%	78.2%	70.3%
GPQA Diamond	93.6%	94.2%	93.6%	—
HLE (with tools)	57.9%	54.7%	52.2%	—
GDPval-AA (Elo)	1,890	1,753	1,769	—
Pricing (per M tokens)	$5/$25	$5/$25	undisclosed	undisclosed

Dashes indicate data insufficient for comparison point.

Verdict

Opus 4.8 leads on overall intelligence (AAII), software engineering (SWE-Bench Pro), and economically valuable reasoning (GDPval-AA), but GPT-5.5 retains competitive edges in complex tool orchestration (Terminal-Bench) and web browsing (BrowseComp single-agent). The improvements are real, concentrated, and come with unchanged pricing — but the lead is narrower than the headline number suggests.

Recommendation

If you need best-in-class code generation, complex reasoning, or enterprise agent workflows that benefit from high efficiency per task, choose Opus 4.8; if your use case demands advanced CLI automation or web-based multi-tool orchestration, GPT-5.5 remains the safer bet until comparative latency and cost-per-task data emerges.