Claude Opus 4.8 Just Launched. Here's What Actually Changed.

Anthropic dropped Opus 4.8 yesterday — same price, sharper judgment, 4× less likely to hide buggy code. Plus $65B raised, Mythos coming for everyone, and dynamic workflows that run hundreds of agents at once.

By Aditya Kumar Jha · May 29, 2026 · 12 min read · AI Models

⚡ Published May 29, 2026 — tested and reported by Aditya Kumar Jha. Every figure sourced from the Opus 4.8 system card, Anthropic's official announcement, and verified press coverage published in the last 24 hours. Key facts: Claude Opus 4.8 launched May 28, 2026 — 41 days after Opus 4.7, Anthropic's fastest upgrade cycle on record. Pricing unchanged from Opus 4.7: $5 per million input tokens, $25 per million output tokens, as of May 29, 2026. Fast mode: 2.5× speed, $10/$50 per million tokens — three times cheaper than fast mode on previous Opus models. SWE-Bench Pro: 69.2% (Opus 4.8) vs 64.3% (Opus 4.7) vs 58.6% (GPT-5.5). Terminal-Bench 2.1: 74.6% (Opus 4.8) vs 78.2% (GPT-5.5) via Terminus-2 — GPT-5.5 still leads this benchmark. GDPval-AA knowledge work: 1890 Elo (Opus 4.8) vs 1769 (GPT-5.5) vs 1753 (Opus 4.7). USAMO 2026 math: 96.7% vs 69.3% — the single biggest jump in the entire 4.7-to-4.8 delta. GraphWalks 1M long-context: 68.1% vs 40.3% — a 69% improvement. System card: Opus 4.8 scored 0% on 'uncritically reporting flawed results' — first Claude model to hit zero on that measure. $65 billion Series H at $965 billion valuation. Revenue run rate: $47 billion annualised. Claude Mythos coming to all customers 'in the coming weeks.'

Claude Opus 4.7 launched six weeks ago. Developers who ran it on extended coding tasks hit the same wall: the model would finish a task, declare success, and move on — even when the code had problems it could have caught. Confident. Wrong. Silent about it. Anthropic shipped Opus 4.8 yesterday as a direct fix. The system card backs the claim: for the first time, a Claude model scored 0% on 'uncritically reporting flawed results' in Anthropic's alignment evaluation.

The launch came alongside two announcements that change the Anthropic picture entirely. First: a $65 billion Series H at a $965 billion valuation — Anthropic is now the most valuable AI company on the planet, ahead of OpenAI's $852 billion March figure. Second: confirmation that Claude Mythos, the model that found over 10,000 critical software vulnerabilities in one month through Project Glasswing, is coming to all customers in the coming weeks. Opus 4.8 is what you can use right now. Mythos is what Anthropic is preparing the world to handle.

Here is what most reviews will not tell you: Opus 4.8 is not competing with GPT-5.5. It is the proving ground for Mythos. Every honesty and alignment improvement in Opus 4.8 is Anthropic stress-testing the safety features it needs before releasing a model that can autonomously find and exploit zero-day vulnerabilities. You are not evaluating a product launch. You are watching a rehearsal.

What Actually Changed — Starting With the Number Nobody Is Reporting

The Honesty Story Is Bigger Than '4×'

Anthropic's press release said Opus 4.8 is 'around four times less likely' to let code flaws pass unremarked. The system card shows something larger. Opus 4.8 scored 0% on uncritically reporting flawed results — the first Claude model to reach zero on that specific measure. The alignment data also shows a more than ten-fold reduction in overconfidence versus Opus 4.7. The '4×' figure applies to code flagging specifically. The full honesty picture is a ten-fold drop in the behavior that makes AI agents most dangerous in production: confidently claiming progress that does not exist.

For anyone running Claude autonomously — multi-file refactors, overnight builds, financial analysis pipelines — this is the most significant practical change in the release. The previous failure was not just wrong code. It was wrong code with a confident explanation of why it was right, which prevented the human from investigating further. That behavior is now ten times less common.

All the Benchmark Scores — Including the Two Regressions

SWE-Bench Pro: 69.2%, up from 64.3%, with GPT-5.5 at 58.6%. A 10.6-point lead on the benchmark that tests real open-source repository resolution. GDPval-AA: 1890 Elo, up from 1753, clearing GPT-5.5 (1769) by 121 points. USAMO 2026 math: 96.7%, up from 69.3% — a 27-point gain and the single biggest delta in the entire release. GraphWalks at 1M tokens: 68.1%, up from 40.3%, a 69% improvement in long-context reasoning.

Terminal-Bench 2.1: 74.6%, up from 66.1%, but GPT-5.5 leads at 78.2% on the same Terminus-2 public harness — and reaches 83.4% on its native Codex CLI. For developers in terminal-heavy workflows, that gap is real. On GPQA Diamond, Opus 4.8 slipped to 93.6% from 94.2% — within trial variance on a near-saturated benchmark, but still a regression.

One result the launch coverage has almost entirely skipped: Vending-Bench 2, which simulates running a vending machine business over a full year, shows Opus 4.8 finishing with roughly $3,000–$5,800 versus Opus 4.7's $8,000–$11,000. DataCamp's review called it 'a bad result.' Anthropic did not highlight it. If long-horizon business simulation or planning is central to your use case, test that specific workload before committing to 4.8.

Benchmark	Opus 4.7	Opus 4.8	GPT-5.5
SWE-Bench Pro (agentic coding)	64.3%	69.2% ✅ leads	58.6%
Terminal-Bench 2.1 (Terminus-2 harness)	66.1%	74.6%	78.2% ✅ leads (83.4% Codex CLI)
OSWorld-Verified (computer use)	82.8%	83.4% ✅ leads	78.7%
MCP-Atlas (tool use)	77.3%	82.2% ✅ leads	75.3%
GDPval-AA (knowledge work, Elo)	1753	1890 ✅ leads (+137)	1769
USAMO 2026 (math)	69.3%	96.7% ✅ leads (+27.4pts)	not reported
GraphWalks 1M (long-context)	40.3%	68.1% ✅ leads (+69%)	45.4%
GPQA Diamond	94.2% ✅ leads	93.6% ⚠️ slight regression	not reported
Pricing (input / output per 1M tokens)	$5 / $25	$5 / $25 (same) ✅	$15 / $60

Dynamic Workflows: What It Lets You Do Today

Dynamic Workflows, available in Research Preview in Claude Code, lets Claude generate a plan, spawn hundreds of parallel subagents to execute it simultaneously, and verify all outputs before reporting back. Anthropic's example: codebase-scale migrations across hundreds of thousands of lines of code, from kickoff to merge, with the existing test suite as the acceptance bar.

A migration that previously required multiple sequential Claude Code sessions — each picking up context from the last — can now run as a single orchestrated task. Research Preview means expect rough edges. The capability is functional. Anthropic is gathering production data before GA.

Effort control also shipped alongside: Standard for quick tasks, Extra (xhigh in Claude Code) for hard problems, Max for the longest autonomous runs. Rate limits in Claude Code were raised to accommodate the higher token usage. DataCamp's review surfaced a useful finding from the system card: at minimum effort, Opus 4.8 already matches Opus 4.7 at maximum effort on SWE-Bench Pro. Standard-tier Opus 4.8 often beats Max-tier Opus 4.7 on cost per correct result.

Before committing to Max effort across your pipeline: run your three most token-heavy tasks at Standard, then compare output quality and token count against your Opus 4.7 logs. The Contra Collective review found Opus 4.8 resolves the same tasks while spending fewer reasoning tokens — meaning the effective cost per correct result often drops even though list pricing is identical. That shows up in your actual usage data, not in benchmark tables.

The $65 Billion Round and What $965 Billion Reflects

Anthropic raised $65 billion in a Series H round led by Altimeter Capital, Dragoneer, Greenoaks, and Sequoia Capital, with Capital Group, Coatue, D1 Capital Partners, GIC, ICONIQ, and XN as co-leads. The round included $15 billion of previously committed hyperscaler investment, including $5 billion from Amazon. Post-money valuation: $965 billion. Revenue run rate grew from $1 billion at the end of 2024 to $47 billion annualised as of May 2026. Nearly tripling in three months. That growth comes from enterprise adoption.

The valuation jump from $380 billion in February to $965 billion now has two drivers. The revenue figure commands frontier tech multiples. The second is Mythos. Project Glasswing results published May 26, 2026 showed Claude Mythos Preview finding 10,000-plus critical vulnerabilities in one month. Mozilla found 271 bugs in Firefox 150 using Mythos versus 27 in Firefox 148 using Opus 4.6 — a ten-fold increase on the same codebase. Apple, Google, Microsoft, AWS, Cloudflare, NVIDIA, CrowdStrike, and JPMorganChase are using it under restricted access. Investors are pricing in a future where Mythos-class models become the infrastructure layer of enterprise software security.

The sentence worth screenshotting: Mozilla found 271 vulnerabilities in Firefox 150 using Claude Mythos Preview — more than ten times the 27 it found in Firefox 148 using Claude Opus 4.6. Same codebase family. Different model. That is production results on one of the most widely deployed browsers in the world, not a synthetic benchmark. It is why Anthropic is worth $965 billion and why Mythos is not publicly available yet.

Claude Mythos: Why It Is Not Out Yet and the Path to General Access

Mythos Preview has been available to roughly 50 Project Glasswing partners since April 2026. The capability that explains the caution: Mythos can autonomously find zero-day vulnerabilities and build working exploits. Releasing it without safeguards strong enough to prevent misuse would be the same as publishing a live exploit database with no access controls — Anthropic's own framing on the Glasswing page.

Anthropic's stated plan: test Mythos-class safety features on an upcoming Opus model first, then deploy at Mythos. Opus 4.8 is that testbed. The path to Mythos for all users runs through one more Opus iteration — likely 4.9 or 5.0. If the six-week cadence holds, that puts the next Opus in early July. Mythos for general access would follow after safety validation. No date has been announced.

Should You Switch to Opus 4.8 Today?

Coming from Opus 4.7: upgrade immediately. Pricing identical. Every benchmark improved except GPQA Diamond (down 0.6 points, within noise) and Vending-Bench 2 (down significantly — check if long-horizon planning is your primary workload). The honesty improvements justify the switch for any autonomous coding use case.

Choosing between Opus 4.8 and GPT-5.5: Opus 4.8 leads on SWE-Bench Pro (+10.6 points), GDPval-AA (+121 Elo), OSWorld computer use (+4.7 points), MCP-Atlas tool use (+6.9 points), and USAMO math (+27 points). GPT-5.5 leads Terminal-Bench 2.1 (78.2% vs 74.6% on Terminus-2). GPT-5.5 also costs three times more per output token. For most agentic and knowledge work, Opus 4.8 has the stronger profile at a significantly lower price.

Evaluating Gemini 3.1 Pro: its one advantage is context length for massive-document research. On agentic coding and reasoning, Opus 4.8 leads. At high volume where token cost dominates, Gemini 3.5 Flash is the better comparison — not Gemini 3.1 Pro. The emerging production pattern is Gemini 3.5 Flash for parallel workers, Opus 4.8 as orchestrator.

Pricing, API Access, and Quick-Start Guide

Model ID: claude-opus-4-8, with a 1M-token context variant at claude-opus-4-8[1m]. Available immediately on the Anthropic API, claude.ai, Claude Code, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Azure.
Standard pricing as of May 29, 2026: $5 per million input tokens, $25 per million output tokens — identical to Opus 4.7. GPT-5.5 costs $15/$60 per million tokens.
Fast mode: $10 input / $50 output per million tokens at 2.5× speed. Three times cheaper than fast mode on previous Opus models. Claude Code access via /fast; API access via waitlist at claude.com/fast-mode.
Effort levels: Standard (default, uses high effort), Extra (xhigh in Claude Code), Max. Rate limits in Claude Code raised to support Extra and Max usage.
Dynamic Workflows: Research Preview in Claude Code — hundreds of parallel subagents for codebase-scale tasks. Not general availability yet.
Claude is now the first frontier model available simultaneously on all three major cloud hyperscalers: AWS Bedrock, Google Cloud Vertex AI, and Microsoft Azure.

Five Questions People Are Searching Right Now — Answered

Is Claude Opus 4.8 better than GPT-5.5?

On most benchmarks, yes: SWE-Bench Pro (69.2% vs 58.6%), GDPval-AA knowledge work (1890 vs 1769 Elo), OSWorld computer use (83.4% vs 78.7%), MCP-Atlas tool use (82.2% vs 75.3%). GPT-5.5 leads Terminal-Bench 2.1 (78.2% vs 74.6% on Terminus-2). Opus 4.8 also costs three times less per output token ($25 vs $60 per million).

What is the Claude Opus 4.8 price?

As of May 29, 2026: $5 per million input tokens, $25 per million output tokens — unchanged from Opus 4.7. Fast mode runs at 2.5× speed for $10/$50 per million tokens, three times cheaper than fast mode on previous Opus models.

What is Claude Opus 4.8 fast mode?

Fast mode runs Opus 4.8 at approximately 2.5× standard speed, priced at $10 input / $50 output per million tokens. Available in Claude Code via /fast now; API access requires a waitlist at claude.com/fast-mode.

What is Claude Opus 4.8 dynamic workflows?

Dynamic Workflows (Research Preview in Claude Code) lets Claude spawn hundreds of parallel subagents on a single large task — codebase migrations, large refactors — then verify all outputs before returning results. Not yet general availability.

When is Claude Mythos releasing to the public?

Anthropic said 'coming weeks' on May 28, 2026. No precise date. The plan is to validate Mythos-class safety features on a future Opus model first — likely Opus 4.9 or 5.0 — then open Mythos to all customers. Mythos can autonomously find and exploit zero-day vulnerabilities, which is why it is still under restricted access via Project Glasswing.

Insight