2026-04-17 | 6 min read | koscak.ai editorial

Two Red Bars. Read the chart end to end.

Claude Opus 4.7 launched on 2026-04-16 to near-universal "god-mode" verdicts. Anthropic's own benchmark chart shows two red bars - regressions against 4.6. One of them is the most load-bearing capability in modern offensive security. Here is what those two bars tell you, and what every security team should do before running an N+1 model in production.

Opus 4.7 ships with six improvements and two regressions against 4.6. The gated Mythos Preview sits +10.0 above public 4.7 on CyberGym.

TL;DR

Anthropic's own chart shows Opus 4.7 regressing against 4.6 on two axes: cybersecurity vulnerability reproduction (CyberGym: 73.8 to 73.1) and agentic search (BrowseComp: 83.7 to 79.3).
Mythos Preview, Anthropic's Glasswing defensive-cyber model, sits at 83.1% CyberGym - ten full points above public 4.7 - and is invitation-only.
Public Opus gets a mild cyber haircut; cyber-capable Opus lives behind a separate, gated tier. This is a business decision, not a training accident.
The rest of the chart is genuinely better: +10.9% SWE-bench Pro, +13% CharXiv, +5.3% OSWorld, +2.9% GPQA. 4.7 is the right default for coding, vision, reasoning, and computer use.
Verdict: default to 4.7 for everything except offensive-security work. Keep 4.6 available for bug-bounty and recon until Glasswing is accessible or the regression closes.

Why "better than 4.6" is a rigged question

Every model launch has the same story shape: vendor ships N+1, shows a bar chart where every bar is taller than N. Buyers read "N+1 crushes N" and upgrade.

This framing has a silent assumption: the N you are comparing against is the same N you were using last week. It is not always true. Providers can adjust the live behaviour of an already-deployed model through system prompts, tool-use pipelines, safety filters, routing, or weight swaps on the same model ID. Users of the API see whatever the provider serves today, not the weights that shipped at announcement.

If the live version of N degrades during the weeks before N+1 launches, every side-by-side you run at launch measures tier differentiation, not capability.

This is not a conspiracy claim. It is a structural fact about how hosted models are distributed. The only reliable control for "did N+1 actually improve" is an output sample of N taken before the launch window, pinned locally, and replayed.

Any team running AI in production should be doing this already. Few are.

The two red bars

Anthropic's Opus 4.7 announcement chart is the one to spend a minute on. Eleven benchmarks. Two of them are red.

Regressions against 4.6:

CyberGym (cybersecurity vulnerability reproduction): 73.8% to 73.1% (-0.7 pts)
BrowseComp (agentic search): 83.7% to 79.3% (-4.4 pts)

The CyberGym drop is small on its own. The pattern is what matters. Anthropic's system card lists a separate model - Mythos Preview, part of the Glasswing initiative - at 83.1% CyberGym, ten full points above public 4.7. Mythos is invitation-only and targeted at defensive-cyber customers.

Read the positioning: public Opus takes a mild cyber haircut; cyber-capable Opus lives behind a separate, gated tier. This is neither surprising nor bad. Enterprise-security customers have been the first monetized tier of frontier model releases since 2023, and a 4.7-class cyber model in unrestricted public hands is a regulatory and liability problem.

Still, if cyber capability is load-bearing for your work - and for every offensive-security team, it is - the chart is telling you where to source it.

The agentic-search drop

The BrowseComp regression (-4.4 pts) is larger but gets less attention because agentic search is less visible. It matters for two classes of work:

OSINT and recon during engagements - any task that involves a model taking multi-step browsing actions against a real target.
Autonomous agents that search before acting - the standard pattern for issue-triage, research assistants, and bug-bounty pipelines that scan advisories before drafting.

If your agents rely on browse-then-act chains, re-measure them on 4.7 before flipping the default. The drop is small enough that general-purpose workloads will not notice; it is large enough that agent pipelines tuned on 4.6 can regress perceptibly.

What to actually do

Three things, in order.

Pin a historical baseline. Take a representative sample of your production prompts and replay them now against current live 4.6 while the API still serves it at parity. Store the outputs. This becomes your reference point for the next two launches, not just this one.
Route by task class, not by model preference. Default to 4.7 for coding, vision, reasoning, and computer use. Keep 4.6 on tap for offensive-security and agentic-search work until Glasswing is generally available or the regression closes. The Anthropic API supports per-request model selection; use it.
Stop trusting vendor benchmarks in isolation. Every frontier lab publishes charts where every bar goes up. Anthropic deserves credit for publishing two that went down. That is rare. Read the chart end to end before you read the press release.

The uncomfortable implication

If Anthropic will list a gated cyber tier on its own system card, every other frontier vendor is doing something similar and not publishing it. When a buyer compares OpenAI, Anthropic, and Google on the same task today, they are comparing three tier-differentiated products, not three underlying model capabilities. The honest benchmark for frontier capability is the invitation-only tier at each lab, not the public API.

For most teams, this does not change daily work. For any team building on top of frontier models to do security, legal, or research work where the gap between public and enterprise tiers matters, it changes which questions get asked at procurement.

Read your vendor's chart. Read it end to end. And keep a copy of today's output before tomorrow's launch.

Update log (1)

2026-04-17Initial publication.

Sources + verification

Benchmark numbers are taken directly from Anthropic's own published announcement and system card. No numbers in this post are reproduced locally; if they were, we would publish the harness alongside. Any team that wants to reproduce a model evaluation is encouraged to do the same.