Claude for Cybersecurity: Opus 4.6 Still Leads 4.7 and 4.8 on Defense

Igor Kozlov

June 8, 20263 min readLLM

Claude for Cybersecurity: Opus 4.6 leads 4.7 and 4.8 on the Cyber Defense Benchmark leaderboard

The short answer: On Simbian's Cyber Defense Benchmark — 1,206 real attack-log investigations across 13 of 14 MITRE ATT&CK tactics — the strongest Claude on defense remains Opus 4.6 at 44.5% coverage. The newer Opus 4.7 (30.9%) and Opus 4.8 (32.7%) are tuned for faster, cheaper completion than for investigation depth. Sonnet 4.6 sits in between at 36.9%. Anthropic still owns the top four leaderboard slots, but no frontier LLM from any vendor clears the 50% threshold.

Anthropic released Opus 4.7 on April 16, 2026 as the first Claude with the Cyber Verification Program and production cybersecurity safeguards, followed by Opus 4.8. We ran every current Claude through the Cyber Defense Benchmark (arXiv 2604.19533) — a defense-oriented cybersecurity LLM gym with deterministic ground truth, covering 13 of 14 MITRE ATT&CK tactics and 105 procedures chained into real kill-chains. Which Claude should a SOC actually run today? Five months later, Opus 4.6 still holds the top spot.

▶ Watch the webinar — Why LLMs Fail in the SOC A 30-minute walkthrough of the Cyber Defense Benchmark: how every frontier model was scored, per-tactic breakdowns, and why no LLM clears 50% coverage on real attack logs.

Opus 4.6 leads. No model passes.

Fourteen frontier models, 1,206 investigations, 100K+ events per environment. Every model reasoned through SQL queries against real Windows attack telemetry, chained across 105 MITRE ATT&CK procedures. The passing threshold is 50% coverage. The leader is Opus 4.6 at 44.5%, short of the bar but the only model to pass 7 of 13 tactics. The top four slots all belong to Anthropic:

Rank	Model	Coverage	Median cost	Tactics passing
1	Opus 4.6	44.5%	$2.71	7 of 13
2	Sonnet 4.6	36.9%	$2.21	2 of 13
3	Opus 4.8	32.7%	$1.69	1 of 13
4	Opus 4.7	30.9%	$1.65	0 of 13

Every other frontier model (GPT 5.5, Gemini 3.5 Flash, GPT 5, DeepSeek 4 Pro, Kimi 2.6) sits below 27%. The category isn't saturated. It's failing.

Opus 4.7: tuned for speed and cost, not investigation depth

Opus 4.7 was the first Claude to ship with the Cyber Verification Program and automated detection of high-risk prompts — a meaningful safety posture for a frontier vendor. On the benchmark it lands at 30.9% coverage with 0 of 13 MITRE tactics passing. The mechanism is straightforward: 4.7 is tuned for fast, cheap completion. Its median investigation runs $1.65 in 243 seconds — the fastest and cheapest Claude on the leaderboard. Those are the right tradeoffs for coding and general agentic work, but on defense, coverage tracks how long the agent keeps pulling SQL against 100K+ events. The safeguards do what they were designed to do; they aren't a defensive-skill upgrade, and the benchmark doesn't measure safety posture.

Wall-clock investigation time vs coverage across frontier models, including Opus 4.8 — Cyber Defense Benchmark

Opus 4.8 follows the same curve

Opus 4.8 lifts coverage to 32.7% at $1.69 per investigation, with 1 of 13 MITRE tactics passing — roughly a two-point gain on 4.7 and the cheapest Claude on the leaderboard. The release character matches 4.7: optimized for fast, low-cost completion. On defensive work it still trails the five-month-older Opus 4.6 by 12 points of coverage. Across the 4.6 → 4.7 → 4.8 progression, headline benchmarks (coding, reasoning, agentic) have moved forward; defensive cybersecurity coverage on this benchmark has not. The most likely reason is the optimization target — investigation persistence isn't a metric recent post-training stacks are pointed at. We see the same trend across the other vendors, too, as compute resources are in high demand.

Sonnet 4.6 is the price-performance pick

Sonnet 4.6 at $2.21 per investigation outscores both Opus 4.7 (30.9%) and Opus 4.8 (32.7%) by a clean margin. For a SOC choosing among Claudes today, the picture is:

Coverage priority: Opus 4.6 at 44.5% — leads all Claudes, 23% cost premium over Sonnet.
Balanced pick: Sonnet 4.6 at 36.9% — best price-performance Claude.
Newer Opuses: 4.7 (30.9%) and 4.8 (32.7%) — cheaper than Sonnet, but on defense the lower cost reflects shorter investigation runs rather than more efficient reasoning.

Expected vs actual cost per investigation across frontier models, including Opus 4.8 — Cyber Defense Benchmark

Why does Claude top out at 44.5%?

The skills that win on SWE-bench and coding benchmarks do not transfer cleanly to defense. SOC reasoning needs hypothesis generation under heavy noise, kill-chain comprehension and investigation planning, and iterative SQL across 100K+ events per environment — none of which appear on offensive or general-purpose cybersecurity LLM evals. Defensive work also penalizes speed: Gemini 3 Flash, fastest model at 68 seconds, scores 14.6%. Investigation depth tracks investigation time. This is the load-bearing finding: the harness, the context, and the skills around the model can address this shortcomings and matter more than the model itself. Simbian's 95% performance in production setting independently verified by a global MSSP cross-validate it. Forty-nine points come from the substrate, not the model.

Cost vs coverage Pareto frontier across frontier models, including Opus 4.8 — Cyber Defense Benchmark

LLMs alone fail at defense. You need a harness.

The practical message from the leaderboard is not "pick a different Claude." It is: do not put a raw LLM in front of your alert queue. Every Claude tops out below 50% coverage, and the gap is structural — not the next-release problem people keep betting on. For SOC work today, Opus 4.6 is the strongest Claude; Claude Opus 4.7 and 4.8 are tuned for different jobs. But more fundamentally: no raw frontier LLM, from any vendor, belongs as the SOC. The model is the easiest part of the system to swap. The hard parts — context, retrieval, the investigation loop — are what actually move the score.

Why every frontier LLM fails on defense

Offense and defense are not symmetric problems.

Offense has a feedback signal. Did the exploit land? Did the shell open? Did I get root? The model knows when it has won. Every iteration refines toward that signal.
Defense has no such signal. Ground truth is invisible. The agent does not know how many malicious events exist, when it has found them all, or when normal enterprise noise has fooled it into calling reconnaissance benign.
Agents quit early. Frontier LLMs routinely stop investigating well before exhausting their query budget — they "believe" they are done. An agent that gives up is not a defender. It is a demo.

A bigger context window does not fix this. A more capable reasoning chain does not fix this. These are things frontier models already do well, and it was not enough. SOAR cannot fix it either — rule-based playbooks break on the first novel alert and require an engineer to rewrite them. Copilots cannot fix it — they wait for an analyst to ask the right question, and attackers do not respect shift changes.

How Simbian's AI SOC Agent gets to 95%

Simbian's AI SOC Agent runs on the same frontier LLMs that sit on the leaderboard above. The reasoning model is often a top-ranked Claude. What is different is the harness around it:

Context Lake™ grounds every investigation in your assets, identities, baseline behavior, prior incidents, analyst SOPs, and tribal knowledge. Generic models stop being generic the moment they touch your environment.
TrustedLLM™ is patented hallucination-resistant and prompt-injection hardened, adversarially trained for SOC reasoning rather than general chat.
Deterministic retrieval and agentic investigation loops are structured around the patterns analysts actually use — hypothesis generation, evidence chaining, kill-chain comprehension and investigation planning — not ReAct-for-its-own-sake. Every hypothesis is grounded in real telemetry, not guessed.
Federated reasoning across 100+ tools — SIEM, EDR, XDR, NDR, identity, cloud. No data migration, no vendor lock-in, no rebuilding playbooks every time a tool changes.
An assessment loop keeps the agent generating and evaluating hypotheses until the investigation is actually complete. No giving up at "good enough." No silent budget exhaustion.

The result in production: 92% of alerts auto-resolved, 95% verdict accuracy, 100% alert coverage, 24×7×365 with no shift gaps. Same model. Same data. Forty-nine points of coverage from the harness, not the LLM.

Raw LLM vs SOAR vs copilots vs Simbian

Approach	Coverage on defense	Fails on
Raw frontier LLM (best Claude)	44.5%	unknown ground truth, agent gives up early
SOAR playbooks	~25% automation	every novel alert, constant playbook maintenance
LLM copilots	analyst-paced	nights, weekends, anything an analyst forgot to ask
Simbian AI SOC Agent	95% in production	—

A model is a component. An agent is a product. If you are picking an LLM for your SOC, you are picking the wrong thing. Pick the harness.

→ Book a demo — see Simbian against your own telemetry

Frequently asked questions

Which Claude model is best for cybersecurity? Claude Opus 4.6 is the strongest Claude on the Cyber Defense Benchmark at 44.5% coverage with 7 of 13 MITRE tactics passing. Sonnet 4.6 is the best price-performance pick at 36.9% coverage and $2.21 per investigation. The newer Opus 4.7 (30.9%) and Opus 4.8 (32.7%) trail 4.6 — they're optimized for faster, cheaper completion, which trades against investigation runway on defensive work.

Does Opus 4.8 close the cybersecurity gap? Not yet. Opus 4.8 scores 32.7% with 1 of 13 MITRE tactics passing — roughly two points above 4.7, and still 12 points below the older Opus 4.6. It is the cheapest Claude per investigation ($1.69) and the second-fastest. Headline benchmarks have moved release-over-release; defensive cybersecurity coverage on this benchmark hasn't. For SOC reasoning on raw LLMs today, Opus 4.6 remains the strongest Claude.

Can any LLM pass the Cyber Defense Benchmark today? No. Across 14 frontier models from Anthropic, OpenAI, Google, and open-weight providers, none crossed the 50% coverage threshold. The leader sits at 44.5%. By contrast, frontier LLMs routinely score above 80% on offensive-security benchmarks. Defense is the harder problem.

Is Claude the best LLM for SOC work? Yes, by a clear margin. Every Anthropic model on the leaderboard outscores GPT 5.5 (26.4%), GPT 5 (17.4%), and every Gemini and open-weight model. The top four leaderboard slots — Opus 4.6, Sonnet 4.6, Opus 4.8, Opus 4.7 — are all Claude. The best LLM for SOC today is Opus 4.6.

What closes the gap between LLM cybersecurity scores and a real SOC? The harness around the model. On the same benchmark, the best frontier LLM alone scored 46%; the same model wrapped in Simbian's harness scored 95% in production environments, as it is independently verified by a global MSSPs. The 49-point lift comes from context, skills, and the agent loop, not the underlying LLM.

Share this article

Continue Reading

Security

LLM Security: GPT 5.5 vs Opus 4.7 and more

Which cybersecurity LLM should your SOC run? We tested GPT 5.5, Opus 4.7, and 10 others on real threat hunting. Performance, cost, speed compared.

Igor Kozlov

May 12, 2026

Security

The Cyber Defense Benchmark: Why Every Frontier LLM Failed

We ran Claude, GPT-5, Gemini, and 8 other frontier LLMs through 884 agentic threat-hunting runs on real attack telemetry. The headline result: zero passed.

Simbian Research Lab

April 28, 2026

Security

Do Not Trust Your SOC LLM

Cyber Defense Benchmark caught Opus 4.7, GPT 5.5, and Gemini 3.1 Pro reward-hacking, bypassing constraints, and agentsplaining. The LLM security risks every SOC must control.

Igor Kozlov

May 18, 2026

Sign up for Simbian's Newsletter

By submitting this form, you agree to our Privacy Policy.

Ask AI about Simbian

The short answer: On Simbian's Cyber Defense Benchmark — 1,206 real attack-log investigations across 13 of 14 MITRE ATT&CK tactics — the strongest Claude on defense remains Opus 4.6 at 44.5% coverage. The newer Opus 4.7 (30.9%) and Opus 4.8 (32.7%) are tuned for faster, cheaper completion than for investigation depth. Sonnet 4.6 sits in between at 36.9%. Anthropic still owns the top four leaderboard slots, but no frontier LLM from any vendor clears the 50% threshold.

▶ Watch the webinar — Why LLMs Fail in the SOC A 30-minute walkthrough of the Cyber Defense Benchmark: how every frontier model was scored, per-tactic breakdowns, and why no LLM clears 50% coverage on real attack logs.

Opus 4.6 leads. No model passes.

Rank	Model	Coverage	Median cost	Tactics passing
1	Opus 4.6	44.5%	$2.71	7 of 13
2	Sonnet 4.6	36.9%	$2.21	2 of 13
3	Opus 4.8	32.7%	$1.69	1 of 13
4	Opus 4.7	30.9%	$1.65	0 of 13

Every other frontier model (GPT 5.5, Gemini 3.5 Flash, GPT 5, DeepSeek 4 Pro, Kimi 2.6) sits below 27%. The category isn't saturated. It's failing.

Opus 4.7: tuned for speed and cost, not investigation depth

Wall-clock investigation time vs coverage across frontier models, including Opus 4.8 — Cyber Defense Benchmark

Opus 4.8 follows the same curve

Sonnet 4.6 is the price-performance pick

Sonnet 4.6 at $2.21 per investigation outscores both Opus 4.7 (30.9%) and Opus 4.8 (32.7%) by a clean margin. For a SOC choosing among Claudes today, the picture is:

Coverage priority: Opus 4.6 at 44.5% — leads all Claudes, 23% cost premium over Sonnet.
Balanced pick: Sonnet 4.6 at 36.9% — best price-performance Claude.
Newer Opuses: 4.7 (30.9%) and 4.8 (32.7%) — cheaper than Sonnet, but on defense the lower cost reflects shorter investigation runs rather than more efficient reasoning.

Expected vs actual cost per investigation across frontier models, including Opus 4.8 — Cyber Defense Benchmark

Why does Claude top out at 44.5%?

Cost vs coverage Pareto frontier across frontier models, including Opus 4.8 — Cyber Defense Benchmark

LLMs alone fail at defense. You need a harness.

Why every frontier LLM fails on defense

Offense and defense are not symmetric problems.

Offense has a feedback signal. Did the exploit land? Did the shell open? Did I get root? The model knows when it has won. Every iteration refines toward that signal.
Defense has no such signal. Ground truth is invisible. The agent does not know how many malicious events exist, when it has found them all, or when normal enterprise noise has fooled it into calling reconnaissance benign.
Agents quit early. Frontier LLMs routinely stop investigating well before exhausting their query budget — they "believe" they are done. An agent that gives up is not a defender. It is a demo.

How Simbian's AI SOC Agent gets to 95%

Simbian's AI SOC Agent runs on the same frontier LLMs that sit on the leaderboard above. The reasoning model is often a top-ranked Claude. What is different is the harness around it:

Context Lake™ grounds every investigation in your assets, identities, baseline behavior, prior incidents, analyst SOPs, and tribal knowledge. Generic models stop being generic the moment they touch your environment.
TrustedLLM™ is patented hallucination-resistant and prompt-injection hardened, adversarially trained for SOC reasoning rather than general chat.
Deterministic retrieval and agentic investigation loops are structured around the patterns analysts actually use — hypothesis generation, evidence chaining, kill-chain comprehension and investigation planning — not ReAct-for-its-own-sake. Every hypothesis is grounded in real telemetry, not guessed.
Federated reasoning across 100+ tools — SIEM, EDR, XDR, NDR, identity, cloud. No data migration, no vendor lock-in, no rebuilding playbooks every time a tool changes.
An assessment loop keeps the agent generating and evaluating hypotheses until the investigation is actually complete. No giving up at "good enough." No silent budget exhaustion.

Raw LLM vs SOAR vs copilots vs Simbian

Approach	Coverage on defense	Fails on
Raw frontier LLM (best Claude)	44.5%	unknown ground truth, agent gives up early
SOAR playbooks	~25% automation	every novel alert, constant playbook maintenance
LLM copilots	analyst-paced	nights, weekends, anything an analyst forgot to ask
Simbian AI SOC Agent	95% in production	—

A model is a component. An agent is a product. If you are picking an LLM for your SOC, you are picking the wrong thing. Pick the harness.

→ Book a demo — see Simbian against your own telemetry