LLM Security: GPT 5.5 vs Opus 4.7 and more

Igor Kozlov

May 12, 20264 min readSecurity

LLM Security: GPT 5.5 vs Opus 4.7 and more — Simbian benchmark comparison

The best LLM for cybersecurity isn't the one with the highest general benchmark score. It's the one that addresses your SOC specific needs best. The one that finds the most attack evidence at a cost and speed your SOC can sustain. We tested 12 models end-to-end on 26 diverse real Windows attack campaigns to answer that question with data, not spec sheets.

Each model ran autonomously inside a ReAct harness with access to a SQL log database. This setup is provider-agnostic and thus objective. We have a proprietary harness too and plan to study Claude Code, Codex, and other commercial harnesses in a follow-up, but this time we benchmark LLMs. The 12 models span four provider groups:

Anthropic — Opus 4.6, Opus 4.7, Sonnet 4.6
OpenAI — GPT 5, GPT 5.5
Google — Gemini 3.1 Pro, Gemini 3 Flash
Open-Weight — DeepSeek 4 Pro, Kimi 2.6, Minimax 2.7, Nemotron 3 Super, Qwen 3.6 Plus

Three numbers matter: performance, cost, and investigation time. Full methodology is in the Cyber Defense Benchmark paper.

Which LLM Performs Best for Cyber Defense?

Performance vs cost — Pareto frontier

Performance is measured using the Coverage Score: the fraction of malicious logs each model both finds and submits across every step of the attack procedure, averaged over 13 of 14 MITRE ATT&CK tactics and normalised to 1. Cost is the cache-adjusted dollar amount per investigation. Not all providers enable prompt caching by default. Without it, you reprocess the entire history on every turn. Exponential growth you can't unwind.

The Pareto frontier:

Model	Performance	Cost
Opus 4.6	0.45	$2.71
Sonnet 4.6	0.37	$2.21
Opus 4.7	0.31	$1.65
GPT 5.5	0.26	$2.37

Opus 4.7 is the cheapest flagship that still scores ~30% across tactics. Opus 4.6 holds the performance ceiling (45%) at a 64% cost premium. GPT 5.5 is dominated on this chart: roughly the same cost as Sonnet 4.6 with materially worse coverage.

How Fast Can Each LLM Investigate a Security Incident?

Performance vs investigation time

Investigation time is the median wall-clock seconds spent on SQL querying and analysis turns (submit and retry overhead stripped so providers aren't penalised for infra throttling). On speed, the ranking flips: GPT 5.5 (~200s) > Opus 4.7 (~250s) > Opus 4.6 (~400s) > Sonnet 4.6 (~450s).

Speed is dominated by decoding rate × number of turns plus thinking budget:

Model	Inference Speed
GPT 5.5	~90 tok/s
Opus 4.6 / 4.7, Sonnet 4.6	~60 tok/s

OpenAI decodes roughly 1.5× faster than Anthropic at the same tier. But Opus 4.7 completes investigations in ~30 SQL turns versus Sonnet/Opus 4.6's ~50. Fewer turns is a bigger lever than per-turn speed.

Why Do Generic AI Benchmarks Fail for Security?

Too many factors trade off against each other, and they cancel in ways you wouldn't predict from a spec sheet:

Number of turns. The model decides how to scope the task. Fewer turns = lower bill and faster wall-clock, but only if the model converges. We have to trust agents to operate at machine response speeds to match AI-driven attacks. Validating that the LLM adequately decides what amount of work is just right is critical.
Inference speed cancels with verbosity. GPT 5.5 emits more tokens per turn but generates them faster. The two effects partially cancel. Net wall-clock requires running the actual workload.
Investigation style. Targeted queries with LIMIT and projected columns vs broad SELECT * over an event table. Pulling random data blows up the context window and affects cost, speed, and performance.
Thinking budgets and reporting. Anthropic exposes reasoning_tokens separately from completion_tokens; some providers bake thinking into completion silently. That directly affects your ability to decompose costs after the fact.
Caching configuration. Uncached deployments generate exponential cost on long agentic interactions.

The AI SOC LLM Leaderboard tracks how these trade-offs shift as new models release.

Can You Predict LLM Costs from Token Pricing Alone?

Expected vs actual cost per investigation

Expected cost (linear fit from per-token pricing) vs actual cost (cache-adjusted) gives R² ≈ 0.83. About 80% of cross-model cost variance is explained by per-million-token list prices alone. The residuals are interesting: Opus 4.7 sits below the line (cheaper than predicted), Sonnet 4.6 sits above (more turns per investigation, not deeper thinking per turn).

If you don't have time or budget to benchmark, $/Mtok is a reasonable first-cut cost screen. It's not a substitute for measuring performance or speed. Our Cyber Defense Benchmark costs ~$1.8K USD. That's the price of knowing rather than guessing.

Practical Advice

Benchmark on your actual task. Generic benchmarks (MMLU, GPQA, etc.) don't generalise to security workloads — coding benchmarks are the closest proxy we've found. See the LLM leaderboard analysis for context.
Validate agent behaviour, not just final answers. Q&A scoring is archaic and easy to game; agents can guess the answer without doing the investigation. Coverage Score scores the procedure steps explicitly so guessable shortcuts don't pay off. For the concrete agent-level failure modes — reward hacking, constraint bypass, agentsplaining — that make behaviour scoring necessary, see Do Not Trust Your SOC LLM.
Use objective scoring. LLM-as-judge is unreliable, expensive, and correlated with the model under test. Score against ground truth.
The harness matters. ReAct gives one ranking; Claude Code or Codex may reorder things by hiding tool overhead behind the harness's prompt engineering. We're studying this in a follow-up.
Mind caching. Verify your provider has prompt caching enabled before running a multi-turn agent in production. Uncached, costs grow quadratically with conversation length and the bill arrives fast.
If you can't benchmark, our empirical guidelines:
- Cost ≈ list $/Mtok. Good ~80%, fails on the flagship tier.
- Speed is gated by decode rate (Anthropic ~60 tok/s, OpenAI ~90 tok/s — varies by tier).
- Performance scales with model tier and is hit hardest by mass-market optimisation pressure. Validate the agent's steps, not just the final answer.

Are Newer LLMs Getting Better or Just Cheaper?

Newer flagships are optimising for cost, not performance. Opus 4.7 is cheaper and faster than Opus 4.6, and lower performance. GPT 5.5 is faster but more expensive than 4.7, with lower coverage than either Opus.

If you're buying a security product, ask the vendor what model they're using today, what they were using six months ago, and whether they re-benchmark when the underlying model changes. The harness, the context architecture, and the engineering around the model matter as much as the model itself. For a deeper look at how LLM selection plays out in practice, watch Claude & OpenAI Will Change Security — Just Not the Way You Think. For a broader view of what's reshaping security operations, Security for Winners covers the strategic decisions security leaders are navigating right now.

Share this article

Continue Reading

Security

Do Not Trust Your SOC LLM

Cyber Defense Benchmark caught Opus 4.7, GPT 5.5, and Gemini 3.1 Pro reward-hacking, bypassing constraints, and agentsplaining. The LLM security risks every SOC must control.

Igor Kozlov

May 18, 2026

Security

AI in the SOC: Benchmarking LLMs for Autonomous Alert Triage

Groundbreaking AI SOC benchmark tests LLMs on 100 real-world cybersecurity scenarios. Top models achieve 61-67% performance with surprising insights on capabilities and limitations in security operations.

Igor Kozlov

June 12, 2025

Security

The Cyber Defense Benchmark: Why Every Frontier LLM Failed

We ran Claude, GPT-5, Gemini, and 8 other frontier LLMs through 884 agentic threat-hunting runs on real attack telemetry. The headline result: zero passed.

Simbian Research Lab

April 28, 2026

Experience the
Power of Simbian's AI Agents Today

Book a Demo

Anthropic — Opus 4.6, Opus 4.7, Sonnet 4.6
OpenAI — GPT 5, GPT 5.5
Google — Gemini 3.1 Pro, Gemini 3 Flash
Open-Weight — DeepSeek 4 Pro, Kimi 2.6, Minimax 2.7, Nemotron 3 Super, Qwen 3.6 Plus

Three numbers matter: performance, cost, and investigation time. Full methodology is in the Cyber Defense Benchmark paper.

Which LLM Performs Best for Cyber Defense?

Performance vs cost — Pareto frontier

The Pareto frontier:

Model	Performance	Cost
Opus 4.6	0.45	$2.71
Sonnet 4.6	0.37	$2.21
Opus 4.7	0.31	$1.65
GPT 5.5	0.26	$2.37

How Fast Can Each LLM Investigate a Security Incident?

Performance vs investigation time

Speed is dominated by decoding rate × number of turns plus thinking budget:

Model	Inference Speed
GPT 5.5	~90 tok/s
Opus 4.6 / 4.7, Sonnet 4.6	~60 tok/s

Why Do Generic AI Benchmarks Fail for Security?

Too many factors trade off against each other, and they cancel in ways you wouldn't predict from a spec sheet:

Number of turns. The model decides how to scope the task. Fewer turns = lower bill and faster wall-clock, but only if the model converges. We have to trust agents to operate at machine response speeds to match AI-driven attacks. Validating that the LLM adequately decides what amount of work is just right is critical.
Inference speed cancels with verbosity. GPT 5.5 emits more tokens per turn but generates them faster. The two effects partially cancel. Net wall-clock requires running the actual workload.
Investigation style. Targeted queries with LIMIT and projected columns vs broad SELECT * over an event table. Pulling random data blows up the context window and affects cost, speed, and performance.
Thinking budgets and reporting. Anthropic exposes reasoning_tokens separately from completion_tokens; some providers bake thinking into completion silently. That directly affects your ability to decompose costs after the fact.
Caching configuration. Uncached deployments generate exponential cost on long agentic interactions.

The AI SOC LLM Leaderboard tracks how these trade-offs shift as new models release.

Can You Predict LLM Costs from Token Pricing Alone?

Expected vs actual cost per investigation

Practical Advice

Benchmark on your actual task. Generic benchmarks (MMLU, GPQA, etc.) don't generalise to security workloads — coding benchmarks are the closest proxy we've found. See the LLM leaderboard analysis for context.
Validate agent behaviour, not just final answers. Q&A scoring is archaic and easy to game; agents can guess the answer without doing the investigation. Coverage Score scores the procedure steps explicitly so guessable shortcuts don't pay off. For the concrete agent-level failure modes — reward hacking, constraint bypass, agentsplaining — that make behaviour scoring necessary, see Do Not Trust Your SOC LLM.
Use objective scoring. LLM-as-judge is unreliable, expensive, and correlated with the model under test. Score against ground truth.
The harness matters. ReAct gives one ranking; Claude Code or Codex may reorder things by hiding tool overhead behind the harness's prompt engineering. We're studying this in a follow-up.
Mind caching. Verify your provider has prompt caching enabled before running a multi-turn agent in production. Uncached, costs grow quadratically with conversation length and the bill arrives fast.
If you can't benchmark, our empirical guidelines:
- Cost ≈ list $/Mtok. Good ~80%, fails on the flagship tier.
- Speed is gated by decode rate (Anthropic ~60 tok/s, OpenAI ~90 tok/s — varies by tier).
- Performance scales with model tier and is hit hardest by mass-market optimisation pressure. Validate the agent's steps, not just the final answer.

LLM Security: GPT 5.5 vs Opus 4.7 and more

Which LLM Performs Best for Cyber Defense?

How Fast Can Each LLM Investigate a Security Incident?

Why Do Generic AI Benchmarks Fail for Security?

Can You Predict LLM Costs from Token Pricing Alone?

Practical Advice

Are Newer LLMs Getting Better or Just Cheaper?

Share this article

Continue Reading

Do Not Trust Your SOC LLM

AI in the SOC: Benchmarking LLMs for Autonomous Alert Triage

The Cyber Defense Benchmark: Why Every Frontier LLM Failed

Experience the Power of Simbian's AI Agents Today

LLM Security: GPT 5.5 vs Opus 4.7 and more

Which LLM Performs Best for Cyber Defense?

How Fast Can Each LLM Investigate a Security Incident?

Why Do Generic AI Benchmarks Fail for Security?

Can You Predict LLM Costs from Token Pricing Alone?

Practical Advice

Are Newer LLMs Getting Better or Just Cheaper?

Share this article

Continue Reading

Do Not Trust Your SOC LLM

AI in the SOC: Benchmarking LLMs for Autonomous Alert Triage

The Cyber Defense Benchmark: Why Every Frontier LLM Failed

Experience the Power of Simbian's AI Agents Today

Experience the
Power of Simbian's AI Agents Today

Experience the
Power of Simbian's AI Agents Today