How Autonomous Penetration Testing Works: Plan, Exploit, Prove

Ambuj Kumar

June 24, 20264 min readPenetration Testing

Autonomous penetration testing loop diagram — four stages (plan, probe, exploit, prove) arranged as a cycle in Simbian's deep blue and aqua palette, with arrows returning to plan

Autonomous penetration testing is the use of AI agents to plan, execute, and verify offensive security tests against running applications with minimal human intervention. It works as a closed loop. An AI Pentest Agent plans the next move from current state, executes a probe or exploit through real tools, verifies the result before claiming a finding, and writes the proof — evidence, reproduction steps, severity — into a reviewable trace. Three controls keep it safe in production: an LLM-as-judge layer that vetoes risky actions, an ephemeral sandbox for white-box runs, and a finding-level reasoning trail every developer can audit.

Most "how AI pentest works" pages stop at the marketing layer. The agent recons, the agent exploits, the agent reports. Nothing in that sequence tells an AppSec engineer whether the box on the slide is doing real work or running a clever wrapper around Nmap. This piece opens the hood. What follows is the actual loop — what decides the next move, what turns an attempt into a finding, what keeps the agent from nuking production — written for the practitioner who has to defend the choice to their team.

Why "the AI just finds vulns" isn't an answer

The fastest way to lose an AppSec engineer is to call something autonomous and skip the mechanism. Read enough vendor pages and the picture blurs. Every ranking page for autonomous penetration testing names the same three components — a planner, a tool layer, a feedback loop — and almost none explain how those components decide anything. BreachLock, Astra, Picus, Horizon3, XBOW, Pentera, Synack: they all sit at the workflow surface. Recon, exploit, report. That description fits a scanner with a chatbot bolted on, and it fits a real agentic system. The vocabulary doesn't separate them. The mechanism does.

Practitioners ask three questions when they see an autonomous pentest demo. Is this just a wrapper around classic tooling? Does it actually exploit, or does it claim? How does it not nuke prod? The fact that those questions don't get answered on the vendor page is the gap this article closes. The 2026 OWASP Autonomous Penetration Testing Standard (APTS) treats the agent runtime as an untrusted component for the same reason: the model alone is not safe enough to act. The architecture around it does the work. That's what we walk through next.

The loop in one sentence: Plan, Probe, Exploit, Prove

An autonomous pentest run is one loop, four stages, repeated for every endpoint × method × role combination that the agent reaches:

Plan: read current state, choose the next action.
Probe: execute the action through a real tool, observe the response.
Exploit: when a probe looks promising, attempt validation under a safety judge.
Prove: verify the result, write evidence + reproduction steps + severity, return to Plan.

Everything else (adaptive discovery, multi-role parallel testing, supply-chain mode, retests) is a variation on which inputs feed the loop and how many attackers run it in parallel. The loop itself doesn't change. The rest of this article walks the four stages and the three safety controls that keep the loop honest. Where the implementation specifics matter, we name how Simbian's AI Pentest Agent does it; the loop applies to any honest agentic pentest design.

Stage 1: how the planner picks the next move

The planner is the part vendors gloss the hardest. Three implementations exist in public research and product copy:

Heuristic scoring: a decision tree weighted by tool output. Cheap, brittle, fails on novel logic. This is what most "automated" pentest tools call autonomy. It is not.
Reinforcement learning: a Q-learning or Markov-decision-process planner that scores actions against expected reward. Academic implementations exist (the NCBI 2025 RL paper is the cleanest reference) and a few research-grade vendors lean this way. Strong on repeatable tasks, weak on first-encounter logic flaws.
LLM-driven planning over state: the agent reads the current run state (endpoints discovered, methods exposed, prior responses, role under test, business context from a knowledge base) and asks a constrained model what to do next. The constraint comes from a skill library, a fixed set of allowed actions and how to invoke them, not free-form generation. This is the only design that handles novel business-logic flaws because it reasons about the application, not just the response code.

Simbian's design sits in the third category. The planner is an LLM reasoning over the run state, scoped to a skill library, with the Context Lake™ supplying application-specific context: known weak spots, prior pentest history, SecOps team annotations, hot entities from any live SOC investigation. The model doesn't invent attack steps from scratch. It picks the next skill from a vetted library and parameterizes it against the live target. That's the distinction between an agent that reasons and a script that runs.

The practical test is this: ask the vendor what happens when their agent hits an endpoint it has never seen, with an authorization scheme it has never been trained on. A real agentic planner reasons through the new shape. A wrapped scanner returns a 404 with a smile.

Stage 2: probe is adaptive recon, not a scan

Recon in a real autonomous pentest is dynamic. Each request is a hypothesis the agent will mutate based on the response. The scanner sends a fixed payload library and reports what fires. The agent sends a probe, reads the response, updates its model of the target, and picks the next probe accordingly. That's why the same five-minute window in a scanner produces a list of CVE matches, and in an agent produces a map of reachable paths and the role under which each one is reachable.

The capability that makes recon useful at all is multi-role parallel testing. Single-user scanners structurally cannot find the authorization-boundary class of vulnerabilities. Broken Object Level Authorization (BOLA), Broken Function Level Authorization (BFLA), and privilege escalation all live at the seam between two users with different scopes. To find them, the test has to run as multiple users at once and cross-check what each one can reach.

In Simbian's AI Pentest Agent, this is implemented as parallel attackers. Each attacker is one running instance of the agent, configured with a different authenticated role (guest, regular user, developer, admin), spinning up together inside one run. The attackers cross-check: can the developer reach admin-only endpoints? Can the regular user mutate another user's resources? The agent doesn't find an IDOR by getting lucky with a payload. It finds one because four versions of the same request, each as a different user, produce inconsistent responses, and the planner can reason about the inconsistency.

For apps behind the firewall (internal portals, staging environments, partner-only services), Simbian's Cloud Link routes pentest traffic through a customer-deployed on-prem agent. Same loop, same findings UI, no exposed surface. Most external scanners structurally cannot reach those targets at all.

The auth-boundary class is where the gap between scanner and agent is widest. The OWASP API Security Top 10 puts BOLA at #1 and BFLA at #5 — the authorization boundary is consistently the most-exploited API surface in production. A test that doesn't run multiple roles in parallel cannot reach that class on principle.

Stage 3: exploit needs a safety layer, or it nukes prod

This is the stage that gets vendors in trouble at demo time. An agent that can plan an exploit can also plan a destructive one. SQL injection on a real customer database, a stored XSS that persists in shared state, a privilege escalation that resets an admin password: every one of those is a valid exploit chain in a permissive lab and a production incident in a permissive prod.

The safety layer has to live outside the planner, because the planner is the thing being constrained. OWASP APTS treats the agent runtime as untrusted for this reason. The architecture has to assume the model will sometimes propose an action the customer would not approve, and it has to catch that action before it executes.

Simbian's safety layer is an LLM-as-judge sitting between the planner and the tool layer. Every candidate exploit the planner produces is reviewed by a separate model whose only job is to assess two questions: Would this action exfiltrate data? Would this action disrupt the application? If either answer is yes above the configured threshold, the action is vetoed and the planner gets the veto reason as new state for its next decision. The judge is not a soft suggestion; it is a hard gate.

Three modes are available, picked per run:

Safe: judge enabled, exfiltration and disruption vetoed by default. Production posture.
Standard: full result, disrupt-class actions still blocked. Staging posture.
Full Throttle: deepest offensive depth, judge off. Recommended for staging and pre-prod only.

Two more controls sit beneath the judge. The integration layer wraps every request in a scoped, observable proxy. The endpoint layer holds a customer-declared ignore list — paths the agent is not allowed to touch, ever. Stack the three and the run is engineered to operate without disrupting critical production environments. Not because the model behaves. Because the system around it makes misbehavior impossible to execute. That is the gap between the AI is probably safe and the AI is provably contained.

Stage 4: proof is the thing, not the claim

The hardest engineering problem in autonomous pentesting is not finding vulnerabilities. It is making sure the agent isn't imagining them. Recent academic work on agentic security testing gave the failure mode a name — hallucinated compliance — where the agent reports a successful exploit it never actually executed. The pattern reproduced across frontier models and across agent frameworks. Different vendors, same lie.

The defense is not a smarter model. The defense is verification before claim. An honest agentic design draws a hard line between attempted and proved. An attempt produces a hypothesis; only a verified attempt produces a finding. Verification means the agent reproduces the exploit deterministically, captures the evidence — the actual HTTP exchange, the actual shell output, the actual screenshot — and writes those into the record. If the reproduction step fails, the finding doesn't exist.

Every finding in Simbian's AI Pentest Agent ships with three things:

Deterministic reproduction steps: a developer can run them by hand and watch the exploit fire (or fail) in the open.
Evidence: one of three forms is required — HTTP request/response exchange, shell output, or screenshot. No claim without one of the three.
Thought Traces: the finding-level reasoning trail showing how the agent got there. What it tried, what worked, what didn't, what context made it pick this path. The trace is the bridge between the agent's logic and the pen-tester's review.

Thought Traces are the named feature that does the heavy lifting on the we are not a scanner claim. A scanner finding is a signature match. An agentic finding ships with its origin story. The trace is also what makes the next retest smarter: when a senior pen-tester edits a finding (changes the severity, rewrites the remediation, adds a CWE), the edit creates a new version with an intent comment, and that comment becomes input the agent considers on the next run. Hallucination doesn't compound across cycles because unverified findings don't survive verification, and verified findings get sharper with every human review.

Same loop, three testing modes: black-box, white-box, supply-chain

The loop is invariant. What changes between modes is the input bundle the planner reads on Stage 1.

Black-box: target URL plus optional auth. The agent reasons from the outside in, with no source code or SBOM context. External-attacker model.
White-box: target URL plus auth plus repository URL plus commit plus a read-only Personal Access Token. The planner uses source code as hints to drive sharper probes in the runtime context. Findings still get verified at runtime, just with the benefit of code-level priors.
Supply-chain: target URL plus auth plus repo plus an SBOM. The agent tests for vulnerabilities introduced by libraries the application code depends on, including chains of library bugs that only become exploitable when specific package combinations interact.

Buyers often conflate two different things here. Supply-chain pentesting in this context tests the packages your application depends on. It is not third-party vendor risk; it does not assess your SaaS suppliers or fill out vendor security questionnaires. SCRM and TPRM are a separate product category. If the vendor conversation is about packages-as-attack-surface, supply-chain mode is the right shelf. If the conversation is about vendor governance, it is the wrong product.

White-box is the mode that worries CISOs and engineering managers, and reasonably so. Giving an AI agent access to source code is not a small ask. The trust model has to be visible. Simbian's is built around four guardrails. The PAT is read-only and clone-only — no push, no admin scopes. The token lives in an outer deterministic pipeline; the agent reasoning loop never sees it. Every run clones into a fresh ephemeral sandbox that is destroyed when the scan finishes. And a policy layer on top of the planner pins actions to the scoped pentest objective. Nothing in white-box mode gives the agent a path to mutate the repo or persist code beyond the run. The source is an input the agent reads, not a substrate it writes to.

Where humans stay in control

A pentest agent that ships its findings unedited is asking for trouble. The right design assumes a senior pen-tester will review every run, edit some fraction of findings, and feed those edits back into the system. Self-improving, not self-driving is the autonomy posture that actually works in production.

In Simbian, every finding ships as v1, agent-authored. A reviewer can change any field — description, remediation, CWE labels, evidence, the CVSS attack-vector inputs that recompute the score live, severity — and each change creates a new immutable version attributed to the editor. Every version carries an intent comment explaining why the edit happened. The agent reads that comment on the next retest. The chain accumulates: v1 (agent) → v2 (human edit) → v3 (agent retest) → v4 (human edit), and so on. Reports always pull the latest version. The version history is the artifact compliance teams hand the auditor when the question is how findings actually get triaged.

The run-level severity uses a maximum rollup, not an average. The reasoning is operational: one Critical means the application is in a Critical state. Averaging a Critical with three Lows hides the Critical. The whole point of the rollup is action triage; the worst finding dictates the action.

The loop continues past the report

The Plan → Probe → Exploit → Prove loop is what most pentest content stops at, because the report is the deliverable. The interesting question is what happens next. In an isolated pentest, the answer is "the report goes into a backlog, and detection engineering might look at it in six weeks." In a unified platform, the pentest finding is the start of a different loop. Each verified finding becomes a detection question for the SOC: did we catch someone exploiting this? It becomes a hunt hypothesis for the Threat Hunt Agent: do we see historical evidence in our logs? It becomes a rule-engineering input: can we catch it next time, and which technique ID in MITRE ATT&CK does it map to?

That handoff (pentest finding → SOC detection question → Threat Hunt hypothesis) is the offense-to-defense closed loop on a shared Context Lake™ and a shared MITRE coordinate system. It is not a feature of the pentest agent; it is a property of running the four agents on one substrate. Worth naming here because the offense-to-defense loop is the only part of autonomous pentesting where the buyer question is structural: does the platform measure whether the finding actually changes the defense posture, or does it stop at the PDF.

What this looks like at scale

The proof point that the loop works at production scale is RapidCosmos Federal Credit Union: a six-month deployment, $1.8B in assets, federal-employee membership. Their starting posture was ARMM Level 2 (informed; basic guardrails). After six months on the loop, they reported ARMM Level 4 (quantitatively managed), 92% false-positive reduction, 88% remediation-time reduction (two hours collapsed to under 15 seconds for the headline workflow), and a Resiliency Index of 9.2/10. The full case is documented in the RapidCosmos AI Pentest case study. The relevant point here: the metrics moved because the loop ran continuously, not because a vendor showed up for an annual engagement.

One detail worth pulling out of that case: every engagement on the agentic tier includes up to five retests by design, because the loop only proves its worth on the retest. The first run finds the bug. The fifth run is where the agent has read the human's edits, the planner has internalized the team's intent, and the retest is sharper than the senior pen-tester's original brief. The economics follow the loop, not the other way around.

Frequently asked questions

Q: How is autonomous penetration testing different from automated penetration testing? Automated pentesting runs a fixed library of probes against a target and reports the matches; the test does not change based on what the target says back. Autonomous pentesting adapts. The agent reads each response, updates its model of the target, and picks the next action accordingly. Automated is signature replay; autonomous is reasoning over state. The capability gap is widest on novel logic flaws and authorization-boundary bugs, neither of which a fixed payload library can reach.

Q: Can autonomous pentesting be run safely in production? Yes, with the right architecture. The model alone is not safe; the system around it has to gate every action. In practice that means an external safety judge that vetoes exfiltration and disruption before execution, a scoped proxy that observes every request, a pre-declared ignore list that excludes paths the customer rules out, and run-mode controls that allow the customer to dial the depth (Safe, Standard, Full Throttle). With those four controls in place, autonomous pentests run against production every release without disrupting it. Without them, they should not.

Q: Do AI pentest agents hallucinate findings? Without verification, yes. Recent academic work documented hallucinated compliance (agents reporting successful exploits they never executed) as a reproducible failure mode across frontier models. The fix is structural, not model-based: draw a hard line between attempted and proved, require evidence (HTTP exchange, shell output, screenshot) plus deterministic reproduction steps for every finding, and discard any finding that fails the reproduction. Hallucination doesn't survive the verification gate.

Q: What is agentic penetration testing, and is it different from autonomous pentesting? Agentic and autonomous are used interchangeably in 2026 product copy. The useful distinction is whether the system runs as a single agent or as a fleet of parallel attacker instances inside one run. Agentic implementations spin up multiple attackers (each one a running instance of the agent, often configured with a different role or scope) and cross-check between them. That parallelism is what makes BOLA, BFLA, and privilege-escalation paths reachable.

Q: Can an AI pentest agent find zero-day vulnerabilities? Sometimes. The agent does not synthesize novel techniques out of nothing; it picks from a skill library and parameterizes each skill against the live target. What it can do is combine known techniques into chains the human library hadn't catalogued, and exercise application logic that signature scanners ignore. The realistic claim is novel-to-this-application findings, not novel-to-the-world exploits. That distinction matters for marketing copy and matters more for compliance evidence.

Q: Does autonomous penetration testing satisfy SOC 2, PCI DSS, and HIPAA pentest requirements? Yes, when the engagement is scoped against the relevant control set and the findings carry the artifacts auditors expect: methodology, scope, evidence, remediation tracking, and a signed report. For teams that need the managed-delivery version, penetration testing services like the LRQA Continuous Pentest Service pair the AI Pentest Agent with human specialists who validate, sign off, and ship a compliance-ready report in days rather than weeks. The autonomous tier and the compliance tier are not in tension; they compose.

What to do with this

If you're evaluating an autonomous pentest vendor, the questions in this article are the test. Ask the planner question: what does it read, and from what library does it choose? Ask the proof question: what evidence ships with a finding, and what happens to a claim that fails reproduction? Ask the safety question: where does the safety layer live, and what is the smallest action it can veto? A real agentic system answers each of them with a mechanism. A wrapper around a scanner answers each of them with adjectives. The difference is whether the work is happening in the model or in the marketing.

The honest test is to run it against your own app, on a target you know cold, and read the Thought Trace. If the agent's reasoning matches yours on the easy bugs and surfaces something you missed on the hard ones, the loop is real. If the report reads like a scanner output dressed in agent vocabulary, it isn't. Either way, the trace tells the truth — which is the only reason the loop is worth running in the first place.

Share this article

Continue Reading

Penetration Testing

What Is AI Penetration Testing? Definition + How It Works

AI penetration testing splits two ways: AI-as-tester (autonomous agents running pentests) and AI-as-target (pentesting LLMs). Definition, process, compliance.

Shivang Kalsi

June 18, 2026

Penetration Testing

AI Penetration Testing vs. Manual Pentesting: Which is Right for You in 2026?

Annual pentests are slow and traditional scanners are noisy. Learn how AI penetration testing uses autonomous agents to continuously validate exploits without the false positives.

David Greene

March 31, 2026

Penetration Testing

Best Continuous Penetration Testing Vendors in 2026: 10 Compared by 6 Pillars

Compare the 10 best continuous penetration testing vendors of 2026 against six buyer-eval pillars. Pricing, scope, autonomy depth, and the offense-to-defense gap.

Sumedh Barde

June 19, 2026

Sign up for Simbian's Newsletter

By submitting this form, you agree to our Privacy Policy.

Ask AI about Simbian

Autonomous penetration testing is the use of AI agents to plan, execute, and verify offensive security tests against running applications with minimal human intervention. It works as a closed loop. An AI Pentest Agent plans the next move from current state, executes a probe or exploit through real tools, verifies the result before claiming a finding, and writes the proof — evidence, reproduction steps, severity — into a reviewable trace. Three controls keep it safe in production: an LLM-as-judge layer that vetoes risky actions, an ephemeral sandbox for white-box runs, and a finding-level reasoning trail every developer can audit.

Why "the AI just finds vulns" isn't an answer

The loop in one sentence: Plan, Probe, Exploit, Prove

An autonomous pentest run is one loop, four stages, repeated for every endpoint × method × role combination that the agent reaches:

Plan: read current state, choose the next action.
Probe: execute the action through a real tool, observe the response.
Exploit: when a probe looks promising, attempt validation under a safety judge.
Prove: verify the result, write evidence + reproduction steps + severity, return to Plan.

Stage 1: how the planner picks the next move

The planner is the part vendors gloss the hardest. Three implementations exist in public research and product copy:

Heuristic scoring: a decision tree weighted by tool output. Cheap, brittle, fails on novel logic. This is what most "automated" pentest tools call autonomy. It is not.
Reinforcement learning: a Q-learning or Markov-decision-process planner that scores actions against expected reward. Academic implementations exist (the NCBI 2025 RL paper is the cleanest reference) and a few research-grade vendors lean this way. Strong on repeatable tasks, weak on first-encounter logic flaws.
LLM-driven planning over state: the agent reads the current run state (endpoints discovered, methods exposed, prior responses, role under test, business context from a knowledge base) and asks a constrained model what to do next. The constraint comes from a skill library, a fixed set of allowed actions and how to invoke them, not free-form generation. This is the only design that handles novel business-logic flaws because it reasons about the application, not just the response code.

Stage 2: probe is adaptive recon, not a scan

Stage 3: exploit needs a safety layer, or it nukes prod

Three modes are available, picked per run:

Safe: judge enabled, exfiltration and disruption vetoed by default. Production posture.
Standard: full result, disrupt-class actions still blocked. Staging posture.
Full Throttle: deepest offensive depth, judge off. Recommended for staging and pre-prod only.

Stage 4: proof is the thing, not the claim

Every finding in Simbian's AI Pentest Agent ships with three things:

Deterministic reproduction steps: a developer can run them by hand and watch the exploit fire (or fail) in the open.
Evidence: one of three forms is required — HTTP request/response exchange, shell output, or screenshot. No claim without one of the three.
Thought Traces: the finding-level reasoning trail showing how the agent got there. What it tried, what worked, what didn't, what context made it pick this path. The trace is the bridge between the agent's logic and the pen-tester's review.

Same loop, three testing modes: black-box, white-box, supply-chain

The loop is invariant. What changes between modes is the input bundle the planner reads on Stage 1.

Black-box: target URL plus optional auth. The agent reasons from the outside in, with no source code or SBOM context. External-attacker model.
White-box: target URL plus auth plus repository URL plus commit plus a read-only Personal Access Token. The planner uses source code as hints to drive sharper probes in the runtime context. Findings still get verified at runtime, just with the benefit of code-level priors.
Supply-chain: target URL plus auth plus repo plus an SBOM. The agent tests for vulnerabilities introduced by libraries the application code depends on, including chains of library bugs that only become exploitable when specific package combinations interact.