Gadi Bashvitz

Gadi Bashvitz

Author

Published Date: May 28, 2026

Estimated Read Time: 11 minutes

AI Security Review Fails In Practice: Claude Opus 4.6 Missed Critical Vulnerabilities & Generated Dangerous False Positives

Why AI Security Reviews Still Fail Without Runtime Validation

Table Of Contents

  1. Introduction
  2. The AI Security Experiment
  3. What The AI Actually Found
  4. What The AI Missed
  5. Why AI Security Reviews Fail Without Runtime Validation
  6. The Bigger Signal: The AI Security Gap
  7. Why Traditional Application Security Testing Cannot Keep Up
  8. How Bright STAR Solves This Problem
  9. Taking The Next Step In AI Security
  10. Final Thoughts

Introduction

AI coding assistants are really changing the way we build software.

We can now make applications, APIs, authentication workflows, and infrastructure configurations in just a few minutes using tools like Claude Code, GitHub Copilot, Gemini, Cursor, ChatGPT, Amazon Q, and a lot of other AI-powered development platforms.

The use of AI-generated code is making software delivery happen much faster than it ever has before, and this is all because of AI coding assistants and AI-generated code.

But it also introduces a critical application security challenge:

Can AI Reliably Secure the Code It Generates?

This question matters more than ever because AI is no longer generating small snippets of code or assisting with boilerplate functions.

Modern AI systems are increasingly responsible for generating:

  1. Entire applications
  2. APIs and integrations
  3. Authentication logic
  4. Authorization workflows
  5. Infrastructure configurations
  6. MCP integrations
  7. Runtime security mechanisms

As organizations embrace AI-assisted development, the volume of AI-generated code entering production environments continues to grow.

And if vulnerabilities are being generated at machine speed, can security validation keep up?

Traditional security review processes were not designed for an environment where entire applications can be created in minutes.

To better understand the effectiveness of modern AI security reviews, we conducted a real-world experiment using Claude Code Opus 4.6.

The objective was simple:

Determine whether AI can reliably identify security vulnerabilities in the code it generates – and whether those findings hold up under runtime validation.

The results exposed significant gaps between AI-generated security assessments and real-world application security outcomes.

The AI Security Experiment

To evaluate the effectiveness of AI security testing, we built a fully functional application consisting of approximately 300 lines of code using Claude Code Opus 4.6.

Once the application was generated, we intentionally inserted two critical vulnerabilities into the codebase.

The goal was straightforward:

Could the same AI model reliably discover those vulnerabilities during a security review?

To answer that question, we conducted five independent AI security reviews against the same application.

The methodology followed a simple process:

  1. Generate an AI-created application
  2. Introduce critical vulnerabilities
  3. Run multiple AI security reviews
  4. Analyze detection consistency
  5. Validate exploitability through runtime testing

At first glance, the hypothesis appeared reasonable.

If AI can generate an application, surely it should be capable of identifying security flaws within that application.

The results proved far more complicated.

What the AI Actually Found

Across the five independent AI security reviews, the findings revealed a surprising level of inconsistency.

Key Findings

ObservationResult
Vulnerabilities consistently identified across all five scansOnly 32%
Findings classified as false positives60%
Scans that missed planted critical vulnerabilities60%
Scans that flagged dead code as critical100%
Findings validated consistently across runsApproximately 30%

The AI identified a variety of potential security issues, including:

  1. Input validation weaknesses
  2. Authentication concerns
  3. Unsafe database operations
  4. Potential injection paths
  5. Application logic flaws

At first glance, these findings appeared encouraging.

However, a closer look revealed a more concerning reality.

Detection was highly inconsistent.

Some vulnerabilities appeared in only one of the five scans.

Others disappeared entirely.

Several findings changed severity ratings between scans, while some vulnerabilities were incorrectly explained or classified as secure.

Most concerning of all, certain critical vulnerabilities were never discovered.

This means that running the same AI security review multiple times against the same application produced materially different results.

For security teams seeking consistency and confidence, that variability creates a significant challenge.

The Reality Behind the Findings

A deeper review of the results revealed a mix of:

  1. Legitimate vulnerabilities
  2. Dead-code findings
  3. False positives
  4. Context-dependent observations
  5. Overstated severity ratings

When runtime validation was performed, many reported findings could not actually be exploited.

This highlights one of the biggest limitations of AI-powered security reviews:

AI reasoning is probabilistic – not deterministic.

Large language models generate conclusions based on probabilities, patterns, and context.

Security testing, however, requires repeatable and verifiable outcomes.

Because in application security, confidence without validation is a risk.

What the AI Missed

While Claude Opus 4.6 successfully identified some vulnerabilities, the experiment also revealed important blind spots.

Several security issues were:

  1. Incorrectly classified
  2. Poorly explained
  3. Completely overlooked

Examples included:

  1. Improper authentication handling
  2. Weak authorization logic
  3. Unsafe input processing paths
  4. Potential injection vectors

In some cases, the AI even generated explanations describing vulnerable code as secure.

This creates one of the most dangerous failure modes in modern AI-assisted development.

Developers often trust AI-generated explanations.

If those explanations are wrong, vulnerable code can move directly into production environments with a false sense of security.

The Most Concerning Result: Missed XSS Vulnerabilities

Perhaps the most significant finding from the experiment was that Claude Opus 4.6 completely failed to detect two intentionally planted XSS vulnerabilities.

These vulnerabilities included:

  1. A Text/HTML default fallback XSS
  2. An application/XML namespace XSS

The attack chain required multiple layers of analysis, including:

  1. Multi-step indirection
  2. Content negotiation logic
  3. Runtime rendering behavior

This is exactly the type of complexity that traditional AI security reviews struggle to understand.

The vulnerabilities were only fully visible when the application was analyzed during runtime execution.

Static reasoning alone failed to uncover them.

And that distinction matters.

Because attackers exploit runtime behavior – not theoretical code patterns.

Why AI Security Reviews Fail Without Runtime Validation

The results of this experiment highlight a broader truth about modern AI security testing.

Large language models are exceptionally good at:

  1. Pattern recognition
  2. Code generation
  3. Documentation
  4. Security explanations

But they continue to struggle with one critical capability:

Runtime security validation.

Security is not about identifying code that looks suspicious.

Security is about determining whether a vulnerability can actually be exploited.

And that requires runtime testing.

The research identified several reasons why AI security reviews continue to struggle.

1. LLMs Do Not Execute Applications

AI models analyze source code statically.

They rely on:

  1. Patterns
  2. Heuristics
  3. Probabilistic reasoning

They do not:

  1. Execute applications
  2. Trigger attack chains
  3. Observe runtime behavior
  4. Validate exploitability

Without execution, vulnerabilities often remain theoretical assumptions rather than proven security risks.

2. AI Security Results Are Probabilistic

Another major limitation uncovered during the experiment was inconsistency.

Each AI security review was influenced by factors such as:

  1. Prompt phrasing
  2. Model randomness
  3. Context window limitations
  4. Response variability

This explains why multiple security scans against the same codebase produced different findings.

Some vulnerabilities appeared in one review but disappeared in the next.

Others changed severity ratings or received entirely different explanations.

This behavior is expected from large language models because they are designed to generate probabilistic outputs rather than deterministic results.

Security testing, however, requires:

  1. Consistency
  2. Repeatability
  3. Reliability

A vulnerability should not appear or disappear based on how a prompt is worded.

Security teams need answers that remain stable across every assessment.

And today, AI security reviews alone cannot consistently provide that level of confidence.

3. AI Lacks Exploit Validation

Perhaps the biggest limitation of AI security reviews is the inability to validate exploitability.

Most AI-powered security tools are capable of identifying:

  1. Potential vulnerabilities
  2. Suspicious code patterns
  3. Security anti-patterns

What they often cannot determine is:

  1. Whether a vulnerability can actually be exploited
  2. Whether the vulnerability is reachable during runtime
  3. Whether a remediation successfully eliminated the issue

This creates two dangerous outcomes.

False Positives

Security teams spend time investigating vulnerabilities that are not actually exploitable.

False Confidence

Developers believe vulnerabilities have been resolved when exploitability still exists.

Both scenarios become increasingly risky as organizations deploy larger volumes of AI-generated applications.

Without runtime validation, security becomes based on assumptions rather than evidence.

The Bigger Signal: The AI Security Gap

While the findings from Claude Opus 4.6 were concerning, the broader implications are even more significant.

This experiment exposed a growing gap between AI-powered software development and modern application security.

Today, AI is generating a rapidly increasing percentage of production software.

Industry estimates suggest:

  1. 30–40% of production code is already AI-generated
  2. Some organizations report more than 70% AI-assisted development
  3. AI-generated applications are becoming common across SaaS environments

Development velocity is accelerating dramatically.

Security validation is not.

Most existing application security programs still depend heavily on:

  1. Static analysis
  2. Heuristic detection
  3. AI code reviews
  4. Manual validation

While these approaches provide value, none of them reliably prove exploitability.

More importantly, they were never designed for modern AI-driven architectures.

Today’s organizations are increasingly deploying:

  1. MCP servers
  2. Agentic AI systems
  3. AI APIs
  4. Autonomous workflows
  5. AI-powered integrations

These environments introduce new attack surfaces and runtime behaviors that traditional security reviews often struggle to understand.

As AI-generated code becomes the norm, the gap between software creation and security validation will continue to grow.

And that gap creates risk.

Why Traditional Application Security Testing Cannot Keep Up

AI-generated software introduces an entirely new challenge:

Machine-Generated Vulnerabilities at Machine Speed

A developer can now generate:

  1. An entire application
  2. Authentication workflows
  3. Authorization systems
  4. APIs
  5. Infrastructure configurations
  6. Complex business logic

Within minutes.

The same development effort previously requiring days or weeks can now happen almost instantly.

The problem is that vulnerabilities scale at the same speed.

If AI generates insecure code, organizations may be introducing security risk faster than traditional AppSec teams can review it.

This creates a fundamental mismatch.

Development accelerates.

Security struggles to keep pace.

Traditional approaches built around periodic reviews, static analysis, and manual validation were not designed for AI-generated software at scale.

Modern application security requires a different model.

One built around:

  1. Runtime validation
  2. Deterministic testing
  3. Continuous exploit verification
  4. Automated security validation

Assumptions cannot scale as fast as AI-generated code.

How Bright STAR Solves This Problem

The challenges identified throughout this research are precisely why Bright Security built STAR (Security Testing & Autonomous Remediation).

STAR was designed specifically for modern development environments where AI-generated applications, APIs, and autonomous systems are becoming increasingly common.

Unlike traditional AI security review tools that rely heavily on static analysis and probabilistic reasoning, STAR focuses on validated security outcomes.

The objective is simple:

Prove security issues exist before reporting them.

And verify they are fixed before closing them.

1. STAR Proves Exploitability

Most AI security reviews identify potential vulnerabilities.

STAR validates real vulnerabilities.

Rather than relying solely on code interpretation, STAR:

  1. Executes applications
  2. Discovers real attack paths
  3. Validates runtime behavior
  4. Confirms exploitability

This dramatically reduces uncertainty and ensures security teams focus on issues that actually matter.

Because exploitable vulnerabilities create risk.

Theoretical vulnerabilities create noise.

2. STAR Eliminates False Positives

One of the biggest challenges with traditional security tools is alert fatigue.

Developers frequently encounter:

  1. Large vulnerability lists
  2. Dead-code findings
  3. Non-exploitable issues
  4. Low-confidence results

As a result, teams spend valuable time chasing findings that never represented real risk.

STAR takes a fundamentally different approach.

By combining:

  1. Runtime DAST validation
  2. Deterministic testing
  3. AI-optimized security workflows

STAR focuses only on:

  1. Exploitable vulnerabilities
  2. Production-relevant findings
  3. Actionable security risks

This allows security teams to spend less time investigating noise and more time fixing real issues.

3. STAR Validates Remediation

Finding vulnerabilities is only half the problem.

Organizations must also verify that fixes actually work.

STAR closes the loop by automatically re-testing applications after remediation.

This process validates:

  1. Whether exploitability still exists
  2. Whether remediation was successful
  3. Whether the vulnerability has truly been eliminated

The result is a continuous security validation cycle:

Find – Validate – Remediate – Re-Test – Verify

Most AI code review solutions cannot reliably perform this workflow today.

Taking the Next Step in AI Security

Artificial intelligence is one of the most powerful productivity accelerators software development has ever experienced.

Its ability to generate applications, APIs, integrations, and workflows is transforming how modern engineering teams operate.

But AI should not become its own security gatekeeper.

Securing AI-generated applications requires solutions capable of understanding:

  1. Runtime behavior
  2. Dynamic attack paths
  3. AI execution chains
  4. Real exploitability
  5. Complex application flows

This becomes increasingly important as organizations continue adopting:

  1. AI coding assistants
  2. AI-generated APIs
  3. Agentic AI workflows
  4. MCP architectures
  5. Autonomous development systems

The future of application security is not about slowing development down.

It is about enabling organizations to move faster without sacrificing confidence.

Bright Security provides the runtime validation layer necessary to safely deploy AI-generated applications at scale.

Whether teams build with Claude, GPT, Gemini, Cursor, or custom LLMs, runtime validation remains essential.

Because AI can generate code.

But security still requires proof.

Final Thoughts

This research revealed a critical reality about the future of AI security.

AI can generate software far faster than traditional security processes can validate it.

Claude Opus 4.6 successfully identified some vulnerabilities during testing.

However, the experiment also exposed several important limitations:

  1. Inconsistent detection results
  2. High false-positive rates
  3. Missed critical vulnerabilities
  4. Lack of exploit validation
  5. Limited runtime visibility

Together, these issues create a growing security gap across modern software development environments.

As organizations increasingly rely on AI-generated code, security teams need more than AI-generated recommendations.

They need:

  1. Runtime validation
  2. Continuous exploit verification
  3. Deterministic testing
  4. Real attack simulation
  5. Runtime security testing

The future of AI security is not about identifying potential vulnerabilities.

It is about proving vulnerabilities exist, validating that fixes work, and continuously verifying security outcomes in production-like environments.

Because in application security:

Finding a vulnerability is only the beginning.

Proving it is exploitable – and proving it has been eliminated – is what truly matters.

Stop testing.

Start Assuring.

Join the world’s leading companies securing the next big cyber frontier with Bright STAR.

Our clients:

More

Threats and Vulnerabilities

Agentic AI Security: New Risks When Apps Start Calling Tools

AI systems are no longer passive tools that generate code or responses. They are becoming active agents that execute workflows,...
Gadi Bashvitz
May 25, 2026
Read More
Threats and Vulnerabilities

LLM Data Leakage: From Code to Production (For AppSec & Platform Teams)

AI is no longer just generating code - it is actively executing workflows across APIs, databases, and external systems. Teams...
Gadi Bashvitz
May 7, 2026
Read More
Threats and Vulnerabilities

Prompt Injection vs Data Poisoning in LLM Apps (Deep Technical Guide)

AAI is not just generating code. It is actually executing workflows across Application Programming Interfaces, databases, and external tools. Teams...
Gadi Bashvitz
May 6, 2026
Read More
Threats and Vulnerabilities

How MCP Endpoints Leak Sensitive Data (3 High-Impact Paths)

In the past two years, there have been significant changes in software development. Not only do programmers code – they...
Gadi Bashvitz
May 6, 2026
Read More