Gadi Bashvitz

Author

Published Date: May 28, 2026

Estimated Read Time: 11 minutes

AI Security Review Fails In Practice: Claude Opus 4.6 Missed Critical Vulnerabilities & Generated Dangerous False Positives

Why AI Security Reviews Still Fail Without Runtime Validation

Introduction
The AI Security Experiment
What The AI Actually Found
What The AI Missed
Why AI Security Reviews Fail Without Runtime Validation
The Bigger Signal: The AI Security Gap
Why Traditional Application Security Testing Cannot Keep Up
How Bright STAR Solves This Problem
Taking The Next Step In AI Security
Final Thoughts

Introduction

AI coding assistants are really changing the way we build software.

We can now make applications, APIs, authentication workflows, and infrastructure configurations in just a few minutes using tools like Claude Code, GitHub Copilot, Gemini, Cursor, ChatGPT, Amazon Q, and a lot of other AI-powered development platforms.

The use of AI-generated code is making software delivery happen much faster than it ever has before, and this is all because of AI coding assistants and AI-generated code.

But it also introduces a critical application security challenge:

Can AI Reliably Secure the Code It Generates?

This question matters more than ever because AI is no longer generating small snippets of code or assisting with boilerplate functions.

Modern AI systems are increasingly responsible for generating:

Entire applications
APIs and integrations
Authentication logic
Authorization workflows
Infrastructure configurations
MCP integrations
Runtime security mechanisms

As organizations embrace AI-assisted development, the volume of AI-generated code entering production environments continues to grow.

And if vulnerabilities are being generated at machine speed, can security validation keep up?

Traditional security review processes were not designed for an environment where entire applications can be created in minutes.

To better understand the effectiveness of modern AI security reviews, we conducted a real-world experiment using Claude Code Opus 4.6.

The objective was simple:

Determine whether AI can reliably identify security vulnerabilities in the code it generates – and whether those findings hold up under runtime validation.

The results exposed significant gaps between AI-generated security assessments and real-world application security outcomes.

The AI Security Experiment

To evaluate the effectiveness of AI security testing, we built a fully functional application consisting of approximately 300 lines of code using Claude Code Opus 4.6.

Once the application was generated, we intentionally inserted two critical vulnerabilities into the codebase.

The goal was straightforward:

Could the same AI model reliably discover those vulnerabilities during a security review?

To answer that question, we conducted five independent AI security reviews against the same application.

The methodology followed a simple process:

Generate an AI-created application
Introduce critical vulnerabilities
Run multiple AI security reviews
Analyze detection consistency
Validate exploitability through runtime testing

At first glance, the hypothesis appeared reasonable.

If AI can generate an application, surely it should be capable of identifying security flaws within that application.

The results proved far more complicated.

What the AI Actually Found

Across the five independent AI security reviews, the findings revealed a surprising level of inconsistency.

Key Findings

Observation	Result
Vulnerabilities consistently identified across all five scans	Only 32%
Findings classified as false positives	60%
Scans that missed planted critical vulnerabilities	60%
Scans that flagged dead code as critical	100%
Findings validated consistently across runs	Approximately 30%

The AI identified a variety of potential security issues, including:

Input validation weaknesses
Authentication concerns
Unsafe database operations
Potential injection paths
Application logic flaws

At first glance, these findings appeared encouraging.

However, a closer look revealed a more concerning reality.

Detection was highly inconsistent.

Some vulnerabilities appeared in only one of the five scans.

Others disappeared entirely.

Several findings changed severity ratings between scans, while some vulnerabilities were incorrectly explained or classified as secure.

Most concerning of all, certain critical vulnerabilities were never discovered.

This means that running the same AI security review multiple times against the same application produced materially different results.

For security teams seeking consistency and confidence, that variability creates a significant challenge.

The Reality Behind the Findings

A deeper review of the results revealed a mix of:

Legitimate vulnerabilities
Dead-code findings
False positives
Context-dependent observations
Overstated severity ratings

When runtime validation was performed, many reported findings could not actually be exploited.

This highlights one of the biggest limitations of AI-powered security reviews:

AI reasoning is probabilistic – not deterministic.

Large language models generate conclusions based on probabilities, patterns, and context.

Security testing, however, requires repeatable and verifiable outcomes.

Because in application security, confidence without validation is a risk.

What the AI Missed

While Claude Opus 4.6 successfully identified some vulnerabilities, the experiment also revealed important blind spots.

Several security issues were:

Incorrectly classified
Poorly explained
Completely overlooked

Examples included:

Improper authentication handling
Weak authorization logic
Unsafe input processing paths
Potential injection vectors

In some cases, the AI even generated explanations describing vulnerable code as secure.

This creates one of the most dangerous failure modes in modern AI-assisted development.

Developers often trust AI-generated explanations.

If those explanations are wrong, vulnerable code can move directly into production environments with a false sense of security.

The Most Concerning Result: Missed XSS Vulnerabilities

Perhaps the most significant finding from the experiment was that Claude Opus 4.6 completely failed to detect two intentionally planted XSS vulnerabilities.

These vulnerabilities included:

A Text/HTML default fallback XSS
An application/XML namespace XSS

The attack chain required multiple layers of analysis, including:

Multi-step indirection
Content negotiation logic
Runtime rendering behavior

This is exactly the type of complexity that traditional AI security reviews struggle to understand.

The vulnerabilities were only fully visible when the application was analyzed during runtime execution.

Static reasoning alone failed to uncover them.

And that distinction matters.

Because attackers exploit runtime behavior – not theoretical code patterns.

Why AI Security Reviews Fail Without Runtime Validation

The results of this experiment highlight a broader truth about modern AI security testing.

Large language models are exceptionally good at:

Pattern recognition
Code generation
Documentation
Security explanations

But they continue to struggle with one critical capability:

Runtime security validation.

Security is not about identifying code that looks suspicious.

Security is about determining whether a vulnerability can actually be exploited.

And that requires runtime testing.

The research identified several reasons why AI security reviews continue to struggle.

1. LLMs Do Not Execute Applications

AI models analyze source code statically.

They rely on:

Patterns
Heuristics
Probabilistic reasoning

They do not:

Execute applications
Trigger attack chains
Observe runtime behavior
Validate exploitability

Without execution, vulnerabilities often remain theoretical assumptions rather than proven security risks.

2. AI Security Results Are Probabilistic

Another major limitation uncovered during the experiment was inconsistency.

Each AI security review was influenced by factors such as:

Prompt phrasing
Model randomness
Context window limitations
Response variability

This explains why multiple security scans against the same codebase produced different findings.

Some vulnerabilities appeared in one review but disappeared in the next.

Others changed severity ratings or received entirely different explanations.

This behavior is expected from large language models because they are designed to generate probabilistic outputs rather than deterministic results.

Security testing, however, requires:

Consistency
Repeatability
Reliability

A vulnerability should not appear or disappear based on how a prompt is worded.

Security teams need answers that remain stable across every assessment.

And today, AI security reviews alone cannot consistently provide that level of confidence.

3. AI Lacks Exploit Validation

Perhaps the biggest limitation of AI security reviews is the inability to validate exploitability.

Most AI-powered security tools are capable of identifying:

Potential vulnerabilities
Suspicious code patterns
Security anti-patterns

What they often cannot determine is:

Whether a vulnerability can actually be exploited
Whether the vulnerability is reachable during runtime
Whether a remediation successfully eliminated the issue

This creates two dangerous outcomes.

False Positives

Security teams spend time investigating vulnerabilities that are not actually exploitable.

False Confidence

Developers believe vulnerabilities have been resolved when exploitability still exists.

Both scenarios become increasingly risky as organizations deploy larger volumes of AI-generated applications.

Without runtime validation, security becomes based on assumptions rather than evidence.

The Bigger Signal: The AI Security Gap

While the findings from Claude Opus 4.6 were concerning, the broader implications are even more significant.

This experiment exposed a growing gap between AI-powered software development and modern application security.

Today, AI is generating a rapidly increasing percentage of production software.

Industry estimates suggest:

30–40% of production code is already AI-generated
Some organizations report more than 70% AI-assisted development
AI-generated applications are becoming common across SaaS environments

Development velocity is accelerating dramatically.

Security validation is not.

Most existing application security programs still depend heavily on:

Static analysis
Heuristic detection
AI code reviews
Manual validation

While these approaches provide value, none of them reliably prove exploitability.

More importantly, they were never designed for modern AI-driven architectures.

Today’s organizations are increasingly deploying:

MCP servers
Agentic AI systems
AI APIs
Autonomous workflows
AI-powered integrations

These environments introduce new attack surfaces and runtime behaviors that traditional security reviews often struggle to understand.

As AI-generated code becomes the norm, the gap between software creation and security validation will continue to grow.

And that gap creates risk.

Why Traditional Application Security Testing Cannot Keep Up

AI-generated software introduces an entirely new challenge:

Machine-Generated Vulnerabilities at Machine Speed

A developer can now generate:

An entire application
Authentication workflows
Authorization systems
APIs
Infrastructure configurations
Complex business logic

Within minutes.

The same development effort previously requiring days or weeks can now happen almost instantly.

The problem is that vulnerabilities scale at the same speed.

If AI generates insecure code, organizations may be introducing security risk faster than traditional AppSec teams can review it.

This creates a fundamental mismatch.

Development accelerates.

Security struggles to keep pace.

Traditional approaches built around periodic reviews, static analysis, and manual validation were not designed for AI-generated software at scale.

Modern application security requires a different model.

One built around:

Runtime validation
Deterministic testing
Continuous exploit verification
Automated security validation

Assumptions cannot scale as fast as AI-generated code.

How Bright STAR Solves This Problem

The challenges identified throughout this research are precisely why Bright Security built STAR (Security Testing & Autonomous Remediation).

STAR was designed specifically for modern development environments where AI-generated applications, APIs, and autonomous systems are becoming increasingly common.

Unlike traditional AI security review tools that rely heavily on static analysis and probabilistic reasoning, STAR focuses on validated security outcomes.

The objective is simple:

Prove security issues exist before reporting them.

And verify they are fixed before closing them.

1. STAR Proves Exploitability

Most AI security reviews identify potential vulnerabilities.

STAR validates real vulnerabilities.

Rather than relying solely on code interpretation, STAR:

Executes applications
Discovers real attack paths
Validates runtime behavior
Confirms exploitability

This dramatically reduces uncertainty and ensures security teams focus on issues that actually matter.

Because exploitable vulnerabilities create risk.

Theoretical vulnerabilities create noise.

2. STAR Eliminates False Positives

One of the biggest challenges with traditional security tools is alert fatigue.

Developers frequently encounter:

Large vulnerability lists
Dead-code findings
Non-exploitable issues
Low-confidence results

As a result, teams spend valuable time chasing findings that never represented real risk.

STAR takes a fundamentally different approach.

By combining:

Runtime DAST validation
Deterministic testing
AI-optimized security workflows

STAR focuses only on:

Exploitable vulnerabilities
Production-relevant findings
Actionable security risks

This allows security teams to spend less time investigating noise and more time fixing real issues.

3. STAR Validates Remediation

Finding vulnerabilities is only half the problem.

Organizations must also verify that fixes actually work.

STAR closes the loop by automatically re-testing applications after remediation.

This process validates:

Whether exploitability still exists
Whether remediation was successful
Whether the vulnerability has truly been eliminated

The result is a continuous security validation cycle:

Find – Validate – Remediate – Re-Test – Verify

Most AI code review solutions cannot reliably perform this workflow today.

Taking the Next Step in AI Security

Artificial intelligence is one of the most powerful productivity accelerators software development has ever experienced.

Its ability to generate applications, APIs, integrations, and workflows is transforming how modern engineering teams operate.

But AI should not become its own security gatekeeper.

Securing AI-generated applications requires solutions capable of understanding:

Runtime behavior
Dynamic attack paths
AI execution chains
Real exploitability
Complex application flows

This becomes increasingly important as organizations continue adopting:

AI coding assistants
AI-generated APIs
Agentic AI workflows
MCP architectures
Autonomous development systems

The future of application security is not about slowing development down.

It is about enabling organizations to move faster without sacrificing confidence.

Bright Security provides the runtime validation layer necessary to safely deploy AI-generated applications at scale.

Whether teams build with Claude, GPT, Gemini, Cursor, or custom LLMs, runtime validation remains essential.

Because AI can generate code.

But security still requires proof.

Final Thoughts

This research revealed a critical reality about the future of AI security.

AI can generate software far faster than traditional security processes can validate it.

Claude Opus 4.6 successfully identified some vulnerabilities during testing.

However, the experiment also exposed several important limitations:

Inconsistent detection results
High false-positive rates
Missed critical vulnerabilities
Lack of exploit validation
Limited runtime visibility

Together, these issues create a growing security gap across modern software development environments.

As organizations increasingly rely on AI-generated code, security teams need more than AI-generated recommendations.

They need:

Runtime validation
Continuous exploit verification
Deterministic testing
Real attack simulation
Runtime security testing

The future of AI security is not about identifying potential vulnerabilities.

It is about proving vulnerabilities exist, validating that fixes work, and continuously verifying security outcomes in production-like environments.

Because in application security:

Finding a vulnerability is only the beginning.

Proving it is exploitable – and proving it has been eliminated – is what truly matters.

Stop testing.

Start Assuring.

Join the world’s leading companies securing the next big cyber frontier with Bright STAR.

Our clients:

Threats and Vulnerabilities

Agentic AI Security: New Risks When Apps Start Calling Tools

AI systems are no longer passive tools that generate code or responses. They are becoming active agents that execute workflows,...

Gadi Bashvitz

May 25, 2026

Threats and Vulnerabilities

LLM Data Leakage: From Code to Production (For AppSec & Platform Teams)

AI is no longer just generating code - it is actively executing workflows across APIs, databases, and external systems. Teams...

Gadi Bashvitz

May 7, 2026

Threats and Vulnerabilities

Prompt Injection vs Data Poisoning in LLM Apps (Deep Technical Guide)

AAI is not just generating code. It is actually executing workflows across Application Programming Interfaces, databases, and external tools. Teams...

Gadi Bashvitz

May 6, 2026

Threats and Vulnerabilities

How MCP Endpoints Leak Sensitive Data (3 High-Impact Paths)

In the past two years, there have been significant changes in software development. Not only do programmers code – they...

Gadi Bashvitz

May 6, 2026

Gadi Bashvitz

AI Security Review Fails In Practice: Claude Opus 4.6 Missed Critical Vulnerabilities & Generated Dangerous False Positives

Table Of Contents

Introduction

Can AI Reliably Secure the Code It Generates?

The AI Security Experiment

What the AI Actually Found

Key Findings

The Reality Behind the Findings

AI reasoning is probabilistic – not deterministic.

What the AI Missed

The Most Concerning Result: Missed XSS Vulnerabilities

Why AI Security Reviews Fail Without Runtime Validation

1. LLMs Do Not Execute Applications

2. AI Security Results Are Probabilistic

3. AI Lacks Exploit Validation

False Positives

False Confidence

The Bigger Signal: The AI Security Gap

Why Traditional Application Security Testing Cannot Keep Up

Machine-Generated Vulnerabilities at Machine Speed

How Bright STAR Solves This Problem

1. STAR Proves Exploitability

2. STAR Eliminates False Positives

3. STAR Validates Remediation

Taking the Next Step in AI Security

Final Thoughts

Book a Demo

Stop testing.

Start Assuring.

More

Agentic AI Security: New Risks When Apps Start Calling Tools

LLM Data Leakage: From Code to Production (For AppSec & Platform Teams)

Prompt Injection vs Data Poisoning in LLM Apps (Deep Technical Guide)

How MCP Endpoints Leak Sensitive Data (3 High-Impact Paths)

Platform

Resources

Company

Partners

Get our newsletter