Gadi Bashvitz

Author

Published Date: May 28, 2026

Estimated Read Time: 8 minutes

AI Security Review Fails Again: Claude Opus 4.6 Struggles To Reliably Remediate Vulnerabilities

Why Runtime Validation Still Matters in AI Security Workflows

Introduction
Why We Ran This Experiment
The Research Setup
Initial Vulnerability Detection Results
AI Remediation Results
When AI Fixes Introduced New Vulnerabilities
The Hidden Cost of AI Security Reviews
What Security Teams Are Learning the Hard Way
Why Runtime Validation Still Matters
How Bright STAR Changed The Results
Cost Comparison: AI-Only vs Bright STAR
The Future of AI Security Is Runtime Validation
Key Research Findings
Final Thoughts

Introduction

Artificial intelligence is rapidly transforming the way software is built, reviewed, and secured.

Across modern engineering organizations, teams are increasingly relying on:

AI coding assistants
AI-powered security review tools
Autonomous remediation workflows
AI-generated applications and APIs

The vision is compelling.

AI can generate code faster than ever before. This thing can find problems when people are making software and it can even suggest how to fix them on its own. As artificial intelligence gets better and better, a lot of companies are starting to think that using intelligence to fix security issues is a good way to make sure their applications are safe.

There is a big question that people do not really have an answer to:

Can AI reliably eliminate security vulnerabilities, or does it simply create the appearance of security improvements?

To answer that question, we conducted a real-world experiment using Claude Opus 4.6. Our objective was to evaluate the model’s ability to:

Detect vulnerabilities
Generate remediation recommendations
Re-analyze the updated code
Validate whether security issues were actually resolved

What we discovered revealed significant limitations in AI-driven remediation workflows, including inconsistent fixes, newly introduced vulnerabilities, escalating token costs, and a critical gap in runtime security validation.

Why We Ran This Experiment

As organizations continue adopting AI coding assistants, AI security review platforms, and autonomous development workflows, a new challenge is emerging:

Can AI reliably secure the code it helps create?

Much of the industry conversation around AI-assisted development focuses on:

Detection accuracy
Development speed
Productivity gains
Code generation capabilities

While these benefits are important, they often overlook a more critical requirement: validating whether vulnerabilities are truly eliminated in runtime environments.

Security outcomes cannot be measured solely by code reviews or remediation suggestions. The real test is whether an application remains exploitable after changes have been implemented.

Our goal was to evaluate whether modern large language models could consistently:

Detect vulnerabilities
Recommend effective fixes
Eliminate runtime exploitability

Rather than simply producing remediation that appears correct on the surface.

The Research Setup

To simulate a realistic engineering workflow, we generated a deliberately vulnerable application containing approximately 450 lines of code using Claude Code powered by Opus 4.6.

The workflow followed a standard security review process:

Security review
Vulnerability detection
AI-generated remediation
Re-analysis of updated code
Runtime security validation

The objective was straightforward:

Could AI reliably fix the vulnerabilities it identified and prove that those vulnerabilities were no longer exploitable?

This approach allowed us to evaluate not only vulnerability detection capabilities but also the reliability of AI-generated remediation under realistic conditions.

Initial Vulnerability Detection Results

Claude Opus 4.6 successfully identified several common security weaknesses during the initial review.

Among the issues detected were:

SQL injection vulnerabilities
Authentication weaknesses
Input validation flaws
Access control issues
Dependency-related risks

These results demonstrate that modern LLMs are becoming increasingly effective at recognizing common security patterns and identifying potentially vulnerable code paths.

However, identifying vulnerabilities is only one part of the security equation.

Detection alone does not make an application secure.

The true challenge begins when remediation is introduced, and organizations attempt to verify that vulnerabilities have actually been removed.

AI Remediation Results

The remediation phase produced mixed outcomes.

While some vulnerabilities were partially addressed, many issues remained unresolved or continued to be exploitable during runtime validation.

Several remediation attempts suffered from one or more of the following problems:

Vulnerabilities remained exploitable
Fixes were incomplete
Runtime validation continued to fail
Security assumptions did not hold under real-world testing

In multiple cases, the generated remediation appeared correct when reviewing the source code.

The code looked cleaner.

The security recommendations appeared reasonable.

The vulnerability seemed resolved.

However, runtime testing revealed that exploitability still existed.

This created a dangerous illusion of security – an environment where applications appeared more secure without actually reducing risk.

The results also varied significantly across remediation attempts, highlighting the inconsistency that still exists within AI-driven security workflows.

When AI Fixes Introduced New Vulnerabilities

One of the most significant findings from the experiment was that some remediation attempts introduced entirely new security issues.

Examples included:

Weak validation logic
Improper authentication handling
Incomplete input sanitization
Expanded attack surface exposure

In several instances:

Previously unreachable paths became accessible
Runtime assumptions failed unexpectedly
Overall security posture worsened after remediation

These findings expose a fundamental limitation of LLM-based security workflows.

Large language models are optimized to generate plausible solutions – not to guarantee secure runtime behavior.

As a result, remediation that appears correct in code reviews can still introduce unintended security consequences that are only discovered through runtime validation.

The Hidden Cost of AI Security Reviews

Security effectiveness was not the only challenge uncovered during the research.

Cost efficiency emerged as another major concern.

Token consumption increased significantly across repeated remediation cycles.

Each additional review required:

Re-analyzing the application
Generating new remediation suggestions
Reviewing updated code
Performing additional validation
Repeating the process when fixes failed

One of the most expensive behaviors observed during testing involved remediation attempts targeting dead code and non-reachable execution paths.

The model frequently spent resources attempting to fix code that had little or no impact on runtime security outcomes.

This increased:

Processing costs
Token consumption
Operational overhead
Remediation complexity

Without delivering meaningful security improvements.

For organizations operating at scale, these inefficiencies can quickly become expensive.

What Security Teams Are Learning the Hard Way

Over the last several years, organizations have rapidly embraced:

AI coding assistants
AI-powered security review workflows
Autonomous remediation pipelines

Yet many security teams are discovering that expectations and reality are often very different.

Assumption	Reality
AI automatically fixes vulnerabilities	Many vulnerabilities remain exploitable
AI reduces security costs	Token costs increase rapidly
AI understands application architecture	AI optimizes for plausible outputs
AI replaces runtime validation	Runtime validation becomes even more important

As AI-generated code becomes increasingly common across SaaS organizations, runtime security validation is becoming more essential – not less.

Why Runtime Validation Still Matters

The research exposed a critical gap within many AI security workflows.

Large language models do not perform deterministic runtime validation.

AI can:

Rewrite code
Suggest fixes
Improve syntax
Identify common security patterns

But AI cannot reliably:

Prove exploitability
Validate runtime behavior
Confirm vulnerability elimination

This creates a significant disconnect between:

Code that appears secure

and

Applications that are actually secure.

Without runtime validation, vulnerabilities can:

Remain exploitable
Shift to new attack paths
Reappear in unexpected ways
Introduce additional security risks

For modern application security programs, runtime validation is no longer optional – it is essential.

How Bright STAR Changed the Results

To better understand the impact of runtime validation, we compared an AI-only security workflow against Bright STAR.

Rather than relying solely on LLM-generated analysis, Bright STAR combines:

Runtime validation
Exploit verification
Deterministic testing
AI-guided remediation

This approach significantly improved:

Validation accuracy
Runtime verification
Remediation reliability
Cost efficiency

Bright STAR reduced:

Token consumption
Operational costs
False positives
Unnecessary remediation cycles

While simultaneously improving security outcomes.

The difference was clear:

Instead of assuming vulnerabilities were fixed, Bright STAR verified whether vulnerabilities were actually eliminated.

Cost Comparison: AI-Only vs Bright STAR

The cost analysis revealed substantial efficiency differences between AI-only security workflows and Bright STAR runtime validation workflows.

Bright STAR Workflow

Approximately $0.62 per scan
Approximately 217K tokens across 14 specialized tasks

Full AI Security Pipeline

$9.67–$21.60 per scan
Approximately 377K tokens across 15 agents

Estimated Enterprise Cost (100 PRs Per Day)

Workflow	Estimated Annual Cost
Full AI Pipeline	~$3.1M/year
Bright STAR Workflow	~$89K/year

The analysis demonstrated that runtime validation significantly reduced:

Token usage
Operational expenses
Remediation overhead

While improving confidence in security outcomes.

The Future of AI Security Is Runtime Validation

The future of AI security is not simply about detecting vulnerabilities or generating remediation suggestions.

It is about proving that vulnerabilities are gone.

As organizations continue adopting:

AI coding assistants
AI-generated APIs
MCP-based architectures
Autonomous development workflows

The need for runtime validation will only increase.

The most effective security programs will combine AI-driven productivity with deterministic security verification.

Because generating a fix is not the same as proving security.

Key Research Findings

Research Area	Observation
Vulnerability Detection	Generally effective
Remediation Reliability	Inconsistent
Runtime Validation	Limited
Token Consumption	High
Operational Cost	Significant
Runtime Verification	Critical

The research demonstrates that AI can accelerate many aspects of application security.

However, without deterministic runtime validation, organizations risk scaling vulnerabilities faster than they eliminate them.

Final Thoughts

Our experiment showed that Claude Opus 4.6 was capable of identifying multiple security vulnerabilities across a vulnerable application.

However, it struggled to consistently remediate those issues and validate the resulting runtime security outcomes.

Key findings included:

Inconsistent remediation success
Introduction of new vulnerabilities
Significant token consumption
Missing runtime validation

AI will continue to play an important role in modern software development.

But AI-generated remediation without runtime validation creates a dangerous false sense of security.

As AI-generated code becomes standard across modern engineering teams, security programs must evolve beyond recommendation-based workflows and embrace deterministic runtime verification.

Because in application security, appearing secure and being secure are not the same thing.

This version keeps the exact flow, research narrative, and Bright STAR positioning of the CEO’s original article while making it read like an executive research report rather than a draft blog.

Stop testing.

Start Assuring.

Join the world’s leading companies securing the next big cyber frontier with Bright STAR.

Our clients:

Industry Insights

AI Agents And MCP Workflows: The Future Of Secure DevSecOps Automation

Modern software delivery environments are becoming increasingly difficult to manage manually. APIs, cloud-native infrastructure, CI/CD systems, runtime orchestration, internal knowledge...

Gadi Bashvitz

June 10, 2026

Industry Insights

The Future Of Tech Support In AppSec

Modern AppSec is no longer only about detecting vulnerabilities. Today, one of the biggest challenges security teams face is operational...

Gadi Bashvitz

June 5, 2026

Industry Insights

AI Pentesting Detects SQLi and XSS – But Stops Before Generating the Patch

Why Finding Vulnerabilities Isn’t the Problem Anymore Table Of Contents Introduction For years, application security teams have been trying to...

Gadi Bashvitz

June 5, 2026

Industry Insights

The Agentic Evolution: Connecting Jira, Figma, And GitHub To Ship Secure Code Faster

For years, software teams have been working towards one thing: making it easier for developers to write code faster. They...

Gadi Bashvitz

June 4, 2026

Gadi Bashvitz

AI Security Review Fails Again: Claude Opus 4.6 Struggles To Reliably Remediate Vulnerabilities

Table Of Contents

Introduction

Why We Ran This Experiment

The Research Setup

Initial Vulnerability Detection Results

AI Remediation Results

When AI Fixes Introduced New Vulnerabilities

The Hidden Cost of AI Security Reviews

What Security Teams Are Learning the Hard Way

Why Runtime Validation Still Matters

How Bright STAR Changed the Results

Cost Comparison: AI-Only vs Bright STAR

Bright STAR Workflow

Full AI Security Pipeline

Estimated Enterprise Cost (100 PRs Per Day)

The Future of AI Security Is Runtime Validation

Key Research Findings

Final Thoughts

Book a Demo

Stop testing.

Start Assuring.

More

AI Agents And MCP Workflows: The Future Of Secure DevSecOps Automation

The Future Of Tech Support In AppSec

AI Pentesting Detects SQLi and XSS – But Stops Before Generating the Patch

The Agentic Evolution: Connecting Jira, Figma, And GitHub To Ship Secure Code Faster

Platform

Resources

Company

Partners

Overview

Comparison Page

Solution Page

Overview

Resources

Gadi Bashvitz

AI Security Review Fails Again: Claude Opus 4.6 Struggles To Reliably Remediate Vulnerabilities

Table Of Contents

Introduction

Why We Ran This Experiment

The Research Setup

Initial Vulnerability Detection Results

AI Remediation Results

When AI Fixes Introduced New Vulnerabilities

The Hidden Cost of AI Security Reviews

What Security Teams Are Learning the Hard Way

Why Runtime Validation Still Matters

How Bright STAR Changed the Results

Cost Comparison: AI-Only vs Bright STAR

Bright STAR Workflow

Full AI Security Pipeline

Estimated Enterprise Cost (100 PRs Per Day)

The Future of AI Security Is Runtime Validation

Key Research Findings

Final Thoughts

Book a Demo

Stop testing.

Start Assuring.

More

AI Agents And MCP Workflows: The Future Of Secure DevSecOps Automation

The Future Of Tech Support In AppSec

AI Pentesting Detects SQLi and XSS – But Stops Before Generating the Patch

The Agentic Evolution: Connecting Jira, Figma, And GitHub To Ship Secure Code Faster

Platform

Resources

Company

Partners

Get our newsletter