When Good Analysis Misses Critical Problems

The Score That Should Not Have Been

Corbulo's analysis ran on OpenClaw on March 13. The score came back: A (93.6).

This was the project with 8 CVEs in 6 weeks. The one Cisco called "a security nightmare." The one where security researchers found 42,665 exposed instances, 93% with authentication bypass. The one whose plugin marketplace had 824 malicious skills planted by attackers. The one that an academic paper was written about.

Maybe the team had fixed everything. Between the first CVE disclosure and today, over 11,000 PRs were merged on OpenClaw. The codebase grew to 1.1 million lines of code with 480,000 lines of tests. That's a staggering amount of work. But still -- an A?

Something was wrong. Not with OpenClaw. With the analysis.

What the A Was Measuring

The category breakdown told the rest. Security: 100. Test coverage: 100. The two categories that should have been the hardest hit by OpenClaw's history were both perfect scores.

We already had all four analysis dimensions: static analysis, graph intelligence, convention detection, and LLM-powered exploration. The problem wasn't missing capabilities. The problem was how the score was composed.

Security scoring was dominated by static findings. OpenClaw's code passes every linting rule. Proper TypeScript types. Good import patterns. The graph detected structural coupling but not trust boundary violations. The LLM explored code but wasn't asking the right questions about behavioral failure modes. Each dimension was doing its job, but the scoring formula wasn't weighting what mattered.

Test coverage was based on test-to-code ratio. OpenClaw has a 43.7% ratio -- excellent by any standard metric. What the score didn't capture: those 480,000 lines of tests only cover 29% of the codebase structurally. You can have a massive test suite that barely touches most of your code.

The score wasn't wrong because we lacked tools. It was wrong because the tools were answering the wrong questions, and the scoring wasn't listening to the signals that mattered.

What We Changed

We overhauled the analysis pipeline. Not just the scoring formula. The entire analysis flow -- how we plan investigations, how the agent explores code, how we extract findings, and how each dimension contributes to the final score.

The core insight: perfect code quality can hide huge behavioral issues. OpenClaw's code is syntactically clean, well-typed, and follows consistent patterns. Every dimension confirmed this. But runtime behaviors -- what happens when connections drop, when plugins misbehave, when inputs aren't validated at trust boundaries -- weren't surfacing because we weren't looking hard enough.

What We Changed

Deeper security scanning

Our static layer was catching style issues but not security patterns. We added security-focused rule sets (OWASP top-ten, secrets detection, injection patterns) that immediately surfaced XSS vectors, hard-coded credentials, and patterns that style linting alone never catches.

Better context, better questions

We improved how much the AI analysis understands about the codebase before it starts investigating, and we shifted its focus from "does this code look right?" to "what happens when things go wrong?" -- behavioral failure modes, trust boundary violations, edge cases in error handling. These are the questions that surface the kind of issues OpenClaw's CVEs represented.

Test quality, not test volume

We replaced naive test-to-code ratio with structural coverage analysis. This is the change that crashed OpenClaw's test score from 100 to 36. 480,000 lines of tests sounds impressive. 29% structural coverage says otherwise.

Scoring that penalizes what matters

We added grade caps: if your security score is below 30, you can't score above a C overall, regardless of how clean your code style is. We fixed how issue density is calculated. We made the score reflect what the analysis finds, not just what's easy to count.

Each iteration ran the full pipeline end-to-end. Each time the score moved closer to what the codebase deserved.

The Score Journey

Iteration	Score	What Changed
Initial run	A (93.6)	Original pipeline. Security 100, test coverage 100. No Semgrep, no graph-guided agent, naive test ratio.
+ Semgrep SAST	A- (87.3)	Added security-focused static analysis. Caught XSS, hard credentials, injection patterns. Code smells recalibrated.
+ Graph-guided LLM	B- (70.0)	Agent queries FalkorDB before exploring. Behavioral missions found plugin isolation gaps, message loss, missing rate limiting.
+ Test quality analysis	C+ (69.6)	Structural coverage replaced naive test ratio. 480K test LOC, only 29% coverage. Test score crashed from 100 to 36.
+ Scoring overhaul	C (60.0)	Grade caps, LOC dilution fix, parallel workflow. Security: 30. Tests: 36. Runtime: 61. Score reflects reality.

Each change peeled back a layer. Original analysis: A. Deeper security scanning: A-. Behavioral failure-mode analysis: B-. Real structural coverage: C+. Proper weighting: C.

The final C isn't a verdict on OpenClaw's engineering. It's a reflection of where the codebase stands across all dimensions -- security architecture, test effectiveness, runtime behavior, and structural quality. An A was hiding real problems. A C surfaces them.

The Numbers That Made It Real

Some data points that crystallized during the investigation:

11,000+

PRs merged between first CVE and today

1.1M

Lines of code (up from ~900K pre-CVE)

480K

Lines of test code (43.7% ratio, 29% coverage)

A- to C

Same codebase, same week, deeper analysis

11,000 PRs is a staggering volume. And the investment shows -- the codebase has genuinely improved in security. All CVEs are patched. Custom lint scripts encode lessons from each vulnerability. The security audit tests are thorough. But 11,000 PRs in 8 weeks also means a lot of code was written under pressure, and that pressure shows in the structural quality.

What Building This Exposed

Clean code is not correct code

OpenClaw passes every linter. It has good TypeScript types. It follows consistent patterns. It also silently loses messages on abort, can't unload plugins, and doesn't rate-limit any of its channel integrations. Code quality metrics measure how the code is written. They don't measure what the code does.

Each analysis dimension sees a different reality

Static analysis sees syntax and patterns. Graph analysis sees structure and dependencies. Convention detection sees consistency. LLM analysis sees behavior and edge cases. OpenClaw scored well on the first three and poorly on the last one. If you only run linters, you'll only see what linters see.

Test volume is the biggest vanity metric in software

480,000 lines of tests. 29% structural coverage. It is hard to find a more convincing example that test quantity is not test quality. The ratio looks amazing on a dashboard. The coverage tells you most of the codebase is untested.

A tool that can't admit when it's wrong is useless

Corbulo gave OpenClaw an A. That A was wrong. Without questioning it, the product would tell users "this project with 8 CVEs is excellent." The most important feature of an analysis platform isn't the analysis -- it's the willingness to rebuild when the results don't match reality.

You Need a Platform That Sees Everything

The lesson from this investigation isn't about OpenClaw. It's about the limits of any single analysis approach.

A linter would have given OpenClaw a pass. A test coverage tool would have been impressed by the 43.7% ratio. A dependency scanner would have found known CVE patterns. Each tool sees one dimension of reality.

What this requires is all four dimensions working together:

Static analysis for the code quality floor -- linting, security patterns, compliance
Graph intelligence for the structural truth -- dependencies, coupling, where data flows
Convention detection for the consistency reality -- does the team follow its own rules?
LLM-powered behavioral analysis for the hard questions -- what happens when things go wrong?

No single dimension would have taken OpenClaw from an A to a C. All four together did. That's the platform we're building.

When Good Analysis Misses Critical Problems

The Score That Should Not Have Been

What the A Was Measuring

What We Changed

What We Changed

The Score Journey

The Numbers That Made It Real

What Building This Exposed

You Need a Platform That Sees Everything

See your codebase across all four dimensions