Deep Dive 18 min read Mar 15, 2026

OpenClaw Scores a C. The Most-Starred Project on GitHub Has Real Problems.

285,000 GitHub stars. 8 CVEs in 6 weeks. 42,000 exposed instances. We analyzed OpenClaw's 1.1 million lines of code across four analysis dimensions. The most-starred project on GitHub scores a C.

Overall Grade

C

60.0

Security Score

30

25 issues found

Test Quality

36

480K test LOC

Total Issues

3,592

across all categories

The Most-Starred Project on GitHub

OpenClaw is the fastest-growing open-source project in GitHub history. It reached 100,000 stars in two days. React took 8 years. Linux took 12. It gained 25,310 stars in a single day on January 26 -- the highest ever recorded.

It's also been called a "security nightmare" by Cisco, "structurally broken" by Conscia, and "ridiculous to try to secure" by Aikido Security. Its creator, Peter Steinberger, has since joined OpenAI. The project is being moved to an independent 501(c)(3) foundation.

We ran Octokraft's full analysis pipeline on the current codebase. Four analysis dimensions: static analysis for code quality, graph analysis for structural patterns, convention detection for consistency, and LLM-powered semantic analysis for behavioral risks. 1.1 million lines of code. 480,000 lines of tests.

The result: 60 / C.


The Full Scorecard

CategoryScoreWhat It Means
Security29.725 security issues including 2 critical, 5 high. Plugin sandboxing gaps, replay attacks, credential handling.
Test Quality36.0480K test LOC but only 29% structural coverage. Shallow breadth despite massive volume.
Runtime Risks61.029 runtime issues. Message loss on abort, no plugin unloading, no rate limiting across channels.
Code Smell72.71,788 code smells. Session key coupling, unbounded growth in core orchestration.
Consistency81.71,387 consistency issues across 368 detected conventions.
Dead Code98.9223 dead code items. Relatively clean.
Duplication100.0No significant duplication detected.
Compliance100.0No compliance violations.

The three categories that drag the score down tell the real story: security (29.7), test quality (36.0), and runtime risks (61.0). These aren't style issues or nitpicks. They're the categories that determine whether your software works correctly and safely in production.


Security: 29.7 / 100

This is a project that had 8 CVEs disclosed in 6 weeks, had 42,665 exposed instances identified by security researchers (93.4% with authentication bypass), and had 20% of its plugin marketplace poisoned with malware. The security score reflects that history.

Critical Findings

Path traversal protection gaps in sandbox

src/agents/sandbox-paths.ts -- The sandbox path validation can be bypassed in edge cases involving symlinks and relative path resolution. This is the same class of vulnerability that led to CVE-2026-25475 (local file inclusion via MEDIA: path).

Blocked host paths -- incomplete coverage

src/agents/sandbox/validate-sandbox-security.ts -- The denylist blocks /etc, /proc, /sys, /dev, /root, /boot, docker.sock. But the completeness of this list is critical. New mount points or platform-specific paths could bypass it.

High-Severity Security Issues

FindingFileSource
No replay attack protection for webhooksextensions/bluebubbles/src/monitor.tsLLM Agent
No process isolation for pluginssrc/plugins/loader.tsLLM Agent
No routing loop detectionsrc/auto-reply/reply/route-reply.tsLLM Agent
XSS via direct Response write from user inputsrc/media/server.tsStatic (Semgrep)
Google service account credentials in test filesextensions/googlechat/Static (Semgrep)

The mix of sources is telling. Static analysis caught the hard credentials and XSS patterns. The LLM agent found the architectural security gaps -- no replay protection, no plugin isolation, no loop detection. Neither analysis alone would have given the full picture.

The Plugin Isolation Problem

This one deserves special attention. OpenClaw's plugins run in the same Node.js process as the gateway. No sandboxing. No VM isolation. No separate worker. A plugin has full access to the global state, can register HTTP handlers without authentication, and can read the config including credential paths.

This is exactly the architectural weakness that enabled the ClawHavoc supply chain attack. 824+ malicious skills were planted in the marketplace, installing keyloggers and credential stealers. The attack worked because plugins had unrestricted access to the host environment. Our analysis flagged this as a high-severity finding independently of the known CVEs.


Test Quality: 36.0 / 100

OpenClaw has 480,598 lines of test code. That's a 43.7% test-to-code ratio -- higher than most production projects. And yet the test quality score is 36.

MetricValueAssessment
Test Code Ratio43.7%Excellent volume
Structural Coverage29.1%Poor breadth
Assertion Density0.544Below average
Mock Usage18.4%Moderate
Test LOC480,598Massive

This is the paradox: 480,000 lines of tests covering only 29% of the codebase. That's an enormous amount of test code for a modest coverage footprint. It means the tests are deep but narrow -- concentrated on specific scenarios rather than broadly covering code paths.

The context matters. 82% of OpenClaw's test files were added after the CVEs were disclosed. The security audit tests are genuinely good -- adversarial scenarios, filesystem manipulation, environment edge cases. But they were written reactively, targeting specific vulnerability classes rather than building systematic coverage.

A project with 480K lines of test code should have better than 29% structural coverage. The gap between test volume and actual coverage is one of the clearest signals that test quantity is not test quality.


Runtime Risks: 61.0 / 100

Runtime issues are the "what happens when things go wrong" category. OpenClaw has 29 runtime findings, 4 of them high-severity.

Message loss on abort

When an AbortSignal fires mid-delivery, the delivery queue acks the entry (marks it done) but undelivered chunks are silently lost. No retry. No dead-letter queue. The user never knows their message was partially delivered.

No plugin unloading

Once loaded, plugin records stay in the registry until process exit. Hooks, HTTP routes, and channels registered by plugins are never cleaned up. Disabling a plugin requires a full restart.

No rate limiting across channels

WhatsApp, Signal, Slack -- none of the channel extensions implement rate limit handling. When a provider throttles requests, the error is either swallowed or propagated as a generic failure.

No resource limits on spawned processes

The process supervisor spawns child processes with output caps and timeouts but no memory or CPU limits. A misbehaving skill can consume unbounded resources.

These are behavioral issues that don't show up in traditional code quality metrics. The code is syntactically clean. It follows patterns. It has tests. But the runtime behavior -- what happens when connections drop, when providers rate-limit, when plugins misbehave -- has significant gaps.


Code Quality and Consistency: The Structural View

The static and graph analysis layers found 3,175 structural issues (1,788 code smells + 1,387 consistency issues). These are the high-volume, lower-severity findings from Biome/Semgrep static analysis and FalkorDB graph rules.

SourceIssuesWhat It Catches
Static Analyzer2,326Linting violations, unsafe patterns, dead assignments, type issues
Graph Rules1,222High fan-out, circular dependencies, god classes, coupling metrics
LLM Agent44Behavioral risks, architectural gaps, security boundaries

The convention analysis detected 368 coding conventions across the codebase with an average compliance of 86.4%. Most conventions are well-followed. But the outliers are informative:

ConventionComplianceCategory
Custom error class extension25%Error handling
Zod schema strict mode33%API design
Private field prefix (# vs _)33%Naming
Client instance caching40%Structure
Env dependency injection40%API design
JSDoc on public utilities43%Documentation

When a project establishes a convention and only follows it 25-40% of the time, it creates cognitive overhead for every contributor. Is the pattern the intended approach, or the exception? With 1,075 contributors and 11,000+ commits in the post-CVE period, convention drift is inevitable without automated enforcement.


Architecture: B Overall, With Scaling Concerns

Modularity

78

Patterns

80

Coupling

72

Scalability

65

The architecture review is the one bright spot. OpenClaw's plugin system is genuinely well-designed -- 50+ channel extensions with clean SDK boundaries, consistent plugin hooks, factory patterns, and TypeBox/Zod validation. The channel adapter interface is one of the better plugin architectures in the open-source ecosystem.

The concerns are at the system level: the auto-reply engine creates broad coupling across the codebase, session management could become a bottleneck at scale, and the trusted-operator security model means scaling beyond single-user deployment requires fundamental architectural changes.


The CVE Context

Our analysis doesn't perform security auditing or penetration testing. The 8 CVEs were discovered by dedicated security researchers at DepthFirst, Oasis Security, Endor Labs, and others -- not by code quality tooling.

But the architectural conditions that enabled those CVEs are exactly what our analysis detects:

No plugin isolation → ClawHavoc supply chain attack

Plugins share process memory and global state. 824+ malicious skills exploited this to install keyloggers and steal credentials.

No replay protection → Webhook exploitation

Missing nonce/timestamp validation on webhook endpoints. Captured requests can be replayed indefinitely.

No routing loop detection → Agent hijacking

Multi-hop routing loops (A->B->C->A) are not detected. Combined with the ClawJacked WebSocket vulnerability, this created a full agent takeover path.

Credential handling gaps → Token exfiltration

Environment variables and config values contain credentials in plaintext. Combined with the unvalidated gatewayUrl parameter, this enabled one-click RCE.

Security firms have been consistent in their assessment. Cisco called personal AI agents like OpenClaw "a security nightmare." Aikido Security's initial audit found 512 vulnerabilities (8 critical). An academic paper on arXiv analyzed the systemic risks. The fundamental tension -- AI agents need broad system access to be useful, but that access creates attack surface -- remains unresolved.


The Response: 11,000+ Commits in 8 Weeks

To the OpenClaw team's credit, the response was swift. Every CVE was patched within days, often within 24 hours. They shipped 12 custom lint scripts encoding lessons from each vulnerability. They added mandatory browser authentication, SSRF deny policies, TLS 1.3 enforcement, and proper path validation.

The scale of the response is staggering: over 11,000 PRs merged between the first CVE disclosure and today. 315,000 lines of test code added. An entire src/security/ directory built from scratch.

But rapid remediation under pressure creates its own debt. The convention compliance dropped. The code smell count tripled. The consistency issues quadrupled. The code got more secure while getting structurally messier. That tradeoff was reasonable in a crisis -- but it means the codebase now carries the weight of both the original architectural gaps and the rushed remediation.


What This Tells Us

Stars don't equal quality

285,000 GitHub stars. A C grade. The most-starred project on GitHub has genuine security issues, shallow test coverage despite massive test volume, and runtime behaviors that fail silently. Popularity measures interest, not engineering maturity.

Test volume is not test quality

480,000 lines of test code. 29% structural coverage. A test quality score of 36. You can have a 43.7% test-to-code ratio and still have most of your codebase untested. Volume is a vanity metric without coverage breadth.

Code quality hides behavioral issues

The code is syntactically clean. It follows patterns. The architecture is well-designed. And yet: messages are silently lost on abort, plugins can't be unloaded, rate limits aren't handled, and spawned processes have no resource caps. These behavioral issues only surface through deep semantic analysis -- they're invisible to linters and code review.

You need all four dimensions

Static analysis found the XSS and hard credentials. Graph analysis found the structural coupling and god classes. Convention detection found the consistency drift. LLM analysis found the behavioral risks -- plugin isolation, message loss, missing rate limiting. No single dimension gives the full picture. You need all four working together.


Methodology

Analysis was run using Octokraft's full pipeline against the latest commit on the default branch of openclaw/openclaw as of March 15, 2026. The pipeline includes:

  • Static analysis via Biome and Semgrep for code quality, security patterns, and compliance
  • Graph analysis via FalkorDB for structural patterns, coupling metrics, and dependency analysis
  • Convention detection for identifying and measuring adherence to coding patterns across the codebase
  • LLM-powered semantic analysis for behavioral risks, architectural gaps, and security boundary assessment

The CVEs discussed were discovered by DepthFirst, Oasis Security, Endor Labs, and other security researchers. Our pipeline detects architectural conditions that enable vulnerabilities, not the vulnerabilities themselves.

See your codebase across all four dimensions

Static analysis, graph intelligence, convention detection, and LLM-powered behavioral analysis. All in one platform.

Try Octokraft