OpenClaw Scores a C. The Most-Starred Project on GitHub Has Real Problems.

The Most-Starred Project on GitHub

OpenClaw is the fastest-growing open-source project in GitHub history. It reached 100,000 stars in two days. React took 8 years. Linux took 12. It gained 25,310 stars in a single day on January 26 -- the highest ever recorded.

It's also been called a "security nightmare" by Cisco, "structurally broken" by Conscia, and "ridiculous to try to secure" by Aikido Security. Its creator, Peter Steinberger, has since joined OpenAI. The project is being moved to an independent 501(c)(3) foundation.

We ran Corbulo's full analysis pipeline on the current codebase. Four analysis dimensions: static analysis for code quality, graph analysis for structural patterns, convention detection for consistency, and LLM-powered semantic analysis for behavioral risks. 1.1 million lines of code. 480,000 lines of tests.

The result: 60 / C.

The Full Scorecard

Category	Score	What It Means
Security	29.7	25 security issues including 2 critical, 5 high. Plugin sandboxing gaps, replay attacks, credential handling.
Test Quality	36.0	480K test LOC but only 29% structural coverage. Shallow breadth despite massive volume.
Runtime Risks	61.0	29 runtime issues. Message loss on abort, no plugin unloading, no rate limiting across channels.
Code Smell	72.7	1,788 code smells. Session key coupling, unbounded growth in core orchestration.
Consistency	81.7	1,387 consistency issues across 368 detected conventions.
Dead Code	98.9	223 dead code items. Relatively clean.
Duplication	100.0	No significant duplication detected.
Compliance	100.0	No compliance violations.

The three categories that drag the score down are the ones that matter: security (29.7), test quality (36.0), and runtime risks (61.0). These aren't style issues or nitpicks. They're the categories that determine whether your software works correctly and safely in production.

Security: 29.7 / 100

This is a project that had 8 CVEs disclosed in 6 weeks, had 42,665 exposed instances identified by security researchers (93.4% with authentication bypass), and had 20% of its plugin marketplace poisoned with malware. The security score reflects that history.

Critical Findings

Path traversal protection gaps in sandbox

src/agents/sandbox-paths.ts -- The sandbox path validation can be bypassed in edge cases involving symlinks and relative path resolution. This is the same class of vulnerability that led to CVE-2026-25475 (local file inclusion via MEDIA: path).

Blocked host paths -- incomplete coverage

src/agents/sandbox/validate-sandbox-security.ts -- The denylist blocks /etc, /proc, /sys, /dev, /root, /boot, docker.sock. But the completeness of this list is critical. New mount points or platform-specific paths could bypass it.

High-Severity Security Issues

Finding	File	Source
No replay attack protection for webhooks	`extensions/bluebubbles/src/monitor.ts`	LLM Agent
No process isolation for plugins	`src/plugins/loader.ts`	LLM Agent
No routing loop detection	`src/auto-reply/reply/route-reply.ts`	LLM Agent
XSS via direct Response write from user input	`src/media/server.ts`	Static (Semgrep)
Google service account credentials in test files	`extensions/googlechat/`	Static (Semgrep)

The mix of sources is telling. Static analysis caught the hard credentials and XSS patterns. The LLM agent found the architectural security gaps -- no replay protection, no plugin isolation, no loop detection. Neither analysis alone would have given the full picture.

The Plugin Isolation Problem

This one deserves special attention. OpenClaw's plugins run in the same Node.js process as the gateway. No sandboxing. No VM isolation. No separate worker. A plugin has full access to the global state, can register HTTP handlers without authentication, and can read the config including credential paths.

This is exactly the architectural weakness that enabled the ClawHavoc supply chain attack. 824+ malicious skills were planted in the marketplace, installing keyloggers and credential stealers. The attack worked because plugins had unrestricted access to the host environment. Our analysis flagged this as a high-severity finding independently of the known CVEs.

Test Quality: 36.0 / 100

OpenClaw has 480,598 lines of test code. That's a 43.7% test-to-code ratio -- higher than most production projects. And yet the test quality score is 36.

Metric	Value	Assessment
Test Code Ratio	43.7%	Excellent volume
Structural Coverage	29.1%	Poor breadth
Assertion Density	0.544	Below average
Mock Usage	18.4%	Moderate
Test LOC	480,598	Massive

This is the paradox: 480,000 lines of tests covering only 29% of the codebase. That's an enormous amount of test code for a modest coverage footprint. It means the tests are deep but narrow -- concentrated on specific scenarios rather than broadly covering code paths.

The context matters. 82% of OpenClaw's test files were added after the CVEs were disclosed. The security audit tests are genuinely good -- adversarial scenarios, filesystem manipulation, environment edge cases. But they were written reactively, targeting specific vulnerability classes rather than building systematic coverage.

A project with 480K lines of test code should have better than 29% structural coverage. The gap between test volume and actual coverage is one of the clearest signals that test quantity is not test quality.

Runtime Risks: 61.0 / 100

Runtime issues are the "what happens when things go wrong" category. OpenClaw has 29 runtime findings, 4 of them high-severity.

Message loss on abort

When an AbortSignal fires mid-delivery, the delivery queue acks the entry (marks it done) but undelivered chunks are silently lost. No retry. No dead-letter queue. The user never knows their message was partially delivered.

No plugin unloading

Once loaded, plugin records stay in the registry until process exit. Hooks, HTTP routes, and channels registered by plugins are never cleaned up. Disabling a plugin requires a full restart.

No rate limiting across channels

WhatsApp, Signal, Slack -- none of the channel extensions implement rate limit handling. When a provider throttles requests, the error is either swallowed or propagated as a generic failure.

No resource limits on spawned processes

The process supervisor spawns child processes with output caps and timeouts but no memory or CPU limits. A misbehaving skill can consume unbounded resources.

These are behavioral issues that don't show up in traditional code quality metrics. The code is syntactically clean. It follows patterns. It has tests. But the runtime behavior -- what happens when connections drop, when providers rate-limit, when plugins misbehave -- has significant gaps.

Code Quality and Consistency: The Structural View

The static and graph analysis layers found 3,175 structural issues (1,788 code smells + 1,387 consistency issues). These are the high-volume, lower-severity findings from Biome/Semgrep static analysis and FalkorDB graph rules.

Source	Issues	What It Catches
Static Analyzer	2,326	Linting violations, unsafe patterns, dead assignments, type issues
Graph Rules	1,222	High fan-out, circular dependencies, god classes, coupling metrics
LLM Agent	44	Behavioral risks, architectural gaps, security boundaries

The convention analysis detected 368 coding conventions across the codebase with an average compliance of 86.4%. Most conventions are well-followed. But the outliers are informative:

Convention	Compliance	Category
Custom error class extension	25%	Error handling
Zod schema strict mode	33%	API design
Private field prefix (# vs _)	33%	Naming
Client instance caching	40%	Structure
Env dependency injection	40%	API design
JSDoc on public utilities	43%	Documentation

When a project establishes a convention and only follows it 25-40% of the time, it creates cognitive overhead for every contributor. Is the pattern the intended approach, or the exception? With 1,075 contributors and 11,000+ commits in the post-CVE period, convention drift is inevitable without automated enforcement.

Architecture: B Overall, With Scaling Concerns

Modularity

Patterns

Coupling

Scalability

The architecture review is the one bright spot. OpenClaw's plugin system is genuinely well-designed -- 50+ channel extensions with clean SDK boundaries, consistent plugin hooks, factory patterns, and TypeBox/Zod validation. The channel adapter interface is one of the better plugin architectures in the open-source ecosystem.

The concerns are at the system level: the auto-reply engine creates broad coupling across the codebase, session management could become a bottleneck at scale, and the trusted-operator security model means scaling beyond single-user deployment requires fundamental architectural changes.

The CVE Context

Our analysis doesn't perform security auditing or penetration testing. The 8 CVEs were discovered by dedicated security researchers at DepthFirst, Oasis Security, Endor Labs, and others -- not by code quality tooling.

But the architectural conditions that enabled those CVEs are exactly what our analysis detects:

No plugin isolation → ClawHavoc supply chain attack

Plugins share process memory and global state. 824+ malicious skills exploited this to install keyloggers and steal credentials.

No replay protection → Webhook exploitation

Missing nonce/timestamp validation on webhook endpoints. Captured requests can be replayed indefinitely.

No routing loop detection → Agent hijacking

Multi-hop routing loops (A->B->C->A) are not detected. Combined with the ClawJacked WebSocket vulnerability, this created a full agent takeover path.

Credential handling gaps → Token exfiltration

Environment variables and config values contain credentials in plaintext. Combined with the unvalidated gatewayUrl parameter, this enabled one-click RCE.

Security firms have been consistent in their assessment. Cisco called personal AI agents like OpenClaw "a security nightmare." Aikido Security's initial audit found 512 vulnerabilities (8 critical). An academic paper on arXiv analyzed the systemic risks. The fundamental tension -- AI agents need broad system access to be useful, but that access creates attack surface -- remains unresolved.

The Response: 11,000+ Commits in 8 Weeks

To the OpenClaw team's credit, the response was swift. Every CVE was patched within days, often within 24 hours. They shipped 12 custom lint scripts encoding lessons from each vulnerability. They added mandatory browser authentication, SSRF deny policies, TLS 1.3 enforcement, and proper path validation.

The scale of the response is staggering: over 11,000 PRs merged between the first CVE disclosure and today. 315,000 lines of test code added. An entire src/security/ directory built from scratch.

But rapid remediation under pressure creates its own debt. The convention compliance dropped. The code smell count tripled. The consistency issues quadrupled. The code got more secure while getting structurally messier. That tradeoff was reasonable in a crisis -- but it means the codebase now carries the weight of both the original architectural gaps and the rushed remediation.

What This Tells Us

Stars don't equal quality

285,000 GitHub stars. A C grade. The most-starred project on GitHub has genuine security issues, shallow test coverage despite massive test volume, and runtime behaviors that fail silently. Popularity measures interest, not engineering maturity.

Test volume is not test quality

480,000 lines of test code. 29% structural coverage. A test quality score of 36. You can have a 43.7% test-to-code ratio and still have most of your codebase untested. Volume is a vanity metric without coverage breadth.

Code quality hides behavioral issues

The code is syntactically clean. It follows patterns. The architecture is well-designed. And yet: messages are silently lost on abort, plugins can't be unloaded, rate limits aren't handled, and spawned processes have no resource caps. These behavioral issues only surface through deep semantic analysis -- they're invisible to linters and code review.

You need all four dimensions

Static analysis found the XSS and hard credentials. Graph analysis found the structural coupling and god classes. Convention detection found the consistency drift. LLM analysis found the behavioral risks -- plugin isolation, message loss, missing rate limiting. No single dimension gives the full picture. You need all four working together.

Methodology

Analysis was run using Corbulo's full pipeline against the latest commit on the default branch of openclaw/openclaw as of March 15, 2026. The pipeline includes:

Static analysis via Biome and Semgrep for code quality, security patterns, and compliance
Graph analysis via FalkorDB for structural patterns, coupling metrics, and dependency analysis
Convention detection for identifying and measuring adherence to coding patterns across the codebase
LLM-powered semantic analysis for behavioral risks, architectural gaps, and security boundary assessment

The CVEs discussed were discovered by DepthFirst, Oasis Security, Endor Labs, and other security researchers. Our pipeline detects architectural conditions that enable vulnerabilities, not the vulnerabilities themselves.