The Most-Starred Project on GitHub
OpenClaw is the fastest-growing open-source project in GitHub history. It reached 100,000 stars in two days. React took 8 years. Linux took 12. It gained 25,310 stars in a single day on January 26 -- the highest ever recorded.
It's also been called a "security nightmare" by Cisco, "structurally broken" by Conscia, and "ridiculous to try to secure" by Aikido Security. Its creator, Peter Steinberger, has since joined OpenAI. The project is being moved to an independent 501(c)(3) foundation.
We ran Octokraft's full analysis pipeline on the current codebase. Four analysis dimensions: static analysis for code quality, graph analysis for structural patterns, convention detection for consistency, and LLM-powered semantic analysis for behavioral risks. 1.1 million lines of code. 480,000 lines of tests.
The result: 60 / C.
The Full Scorecard
| Category | Score | What It Means |
|---|---|---|
| Security | 29.7 | 25 security issues including 2 critical, 5 high. Plugin sandboxing gaps, replay attacks, credential handling. |
| Test Quality | 36.0 | 480K test LOC but only 29% structural coverage. Shallow breadth despite massive volume. |
| Runtime Risks | 61.0 | 29 runtime issues. Message loss on abort, no plugin unloading, no rate limiting across channels. |
| Code Smell | 72.7 | 1,788 code smells. Session key coupling, unbounded growth in core orchestration. |
| Consistency | 81.7 | 1,387 consistency issues across 368 detected conventions. |
| Dead Code | 98.9 | 223 dead code items. Relatively clean. |
| Duplication | 100.0 | No significant duplication detected. |
| Compliance | 100.0 | No compliance violations. |
The three categories that drag the score down tell the real story: security (29.7), test quality (36.0), and runtime risks (61.0). These aren't style issues or nitpicks. They're the categories that determine whether your software works correctly and safely in production.
Security: 29.7 / 100
This is a project that had 8 CVEs disclosed in 6 weeks, had 42,665 exposed instances identified by security researchers (93.4% with authentication bypass), and had 20% of its plugin marketplace poisoned with malware. The security score reflects that history.
Critical Findings
Path traversal protection gaps in sandbox
src/agents/sandbox-paths.ts -- The sandbox path validation can be bypassed in edge cases involving symlinks and relative path resolution. This is the same class of vulnerability that led to CVE-2026-25475 (local file inclusion via MEDIA: path).
Blocked host paths -- incomplete coverage
src/agents/sandbox/validate-sandbox-security.ts -- The denylist blocks /etc, /proc, /sys, /dev, /root, /boot, docker.sock. But the completeness of this list is critical. New mount points or platform-specific paths could bypass it.
High-Severity Security Issues
| Finding | File | Source |
|---|---|---|
| No replay attack protection for webhooks | extensions/bluebubbles/src/monitor.ts | LLM Agent |
| No process isolation for plugins | src/plugins/loader.ts | LLM Agent |
| No routing loop detection | src/auto-reply/reply/route-reply.ts | LLM Agent |
| XSS via direct Response write from user input | src/media/server.ts | Static (Semgrep) |
| Google service account credentials in test files | extensions/googlechat/ | Static (Semgrep) |
The mix of sources is telling. Static analysis caught the hard credentials and XSS patterns. The LLM agent found the architectural security gaps -- no replay protection, no plugin isolation, no loop detection. Neither analysis alone would have given the full picture.
The Plugin Isolation Problem
This one deserves special attention. OpenClaw's plugins run in the same Node.js process as the gateway. No sandboxing. No VM isolation. No separate worker. A plugin has full access to the global state, can register HTTP handlers without authentication, and can read the config including credential paths.
This is exactly the architectural weakness that enabled the ClawHavoc supply chain attack. 824+ malicious skills were planted in the marketplace, installing keyloggers and credential stealers. The attack worked because plugins had unrestricted access to the host environment. Our analysis flagged this as a high-severity finding independently of the known CVEs.
Test Quality: 36.0 / 100
OpenClaw has 480,598 lines of test code. That's a 43.7% test-to-code ratio -- higher than most production projects. And yet the test quality score is 36.
| Metric | Value | Assessment |
|---|---|---|
| Test Code Ratio | 43.7% | Excellent volume |
| Structural Coverage | 29.1% | Poor breadth |
| Assertion Density | 0.544 | Below average |
| Mock Usage | 18.4% | Moderate |
| Test LOC | 480,598 | Massive |
This is the paradox: 480,000 lines of tests covering only 29% of the codebase. That's an enormous amount of test code for a modest coverage footprint. It means the tests are deep but narrow -- concentrated on specific scenarios rather than broadly covering code paths.
The context matters. 82% of OpenClaw's test files were added after the CVEs were disclosed. The security audit tests are genuinely good -- adversarial scenarios, filesystem manipulation, environment edge cases. But they were written reactively, targeting specific vulnerability classes rather than building systematic coverage.
A project with 480K lines of test code should have better than 29% structural coverage. The gap between test volume and actual coverage is one of the clearest signals that test quantity is not test quality.
Runtime Risks: 61.0 / 100
Runtime issues are the "what happens when things go wrong" category. OpenClaw has 29 runtime findings, 4 of them high-severity.
Message loss on abort
When an AbortSignal fires mid-delivery, the delivery queue acks the entry (marks it done) but undelivered chunks are silently lost. No retry. No dead-letter queue. The user never knows their message was partially delivered.
No plugin unloading
Once loaded, plugin records stay in the registry until process exit. Hooks, HTTP routes, and channels registered by plugins are never cleaned up. Disabling a plugin requires a full restart.
No rate limiting across channels
WhatsApp, Signal, Slack -- none of the channel extensions implement rate limit handling. When a provider throttles requests, the error is either swallowed or propagated as a generic failure.
No resource limits on spawned processes
The process supervisor spawns child processes with output caps and timeouts but no memory or CPU limits. A misbehaving skill can consume unbounded resources.
These are behavioral issues that don't show up in traditional code quality metrics. The code is syntactically clean. It follows patterns. It has tests. But the runtime behavior -- what happens when connections drop, when providers rate-limit, when plugins misbehave -- has significant gaps.
Code Quality and Consistency: The Structural View
The static and graph analysis layers found 3,175 structural issues (1,788 code smells + 1,387 consistency issues). These are the high-volume, lower-severity findings from Biome/Semgrep static analysis and FalkorDB graph rules.
| Source | Issues | What It Catches |
|---|---|---|
| Static Analyzer | 2,326 | Linting violations, unsafe patterns, dead assignments, type issues |
| Graph Rules | 1,222 | High fan-out, circular dependencies, god classes, coupling metrics |
| LLM Agent | 44 | Behavioral risks, architectural gaps, security boundaries |
The convention analysis detected 368 coding conventions across the codebase with an average compliance of 86.4%. Most conventions are well-followed. But the outliers are informative:
| Convention | Compliance | Category |
|---|---|---|
| Custom error class extension | 25% | Error handling |
| Zod schema strict mode | 33% | API design |
| Private field prefix (# vs _) | 33% | Naming |
| Client instance caching | 40% | Structure |
| Env dependency injection | 40% | API design |
| JSDoc on public utilities | 43% | Documentation |
When a project establishes a convention and only follows it 25-40% of the time, it creates cognitive overhead for every contributor. Is the pattern the intended approach, or the exception? With 1,075 contributors and 11,000+ commits in the post-CVE period, convention drift is inevitable without automated enforcement.
Architecture: B Overall, With Scaling Concerns
Modularity
78
Patterns
80
Coupling
72
Scalability
65
The architecture review is the one bright spot. OpenClaw's plugin system is genuinely well-designed -- 50+ channel extensions with clean SDK boundaries, consistent plugin hooks, factory patterns, and TypeBox/Zod validation. The channel adapter interface is one of the better plugin architectures in the open-source ecosystem.
The concerns are at the system level: the auto-reply engine creates broad coupling across the codebase, session management could become a bottleneck at scale, and the trusted-operator security model means scaling beyond single-user deployment requires fundamental architectural changes.
The CVE Context
Our analysis doesn't perform security auditing or penetration testing. The 8 CVEs were discovered by dedicated security researchers at DepthFirst, Oasis Security, Endor Labs, and others -- not by code quality tooling.
But the architectural conditions that enabled those CVEs are exactly what our analysis detects:
No plugin isolation → ClawHavoc supply chain attack
Plugins share process memory and global state. 824+ malicious skills exploited this to install keyloggers and steal credentials.
No replay protection → Webhook exploitation
Missing nonce/timestamp validation on webhook endpoints. Captured requests can be replayed indefinitely.
No routing loop detection → Agent hijacking
Multi-hop routing loops (A->B->C->A) are not detected. Combined with the ClawJacked WebSocket vulnerability, this created a full agent takeover path.
Credential handling gaps → Token exfiltration
Environment variables and config values contain credentials in plaintext. Combined with the unvalidated gatewayUrl parameter, this enabled one-click RCE.
Security firms have been consistent in their assessment. Cisco called personal AI agents like OpenClaw "a security nightmare." Aikido Security's initial audit found 512 vulnerabilities (8 critical). An academic paper on arXiv analyzed the systemic risks. The fundamental tension -- AI agents need broad system access to be useful, but that access creates attack surface -- remains unresolved.
The Response: 11,000+ Commits in 8 Weeks
To the OpenClaw team's credit, the response was swift. Every CVE was patched within days, often within 24 hours. They shipped 12 custom lint scripts encoding lessons from each vulnerability. They added mandatory browser authentication, SSRF deny policies, TLS 1.3 enforcement, and proper path validation.
The scale of the response is staggering: over 11,000 PRs merged between the first CVE disclosure and today. 315,000 lines of test code added. An entire src/security/ directory built from scratch.
But rapid remediation under pressure creates its own debt. The convention compliance dropped. The code smell count tripled. The consistency issues quadrupled. The code got more secure while getting structurally messier. That tradeoff was reasonable in a crisis -- but it means the codebase now carries the weight of both the original architectural gaps and the rushed remediation.
What This Tells Us
Stars don't equal quality
285,000 GitHub stars. A C grade. The most-starred project on GitHub has genuine security issues, shallow test coverage despite massive test volume, and runtime behaviors that fail silently. Popularity measures interest, not engineering maturity.
Test volume is not test quality
480,000 lines of test code. 29% structural coverage. A test quality score of 36. You can have a 43.7% test-to-code ratio and still have most of your codebase untested. Volume is a vanity metric without coverage breadth.
Code quality hides behavioral issues
The code is syntactically clean. It follows patterns. The architecture is well-designed. And yet: messages are silently lost on abort, plugins can't be unloaded, rate limits aren't handled, and spawned processes have no resource caps. These behavioral issues only surface through deep semantic analysis -- they're invisible to linters and code review.
You need all four dimensions
Static analysis found the XSS and hard credentials. Graph analysis found the structural coupling and god classes. Convention detection found the consistency drift. LLM analysis found the behavioral risks -- plugin isolation, message loss, missing rate limiting. No single dimension gives the full picture. You need all four working together.
Methodology
Analysis was run using Octokraft's full pipeline against the latest commit on the default branch of openclaw/openclaw as of March 15, 2026. The pipeline includes:
- Static analysis via Biome and Semgrep for code quality, security patterns, and compliance
- Graph analysis via FalkorDB for structural patterns, coupling metrics, and dependency analysis
- Convention detection for identifying and measuring adherence to coding patterns across the codebase
- LLM-powered semantic analysis for behavioral risks, architectural gaps, and security boundary assessment
The CVEs discussed were discovered by DepthFirst, Oasis Security, Endor Labs, and other security researchers. Our pipeline detects architectural conditions that enable vulnerabilities, not the vulnerabilities themselves.