The Contenders
| OpenAI Codex | Gemini CLI | OpenCode | |
|---|---|---|---|
| Language | Rust | TypeScript | TypeScript |
| Lines of Code | 443,699 | 399,417 | 419,527 |
| Test LOC | 83,689 | 257,893 | 34,902 |
| Open PRs | 165 | 472 | 490 |
| 30-day activity | +12.4K / -1.6K | +11.2K / -5.1K | +10.8K / -5.2K |
All three are actively developed, similarly sized codebases (~400-440K LOC). What separates them is everything under the hood.
Score Breakdown by Category
Each category is scored 0-100. Higher is better.
| Category | Codex | Gemini CLI | OpenCode |
|---|---|---|---|
| Security | 100 | 99.98 | 99.41 |
| Runtime Risks | 100 | 80.87 | 50.24 |
| Test Coverage | 100 | 86.47 | 91.88 |
| Code Smells | 57.94 | 34.57 | 34.02 |
| Duplication | 100 | 100 | 100 |
| Dead Code | 97.37 | 97.62 | 98.13 |
| Consistency | 98.34 | 16.32 | 25.00 |
| Compliance | 100 | 100 | 100 |
Issue Density: Rust vs TypeScript
Codex has 6-8x fewer issues per line of code than both TypeScript projects. This is partly Rust's strict compiler catching problems at build time that surface as analysis findings in TypeScript. Rust's ownership model, exhaustive pattern matching, and strong type system eliminate entire categories of bugs before they reach production.
Issue Severity
| Severity | Codex | Gemini CLI | OpenCode |
|---|---|---|---|
| Critical | 0 | 0 | 0 |
| High | 0 | 32 | 6 |
| Medium | 439 | 3,564 | 3,030 |
| Low | 50 | 45 | 34 |
Zero high-severity issues for Codex. Gemini CLI has the most, with 32 high-severity findings -mostly in testing (7) and code smells (25).
Where the Issues Live
| Category | Codex | Gemini CLI | OpenCode |
|---|---|---|---|
| Code Smells | 431 | 783 | 1,067 |
| Consistency | 10 | 2,734 | 1,678 |
| Runtime | 0 | 63 | 277 |
| Dead Code | 48 | 39 | 32 |
| Testing | 0 | 25 | 15 |
| Security | 0 | 2 | 1 |
Consistency is where Gemini CLI and OpenCode bleed points. With 2,734 consistency issues, Gemini CLI's score in this category drops to just 16/100.
OpenCode has the most runtime risk issues (277) -4.4x more than Gemini CLI, pulling its runtime risks score down to 50/100.
Where the Findings Come From
| Source | Codex | Gemini CLI | OpenCode |
|---|---|---|---|
| Graph Analysis | 477 (97.5%) | 257 (7.0%) | 131 (4.3%) |
| Static Analyzers | 11 (2.2%) | 3,357 (92.1%) | 2,924 (95.2%) |
| LLM Agent | 1 (0.2%) | 32 (0.9%) | 15 (0.5%) |
The static analyzer numbers deserve context -and honesty about our methodology.
Why Biome, not ESLint?
Octokraft uses Biome as its static analyzer for TypeScript/JavaScript projects. Biome is faster, more comprehensive, and applies a consistent ruleset across all projects regardless of what linting the project has configured. This gives us an apples-to-apples comparison -but it also means we're sometimes flagging things the project's own tooling would ignore.
What's Driving the Numbers
The top Biome findings across both TypeScript projects:
| Rule | Gemini CLI | OpenCode | What it catches |
|---|---|---|---|
| useLiteralKeys | 557 | 102 | obj["key"] instead of obj.key |
| noNonNullAssertion | 364 | 153 | TypeScript ! operator usage |
| noExplicitAny | 104 | 332 | Using the any type |
| useImportType | 121 | 82 | Missing import type |
| useTemplate | 57 | 146 | String concat vs template literals |
| noUnusedVariables | - | 67 | Declared but never used |
| noUnusedImports | - | 49 | Imported but never used |
Some of these are stylistic preferences (literal keys, template strings). Others are genuine code quality signals -particularly noExplicitAny and noUnusedVariables.
To see what 332 any uses actually look like in practice, here's a function from OpenCode's OpenAI provider adapter:
export function fromOpenaiRequest(body: any): CommonRequest {
if (!body || typeof body !== "object") return body
const toImg = (p: any) => {
if (!p || typeof p !== "object") return undefined
if ((p as any).type === "image_url" && (p as any).image_url)
return { type: "image_url", image_url: (p as any).image_url }
if ((p as any).type === "input_image" && (p as any).image_url)
return { type: "image_url", image_url: (p as any).image_url }
const s = (p as any).source
if (!s || typeof s !== "object") return undefined
if ((s as any).type === "url" && typeof (s as any).url === "string")
return { type: "image_url", image_url: { url: (s as any).url } }
if (
(s as any).type === "base64" &&
typeof (s as any).media_type === "string" &&
typeof (s as any).data === "string"
)
return {
type: "image_url",
image_url: { url: `data:${(s as any).media_type};base64,${(s as any).data}` },
}
} That's 15+ as any casts in a single function. The parameter p is already typed any, so every (p as any) is redundant -and the nested (s as any) casts show a lack of type narrowing. A single interface definition would eliminate all of these and catch shape mismatches at compile time.
This pattern extends beyond business logic into core utilities. Here's OpenCode's logging module:
export namespace Log {
const ctx = Context.create<{
tags: Record<string, any>
}>()
export function create(tags?: Record<string, any>) {
const result = {
info(message?: any, extra?: Record<string, any>) {
const prefix = Object.entries({
...use().tags,
...tags,
...extra,
})
.map(([key, value]) => `${key}=${value}`)
.join(" ")
console.log(prefix, message)
return result
},
// ...
}
}
} Five any types in a 55-line file. The tags, message, and extra parameters could all be typed -Record<string, string | number | boolean> would cover most logging use cases. When any lives in core utilities, it propagates through every callsite.
What Their Own Linters Say
We dug into each project's linting setup. The contrast is striking.
Gemini CLI: Thorough ESLint config
A 400-line ESLint config that explicitly sets @typescript-eslint/no-explicit-any: "error" -their own rules ban any. The 104 Biome findings for this rule represent violations of Google's own standards that slipped through. They also enforce consistent-type-imports, no-floating-promises, no-unsafe-assignment, and custom rules like banning raw fetch() in favor of safeFetch() for SSRF protection.
OpenCode: No linter -but context matters
No ESLint. No Biome. No Prettier. The only check is bun turbo typecheck, and even that runs without strict: true in tsconfig. The 332 any uses, 67 unused variables, and 49 unused imports have no automated guardrails. For a smaller, community-driven project competing against OpenAI and Google, this is understandable -lint configs aren't the first thing you set up when you're shipping fast. But it's the single highest-leverage improvement they could make.
Codex: Strict Clippy
Rust's Clippy with 36+ deny-level lints in Cargo.toml -the strictest configuration of the three. Rust's compiler plus Clippy leave very little for external analysis to find, which is why 97.5% of Codex's findings come from graph analysis (structural patterns) rather than static linting.
Testing: A Tale of Three Philosophies
| Metric | Codex | Gemini CLI | OpenCode |
|---|---|---|---|
| Test-to-Code Ratio | 18.9% | 64.6% | 8.3% |
| Structural Coverage | 19.5% | 45.3% | 10.1% |
| Assertion Density | 1.22 | 0.25 | 0.28 |
| Mock Usage | 15.5% | 15.8% | 0.6% |
Gemini CLI has a staggering 64.6% test-to-code ratio -nearly two-thirds of the codebase is test code. It also has the highest structural coverage at 45.3%, backed by 738+ test files. Google's testing culture shows.
Codex writes fewer but denser tests. Despite lower coverage numbers, Codex's assertion density (1.22 assertions per test function) is 4-5x higher than either TypeScript project. Fewer tests, but each one validates more.
OpenCode has the weakest test story at 8.3% test ratio and only 10.1% structural coverage, with near-zero mock usage (0.6%). This is a real risk area as the project scales.
Architecture Quality
All three received a B architecture rating, but the dimension scores reveal different strengths:
| Dimension | Codex | Gemini CLI | OpenCode |
|---|---|---|---|
| Modularity | 78 | 82 | 78 |
| Coupling | 72 | 72 | 72 |
| Scalability | 75 | 78 | 75 |
| Patterns | 82 | 85 | 82 |
Gemini CLI leads in every architecture dimension. Its monorepo structure with dedicated packages demonstrates Google's experience building large-scale TypeScript applications. The policy engine, hook system, and declarative tool abstractions are standout patterns.
That said, even Gemini CLI has structural debt. Our graph analysis flagged the ClearcutLogger class as a god class -a single telemetry class spanning 1,941 lines:
// 1,941 lines, 61 methods
export class ClearcutLogger {
private static instance: ClearcutLogger;
private config?: Config;
private sessionData: EventValue[] = [];
private promptId: string = '';
private readonly installationManager: InstallationManager;
private readonly userAccountManager: UserAccountManager;
private readonly hashedGHRepositoryName?: string;
private readonly events: FixedDeque<LogEventEntry[]>;
private lastFlushTime: number = Date.now();
private flushing: boolean = false;
private pendingFlush: boolean = false;
// ... 61 methods follow
} Similarly, their Storage class (403 lines, 54 methods) handles everything from session management to OAuth token paths to plan directories. Both are candidates for decomposition.
Codex's Rust workspace with 70+ crates is impressive in granularity, though the core crate has grown into a monolithic hub. The main codex.rs file is 7,226 lines with 54 public functions -it's the central nervous system that everything routes through. This is the primary architectural concern our graph analysis flagged.
OpenCode's 20+ package monorepo is well-structured with an event bus pattern and multi-platform support (CLI, web, desktop, cloud), but large files and cross-module dependencies create maintenance pressure.
Architectural Highlights
Codex
- Protocol abstraction layer
- Platform-specific sandboxing
- SQ/EQ async communication
- 36+ Clippy deny-level lints
Gemini CLI
- Rule-based policy engine
- Comprehensive hook system
- Structured error hierarchy
- AbortSignal cancellation
OpenCode
- Async context DI
- SST infrastructure-as-code
- Zod + OpenAPI generation
- JSON-to-SQLite migration
Codebase Dynamics
| Metric | Codex | Gemini CLI | OpenCode |
|---|---|---|---|
| Churn Rate | 3.16% | 0% | 3.80% |
| Refactoring Rate | 12.7% | 0% | 48.1% |
| Convention Consistency | 81.6% | 92.6% | 85.3% |
| Convention Deviations | 25 | 13 | 26 |
OpenCode's 48% refactoring rate is the standout number here -almost half of recent changes are refactors rather than new features. This signals a team actively investing in codebase quality, even without formal lint tooling. It's the kind of organic discipline that's hard to fake.
Gemini CLI's near-zero churn and refactoring is unusual -possibly reflecting a stable initial architecture or a recent major release that reset activity metrics. Their 92.6% convention consistency is the highest of the three, with only 13 deviations detected.
Key Takeaways
1. Rust's Compiler Is a Code Quality Multiplier
Codex's 1.1 issues per 1K LOC vs 7-9 for the TypeScript projects isn't just about coding skill -it's about language design. Rust's ownership model, exhaustive matching, and strict type system prevent entire categories of issues from existing.
2. Tests Can Be a Liability Without Density
Gemini CLI has a massive test suite (65% of the codebase), but with an assertion density of only 0.25, many of those tests may be shallow. Codex's approach -fewer tests with 5x the assertions -is arguably more valuable. Coverage percentage matters less than what you're actually asserting.
3. OpenCode Is the Underdog That Holds Its Own
OpenCode is a community-driven project competing against OpenAI and Google -and it scores a B with architecture scores identical to Codex. Its 48% refactoring rate shows a team actively investing in quality. The multi-platform support (CLI, web, desktop, cloud) is the most ambitious of the three. The one clear gap: zero lint enforcement. Adding a Biome or ESLint config would be the single highest-leverage improvement they could make -and would likely push that B toward a B+.
4. Consistency Debt Is the Debt Nobody Talks About
Both TypeScript projects have thousands of consistency issues. Most are stylistic -Biome preferences that differ from the project's conventions. But some are real: Gemini CLI's own ESLint config bans any, yet 104 violations slipped through. Convention drift makes every code review harder, every onboarding slower, and every refactor riskier.
5. Architecture Scores Are Remarkably Similar
All three converge around the same architecture grade (B) with identical coupling scores (72). The "coding agent" problem space naturally leads to certain architectural patterns -and all three teams are solving it competently.
6. Test Investment Separates Good From Great
The spread is dramatic: Gemini CLI at 65% test ratio, Codex at 19%, OpenCode at 8%. But raw coverage isn't the whole story -Codex's 5x assertion density suggests quality beats quantity. For OpenCode, getting from 8% to even 20% would significantly reduce the 277 runtime risk issues and build confidence for contributors working on those 490 open PRs.
Methodology
Each project was analyzed using Octokraft's full analysis pipeline, which combines three detection layers:
Layer 1: Static Analysis
Octokraft runs Biome for TypeScript/JavaScript and Clippy for Rust. We chose Biome over ESLint deliberately: it's faster (written in Rust), provides a unified formatter + linter, and applies a consistent ruleset across projects. This matters for benchmarks -if we used each project's own ESLint config, we'd be comparing different standards. Biome gives us one ruler for everyone.
Layer 2: Graph Analysis
Octokraft indexes every symbol, dependency, and call path into a FalkorDB knowledge graph. This detects structural issues no line-by-line linter can find: high fan-out modules, dead code (exported but never imported), god classes, circular dependencies, and coupling patterns.
Layer 3: LLM-Powered Analysis
An AI agent reads the actual source code and identifies semantic issues: incomplete migrations, test quality problems, hardcoded credentials, architectural anti-patterns. These are findings that require understanding intent, not just syntax.
On top of these, Octokraft runs convention detection, architecture review, and health scoring across 8 weighted categories.
What we'd improve
The Biome ruleset could be smarter about distinguishing stylistic preferences (useLiteralKeys) from genuine quality signals (noExplicitAny). A future version might weight rules differently or let projects bring their own config. We're also exploring running each project's own linter in addition to Biome, to show both "what their tools catch" and "what they're missing."
All analyses were run against the latest commit on the default branch as of March 11, 2026.