Benchmark 12 min read Mar 12, 2026

Code Quality Showdown: OpenAI Codex vs Google Gemini CLI vs OpenCode

We ran Octokraft's full analysis pipeline on three of the most talked-about open-source coding agents. Here's what we found.

OpenAI Codex

A

94.8

Rust · 443K LOC

Google Gemini CLI

B+

80.1

TypeScript · 399K LOC

OpenCode

B

76.4

TypeScript · 420K LOC

The Contenders

OpenAI CodexGemini CLIOpenCode
LanguageRustTypeScriptTypeScript
Lines of Code443,699399,417419,527
Test LOC83,689257,89334,902
Open PRs165472490
30-day activity+12.4K / -1.6K+11.2K / -5.1K+10.8K / -5.2K

All three are actively developed, similarly sized codebases (~400-440K LOC). What separates them is everything under the hood.


Score Breakdown by Category

Each category is scored 0-100. Higher is better.

CategoryCodexGemini CLIOpenCode
Security10099.9899.41
Runtime Risks10080.8750.24
Test Coverage10086.4791.88
Code Smells57.9434.5734.02
Duplication100100100
Dead Code97.3797.6298.13
Consistency98.3416.3225.00
Compliance100100100

Issue Density: Rust vs TypeScript

Codex1.1 / 1K LOC
OpenCode7.3 / 1K LOC
Gemini CLI9.1 / 1K LOC

Codex has 6-8x fewer issues per line of code than both TypeScript projects. This is partly Rust's strict compiler catching problems at build time that surface as analysis findings in TypeScript. Rust's ownership model, exhaustive pattern matching, and strong type system eliminate entire categories of bugs before they reach production.

Issue Severity

SeverityCodexGemini CLIOpenCode
Critical000
High0326
Medium4393,5643,030
Low504534

Zero high-severity issues for Codex. Gemini CLI has the most, with 32 high-severity findings -mostly in testing (7) and code smells (25).

Where the Issues Live

CategoryCodexGemini CLIOpenCode
Code Smells4317831,067
Consistency102,7341,678
Runtime063277
Dead Code483932
Testing02515
Security021

Consistency is where Gemini CLI and OpenCode bleed points. With 2,734 consistency issues, Gemini CLI's score in this category drops to just 16/100.

OpenCode has the most runtime risk issues (277) -4.4x more than Gemini CLI, pulling its runtime risks score down to 50/100.


Where the Findings Come From

SourceCodexGemini CLIOpenCode
Graph Analysis477 (97.5%)257 (7.0%)131 (4.3%)
Static Analyzers11 (2.2%)3,357 (92.1%)2,924 (95.2%)
LLM Agent1 (0.2%)32 (0.9%)15 (0.5%)

The static analyzer numbers deserve context -and honesty about our methodology.

Why Biome, not ESLint?

Octokraft uses Biome as its static analyzer for TypeScript/JavaScript projects. Biome is faster, more comprehensive, and applies a consistent ruleset across all projects regardless of what linting the project has configured. This gives us an apples-to-apples comparison -but it also means we're sometimes flagging things the project's own tooling would ignore.

What's Driving the Numbers

The top Biome findings across both TypeScript projects:

RuleGemini CLIOpenCodeWhat it catches
useLiteralKeys557102obj["key"] instead of obj.key
noNonNullAssertion364153TypeScript ! operator usage
noExplicitAny104332Using the any type
useImportType12182Missing import type
useTemplate57146String concat vs template literals
noUnusedVariables-67Declared but never used
noUnusedImports-49Imported but never used

Some of these are stylistic preferences (literal keys, template strings). Others are genuine code quality signals -particularly noExplicitAny and noUnusedVariables.

To see what 332 any uses actually look like in practice, here's a function from OpenCode's OpenAI provider adapter:

opencode/packages/console/app/src/routes/zen/util/provider/openai.ts
export function fromOpenaiRequest(body: any): CommonRequest {
  if (!body || typeof body !== "object") return body

  const toImg = (p: any) => {
    if (!p || typeof p !== "object") return undefined
    if ((p as any).type === "image_url" && (p as any).image_url)
      return { type: "image_url", image_url: (p as any).image_url }
    if ((p as any).type === "input_image" && (p as any).image_url)
      return { type: "image_url", image_url: (p as any).image_url }
    const s = (p as any).source
    if (!s || typeof s !== "object") return undefined
    if ((s as any).type === "url" && typeof (s as any).url === "string")
      return { type: "image_url", image_url: { url: (s as any).url } }
    if (
      (s as any).type === "base64" &&
      typeof (s as any).media_type === "string" &&
      typeof (s as any).data === "string"
    )
      return {
        type: "image_url",
        image_url: { url: `data:${(s as any).media_type};base64,${(s as any).data}` },
      }
  }

That's 15+ as any casts in a single function. The parameter p is already typed any, so every (p as any) is redundant -and the nested (s as any) casts show a lack of type narrowing. A single interface definition would eliminate all of these and catch shape mismatches at compile time.

This pattern extends beyond business logic into core utilities. Here's OpenCode's logging module:

opencode/packages/console/core/src/util/log.ts
export namespace Log {
  const ctx = Context.create<{
    tags: Record<string, any>
  }>()

  export function create(tags?: Record<string, any>) {
    const result = {
      info(message?: any, extra?: Record<string, any>) {
        const prefix = Object.entries({
          ...use().tags,
          ...tags,
          ...extra,
        })
          .map(([key, value]) => `${key}=${value}`)
          .join(" ")
        console.log(prefix, message)
        return result
      },
      // ...
    }
  }
}

Five any types in a 55-line file. The tags, message, and extra parameters could all be typed -Record<string, string | number | boolean> would cover most logging use cases. When any lives in core utilities, it propagates through every callsite.

What Their Own Linters Say

We dug into each project's linting setup. The contrast is striking.

Gemini CLI: Thorough ESLint config

A 400-line ESLint config that explicitly sets @typescript-eslint/no-explicit-any: "error" -their own rules ban any. The 104 Biome findings for this rule represent violations of Google's own standards that slipped through. They also enforce consistent-type-imports, no-floating-promises, no-unsafe-assignment, and custom rules like banning raw fetch() in favor of safeFetch() for SSRF protection.

OpenCode: No linter -but context matters

No ESLint. No Biome. No Prettier. The only check is bun turbo typecheck, and even that runs without strict: true in tsconfig. The 332 any uses, 67 unused variables, and 49 unused imports have no automated guardrails. For a smaller, community-driven project competing against OpenAI and Google, this is understandable -lint configs aren't the first thing you set up when you're shipping fast. But it's the single highest-leverage improvement they could make.

Codex: Strict Clippy

Rust's Clippy with 36+ deny-level lints in Cargo.toml -the strictest configuration of the three. Rust's compiler plus Clippy leave very little for external analysis to find, which is why 97.5% of Codex's findings come from graph analysis (structural patterns) rather than static linting.


Testing: A Tale of Three Philosophies

MetricCodexGemini CLIOpenCode
Test-to-Code Ratio18.9%64.6%8.3%
Structural Coverage19.5%45.3%10.1%
Assertion Density1.220.250.28
Mock Usage15.5%15.8%0.6%

Gemini CLI has a staggering 64.6% test-to-code ratio -nearly two-thirds of the codebase is test code. It also has the highest structural coverage at 45.3%, backed by 738+ test files. Google's testing culture shows.

Codex writes fewer but denser tests. Despite lower coverage numbers, Codex's assertion density (1.22 assertions per test function) is 4-5x higher than either TypeScript project. Fewer tests, but each one validates more.

OpenCode has the weakest test story at 8.3% test ratio and only 10.1% structural coverage, with near-zero mock usage (0.6%). This is a real risk area as the project scales.


Architecture Quality

All three received a B architecture rating, but the dimension scores reveal different strengths:

DimensionCodexGemini CLIOpenCode
Modularity788278
Coupling727272
Scalability757875
Patterns828582

Gemini CLI leads in every architecture dimension. Its monorepo structure with dedicated packages demonstrates Google's experience building large-scale TypeScript applications. The policy engine, hook system, and declarative tool abstractions are standout patterns.

That said, even Gemini CLI has structural debt. Our graph analysis flagged the ClearcutLogger class as a god class -a single telemetry class spanning 1,941 lines:

gemini-cli/packages/core/src/telemetry/clearcut-logger/clearcut-logger.ts
// 1,941 lines, 61 methods
export class ClearcutLogger {
  private static instance: ClearcutLogger;
  private config?: Config;
  private sessionData: EventValue[] = [];
  private promptId: string = '';
  private readonly installationManager: InstallationManager;
  private readonly userAccountManager: UserAccountManager;
  private readonly hashedGHRepositoryName?: string;

  private readonly events: FixedDeque<LogEventEntry[]>;
  private lastFlushTime: number = Date.now();
  private flushing: boolean = false;
  private pendingFlush: boolean = false;
  // ... 61 methods follow
}

Similarly, their Storage class (403 lines, 54 methods) handles everything from session management to OAuth token paths to plan directories. Both are candidates for decomposition.

Codex's Rust workspace with 70+ crates is impressive in granularity, though the core crate has grown into a monolithic hub. The main codex.rs file is 7,226 lines with 54 public functions -it's the central nervous system that everything routes through. This is the primary architectural concern our graph analysis flagged.

OpenCode's 20+ package monorepo is well-structured with an event bus pattern and multi-platform support (CLI, web, desktop, cloud), but large files and cross-module dependencies create maintenance pressure.

Architectural Highlights

Codex

  • Protocol abstraction layer
  • Platform-specific sandboxing
  • SQ/EQ async communication
  • 36+ Clippy deny-level lints

Gemini CLI

  • Rule-based policy engine
  • Comprehensive hook system
  • Structured error hierarchy
  • AbortSignal cancellation

OpenCode

  • Async context DI
  • SST infrastructure-as-code
  • Zod + OpenAPI generation
  • JSON-to-SQLite migration

Codebase Dynamics

MetricCodexGemini CLIOpenCode
Churn Rate3.16%0%3.80%
Refactoring Rate12.7%0%48.1%
Convention Consistency81.6%92.6%85.3%
Convention Deviations251326

OpenCode's 48% refactoring rate is the standout number here -almost half of recent changes are refactors rather than new features. This signals a team actively investing in codebase quality, even without formal lint tooling. It's the kind of organic discipline that's hard to fake.

Gemini CLI's near-zero churn and refactoring is unusual -possibly reflecting a stable initial architecture or a recent major release that reset activity metrics. Their 92.6% convention consistency is the highest of the three, with only 13 deviations detected.


Key Takeaways

1. Rust's Compiler Is a Code Quality Multiplier

Codex's 1.1 issues per 1K LOC vs 7-9 for the TypeScript projects isn't just about coding skill -it's about language design. Rust's ownership model, exhaustive matching, and strict type system prevent entire categories of issues from existing.

2. Tests Can Be a Liability Without Density

Gemini CLI has a massive test suite (65% of the codebase), but with an assertion density of only 0.25, many of those tests may be shallow. Codex's approach -fewer tests with 5x the assertions -is arguably more valuable. Coverage percentage matters less than what you're actually asserting.

3. OpenCode Is the Underdog That Holds Its Own

OpenCode is a community-driven project competing against OpenAI and Google -and it scores a B with architecture scores identical to Codex. Its 48% refactoring rate shows a team actively investing in quality. The multi-platform support (CLI, web, desktop, cloud) is the most ambitious of the three. The one clear gap: zero lint enforcement. Adding a Biome or ESLint config would be the single highest-leverage improvement they could make -and would likely push that B toward a B+.

4. Consistency Debt Is the Debt Nobody Talks About

Both TypeScript projects have thousands of consistency issues. Most are stylistic -Biome preferences that differ from the project's conventions. But some are real: Gemini CLI's own ESLint config bans any, yet 104 violations slipped through. Convention drift makes every code review harder, every onboarding slower, and every refactor riskier.

5. Architecture Scores Are Remarkably Similar

All three converge around the same architecture grade (B) with identical coupling scores (72). The "coding agent" problem space naturally leads to certain architectural patterns -and all three teams are solving it competently.

6. Test Investment Separates Good From Great

The spread is dramatic: Gemini CLI at 65% test ratio, Codex at 19%, OpenCode at 8%. But raw coverage isn't the whole story -Codex's 5x assertion density suggests quality beats quantity. For OpenCode, getting from 8% to even 20% would significantly reduce the 277 runtime risk issues and build confidence for contributors working on those 490 open PRs.


Methodology

Each project was analyzed using Octokraft's full analysis pipeline, which combines three detection layers:

Layer 1: Static Analysis

Octokraft runs Biome for TypeScript/JavaScript and Clippy for Rust. We chose Biome over ESLint deliberately: it's faster (written in Rust), provides a unified formatter + linter, and applies a consistent ruleset across projects. This matters for benchmarks -if we used each project's own ESLint config, we'd be comparing different standards. Biome gives us one ruler for everyone.

Layer 2: Graph Analysis

Octokraft indexes every symbol, dependency, and call path into a FalkorDB knowledge graph. This detects structural issues no line-by-line linter can find: high fan-out modules, dead code (exported but never imported), god classes, circular dependencies, and coupling patterns.

Layer 3: LLM-Powered Analysis

An AI agent reads the actual source code and identifies semantic issues: incomplete migrations, test quality problems, hardcoded credentials, architectural anti-patterns. These are findings that require understanding intent, not just syntax.

On top of these, Octokraft runs convention detection, architecture review, and health scoring across 8 weighted categories.

What we'd improve

The Biome ruleset could be smarter about distinguishing stylistic preferences (useLiteralKeys) from genuine quality signals (noExplicitAny). A future version might weight rules differently or let projects bring their own config. We're also exploring running each project's own linter in addition to Biome, to show both "what their tools catch" and "what they're missing."

All analyses were run against the latest commit on the default branch as of March 11, 2026.

Run this analysis on your codebase

Octokraft gives you health scores, architecture reviews, and convention detection for your repositories.

Try Octokraft