Benchmark 12 min read Apr 9, 2026

Three AI Coding Agents Compared: OpenCode 82.4, Codex 82.3, Gemini CLI 70.6

Octokraft's full analysis pipeline scored three prominent open-source AI coding agents. All three score B- or above. OpenCode and Codex nearly tie at B+. The differences cluster in security, runtime safety, and testing, the categories where language choice and engineering investment matter most.

OpenCode

B+

82.4

Go/TypeScript · 274K LOC

OpenAI Codex

B+

82.3

Rust · 555K LOC

Google Gemini CLI

B-

70.6

TypeScript · 449K LOC

The Contenders

OpenCodeCodexGemini CLI
LanguageGo/TypeScriptRustTypeScript
Lines of Code274K555K449K
Score82.482.370.6
GradeB+B+B-
Assessment Issues39854672

Three actively developed codebases ranging from 274K to 555K lines of code. OpenCode edges Codex by 0.1 points, a gap that is statistically meaningless. The real separation is in the 11.8-point drop to Gemini CLI, and in how each project distributes its strengths across categories.


Score Breakdown by Category

CategoryOpenCodeCodexGemini CLIWhat It Measures
Security64.484.448.3Codex (Rust) leads but no longer scores 100. OpenCode (Go) sits in the middle.
Runtime Risks63.291.856.6Rust's ownership model gives Codex a 28-point lead over the next closest.
Testing90.837.456.3OpenCode leads by 34 points. Codex trails despite Rust's strong test tooling.
Code Smells85.590.579.5All three stay above 79. Codex is the cleanest.
Consistency99.988.697.0OpenCode and Gemini maintain strong style consistency. Codex trails.
Dead Code100.098.597.4All three are clean. No significant dead code.

Codex leads on security at 84.4, but no project achieves a perfect score. The pipeline found security patterns in both Rust and Go codebases that surface-level analysis would miss. Codex's 36-point advantage over Gemini CLI (84.4 vs. 48.3) reflects Rust's ownership model, but it is not the blanket immunity that language choice alone might suggest.

The biggest separation is in testing. OpenCode leads at 90.8. Codex sits at 37.4. Go's idiomatic testing infrastructure and OpenCode's test investment outperform both the Rust and TypeScript projects in this category.


Security: The Language Dividend, Diminished

The data complicates the clean story that Rust and Go projects achieve perfect security scores because the type system prevents common vulnerability classes.

Codex (Rust) scores 84.4 on security with 854 assessment issues. The analysis found patterns that compile-time safety does not cover: unsafe blocks, dependency-chain risks, and configuration-level security gaps. Rust prevents buffer overflows and null pointer dereferences. It does not prevent logic errors in authentication flows, insecure defaults in configuration, or unsafe trust boundaries in dependency handling.

OpenCode (Go) scores 64.4. Go's type system catches fewer vulnerability classes than Rust's ownership model, and the analysis surface exposed more patterns accordingly.

Gemini CLI (TypeScript) scores 48.3. TypeScript's dynamic runtime creates more surface area for security findings than compiled alternatives, and the gap shows.

The language dividend is real as a relative advantage. Codex leads Gemini by 36 points on security. But the absolute guarantee, the claim that Rust and Go projects get perfect security scores for free, does not hold. Compile-time safety narrows the attack surface. It does not eliminate it.


Testing: OpenCode Takes the Lead

MetricOpenCodeCodexGemini CLI
Test Score90.837.456.3
Assessment Issues39854672
Active Issues7931,3711,454

OpenCode's test score of 90.8 is the highest across all three projects. Go's testing conventions, particularly the built-in testing package and table-driven test patterns, create a low-friction path to high structural coverage.

Codex's test score sits at 37.4 despite Rust's strong test tooling. A high-quality type system reduces the categories of bugs that tests need to catch, but it does not produce tests on its own. Codex's 555K LOC codebase has a lower ratio of test infrastructure to production code than the pipeline expects.

Gemini CLI sits in the middle at 56.3. Google's testing culture shows in the TypeScript codebase, but the analysis revealed gaps in structural coverage that keep the score below 60.


Broader Context: Six AI Agents Ranked

These three projects are part of a larger set of AI coding agent repositories in the benchmark. The full ranking provides additional context.

RankAgentScoreGradeLanguageLOC
1Goose83.7B+Rust206K
2OpenCode82.4B+Go/TS274K
3Codex82.3B+Rust555K
4Gemini CLI70.6B-TS449K
5OpenHands65.6C+Python361K
6Sweep64.9CPython81K

Goose (Rust, 206K LOC) leads all AI coding agents at 83.7, edging out OpenCode by 1.3 points. The top three are all B+ and all written in Rust or Go. The bottom two are both Python. The language-to-grade correlation in AI agent codebases mirrors the pattern across the full 31-project benchmark.


What This Shows

Language choice still matters, but less absolutely

Codex (Rust) and OpenCode (Go) score higher overall, but neither achieves perfect security. The language dividend in security is 36 points (Codex 84.4 vs. Gemini 48.3). TypeScript's weaker type safety still shows in runtime scores (Gemini 56.6 vs. Codex 91.8), but the gap narrows in other categories.

Testing investment drives OpenCode's position

OpenCode's 90.8 test score is the highest in the comparison and compensates for its lower security and runtime scores. Go's testing ecosystem creates a low-friction path to strong coverage. Codex's 37.4, by contrast, shows that a strong type system does not automatically produce strong tests.

All three remain well-engineered

B- to B+ is a narrow 11.8-point range. All three have code smell scores above 79, consistency above 88, and dead code scores above 97. The differences cluster in the high-weight categories (security 2.0x, runtime 1.5x, testing 1.5x) where language choice and test investment have the most impact on the weighted score.


Methodology

Analysis was run using Octokraft's full pipeline against the latest commit on the default branch of each repository as of April 8, 2026. The pipeline includes:

  • Static analysis via Biome and Semgrep for code quality, security patterns, and compliance
  • Graph analysis via FalkorDB for structural patterns, coupling metrics, and dependency analysis
  • Convention detection for identifying and measuring adherence to coding patterns
  • LLM-powered semantic analysis for behavioral risks, architectural gaps, and security boundaries

Health scores weight categories differently: security 2.0x, runtime and testing 1.5x, code smells 1.0x, others below 1.0x. All three projects were analyzed with identical configuration.

Analyze your coding agent's output

Static analysis, graph intelligence, convention detection, and LLM-powered behavioral analysis. See what your tools produce.

Try Octokraft