Three AI Coding Agents Compared: OpenCode 82.4, Codex 82.3, Gemini CLI 70.6

The Contenders

	OpenCode	Codex	Gemini CLI
Language	Go/TypeScript	Rust	TypeScript
Lines of Code	274K	555K	449K
Score	82.4	82.3	70.6
Grade	B+	B+	B-
Assessment Issues	39	854	672

Three actively developed codebases ranging from 274K to 555K lines of code. OpenCode edges Codex by 0.1 points, a gap that is statistically meaningless. The real separation is in the 11.8-point drop to Gemini CLI, and in how each project distributes its strengths across categories.

Score Breakdown by Category

Category	OpenCode	Codex	Gemini CLI	What It Measures
Security	64.4	84.4	48.3	Codex (Rust) leads but no longer scores 100. OpenCode (Go) sits in the middle.
Runtime Risks	63.2	91.8	56.6	Rust's ownership model gives Codex a 28-point lead over the next closest.
Testing	90.8	37.4	56.3	OpenCode leads by 34 points. Codex trails despite Rust's strong test tooling.
Code Smells	85.5	90.5	79.5	All three stay above 79. Codex is the cleanest.
Consistency	99.9	88.6	97.0	OpenCode and Gemini maintain strong style consistency. Codex trails.
Dead Code	100.0	98.5	97.4	All three are clean. No significant dead code.

Codex leads on security at 84.4, but no project achieves a perfect score. The pipeline found security patterns in both Rust and Go codebases that surface-level analysis would miss. Codex's 36-point advantage over Gemini CLI (84.4 vs. 48.3) reflects Rust's ownership model, but it is not the blanket immunity that language choice alone might suggest.

The biggest separation is in testing. OpenCode leads at 90.8. Codex sits at 37.4. Go's idiomatic testing infrastructure and OpenCode's test investment outperform both the Rust and TypeScript projects in this category.

Security: The Language Dividend, Diminished

The data complicates the clean story that Rust and Go projects achieve perfect security scores because the type system prevents common vulnerability classes.

Codex (Rust) scores 84.4 on security with 854 assessment issues. The analysis found patterns that compile-time safety does not cover: unsafe blocks, dependency-chain risks, and configuration-level security gaps. Rust prevents buffer overflows and null pointer dereferences. It does not prevent logic errors in authentication flows, insecure defaults in configuration, or unsafe trust boundaries in dependency handling.

OpenCode (Go) scores 64.4. Go's type system catches fewer vulnerability classes than Rust's ownership model, and the analysis surface exposed more patterns accordingly.

Gemini CLI (TypeScript) scores 48.3. TypeScript's dynamic runtime creates more surface area for security findings than compiled alternatives, and the gap shows.

The language dividend is real as a relative advantage. Codex leads Gemini by 36 points on security. But the absolute guarantee, the claim that Rust and Go projects get perfect security scores for free, does not hold. Compile-time safety narrows the attack surface. It does not eliminate it.

Testing: OpenCode Takes the Lead

Metric	OpenCode	Codex	Gemini CLI
Test Score	90.8	37.4	56.3
Assessment Issues	39	854	672
Active Issues	793	1,371	1,454

OpenCode's test score of 90.8 is the highest across all three projects. Go's testing conventions, particularly the built-in testing package and table-driven test patterns, create a low-friction path to high structural coverage.

Codex's test score sits at 37.4 despite Rust's strong test tooling. A high-quality type system reduces the categories of bugs that tests need to catch, but it does not produce tests on its own. Codex's 555K LOC codebase has a lower ratio of test infrastructure to production code than the pipeline expects.

Gemini CLI sits in the middle at 56.3. Google's testing culture shows in the TypeScript codebase, but the analysis revealed gaps in structural coverage that keep the score below 60.

Broader Context: Six AI Agents Ranked

These three projects are part of a larger set of AI coding agent repositories in the benchmark. The full ranking provides additional context.

Rank	Agent	Score	Grade	Language	LOC
1	Goose	83.7	B+	Rust	206K
2	OpenCode	82.4	B+	Go/TS	274K
3	Codex	82.3	B+	Rust	555K
4	Gemini CLI	70.6	B-	TS	449K
5	OpenHands	65.6	C+	Python	361K
6	Sweep	64.9	C	Python	81K

Goose (Rust, 206K LOC) leads all AI coding agents at 83.7, edging out OpenCode by 1.3 points. The top three are all B+ and all written in Rust or Go. The bottom two are both Python. The language-to-grade correlation in AI agent codebases mirrors the pattern across the full 31-project benchmark.

What This Shows

Language choice still matters, but less absolutely

Codex (Rust) and OpenCode (Go) score higher overall, but neither achieves perfect security. The language dividend in security is 36 points (Codex 84.4 vs. Gemini 48.3). TypeScript's weaker type safety still shows in runtime scores (Gemini 56.6 vs. Codex 91.8), but the gap narrows in other categories.

Testing investment drives OpenCode's position

OpenCode's 90.8 test score is the highest in the comparison and compensates for its lower security and runtime scores. Go's testing ecosystem creates a low-friction path to strong coverage. Codex's 37.4, by contrast, shows that a strong type system does not automatically produce strong tests.

All three remain well-engineered

B- to B+ is a narrow 11.8-point range. All three have code smell scores above 79, consistency above 88, and dead code scores above 97. The differences cluster in the high-weight categories (security 2.0x, runtime 1.5x, testing 1.5x) where language choice and test investment have the most impact on the weighted score.

Methodology

Analysis was run using Octokraft's full pipeline against the latest commit on the default branch of each repository as of April 8, 2026. The pipeline includes:

Static analysis via Biome and Semgrep for code quality, security patterns, and compliance
Graph analysis via FalkorDB for structural patterns, coupling metrics, and dependency analysis
Convention detection for identifying and measuring adherence to coding patterns
LLM-powered semantic analysis for behavioral risks, architectural gaps, and security boundaries

Health scores weight categories differently: security 2.0x, runtime and testing 1.5x, code smells 1.0x, others below 1.0x. All three projects were analyzed with identical configuration.