The Contenders
| OpenCode | Codex | Gemini CLI | |
|---|---|---|---|
| Language | Go/TypeScript | Rust | TypeScript |
| Lines of Code | 274K | 555K | 449K |
| Score | 82.4 | 82.3 | 70.6 |
| Grade | B+ | B+ | B- |
| Assessment Issues | 39 | 854 | 672 |
Three actively developed codebases ranging from 274K to 555K lines of code. OpenCode edges Codex by 0.1 points, a gap that is statistically meaningless. The real separation is in the 11.8-point drop to Gemini CLI, and in how each project distributes its strengths across categories.
Score Breakdown by Category
| Category | OpenCode | Codex | Gemini CLI | What It Measures |
|---|---|---|---|---|
| Security | 64.4 | 84.4 | 48.3 | Codex (Rust) leads but no longer scores 100. OpenCode (Go) sits in the middle. |
| Runtime Risks | 63.2 | 91.8 | 56.6 | Rust's ownership model gives Codex a 28-point lead over the next closest. |
| Testing | 90.8 | 37.4 | 56.3 | OpenCode leads by 34 points. Codex trails despite Rust's strong test tooling. |
| Code Smells | 85.5 | 90.5 | 79.5 | All three stay above 79. Codex is the cleanest. |
| Consistency | 99.9 | 88.6 | 97.0 | OpenCode and Gemini maintain strong style consistency. Codex trails. |
| Dead Code | 100.0 | 98.5 | 97.4 | All three are clean. No significant dead code. |
Codex leads on security at 84.4, but no project achieves a perfect score. The pipeline found security patterns in both Rust and Go codebases that surface-level analysis would miss. Codex's 36-point advantage over Gemini CLI (84.4 vs. 48.3) reflects Rust's ownership model, but it is not the blanket immunity that language choice alone might suggest.
The biggest separation is in testing. OpenCode leads at 90.8. Codex sits at 37.4. Go's idiomatic testing infrastructure and OpenCode's test investment outperform both the Rust and TypeScript projects in this category.
Security: The Language Dividend, Diminished
The data complicates the clean story that Rust and Go projects achieve perfect security scores because the type system prevents common vulnerability classes.
Codex (Rust) scores 84.4 on security with 854 assessment issues. The analysis found patterns that compile-time safety does not cover: unsafe blocks, dependency-chain risks, and configuration-level security gaps. Rust prevents buffer overflows and null pointer dereferences. It does not prevent logic errors in authentication flows, insecure defaults in configuration, or unsafe trust boundaries in dependency handling.
OpenCode (Go) scores 64.4. Go's type system catches fewer vulnerability classes than Rust's ownership model, and the analysis surface exposed more patterns accordingly.
Gemini CLI (TypeScript) scores 48.3. TypeScript's dynamic runtime creates more surface area for security findings than compiled alternatives, and the gap shows.
The language dividend is real as a relative advantage. Codex leads Gemini by 36 points on security. But the absolute guarantee, the claim that Rust and Go projects get perfect security scores for free, does not hold. Compile-time safety narrows the attack surface. It does not eliminate it.
Testing: OpenCode Takes the Lead
| Metric | OpenCode | Codex | Gemini CLI |
|---|---|---|---|
| Test Score | 90.8 | 37.4 | 56.3 |
| Assessment Issues | 39 | 854 | 672 |
| Active Issues | 793 | 1,371 | 1,454 |
OpenCode's test score of 90.8 is the highest across all three projects. Go's testing conventions, particularly the built-in testing package and table-driven test patterns, create a low-friction path to high structural coverage.
Codex's test score sits at 37.4 despite Rust's strong test tooling. A high-quality type system reduces the categories of bugs that tests need to catch, but it does not produce tests on its own. Codex's 555K LOC codebase has a lower ratio of test infrastructure to production code than the pipeline expects.
Gemini CLI sits in the middle at 56.3. Google's testing culture shows in the TypeScript codebase, but the analysis revealed gaps in structural coverage that keep the score below 60.
Broader Context: Six AI Agents Ranked
These three projects are part of a larger set of AI coding agent repositories in the benchmark. The full ranking provides additional context.
| Rank | Agent | Score | Grade | Language | LOC |
|---|---|---|---|---|---|
| 1 | Goose | 83.7 | B+ | Rust | 206K |
| 2 | OpenCode | 82.4 | B+ | Go/TS | 274K |
| 3 | Codex | 82.3 | B+ | Rust | 555K |
| 4 | Gemini CLI | 70.6 | B- | TS | 449K |
| 5 | OpenHands | 65.6 | C+ | Python | 361K |
| 6 | Sweep | 64.9 | C | Python | 81K |
Goose (Rust, 206K LOC) leads all AI coding agents at 83.7, edging out OpenCode by 1.3 points. The top three are all B+ and all written in Rust or Go. The bottom two are both Python. The language-to-grade correlation in AI agent codebases mirrors the pattern across the full 31-project benchmark.
What This Shows
Language choice still matters, but less absolutely
Codex (Rust) and OpenCode (Go) score higher overall, but neither achieves perfect security. The language dividend in security is 36 points (Codex 84.4 vs. Gemini 48.3). TypeScript's weaker type safety still shows in runtime scores (Gemini 56.6 vs. Codex 91.8), but the gap narrows in other categories.
Testing investment drives OpenCode's position
OpenCode's 90.8 test score is the highest in the comparison and compensates for its lower security and runtime scores. Go's testing ecosystem creates a low-friction path to strong coverage. Codex's 37.4, by contrast, shows that a strong type system does not automatically produce strong tests.
All three remain well-engineered
B- to B+ is a narrow 11.8-point range. All three have code smell scores above 79, consistency above 88, and dead code scores above 97. The differences cluster in the high-weight categories (security 2.0x, runtime 1.5x, testing 1.5x) where language choice and test investment have the most impact on the weighted score.
Methodology
Analysis was run using Octokraft's full pipeline against the latest commit on the default branch of each repository as of April 8, 2026. The pipeline includes:
- Static analysis via Biome and Semgrep for code quality, security patterns, and compliance
- Graph analysis via FalkorDB for structural patterns, coupling metrics, and dependency analysis
- Convention detection for identifying and measuring adherence to coding patterns
- LLM-powered semantic analysis for behavioral risks, architectural gaps, and security boundaries
Health scores weight categories differently: security 2.0x, runtime and testing 1.5x, code smells 1.0x, others below 1.0x. All three projects were analyzed with identical configuration.