Benchmark 14 min read Mar 15, 2026

Claude Code vs Codex CLI: Same Model, Same Task, Different Agent

Same LLM. Same codebase. Same prompts. Same tools. One variable: the agent CLI. 2.7x difference in findings. 2.2x difference in cost. Zero overlap in security coverage.

Codex Duration

23m

10 tasks

Claude Duration

35m

18 tasks

Codex Issues

27

0 critical

Claude Issues

72

5 critical

A note on scope: This benchmark measures how each agent performs as a code analysis tool -- reading, exploring, and reasoning about an existing codebase. We did not test code generation, refactoring, or bug fixing. The agents were read-only investigators, not authors. That said, the behavioral differences we observed -- planning strategies, exploration depth, tool usage patterns -- are properties of the agent CLI itself, not the task. We speculate on what this means for code generation at the end.


The Setup

Target repository: OpenClaw (~1M LOC, TypeScript/Node.js)

Task: Automated code health analysis. Plan investigation tasks, execute them in parallel, evaluate and structure the findings.

LLM: GLM-5:cloud via Ollama's Responses API. Identical model, identical endpoint, identical API key for both runs.

Agent CLIs:

  • Codex CLI (OpenAI) with a custom model provider pointing at the Ollama endpoint
  • Claude Code (Anthropic) with the same Ollama endpoint via proxy mode

Both agents had identical tool access: file reading, grep, glob, shell execution, and file writing.


The Numbers

MetricCodex CLIClaude CodeDelta
Total Duration22m 56s35m 21s1.5x
Investigation Tasks10181.8x
Issues Found27722.7x
Critical Issues05 --
High Issues6142.3x
Total Tokens7.76M17.16M2.2x
LLM API Calls15291.9x
Issues per 1M Tokens3.54.2+20%

How Each Agent Planned the Work

Both agents received the same planning prompt with the same codebase context. They produced very different investigation strategies.

Codex: 10 Focused Subsystem Investigations

Codex planned 10 tasks, each targeting a specific subsystem. The strategy was subsystem-by-subsystem, architecturally scoped:

#InvestigationFocus
1Gateway IngressMessage entry and routing from WhatsApp/Telegram through to agents
2Agent OrchestrationModel selection, fallback chains, tool execution
3Voice Call State MachineState transitions, timers, persistence
4Webhook SecurityTwilio/Plivo signature validation, replay attacks
5Channel PluginsInterface contracts across messaging channels
6Config SystemHot reload, validation, migration
7MSTeams MessagingComplex send path, token refresh, error classification
8BlueBubbles MonitorWebhook processing, attachment handling
9Extension RuntimePlugin loading, isolation boundaries
10Error PropagationCross-cutting error handling patterns

Claude Code: 18 Narrower, Risk-Oriented Investigations

Claude planned 18 tasks -- nearly twice as many -- with a different philosophy. Instead of broad subsystem reviews, Claude created narrowly-scoped, risk-oriented investigations:

#InvestigationFocus
1Message Routing PipelineDelivery guarantees, session key failures
2Gateway Server LifecycleCrash recovery, signal handling
3Tool Execution SecurityCommand injection, path traversal, sandboxing
4Model Fallback ChainAuth lockout, streaming failures
5Channel Plugin SystemReconnection strategies, capability mismatches
6Webhook SecurityTiming attacks, replay prevention, SSRF
7Session Store ConcurrencyRace conditions, partial writes, disk failures
8Extension Channel LifecycleDependency isolation, crash containment
9Voice Call State MachineTimer races, stuck states
10Multi-Agent SpawningInfinite loops, spawn limits, isolation
11Config ValidationStartup failures, silent defaults
12Plugin API SurfaceSecurity boundaries, unauthorized registration
13Hook System LifecycleDynamic code loading, infinite loops
14Memory Search & EmbeddingsDimension mismatches, store bloat
15Streaming Response HandlingBuffer growth, cancellation
16Canvas Host Browser ControlXSS, unauthenticated HTTP, arbitrary JS
17Mobile App GatewayDisconnection handling, offline queueing
18Account Resolution & CachingCredential protection, cache invalidation

Claude covered areas Codex didn't touch at all: multi-agent spawning, hook system security, memory/embeddings, streaming buffers, canvas/browser control, and mobile app connections. These additional tasks found 8 of the 15 security issues and 12 of the 32 runtime issues.


What Each Agent Actually Found

The Security Gap

This is the most striking difference. Claude found 15 security issues including 5 critical ones. Codex found zero.

Critical FindingFileConfidence
Plugin HTTP handlers bypass gateway authenticationserver/plugins-http.ts95%
No file system sandboxing for pluginsplugins/registry.ts95%
Entire BlueBubbles channel untestedextensions/bluebubbles/90%
Entire Signal channel untestedextensions/signal/90%
Gateway server lifecycle has zero testsserver.impl.ts95%

Claude's medium security findings included timing attacks in token comparison, webhook replay vulnerabilities, dynamic code loading from user-configurable paths, an unauthenticated canvas host HTTP server, and missing sandboxing for plugin file system access.

Codex investigated webhook security and did find the replay attack vulnerability and timing-safe comparison issues -- but these observations stayed in its exploration notes and never surfaced as structured findings. The knowledge was there; the final evaluation didn't capture it.

Where Both Agents Agreed

Both agents investigated the voice call state machine, webhook security, channel plugins, config system, and extension runtime. Their findings overlapped significantly:

Voice Call State Machine

Both found: TTS failure leaves calls in inconsistent state (high severity), calls can get stuck in non-terminal states, timer race conditions, persistence gaps after restart ("zombie calls")

Channel Plugins

Both found: inconsistent error handling across channels (MSTeams sophisticated, others raw), rate limiting varies wildly, no capability mismatch detection

Webhook Security

Both found: no replay prevention for Twilio webhooks, timing-safe comparison is mostly correct but has edge cases

Where Codex Went Deeper

On the 10 subsystems Codex investigated, its reports were highly detailed:

  • Agent Orchestration: Codex mapped the complete 7-layer tool permission hierarchy and 14-section system prompt composition -- details Claude didn't capture at this level
  • MSTeams Messaging: Codex traced the full send path including token delegation, conversation store atomicity (temp-file-then-rename), and the deliberate design decision to NOT retry ambiguous transport errors to prevent duplicate posts
  • BlueBubbles Monitor: Codex found the specific edge case where attachment download failures produce placeholder text with undefined MediaUrls -- the agent sees a placeholder but has no actual media to process

Where Claude Went Deeper

Claude's 8 additional tasks found categories of issues Codex never touched:

  • Multi-Agent Spawning: No hard limit on subagent spawn count, agent-to-agent allowlist bypass, no rate limiting on tool invocations
  • Hook System: Dynamic code loading from user-configurable paths (high severity), hooks can modify core gateway data, no infinite loop protection
  • Plugin API Surface: Plugins receive full config with credential paths, tools auto-register without user consent, HTTP handlers lack auth
  • Canvas Host: HTTP server binds to 0.0.0.0 with no authentication, canvas.eval allows arbitrary JS execution without sandboxing
  • Streaming Responses: Unbounded buffer growth accumulating without size limits
  • Memory/LanceDB: No bloat prevention -- no cache limit, no cleanup, no compaction

Token Economics

Where the Tokens Went

PhaseCodexClaudeWhat Happened
Planning110K42KCodex spent more tokens deciding; Claude decided faster but planned more tasks
Code Exploration4.90M12.48MClaude read 2.5x more files across 1.8x more tasks
Final Evaluation78K2.11MClaude re-read source files to verify findings; Codex produced output directly
Test Quality2.68M2.36MNearly identical -- same task, same prompt

The exploration phase dominates: 95%+ of tokens went to agents reading source files. Planning and evaluation together account for less than 5%.

Per-Task Token Variance

CodexClaude
Lightest task119K tokens196K tokens
Heaviest task1.22M tokens1.63M tokens
Average per task490K tokens693K tokens
Variance range10x8x

The heaviest tasks for both agents were investigations into deeply connected subsystems -- webhook handling and plugin architectures -- where the agent followed dependency chains across many files.

The Agent CLI Overhead Tax

The Codex CLI injects a ~7,500 token system prompt on every turn (tool definitions, safety rules, sandbox config). Claude Code's overhead is comparable. Over multiple turns per task, this compounds -- but it's still dwarfed by the 500K-1.6M tokens of actual code exploration per task.

Neither agent used prompt caching in this configuration. With caching, the per-turn overhead would amortize to near-zero on subsequent turns.


When to Use Which Agent

Use Codex When: Speed and Architecture Understanding Matter

Codex excels at deep subsystem mapping. Its 10-task approach produced richly detailed reports -- complete state machine documentation, multi-layer permission hierarchies, full request-path traces with design rationale.

  • Onboarding onto an unfamiliar codebase -- Codex's subsystem reports read like architectural documentation
  • Understanding how a specific feature works end-to-end -- the MSTeams send-path trace and voice call state machine map were thorough enough to inform design decisions
  • Routine health checks -- 23 minutes and 7.8M tokens for a solid assessment of 10 core subsystems
  • Token budget constraints -- half the cost of Claude for core architectural and runtime findings

Weakness: Can miss entire risk categories. Zero security findings on a codebase with real security issues is a significant blind spot.

Use Claude Code When: Security and Risk Coverage Matter

Claude excels at risk-oriented investigation. Its 18-task approach with narrowly-scoped, threat-model-driven investigations found 2.7x more issues -- and critically, found the issues that matter most.

  • Assessing security posture -- 15 security issues including unauthenticated endpoints, unsandboxed plugin APIs, and dynamic code loading
  • Preparing for a security review or audit -- findings map directly to OWASP-style categories
  • Evaluating a codebase before acquisition or major investment -- comprehensive coverage across 18 areas
  • Finding cross-cutting issues -- problems that span subsystem boundaries

Weakness: 1.5x slower and 2.2x more expensive. Shallower per-subsystem depth -- more findings, less architectural understanding of each area.

The Hybrid Approach

Run Codex first for a fast architectural map -- understand the subsystems, identify the complex areas, get a baseline. Then run Claude second on flagged areas -- targeted security and risk investigations where they matter most. This gives you Codex's speed and depth where it's strong (architecture, runtime) and Claude's thoroughness where it's strong (security, cross-cutting risks) -- without paying Claude's token cost for the entire codebase.


What This Might Mean for Code Generation

We tested these agents as readers, not writers. But the most revealing signal isn't in the findings themselves -- it's in how each agent decomposed the problem before doing any work.

Given an open-ended prompt ("analyze this codebase for health issues"), each agent made a planning decision that defined everything that followed. Codex broke the codebase into 10 architectural subsystems and investigated each one deeply. Claude broke it into 18 risk-oriented concerns and investigated each one narrowly. Same prompt, same model -- completely different decomposition strategy.

That planning behavior is the clearest window into how these agents would approach code generation.

Codex: Targeted, Subsystem-Aware

Codex's instinct is to identify the relevant subsystem and go deep. It mapped state machines exhaustively, traced send paths end-to-end, documented multi-layer permission hierarchies. For code generation, this suggests an agent that will understand the local architecture well, produce code that fits existing patterns, and work fast -- but stay within its lane. Just as it missed security issues that crossed subsystem boundaries, it might implement a feature without considering how it interacts with authentication, error propagation, or adjacent modules.

Codex is the agent you'd want for well-scoped tasks: "add retry logic to the message sender" or "refactor the state machine to handle TTS failures."

Claude Code: Holistic, Risk-Aware

Claude's instinct is to survey the landscape and identify risks across boundaries. It didn't just look at the plugin system -- it created separate investigations for plugin API security, hook system lifecycle, and extension isolation. For code generation, this suggests an agent that will read more widely before writing, consider what could go wrong, and touch more files per change -- adding input validation, checking auth boundaries, and handling edge cases without being explicitly prompted.

Claude is the agent you'd want for changes with broad implications: "add a new channel plugin type" or "implement rate limiting across the gateway."

The Planning Signal

The strongest predictor from this benchmark isn't speed or token count -- it's the planning step. In less than 1% of total tokens, each agent revealed its fundamental approach to problem decomposition:

Codex thinks in subsystems

"What are the components, and what does each one do?"

Claude thinks in risks

"What could go wrong, and where are the boundaries?"

When writing code, this likely translates to: Codex builds the thing you asked for, well-integrated with its immediate surroundings. Claude builds the thing you asked for, plus the guardrails you didn't ask for -- at the cost of time and tokens.

Neither approach is better. They're different tools for different problems. But we haven't tested code generation yet. That's the next benchmark.


The Invisible Variable

Same model, same temperature, same prompts, same tools -- but the agent CLI's scaffolding changed everything. The system prompts, tool definitions, exploration patterns, and instruction-following behavior that each CLI wraps around the model produced a 2.7x difference in findings and a 2.2x difference in cost.

The model is constant. The agent is the variable.

This has practical implications for anyone building on top of coding agents: your choice of agent CLI isn't just a preference or vendor lock-in decision. It's a decision about what kinds of problems your system will find -- and what kinds it will miss.

Run this analysis on your codebase

Octokraft gives you multi-agent code health analysis with Claude Code, Codex CLI, and more.

Try Octokraft