Claude Code vs Codex CLI: Same Model, Same Task, Different Agent

A note on scope: This benchmark measures how each agent performs as a code analysis tool -- reading, exploring, and reasoning about an existing codebase. We did not test code generation, refactoring, or bug fixing. The agents were read-only investigators, not authors. That said, the behavioral differences we observed -- planning strategies, exploration depth, tool usage patterns -- are properties of the agent CLI itself, not the task. We speculate on what this means for code generation at the end.

The Setup

Target repository: OpenClaw (~1M LOC, TypeScript/Node.js)

Task: Automated code health analysis. Plan investigation tasks, execute them in parallel, evaluate and structure the findings.

LLM: GLM-5:cloud via Ollama's Responses API. Identical model, identical endpoint, identical API key for both runs.

Agent CLIs:

Codex CLI (OpenAI) with a custom model provider pointing at the Ollama endpoint
Claude Code (Anthropic) with the same Ollama endpoint via proxy mode

Both agents had identical tool access: file reading, grep, glob, shell execution, and file writing.

The Numbers

Metric	Codex CLI	Claude Code	Delta
Total Duration	22m 56s	35m 21s	1.5x
Investigation Tasks	10	18	1.8x
Issues Found	27	72	2.7x
Critical Issues	0	5	--
High Issues	6	14	2.3x
Total Tokens	7.76M	17.16M	2.2x
LLM API Calls	15	29	1.9x
Issues per 1M Tokens	3.5	4.2	+20%

How Each Agent Planned the Work

Both agents received the same planning prompt with the same codebase context. They produced very different investigation strategies.

Codex: 10 Focused Subsystem Investigations

Codex planned 10 tasks, each targeting a specific subsystem. The strategy was subsystem-by-subsystem, architecturally scoped:

#	Investigation	Focus
1	Gateway Ingress	Message entry and routing from WhatsApp/Telegram through to agents
2	Agent Orchestration	Model selection, fallback chains, tool execution
3	Voice Call State Machine	State transitions, timers, persistence
4	Webhook Security	Twilio/Plivo signature validation, replay attacks
5	Channel Plugins	Interface contracts across messaging channels
6	Config System	Hot reload, validation, migration
7	MSTeams Messaging	Complex send path, token refresh, error classification
8	BlueBubbles Monitor	Webhook processing, attachment handling
9	Extension Runtime	Plugin loading, isolation boundaries
10	Error Propagation	Cross-cutting error handling patterns

Claude Code: 18 Narrower, Risk-Oriented Investigations

Claude planned 18 tasks -- nearly twice as many -- with a different philosophy. Instead of broad subsystem reviews, Claude created narrowly-scoped, risk-oriented investigations:

#	Investigation	Focus
1	Message Routing Pipeline	Delivery guarantees, session key failures
2	Gateway Server Lifecycle	Crash recovery, signal handling
3	Tool Execution Security	Command injection, path traversal, sandboxing
4	Model Fallback Chain	Auth lockout, streaming failures
5	Channel Plugin System	Reconnection strategies, capability mismatches
6	Webhook Security	Timing attacks, replay prevention, SSRF
7	Session Store Concurrency	Race conditions, partial writes, disk failures
8	Extension Channel Lifecycle	Dependency isolation, crash containment
9	Voice Call State Machine	Timer races, stuck states
10	Multi-Agent Spawning	Infinite loops, spawn limits, isolation
11	Config Validation	Startup failures, silent defaults
12	Plugin API Surface	Security boundaries, unauthorized registration
13	Hook System Lifecycle	Dynamic code loading, infinite loops
14	Memory Search & Embeddings	Dimension mismatches, store bloat
15	Streaming Response Handling	Buffer growth, cancellation
16	Canvas Host Browser Control	XSS, unauthenticated HTTP, arbitrary JS
17	Mobile App Gateway	Disconnection handling, offline queueing
18	Account Resolution & Caching	Credential protection, cache invalidation

Claude covered areas Codex didn't touch at all: multi-agent spawning, hook system security, memory/embeddings, streaming buffers, canvas/browser control, and mobile app connections. These additional tasks found 8 of the 15 security issues and 12 of the 32 runtime issues.

What Each Agent Actually Found

The Security Gap

This is the most striking difference. Claude found 15 security issues including 5 critical ones. Codex found zero.

Critical Finding	File	Confidence
Plugin HTTP handlers bypass gateway authentication	`server/plugins-http.ts`	95%
No file system sandboxing for plugins	`plugins/registry.ts`	95%
Entire BlueBubbles channel untested	`extensions/bluebubbles/`	90%
Entire Signal channel untested	`extensions/signal/`	90%
Gateway server lifecycle has zero tests	`server.impl.ts`	95%

Claude's medium security findings included timing attacks in token comparison, webhook replay vulnerabilities, dynamic code loading from user-configurable paths, an unauthenticated canvas host HTTP server, and missing sandboxing for plugin file system access.

Codex investigated webhook security and did find the replay attack vulnerability and timing-safe comparison issues -- but these observations stayed in its exploration notes and never surfaced as structured findings. The knowledge was there; the final evaluation didn't capture it.

Where Both Agents Agreed

Both agents investigated the voice call state machine, webhook security, channel plugins, config system, and extension runtime. Their findings overlapped significantly:

Voice Call State Machine

Both found: TTS failure leaves calls in inconsistent state (high severity), calls can get stuck in non-terminal states, timer race conditions, persistence gaps after restart ("zombie calls")

Channel Plugins

Both found: inconsistent error handling across channels (MSTeams sophisticated, others raw), rate limiting varies wildly, no capability mismatch detection

Webhook Security

Both found: no replay prevention for Twilio webhooks, timing-safe comparison is mostly correct but has edge cases

Where Codex Went Deeper

On the 10 subsystems Codex investigated, its reports were highly detailed:

Agent Orchestration: Codex mapped the complete 7-layer tool permission hierarchy and 14-section system prompt composition -- details Claude didn't capture at this level
MSTeams Messaging: Codex traced the full send path including token delegation, conversation store atomicity (temp-file-then-rename), and the deliberate design decision to NOT retry ambiguous transport errors to prevent duplicate posts
BlueBubbles Monitor: Codex found the specific edge case where attachment download failures produce placeholder text with undefined MediaUrls -- the agent sees a placeholder but has no actual media to process

Where Claude Went Deeper

Claude's 8 additional tasks found categories of issues Codex never touched:

Multi-Agent Spawning: No hard limit on subagent spawn count, agent-to-agent allowlist bypass, no rate limiting on tool invocations
Hook System: Dynamic code loading from user-configurable paths (high severity), hooks can modify core gateway data, no infinite loop protection
Plugin API Surface: Plugins receive full config with credential paths, tools auto-register without user consent, HTTP handlers lack auth
Canvas Host: HTTP server binds to 0.0.0.0 with no authentication, canvas.eval allows arbitrary JS execution without sandboxing
Streaming Responses: Unbounded buffer growth accumulating without size limits
Memory/LanceDB: No bloat prevention -- no cache limit, no cleanup, no compaction

Token Economics

Where the Tokens Went

Phase	Codex	Claude	What Happened
Planning	110K	42K	Codex spent more tokens deciding; Claude decided faster but planned more tasks
Code Exploration	4.90M	12.48M	Claude read 2.5x more files across 1.8x more tasks
Final Evaluation	78K	2.11M	Claude re-read source files to verify findings; Codex produced output directly
Test Quality	2.68M	2.36M	Nearly identical -- same task, same prompt

The exploration phase dominates: 95%+ of tokens went to agents reading source files. Planning and evaluation together account for less than 5%.

Per-Task Token Variance

	Codex	Claude
Lightest task	119K tokens	196K tokens
Heaviest task	1.22M tokens	1.63M tokens
Average per task	490K tokens	693K tokens
Variance range	10x	8x

The heaviest tasks for both agents were investigations into deeply connected subsystems -- webhook handling and plugin architectures -- where the agent followed dependency chains across many files.

The Agent CLI Overhead Tax

The Codex CLI injects a ~7,500 token system prompt on every turn (tool definitions, safety rules, sandbox config). Claude Code's overhead is comparable. Over multiple turns per task, this compounds -- but it's still dwarfed by the 500K-1.6M tokens of actual code exploration per task.

Neither agent used prompt caching in this configuration. With caching, the per-turn overhead would amortize to near-zero on subsequent turns.

When to Use Which Agent

Use Codex When: Speed and Architecture Understanding Matter

Codex excels at deep subsystem mapping. Its 10-task approach produced richly detailed reports -- complete state machine documentation, multi-layer permission hierarchies, full request-path traces with design rationale.

Onboarding onto an unfamiliar codebase -- Codex's subsystem reports read like architectural documentation
Understanding how a specific feature works end-to-end -- the MSTeams send-path trace and voice call state machine map were thorough enough to inform design decisions
Routine health checks -- 23 minutes and 7.8M tokens for a solid assessment of 10 core subsystems
Token budget constraints -- half the cost of Claude for core architectural and runtime findings

Weakness: Can miss entire risk categories. Zero security findings on a codebase with real security issues is a significant blind spot.

Use Claude Code When: Security and Risk Coverage Matter

Claude excels at risk-oriented investigation. Its 18-task approach with narrowly-scoped, threat-model-driven investigations found 2.7x more issues -- and critically, found the issues that matter most.

Assessing security posture -- 15 security issues including unauthenticated endpoints, unsandboxed plugin APIs, and dynamic code loading
Preparing for a security review or audit -- findings map directly to OWASP-style categories
Evaluating a codebase before acquisition or major investment -- comprehensive coverage across 18 areas
Finding cross-cutting issues -- problems that span subsystem boundaries

Weakness: 1.5x slower and 2.2x more expensive. Shallower per-subsystem depth -- more findings, less architectural understanding of each area.

The Hybrid Approach

Run Codex first for a fast architectural map -- understand the subsystems, identify the complex areas, get a baseline. Then run Claude second on flagged areas -- targeted security and risk investigations where they matter most. This gives you Codex's speed and depth where it's strong (architecture, runtime) and Claude's thoroughness where it's strong (security, cross-cutting risks) -- without paying Claude's token cost for the entire codebase.

What This Might Mean for Code Generation

We tested these agents as readers, not writers. But the most revealing signal isn't in the findings themselves -- it's in how each agent decomposed the problem before doing any work.

Given an open-ended prompt ("analyze this codebase for health issues"), each agent made a planning decision that defined everything that followed. Codex broke the codebase into 10 architectural subsystems and investigated each one deeply. Claude broke it into 18 risk-oriented concerns and investigated each one narrowly. Same prompt, same model -- completely different decomposition strategy.

That planning behavior is the clearest window into how these agents would approach code generation.

Codex: Targeted, Subsystem-Aware

Codex's instinct is to identify the relevant subsystem and go deep. It mapped state machines exhaustively, traced send paths end-to-end, documented multi-layer permission hierarchies. For code generation, this suggests an agent that will understand the local architecture well, produce code that fits existing patterns, and work fast -- but stay within its lane. Just as it missed security issues that crossed subsystem boundaries, it might implement a feature without considering how it interacts with authentication, error propagation, or adjacent modules.

Codex is the agent you'd want for well-scoped tasks: "add retry logic to the message sender" or "refactor the state machine to handle TTS failures."

Claude Code: Holistic, Risk-Aware

Claude's instinct is to survey the landscape and identify risks across boundaries. It didn't just look at the plugin system -- it created separate investigations for plugin API security, hook system lifecycle, and extension isolation. For code generation, this suggests an agent that will read more widely before writing, consider what could go wrong, and touch more files per change -- adding input validation, checking auth boundaries, and handling edge cases without being explicitly prompted.

Claude is the agent you'd want for changes with broad implications: "add a new channel plugin type" or "implement rate limiting across the gateway."

The Planning Signal

The strongest predictor from this benchmark isn't speed or token count -- it's the planning step. In less than 1% of total tokens, each agent revealed its fundamental approach to problem decomposition:

Codex thinks in subsystems

"What are the components, and what does each one do?"

Claude thinks in risks

"What could go wrong, and where are the boundaries?"

When writing code, this likely translates to: Codex builds the thing you asked for, well-integrated with its immediate surroundings. Claude builds the thing you asked for, plus the guardrails you didn't ask for -- at the cost of time and tokens.

Neither approach is better. They're different tools for different problems. But we haven't tested code generation yet. That's the next benchmark.

The Invisible Variable

Same model, same temperature, same prompts, same tools -- but the agent CLI's scaffolding changed everything. The system prompts, tool definitions, exploration patterns, and instruction-following behavior that each CLI wraps around the model produced a 2.7x difference in findings and a 2.2x difference in cost.

The model is constant. The agent is the variable.

This has practical implications for anyone building on top of coding agents: your choice of agent CLI isn't just a preference or vendor lock-in decision. It's a decision about what kinds of problems your system will find -- and what kinds it will miss.