A note on scope: This benchmark measures how each agent performs as a code analysis tool -- reading, exploring, and reasoning about an existing codebase. We did not test code generation, refactoring, or bug fixing. The agents were read-only investigators, not authors. That said, the behavioral differences we observed -- planning strategies, exploration depth, tool usage patterns -- are properties of the agent CLI itself, not the task. We speculate on what this means for code generation at the end.
The Setup
Target repository: OpenClaw (~1M LOC, TypeScript/Node.js)
Task: Automated code health analysis. Plan investigation tasks, execute them in parallel, evaluate and structure the findings.
LLM: GLM-5:cloud via Ollama's Responses API. Identical model, identical endpoint, identical API key for both runs.
Agent CLIs:
- Codex CLI (OpenAI) with a custom model provider pointing at the Ollama endpoint
- Claude Code (Anthropic) with the same Ollama endpoint via proxy mode
Both agents had identical tool access: file reading, grep, glob, shell execution, and file writing.
The Numbers
| Metric | Codex CLI | Claude Code | Delta |
|---|---|---|---|
| Total Duration | 22m 56s | 35m 21s | 1.5x |
| Investigation Tasks | 10 | 18 | 1.8x |
| Issues Found | 27 | 72 | 2.7x |
| Critical Issues | 0 | 5 | -- |
| High Issues | 6 | 14 | 2.3x |
| Total Tokens | 7.76M | 17.16M | 2.2x |
| LLM API Calls | 15 | 29 | 1.9x |
| Issues per 1M Tokens | 3.5 | 4.2 | +20% |
How Each Agent Planned the Work
Both agents received the same planning prompt with the same codebase context. They produced very different investigation strategies.
Codex: 10 Focused Subsystem Investigations
Codex planned 10 tasks, each targeting a specific subsystem. The strategy was subsystem-by-subsystem, architecturally scoped:
| # | Investigation | Focus |
|---|---|---|
| 1 | Gateway Ingress | Message entry and routing from WhatsApp/Telegram through to agents |
| 2 | Agent Orchestration | Model selection, fallback chains, tool execution |
| 3 | Voice Call State Machine | State transitions, timers, persistence |
| 4 | Webhook Security | Twilio/Plivo signature validation, replay attacks |
| 5 | Channel Plugins | Interface contracts across messaging channels |
| 6 | Config System | Hot reload, validation, migration |
| 7 | MSTeams Messaging | Complex send path, token refresh, error classification |
| 8 | BlueBubbles Monitor | Webhook processing, attachment handling |
| 9 | Extension Runtime | Plugin loading, isolation boundaries |
| 10 | Error Propagation | Cross-cutting error handling patterns |
Claude Code: 18 Narrower, Risk-Oriented Investigations
Claude planned 18 tasks -- nearly twice as many -- with a different philosophy. Instead of broad subsystem reviews, Claude created narrowly-scoped, risk-oriented investigations:
| # | Investigation | Focus |
|---|---|---|
| 1 | Message Routing Pipeline | Delivery guarantees, session key failures |
| 2 | Gateway Server Lifecycle | Crash recovery, signal handling |
| 3 | Tool Execution Security | Command injection, path traversal, sandboxing |
| 4 | Model Fallback Chain | Auth lockout, streaming failures |
| 5 | Channel Plugin System | Reconnection strategies, capability mismatches |
| 6 | Webhook Security | Timing attacks, replay prevention, SSRF |
| 7 | Session Store Concurrency | Race conditions, partial writes, disk failures |
| 8 | Extension Channel Lifecycle | Dependency isolation, crash containment |
| 9 | Voice Call State Machine | Timer races, stuck states |
| 10 | Multi-Agent Spawning | Infinite loops, spawn limits, isolation |
| 11 | Config Validation | Startup failures, silent defaults |
| 12 | Plugin API Surface | Security boundaries, unauthorized registration |
| 13 | Hook System Lifecycle | Dynamic code loading, infinite loops |
| 14 | Memory Search & Embeddings | Dimension mismatches, store bloat |
| 15 | Streaming Response Handling | Buffer growth, cancellation |
| 16 | Canvas Host Browser Control | XSS, unauthenticated HTTP, arbitrary JS |
| 17 | Mobile App Gateway | Disconnection handling, offline queueing |
| 18 | Account Resolution & Caching | Credential protection, cache invalidation |
Claude covered areas Codex didn't touch at all: multi-agent spawning, hook system security, memory/embeddings, streaming buffers, canvas/browser control, and mobile app connections. These additional tasks found 8 of the 15 security issues and 12 of the 32 runtime issues.
What Each Agent Actually Found
The Security Gap
This is the most striking difference. Claude found 15 security issues including 5 critical ones. Codex found zero.
| Critical Finding | File | Confidence |
|---|---|---|
| Plugin HTTP handlers bypass gateway authentication | server/plugins-http.ts | 95% |
| No file system sandboxing for plugins | plugins/registry.ts | 95% |
| Entire BlueBubbles channel untested | extensions/bluebubbles/ | 90% |
| Entire Signal channel untested | extensions/signal/ | 90% |
| Gateway server lifecycle has zero tests | server.impl.ts | 95% |
Claude's medium security findings included timing attacks in token comparison, webhook replay vulnerabilities, dynamic code loading from user-configurable paths, an unauthenticated canvas host HTTP server, and missing sandboxing for plugin file system access.
Codex investigated webhook security and did find the replay attack vulnerability and timing-safe comparison issues -- but these observations stayed in its exploration notes and never surfaced as structured findings. The knowledge was there; the final evaluation didn't capture it.
Where Both Agents Agreed
Both agents investigated the voice call state machine, webhook security, channel plugins, config system, and extension runtime. Their findings overlapped significantly:
Voice Call State Machine
Both found: TTS failure leaves calls in inconsistent state (high severity), calls can get stuck in non-terminal states, timer race conditions, persistence gaps after restart ("zombie calls")
Channel Plugins
Both found: inconsistent error handling across channels (MSTeams sophisticated, others raw), rate limiting varies wildly, no capability mismatch detection
Webhook Security
Both found: no replay prevention for Twilio webhooks, timing-safe comparison is mostly correct but has edge cases
Where Codex Went Deeper
On the 10 subsystems Codex investigated, its reports were highly detailed:
- Agent Orchestration: Codex mapped the complete 7-layer tool permission hierarchy and 14-section system prompt composition -- details Claude didn't capture at this level
- MSTeams Messaging: Codex traced the full send path including token delegation, conversation store atomicity (temp-file-then-rename), and the deliberate design decision to NOT retry ambiguous transport errors to prevent duplicate posts
- BlueBubbles Monitor: Codex found the specific edge case where attachment download failures produce placeholder text with undefined MediaUrls -- the agent sees a placeholder but has no actual media to process
Where Claude Went Deeper
Claude's 8 additional tasks found categories of issues Codex never touched:
- Multi-Agent Spawning: No hard limit on subagent spawn count, agent-to-agent allowlist bypass, no rate limiting on tool invocations
- Hook System: Dynamic code loading from user-configurable paths (high severity), hooks can modify core gateway data, no infinite loop protection
- Plugin API Surface: Plugins receive full config with credential paths, tools auto-register without user consent, HTTP handlers lack auth
- Canvas Host: HTTP server binds to 0.0.0.0 with no authentication,
canvas.evalallows arbitrary JS execution without sandboxing - Streaming Responses: Unbounded buffer growth accumulating without size limits
- Memory/LanceDB: No bloat prevention -- no cache limit, no cleanup, no compaction
Token Economics
Where the Tokens Went
| Phase | Codex | Claude | What Happened |
|---|---|---|---|
| Planning | 110K | 42K | Codex spent more tokens deciding; Claude decided faster but planned more tasks |
| Code Exploration | 4.90M | 12.48M | Claude read 2.5x more files across 1.8x more tasks |
| Final Evaluation | 78K | 2.11M | Claude re-read source files to verify findings; Codex produced output directly |
| Test Quality | 2.68M | 2.36M | Nearly identical -- same task, same prompt |
The exploration phase dominates: 95%+ of tokens went to agents reading source files. Planning and evaluation together account for less than 5%.
Per-Task Token Variance
| Codex | Claude | |
|---|---|---|
| Lightest task | 119K tokens | 196K tokens |
| Heaviest task | 1.22M tokens | 1.63M tokens |
| Average per task | 490K tokens | 693K tokens |
| Variance range | 10x | 8x |
The heaviest tasks for both agents were investigations into deeply connected subsystems -- webhook handling and plugin architectures -- where the agent followed dependency chains across many files.
The Agent CLI Overhead Tax
The Codex CLI injects a ~7,500 token system prompt on every turn (tool definitions, safety rules, sandbox config). Claude Code's overhead is comparable. Over multiple turns per task, this compounds -- but it's still dwarfed by the 500K-1.6M tokens of actual code exploration per task.
Neither agent used prompt caching in this configuration. With caching, the per-turn overhead would amortize to near-zero on subsequent turns.
When to Use Which Agent
Use Codex When: Speed and Architecture Understanding Matter
Codex excels at deep subsystem mapping. Its 10-task approach produced richly detailed reports -- complete state machine documentation, multi-layer permission hierarchies, full request-path traces with design rationale.
- Onboarding onto an unfamiliar codebase -- Codex's subsystem reports read like architectural documentation
- Understanding how a specific feature works end-to-end -- the MSTeams send-path trace and voice call state machine map were thorough enough to inform design decisions
- Routine health checks -- 23 minutes and 7.8M tokens for a solid assessment of 10 core subsystems
- Token budget constraints -- half the cost of Claude for core architectural and runtime findings
Weakness: Can miss entire risk categories. Zero security findings on a codebase with real security issues is a significant blind spot.
Use Claude Code When: Security and Risk Coverage Matter
Claude excels at risk-oriented investigation. Its 18-task approach with narrowly-scoped, threat-model-driven investigations found 2.7x more issues -- and critically, found the issues that matter most.
- Assessing security posture -- 15 security issues including unauthenticated endpoints, unsandboxed plugin APIs, and dynamic code loading
- Preparing for a security review or audit -- findings map directly to OWASP-style categories
- Evaluating a codebase before acquisition or major investment -- comprehensive coverage across 18 areas
- Finding cross-cutting issues -- problems that span subsystem boundaries
Weakness: 1.5x slower and 2.2x more expensive. Shallower per-subsystem depth -- more findings, less architectural understanding of each area.
The Hybrid Approach
Run Codex first for a fast architectural map -- understand the subsystems, identify the complex areas, get a baseline. Then run Claude second on flagged areas -- targeted security and risk investigations where they matter most. This gives you Codex's speed and depth where it's strong (architecture, runtime) and Claude's thoroughness where it's strong (security, cross-cutting risks) -- without paying Claude's token cost for the entire codebase.
What This Might Mean for Code Generation
We tested these agents as readers, not writers. But the most revealing signal isn't in the findings themselves -- it's in how each agent decomposed the problem before doing any work.
Given an open-ended prompt ("analyze this codebase for health issues"), each agent made a planning decision that defined everything that followed. Codex broke the codebase into 10 architectural subsystems and investigated each one deeply. Claude broke it into 18 risk-oriented concerns and investigated each one narrowly. Same prompt, same model -- completely different decomposition strategy.
That planning behavior is the clearest window into how these agents would approach code generation.
Codex: Targeted, Subsystem-Aware
Codex's instinct is to identify the relevant subsystem and go deep. It mapped state machines exhaustively, traced send paths end-to-end, documented multi-layer permission hierarchies. For code generation, this suggests an agent that will understand the local architecture well, produce code that fits existing patterns, and work fast -- but stay within its lane. Just as it missed security issues that crossed subsystem boundaries, it might implement a feature without considering how it interacts with authentication, error propagation, or adjacent modules.
Codex is the agent you'd want for well-scoped tasks: "add retry logic to the message sender" or "refactor the state machine to handle TTS failures."
Claude Code: Holistic, Risk-Aware
Claude's instinct is to survey the landscape and identify risks across boundaries. It didn't just look at the plugin system -- it created separate investigations for plugin API security, hook system lifecycle, and extension isolation. For code generation, this suggests an agent that will read more widely before writing, consider what could go wrong, and touch more files per change -- adding input validation, checking auth boundaries, and handling edge cases without being explicitly prompted.
Claude is the agent you'd want for changes with broad implications: "add a new channel plugin type" or "implement rate limiting across the gateway."
The Planning Signal
The strongest predictor from this benchmark isn't speed or token count -- it's the planning step. In less than 1% of total tokens, each agent revealed its fundamental approach to problem decomposition:
Codex thinks in subsystems
"What are the components, and what does each one do?"
Claude thinks in risks
"What could go wrong, and where are the boundaries?"
When writing code, this likely translates to: Codex builds the thing you asked for, well-integrated with its immediate surroundings. Claude builds the thing you asked for, plus the guardrails you didn't ask for -- at the cost of time and tokens.
Neither approach is better. They're different tools for different problems. But we haven't tested code generation yet. That's the next benchmark.
The Invisible Variable
Same model, same temperature, same prompts, same tools -- but the agent CLI's scaffolding changed everything. The system prompts, tool definitions, exploration patterns, and instruction-following behavior that each CLI wraps around the model produced a 2.7x difference in findings and a 2.2x difference in cost.
The model is constant. The agent is the variable.
This has practical implications for anyone building on top of coding agents: your choice of agent CLI isn't just a preference or vendor lock-in decision. It's a decision about what kinds of problems your system will find -- and what kinds it will miss.