Blog

What we're learning about AI-generated code, quality drift, and keeping velocity without the mess.

App Clinic 26 min read

App Clinic: cal.com

A dissection of the cal.com codebase — how it works, what the team does well, where the technical debt sits, and what other projects can learn from it. The full Octokraft clinic.

Apr 20, 2026

Read article

App Clinic 28 min read

App Clinic: Grafana

A dissection of the Grafana codebase — how it works, what the team does well, where the technical debt sits, and what other projects can learn from it. The full Octokraft clinic.

Apr 20, 2026

Benchmark 14 min read

Does Your Language Choice Determine Code Quality?

Rust, Go, TypeScript, Python, PHP, and Swift scored by the same pipeline across 24 open source projects. Rust's compiler sets a floor no other language matches. But Go proves the same toolchain can produce a 97.9 and a 50.1, depending on the team.

Apr 13, 2026

Deep Dive 14 min read

Next.js vs Its Vibe-Coded Alternative: What the Analysis Shows

One engineer and 800 AI sessions built a Next.js replacement in under a week for $1,100. Octokraft's analysis pipeline gave it a 99.6 on security. Then external researchers found 31 vulnerabilities that static analysis never caught.

Apr 9, 2026

Benchmark 18 min read

Coding Agent Showdown: Two Experiments, Zero Clean Passes

Four AI coding agents received the same weather service to port and the same bug to fix. Three ports work. All four scored D- or worse. Two agents fixed the root cause. The data shows what these tools produce without human review.

Apr 13, 2026

Benchmark 18 min read

Heritage vs AI-Heavy: 24 Open Source Projects, One Benchmark

The same analysis pipeline scored 24 open source repositories. Heritage projects averaged 76.0. AI-heavy projects averaged 74.4. The methodology does not predict the score. Language choice and engineering discipline do.

Apr 9, 2026

Deep Dive 18 min read

OpenClaw Scores a C. The Most-Starred Project on GitHub Has Real Problems.

285,000 GitHub stars. 8 CVEs. 42,000 exposed instances. We analyzed OpenClaw's 1.1 million lines of code across four dimensions. Here's what we found.

Mar 15, 2026

Engineering 12 min read

When Good Analysis Misses Critical Problems

OpenClaw looked great by every standard metric. Clean code, strong architecture, massive test suite. It also had 8 CVEs. Here's why good analysis isn't enough.

Mar 15, 2026

Benchmark 14 min read

Claude Code vs Codex CLI: Same Model, Same Task, Different Agent

We ran the exact same code analysis on the same codebase with the same LLM. The only variable: which agent CLI executed the work. 2.7x difference in findings. 2.2x difference in cost.

Mar 15, 2026

Engineering 8 min read

What Technical Debt Management Actually Looks Like When Your Tool Does the Work

Most technical debt processes fail because they rely on separate tracking rituals. The sustainable model is tooling that surfaces friction where work already happens.

Mar 13, 2026

Benchmark 12 min read

Three AI Coding Agents Compared: OpenCode 82.4, Codex 82.3, Gemini CLI 70.6

Octokraft's full analysis pipeline scored three open-source AI coding agents. OpenCode edges Codex by 0.1 at B+. Gemini CLI trails at B-. Security is no longer a clean sweep for compiled languages.

Apr 9, 2026

Try Octokraft on your repos

Health scores, architecture reviews, and convention detection. Free to start.