The Question
Most engineers agree that typed, compiled languages produce safer code. The question is where that advantage shows up, how large it is, and how much team practice can close the gap.
The same Octokraft analysis pipeline that scored 24 open source projects for the heritage vs. AI-heavy benchmark ran the numbers by language instead. Rust, Go, TypeScript, Python, PHP, and Swift. The smallest project is 10,000 lines. The largest is 2.3 million. Same rubric, same weights, same severity multipliers. The only variable is the static analyzers matched to each ecosystem.
Language here means the primary language of the repository -- a distinction that matters for mixed codebases. Grafana and Mattermost are counted under TypeScript because their product surface is TypeScript-led even though their backend security findings include Go code. Dify is counted under Python even though it carries a large TypeScript frontend. The language tables show the failure modes that dominate each repository, not a clean compiler-only laboratory experiment.
Rust dominates, as expected. Python trails, as expected. But the spread within each language group scrambles the rankings, and the category-level breakdowns draw a line between where language mechanics stop and team practices start.
The Averages
| Language | Avg Score | Grade | Min | Max | Spread | Projects |
|---|---|---|---|---|---|---|
| Rust | 89.0 | A- | 82.3 | 99.5 | 17.2 | 4 |
| Go | 79.7 | B | 50.1 | 97.9 | 47.8 | 5 |
| TypeScript | 77.2 | B- | 53.3 | 95.5 | 42.2 | 14 |
| Swift | 69.1 | C+ | 1 project | 1 | ||
| Python | 64.9 | C | 41.3 | 79.2 | 37.9 | 6 |
| PHP | 42.4 | D- | 1 project | 1 | ||
Rust's lowest-scoring project (Codex, 82.3) beats the average of every other language group. The floor that the Rust compiler establishes is higher than the ceiling most other languages reach on average. Go's top project (Traefik, 97.9) beats every Rust project. But Go's bottom (gh CLI, 50.1) sits below every Python project except wttr.in.
Look at the spread column. Rust has a 17.2-point gap between its best and worst project. Go has 47.8. The compiler constrains the floor. It does not constrain the ceiling, and it does not constrain the variance between teams using the same language.
The average also hides what each language spends its score on. Rust's open issues cluster in testing and structural coupling -- the floor stays high because security and runtime almost never collapse. Go's spread is wide because one half of the group looks like Traefik and Terraform while the other half looks like Kubernetes and gh CLI, with completely different dominant issue categories. TypeScript's average is pulled in two directions: high-scoring repos like Supabase and Next.js on one side, and small or fast-moving repos with concentrated security/runtime debt on the other. Python's average is dragged down by framework-level security debt in Django and low-floor operational risk in wttr.in, even though scikit-learn still lands near the top of the language group.
Security: The Biggest Language Differentiator
Security is the category where language choice has the most effect. The gap between Rust (91.3 average) and Python (22.7) is 68.6 points. No other category produces a spread that wide.
| Language | Avg Security | Min | Max | Total Issues |
|---|---|---|---|---|
| Rust | 91.3 | 84.2 | 100.0 | 38 |
| Go | 71.9 | 22.2 | 100.0 | 2,233 |
| TypeScript | 62.9 | 3.4 | 100.0 | 494 |
| Python | 22.7 | 7.1 | 54.5 | 882 |
Rust's compiler eliminates whole classes of memory-unsafety, but the sampled issue descriptions tell a more specific story than "Rust is safer." Rust security findings mostly come from escape hatches and tooling edges: certificate validation bypass flags in Deno, subprocess and tar extraction issues in Goose/Codex Python SDK glue, and a small tail of shell or archive handling. Across four Rust projects totaling 1.4 million lines of code, there are 38 security issues. Most of the remaining risk sits where the code deliberately steps outside the compiler's guarantees.
Go's security average (71.9) is dominated by one repo, and the raw total overstates the severity. Kubernetes accounts for 2,210 of Go's 2,233 total security issues, but that inventory breaks down to 2 high, 2,207 medium, and 1 low. Kubernetes makes Go look noisy rather than catastrophically insecure. The useful takeaway is narrower: Go's escape hatches are permissive enough that a very large systems project can accumulate a lot of medium-severity exposure around unsafe, randomness, and TLS boundaries. But the sharpest critical/high security concentrations in this benchmark sit elsewhere -- in repos like Appwrite, Dify, Mattermost, Cal.com, n8n, and Open SaaS. Remove Kubernetes and the rest of Go looks much closer to Rust on security volume.
TypeScript (62.9) is not one security profile. The sampled descriptions split into two groups. One is feature-boundary failures in TypeScript-native code: missing authorization in Open SaaS, dangerouslySetInnerHTML and postMessage('*') patterns in Cal.com and Excalidraw, and CORS or command-execution issues in n8n. The other is mixed-platform debt inside TypeScript-primary products like Grafana and Mattermost, where backend Go security issues show up under the same repo score. TypeScript's average sits in the middle because the language protects neither against backend platform risk nor against trust-boundary shortcuts in application code.
Python's security average (22.7) has a different origin. The descriptions repeatedly point to string-built SQL, unescaped template/rendering surfaces, deserialization, and shell/process execution. Django alone contributes 654 issues, many around as_sql, mark_safe, and open redirect patterns. Dify adds hardcoded SQL expressions, logger credential disclosure, and SQLAlchemy text usage. Sweep adds direct command execution risk. The common denominator is not dynamic typing per se -- it is that Python frameworks and libraries expose powerful string-driven extension points, and the language does little to force those boundaries into safer forms.
Language shapes which security mistakes come naturally. Rust mistakes are usually explicit opt-outs. Go mistakes often come from unsafe or network-facing systems code. TypeScript mistakes often come from boundary modeling in application features. Python mistakes often come from framework surfaces that are flexible enough to let dangerous behavior look normal.
Runtime Safety: Where Compiler Mechanics Show Up
Rust averages 96.8 on runtime safety. Python averages 57.3. That 39.5-point gap traces directly to error handling models.
| Language | Avg Runtime | Error Handling Model |
|---|---|---|
| Rust | 96.8 | Result<T, E> + pattern matching, compiler-enforced |
| Go | 73.1 | Explicit error returns, convention-enforced |
| TypeScript | 72.7 | Exceptions + optional chaining, partially enforced |
| Python | 57.3 | Exceptions, unchecked |
Rust's Result type and ownership rules show up in the sampled descriptions. Rust runtime findings are mostly operational and architectural -- unbounded channels in Codex, command hot paths in Starship, shell timeout behavior in Goose, request-body limits in Deno -- not basic correctness failures. Every Rust project in the benchmark scores above 91 on runtime because the compiler removes the easy ways to mishandle memory and error propagation. The remaining issues are about service behavior under load, not whether a null dereference or unchecked error slips through routine code.
Go requires explicit if err != nil checks but does not enforce them. The descriptions show what that means project by project. Traefik's runtime issues are moderate operational concerns: global connection state, configuration thrash, goroutine observability. Terraform's are heavier but still about cancellation, state integrity, and resource cleanup. gh CLI falls apart because its sampled runtime issues sit on central shared-state and auth paths: a global SSO header data race, swallowed token retrieval errors, unsynchronized config caching. Same language, same idioms -- but one repo treats runtime discipline as a design constraint and the other treats it as a local implementation detail.
TypeScript's runtime average (72.7) splits between disciplined teams and permissive failure handling. Supabase lands at 99.5 because its unresolved descriptions are small operational edge cases and one deployment-path import failure, not systemic error handling gaps. At the other end of the group, the sampled descriptions fill up with empty catch blocks, silent fallbacks, missing rate limits, missing timeouts, unsafe casts, and production assumptions that only hold in development. Optional chaining helps with null access. It does nothing for lifecycle control, retries, backpressure, or cleanup.
Python's runtime issues expose the cost of mixing high-level code with low-level or global-state-heavy frameworks. Django's critical runtime finding is shared database connection state. scikit-learn's runtime findings sit in Cython allocation, threadpool initialization, and large-memory behavior. Dify exposes unauthenticated webhook endpoints without rate limiting. wttr.in and OpenHands pull the average down with thin operational safeguards. Python's runtime floor is not low because exceptions are inherently bad -- it is low because the language offers almost no enforcement around shared mutable state, concurrency boundaries, or resource limits.
Supabase matching Rust at 99.5 matters because of what it cannot be explained by: the language. TypeScript can equal Rust on runtime only when a team manually builds the operational discipline that Rust makes harder to skip.
The Go Spread: Same Language, 47.8-Point Gap
Traefik scores 97.9. gh CLI scores 50.1. Both are Go. Both use the same compiler, the same type system, the same standard library. The 47.8-point gap is the widest of any language group, and it exists entirely because of project practices.
| Category | Traefik (97.9) | Terraform (92.7) | Prometheus (85.2) | K8s (72.8) | gh CLI (50.1) |
|---|---|---|---|---|---|
| Security | 100.0 | 93.1 | 100.0 | 22.2 | 44.3 |
| Runtime | 96.6 | 70.3 | 90.2 | 92.6 | 15.7 |
| Testing | 98.9 | 99.4 | 26.0 | 65.3 | 9.4 |
| Code Quality | 88.3 | 95.4 | 95.2 | 83.9 | 55.0 |
| Consistency | 99.7 | 99.8 | 99.8 | 98.9 | 94.9 |
| Dead Code | 100.0 | 100.0 | 100.0 | 95.4 | 79.5 |
Look at the consistency row. Even gh CLI, which scores 50.1 overall, manages 94.9 on consistency. Go's toolchain enforces formatting (gofmt), import organization, and naming conventions at the language level. Consistency is the one category where Go's compiler genuinely constrains the floor. Every Go project scores above 94.9 on it.
Testing is where the gap blows open. Traefik and Terraform both score above 98. Prometheus drops to 26.0. gh CLI scores 9.4. The test infrastructure exists in all four projects. gh CLI has a test-code ratio of 0.78. Traefik's ratio is 0.75. But the sampled descriptions show that Traefik's open testing debt is narrow and peripheral, while gh CLI's sits on verification itself: a fake mock verifier, crypto tests that never exercise real verification, and broad untested command paths. The difference is structural: test organization, assertion quality, coverage distribution, and where the weak tests sit. Not raw volume.
(A caveat on Prometheus: the structural coverage metric measures direct test-to-function call edges, which undercounts projects that route assertions through test helpers. Prometheus has 177K lines of test code. The low score reflects a measurement limitation for that test architecture, not an absence of testing.)
Code smell density separates them further. gh CLI has 156 code smell issues across 148K lines of code. Traefik has 3 across 163K. Similar-sized codebases, 52x difference in code smells.
Testing: Culture Over Tooling
TypeScript has the most testing tooling of any language in the benchmark. Jest, Vitest, Playwright, Testing Library, Cypress. It also has the lowest testing average of any multi-project language group.
| Language | Avg Testing | Min | Max |
|---|---|---|---|
| Rust | 62.9 | 34.7 | 97.8 |
| Go | 59.8 | 9.4 | 99.4 |
| Python | 58.0 | 4.6 | 99.2 |
| TypeScript | 53.7 | 15.4 | 90.8 |
Testing issues account for 54.7% of all TypeScript issues in the benchmark. The sampled descriptions show not one problem but three. First, breadth debt: n8n, Cal.com, Next.js, Grafana, and Dify all have large inventories of Untested Function, Untested Method, Page, execute, or getStyles findings. The callable surface grows faster than direct verification. Second, false-confidence debt: Supabase has a test with no assertion, LibreChat has waitFor without await, and multiple projects carry low assertion-density findings. Third, disabled-critical-path debt: Twenty comments out core workflow suites and Cal.com skips confirm-handler coverage. The tooling is not the constraint. The issue descriptions show that the test surface is present, but the strongest guarantees are missing exactly where the application crosses stateful boundaries.
The split within Go projects is even sharper. Terraform scores 99.4 on testing. Prometheus scores 26.0. gh CLI scores 9.4. The descriptions explain why. Terraform's remaining testing debt is narrow. Prometheus misses retry and contract testing on notifier and storage interfaces. gh CLI has broken confidence at the center of the verification story itself. The range from top to bottom is 90 points within one language because testing quality depends less on syntax and more on whether the team treats tests as executable documentation or as scaffolding.
The test-code ratio paradox makes this concrete. Appwrite (PHP) has 5.68x more test code than production code and scores 2.6 on testing because its sampled descriptions show health tests that only check HTTP 200, no tests for critical cascade delete logic, and broad untested action handlers. gh CLI has a 0.78 test-code ratio and scores 9.4. Traefik has a 0.75 ratio and scores 98.9. Volume is not quality. The issue descriptions repeatedly show that low testing scores come from weak trust in what the tests prove.
Code Quality and Consistency
Code quality -- code smell density, structural complexity, god classes -- is the category where TypeScript performs best relative to other languages.
| Language | Avg Code Quality | Avg Consistency |
|---|---|---|
| TypeScript | 89.2 | 96.0 |
| Rust | 88.4 | 93.2 |
| Go | 83.6 | 98.6 |
| Python | 79.7 | 92.2 |
The descriptions behind code quality are more informative than the averages. TypeScript leads this category partly because many of its smell findings are medium- or low-severity maintainability signals distributed across very large repos: unused imports, high-fan-out handlers, static-method repositories, and boundary leakage between feature and API layers. That still matters, but it is usually cheaper debt than the central architectural breakpoints seen in Vinext or Twenty. The language's ecosystem encourages interface-heavy decomposition, but the sampled descriptions show that TypeScript can just as easily centralize itself into giant adapters, dataloaders, and plugin files when the repo allows it.
Rust's code quality average stays high because even its weaker projects tend to fail structurally rather than chaotically. Goose and Codex accumulate provider coupling, server-route coupling, and large TUI or security modules, but the descriptions still read like architecture debt inside otherwise constrained systems.
Go's lower code-quality average is driven by central hubs: Traefik's dynamic config package, Terraform's core engine and protocol evolution, Kubernetes' controller and kubelet sprawl, and gh CLI's pkg/cmd/pr/shared/ god module. The recurring pattern is not poor syntax hygiene -- it is large coordination points where one package becomes the meeting place for too many responsibilities.
Python's lower code-quality average comes from global-state and coupling-heavy frameworks rather than from style. The sampled descriptions point to Django app-registry and core-module centralization, Dify service/controller coupling, OpenHands global singletons, and scikit-learn utils hubs. Python gives teams broad freedom to compose modules however they like. The weak cases show what happens when nothing in the language pushes back on that freedom.
Consistency is where Go's toolchain advantage is clearest. gofmt eliminates style debates. Import organization is enforced. Naming conventions are prescribed. Go averages 98.6 on consistency. Even the worst Go project on consistency (gh CLI, 94.9) outscores the average of every other language group on this metric. The sampled descriptions show that Go consistency failures are usually architectural exceptions, not style drift: a god config package, divergent command pattern, or kubelet-scale central module.
TypeScript and Python show the opposite pattern. They accumulate many more style-level consistency findings -- optional-chain suggestions, unnecessary fragments, implicit any, static-only classes. Those findings can be numerous without doing much score damage because they do not necessarily signal architectural breakage. When TypeScript or Python consistency does collapse, it usually happens for the same reason as Go: a central service or file violates the codebase's own intended boundaries.
Duplication is a solved problem across all languages. Twenty-eight of 24 projects score 100.0.
What Language Buys You, and What It Doesn't
Language sets the floor by making some failure modes common and others rare. Rust largely removes memory-unsafety and unchecked error propagation, so its issue descriptions skew toward operational edges, test gaps, and oversized coordination modules. Go gives teams a safer default than Python or TypeScript but leaves enough escape hatches that large systems code can pile up unsafe, TLS, concurrency, and shared-state risk when it needs to. TypeScript allows disciplined teams like Supabase to build very healthy systems, but its recurring issue descriptions show that application-boundary mistakes, weak verification, and silent operational fallbacks remain easy to write. Python remains the most permissive of the multi-project groups: powerful frameworks, string-driven APIs, and limited enforcement around shared state make its security and runtime issues qualitatively broader.
The compiler cannot write tests. Rust's testing average (62.9) is only marginally better than Go's (59.8) or Python's (58.0). The sampled Rust testing descriptions are full of untested routes, missing sandbox checks, and absent direct verification on important boundaries. A safer language reduces the number of defects tests need to catch. It does not ensure the tests exist.
The compiler cannot preserve architecture. God modules appear in every language: Kubernetes' kubelet in Go, Twenty's DataloaderService in TypeScript, Vinext's index.ts, Django's core registries, Codex's app.rs, and Appwrite's action and exception hubs in PHP. Once a team lets too many responsibilities accumulate in one place, the module graph dominates the score more than the syntax does.
The practical takeaway is narrower than "language does not matter" and more useful than "just choose Rust." Language determines which mistakes are cheap to make. Team practice determines whether the codebase keeps making them after the first warning.
| Category | #1 | #2 | #3 | #4 |
|---|---|---|---|---|
| Overall | Rust (89.0) | Go (79.7) | TypeScript (77.2) | Python (64.9) |
| Security | Rust (91.3) | Go (71.9) | TypeScript (62.9) | Python (22.7) |
| Runtime | Rust (96.8) | Go (73.1) | TypeScript (72.7) | Python (57.3) |
| Testing | Rust (62.9) | Go (59.8) | Python (58.0) | TypeScript (53.7) |
| Code Quality | TypeScript (89.2) | Rust (88.4) | Go (83.6) | Python (79.7) |
| Consistency | Go (98.6) | TypeScript (96.0) | Rust (93.2) | Python (92.2) |
The projects at the top of every language group share the same traits: testing discipline, clean module boundaries, active maintenance. The projects at the bottom share the same gaps, regardless of language. Rust's floor is higher than every other language's average. But the 47.8-point spread within Go, and the fact that a TypeScript project (Supabase) beats all but one Go project, means the team operating the language still matters more than the language itself.
Data from Octokraft's analysis of 24 open-source repositories across Rust, Go, TypeScript, Python, PHP, and Swift. All projects analyzed with identical rule sets, category weights, and severity multipliers. Full methodology described in the benchmark post.