App Clinic 28 min read Apr 20, 2026

App Clinic: Grafana

A dissection of the Grafana codebase: how it works, what the team does well, where the technical debt sits, and what other projects can learn from it.

What the app is

Grafana is an open-source platform for visualizing metrics, logs, and traces. Engineers connect it to data sources like Prometheus, Loki, Elasticsearch, InfluxDB, and Postgres, then build dashboards, run ad-hoc queries, and configure alerts against those sources. It ships as a single binary that serves an HTTP API and a React frontend.

Grafana Labs maintains the project. It is used in production for infrastructure monitoring across companies large and small, and also serves as the upstream for Grafana Labs' hosted product and for related projects like Loki and Mimir.

The backend is written in Go, built with Wire for compile-time dependency injection. The frontend is TypeScript and React, with Redux Toolkit for state and Emotion for styling. The repository also publishes a set of npm packages under the @grafana/ scope and ships several standalone Go applications that run alongside the main server.


Repository anatomy

Grafana is a monorepo. The top-level layout:

DirectoryWhat lives here
pkg/Go backend — HTTP API, domain services, storage, plugin integration, infrastructure
apps/Standalone Go applications built on the Grafana App SDK. Each one manages a Kubernetes-style resource: dashboards, folders, alerts, provisioning, secrets, IAM, and others
packages/TypeScript libraries published to npm as @grafana/data, @grafana/ui, @grafana/runtime, @grafana/prometheus, and a dozen more
public/app/React frontend — features, core chrome, Redux store
devenv/Developer environment — Docker blocks, Nginx configs, local CDN
kinds/, kindsv2/, cue.mod/CUE schema definitions used to generate Go and TypeScript types
e2e/, e2e-playwright/End-to-end test suites
scripts/, packaging/, hack/Build, packaging, and tooling scripts
docs/, contribute/Documentation and contributor guides
conf/Default configuration files shipped with the binary

Tests live next to the code they exercise. Go files follow the _test.go convention; frontend files use *.test.ts and *.test.tsx. Mocks are generated alongside their interfaces using mockery. Schema-derived types are generated from CUE and committed as *_gen.ts and *_gen.go.

Dependencies are not vendored into the repository. Go modules are listed in go.mod; frontend dependencies are managed via Yarn workspaces declared in package.json.


How the app works

Grafana runs as a single Go binary. At startup, Google Wire builds the dependency graph at compile time and constructs every backend service through explicit constructor calls. The binary serves both the legacy HTTP API and a Kubernetes-style resource API. The React frontend ships with the same binary and talks to both APIs over HTTP. The entry point is pkg/cmd/grafana/main.go, which launches either the CLI or the server.

Two HTTP APIs side by side

Grafana exposes two HTTP APIs in parallel. The older one lives under /api/… and is served by pkg/api/http_server.go — the same file where the HTTPServer struct holds references to every backend service that exposes HTTP endpoints. Route registration and middleware composition happen inside this one package, and every legacy endpoint the Grafana API documentation lists is a handler method on HTTPServer.

The newer one lives under /apis/… and is a Kubernetes-style API built on top of k8s.io/apiserver. pkg/services/apiserver/ holds the server; each resource type — dashboards, folders, alert rules, provisioning jobs, secrets, IAM — is declared as a versioned resource with CRUD and watch semantics. A frontend request that lists dashboards can either hit GET /api/search?type=dash-db on the legacy path, or GET /apis/dashboard.grafana.app/v1/dashboards on the new one, and the response shapes differ because the two APIs expose different contracts.

The migration is gradual. Older screens in public/app/ still hit /api/…; newer screens — dashboard-scene, the unified alerting UI, provisioning — hit /apis/…. A developer adding a brand-new feature today writes a resource-API handler, while a developer touching an existing feature usually still writes against the legacy handler, because that is where the existing code sits.

Two storage layers

Grafana also carries two storage layers, and during the migration they run together against the same database. pkg/services/sqlstore/ is the legacy SQL store built on xorm, which backs most legacy-API features and supports PostgreSQL, MySQL, and SQLite. pkg/storage/unified/ is the newer layer backing the resource API; its implementation kvStorageBackend in pkg/storage/unified/resource/storage_backend.go owns the key-value store, an event store, a notifier that fans out to watch subscribers, a garbage collector, and a dashboard-version retention policy.

Both layers can run against the same PostgreSQL, MySQL, or SQLite instance because they use disjoint tables. During the feature migration, the same resource type can have handlers in both layers — a request to /api/dashboards hits the legacy xorm-backed store, while a request to /apis/dashboard.grafana.app/v1/dashboards hits kvStorageBackend. A single Grafana deployment runs both paths concurrently until the feature is fully migrated over.

Apps, alerting, real-time

apps/ contains standalone Go binaries built on the Grafana App SDK — apps/dashboard/, apps/folder/, apps/alerting/, apps/provisioning/, apps/secret/, apps/iam/. Each app manages exactly one Kubernetes-style resource and follows the same internal layout: CUE kinds under kinds/ declare the schema, generated Go types land in pkg/apis/, and business logic lives in pkg/app/ or pkg/repository/. Each app compiles to its own binary and registers itself against the resource API layer at startup.

pkg/services/ngalert/ is the unified alerting subsystem. The scheduler runs each alert rule on its configured interval, the evaluator executes the rule's expression against the configured data source, the state manager tracks firing/pending/resolved transitions per rule, and the notifier routes firing alerts to the configured receivers. Sub-packages split the work: eval/, state/, schedule/, notifier/, sender/, provisioning/, cluster/, api/.

pkg/services/live/ hosts a Centrifuge-based WebSocket server that pushes dashboard updates to the browser in real time — new annotations and data-source state changes appear on panels without a reload. pkg/services/rendering/ drives a headless-browser service that renders dashboards into PNGs for alert-email images and share links.

Plugins

Plugins extend Grafana in three ways: data sources (connect a new backing service), panels (new visualisation types), and app plugins (custom pages and actions). They are loaded by pkg/plugins/manager/. Each plugin is either signed by Grafana Labs (the production default), unsigned but allowlisted in config (used for internal and pre-release plugins), or unsigned with development mode enabled. The signature decision lives in pkg/plugins/manager/signature/authorizer.go.

Backend plugins — data sources and app plugins — run as separate OS processes, which means a plugin crashing does not crash Grafana, and they communicate with the main binary over gRPC. Frontend plugins are React modules loaded into the browser by the plugin runtime in packages/grafana-runtime/.


Architecture

Octokraft's architecture review evaluates the codebase across four dimensions — modularity, coupling, scalability, and patterns — and writes up strengths, weaknesses, and a module-by-module description. Grafana's scores:

DimensionScore
Modularity65
Coupling55
Scalability60
Patterns75
OverallC

The executive summary, verbatim:

Grafana's architecture shows mature foundations with Wire dependency injection, interface-driven design, and Kubernetes-style APIs. However, accumulated technical debt requires attention: ngalert has 17-service coupling, alert evaluation lacks backpressure, and HTTPServer has 70+ dependencies needing decomposition.

Strengths

The architecture review names three patterns as the codebase's foundational strengths.

Wire dependency injection. The backend uses Google Wire for compile-time dependency injection. pkg/server/wire.go declares the dependency sets. A set looks like this:

var withOTelSet = wire.NewSet(
    otelTracer,
    grpcserver.ProvideService,
    interceptors.ProvideAuthenticator,
)

Each item is either a Provide*() factory function that constructs a service, or a wire.Bind(new(Interface), new(*ImplType)) that tells Wire to satisfy the interface with the concrete type. The main set wireSet in the same file composes dozens of these smaller sets — wireBasicSet, withOTelSet, HTTP, storage, alerting, authentication, access control, and more. The Wire tool reads these declarations and generates pkg/server/wire_gen.go, which constructs every service through explicit constructor calls in the correct order.

Because the generation happens at compile time, a circular dependency becomes a build error (Wire refuses to generate the file) rather than a runtime stack overflow. A missing provider is a build error. An ambiguous binding is a build error. Every new service in pkg/services/ plugs in by adding one line to a set — no registry walks, no reflection, no startup-time resolution.

Interface-driven design. Domain services under pkg/services/ define their interface in the same package as the implementation. For example, pkg/services/dashboards/dashboards.go declares the DashboardService interface; pkg/services/dashboards/dashboardimpl/ implements it; pkg/services/dashboards/dashboardtest/ provides fakes for tests. The pattern lets callers depend on the interface rather than the concrete type, and it lets the test package supply a lightweight replacement without pulling in the full service graph. Grafana applies the same layout across authentication, access control, search, provisioning, alerting, and most other domain services.

Kubernetes-style resource APIs. The newer API layer under pkg/apiserver/ and pkg/aggregator/ is built on k8s.io/apiserver. Resources are declared as CUE schemas in kinds/ or apps/<name>/kinds/, and a type generator produces Go and TypeScript types from those schemas. Each resource supports the standard Kubernetes verbs — Get, List, Watch, Create, Update, Delete — plus admission validators and mutators. apps/provisioning/pkg/apis/admission/combined_validator.go is a reference implementation: admission iterates every registered Validator and returns the first structured field.ErrorList error. This layer is gradually replacing the legacy /api/… handlers with resource-oriented /apis/… endpoints.

Weaknesses

The review's three decomposition recommendations.

HTTPServer has 70+ dependencies. pkg/api/http_server.go is 1,106 lines and the HTTPServer struct references roughly seventy backend services. A small slice of its declared fields:

type HTTPServer struct {
    // ... (the struct opens with a log, web mux, middleware slice, route registry)
    DataSourceCache              datasources.CacheService
    AuthTokenService             auth.UserTokenService
    QuotaService                 quota.Service
    ProvisioningService          provisioning.ProvisioningService
    AccessControl                accesscontrol.AccessControl
    pluginClient                 plugins.Client
    pluginStore                  pluginstore.Store
    SearchService                search.Service
    Live                         *live.GrafanaLive
    SQLStore                     db.DB
    AlertNG                      *ngalert.AlertNG
    SecretsService               secrets.Service
    DataSourcesService           datasources.DataSourceService
    // ... fifty-five more fields ...
}

Every feature that exposes HTTP endpoints adds a field to this struct. The review recommends decomposing the server into route groups, each with its own focused dependency set, and using handler composition rather than a single coordinating struct. The change is a large refactor — every call site that accesses hs.SecretsService or hs.SearchService would need a new injection point — and it has not been done.

ngalert has 17-service coupling. pkg/services/ngalert/ngalert.go is 897 lines. Its ProvideService constructor combines roughly seventeen other services: the evaluation engine, state manager, scheduler, notifier, outbound sender, historian, external Alertmanager client, rendering service, access control, cluster coordinator, and more. The review recommends splitting ngalert into bounded sub-services — one for rule evaluation, one for notification routing, one for alert state management — each with a smaller dependency set. Today ngalert is the single entry point for alerting, and every piece of alerting behaviour routes through the same constructor.

Alert evaluation lacks backpressure. pkg/services/ngalert/eval/ runs rule evaluations on a fixed schedule and queries the underlying data sources. There are no bounded queues, no circuit breakers, and no rate limits between the scheduler and the evaluator. Under load — many rules evaluating at once, slow data source, failed alert deliveries — nothing slows the pipeline down other than the rules themselves. The review recommends bounded queues with backpressure so that evaluation cannot exhaust memory or goroutine budget.

Module layout

The review also documents each major area:

AreaPurpose
pkg/api/Legacy /api/… HTTP handlers; handlers call services, services call stores
pkg/services/Domain services (dashboards, ngalert, authn, accesscontrol, search, sqlstore, provisioning)
pkg/server/Application bootstrap and Wire DI configuration
pkg/storage/unified/Resource-API storage backend, implements k8s.io/apiserver/pkg/storage.Interface
pkg/storage/secret/Encrypted secrets storage with key management
pkg/apiserver/, pkg/aggregator/Kubernetes-style API layer hosting /apis/…
apps/<name>/Standalone Go apps, one Kubernetes-style resource each
public/app/core/React core — chrome, sidebar, page framework, context providers
public/app/features/Domain React modules (dashboard, explore, alerting, plugins, datasources)
public/app/store/Single Redux store with RTK Query
packages/grafana-dataCore data structures (DataFrame, Field, transformations)
packages/grafana-uiReact component library
packages/grafana-runtimeRuntime services (backend, location, data sources)
packages/grafana-schemaCUE-generated TypeScript types
packages/grafana-scenesDeclarative dashboard framework

Files with the most dependencies

Three files concentrate most of the cross-service references:

  • pkg/api/http_server.go (1,106 lines) — HTTPServer with roughly seventy backend services. The strongest-coupled file in the codebase.
  • pkg/services/ngalert/ngalert.go (897 lines) — the alerting constructor with seventeen sub-services.
  • pkg/storage/unified/resource/storage_backend.go (2,199 lines) — kvStorageBackend with storage, event, notifier, pruner, garbage collector, bulk lock, and dashboard-version retention responsibilities.

Conventions

A convention is a pattern the team uses repeatedly across the codebase — a naming rule, an error-handling rule, a way to structure a component, a way to wire a service. Octokraft finds them by scanning files in similar roles (UI components, workers, validators, services, models, configs) and looking for code patterns the team applies over and over. A convention shows up when the same approach appears in two or more places; the tool then checks every other file in the same role to see whether it conforms.

As an example, every job worker in apps/provisioning/pkg/ implements two methods — IsSupported(job) to decide whether the worker handles the job, and Process(ctx, job, progress) to execute it. That pattern appears in delete/worker.go, deleteresources/worker.go, and export/worker.go. Because three separate workers follow the same structure, Octokraft records it as a convention and treats any new worker that skips it as a deviation.

Grafana has 71 detected conventions. 67 have 100% compliance, 4 fall below. The six below are the ones that most directly govern how the team writes code.

Runtime services follow a strict singleton-with-startup-guard pattern

The convention detector records this as 100%-compliant across five files. Every runtime service exposed by packages/grafana-runtime/src/services/ uses the same structure: a module-level let factory or let instance, a setXxx() function that asserts the factory is unset and throws if called twice, and a getXxx() / createXxx() function that throws if the service is accessed before startup completes. QueryRunner.ts is the reference implementation:

let factory: QueryRunnerFactory | undefined;

export const setQueryRunnerFactory = (instance: QueryRunnerFactory): void => {
  if (factory) {
    throw new Error('Runner should only be set when Grafana is starting.');
  }
  factory = instance;
};

export const createQueryRunner = (): QueryRunner => {
  if (!factory) {
    throw new Error('`createQueryRunner` can only be used after Grafana instance has started.');
  }
  return factory();
};

The same pattern appears in EchoSrv.ts, CorrelationsService.ts, backendSrv.ts, and appEvents.ts. Global state stays global, but the lifecycle is explicit: a service accessed out of order throws at the call site instead of returning stale or nil state.

Job workers follow a full tracing-and-progress contract

The convention detector picks this up through nine separate convention entries — one each for the two-method interface, the tracing-span pattern, the metrics deferral, the progress recorder, the repository type assertion, the StageOptions commit-once policy, concurrency.ForEachJob for concurrent work, explicit context.WithTimeout wrapping, and structured field-error return values. All nine hit 100% compliance across the three scanned worker files. Every job worker in apps/provisioning/pkg/ and pkg/registry/apis/provisioning/jobs/ implements the same two-method interface — IsSupported(job) to check compatibility, Process(ctx, job, progress) to execute. Inside Process, each worker follows the same six-step sequence:

  1. Start an OpenTelemetry span with tracing.Start(...).
  2. Defer a cleanup function that records the error and always calls span.End().
  3. Record metrics through a deferred function that captures an outcome variable.
  4. Report progress through a shared JobProgressRecorder rather than direct status field writes.
  5. Type-assert the repository to ReaderWriter with an ok check.
  6. Wrap any repository mutations in a StageOptions struct with StageModeCommitOnlyOnce, an explicit timeout, and PushOnWrites: false.

pkg/registry/apis/provisioning/jobs/delete/worker.go puts every step on display inside a single Process body. The code, abridged:

func (w *Worker) Process(ctx context.Context, repo repository.Repository,
    job provisioning.Job, progress jobs.JobProgressRecorder) (processErr error) {

    ctx, span := tracing.Start(ctx, "provisioning.delete.process")       // step 1
    defer func() {                                                       // step 2
        if processErr != nil { _ = tracing.Error(span, processErr) }
        span.End()
    }()

    start, outcome := time.Now(), utils.ErrorOutcome
    defer func() {                                                       // step 3
        w.metrics.RecordJob("delete", outcome, resourcesDeleted,
            time.Since(start).Seconds())
    }()

    progress.SetTotal(ctx, len(paths)+len(opts.Resources))               // step 4
    progress.StrictMaxErrors(1)

    fn := func(repo repository.Repository, _ bool) error {
        rw, ok := repo.(repository.ReaderWriter)                         // step 5
        if !ok { return errors.New("repository is not a ReaderWriter") }
        // ... domain logic: resolve paths, delete files, record progress ...
        return nil
    }

    stageOptions := repository.StageOptions{                             // step 6
        Mode:                  repository.StageModeCommitOnlyOnce,
        CommitOnlyOnceMessage: msg,
        PushOnWrites:          false,
        Timeout:               10 * time.Minute,
    }
    return w.wrapFn(ctx, repo, stageOptions, fn)
}

deleteresources/worker.go and export/worker.go follow the same order. Concurrent resource deletion uses concurrency.ForEachJob from github.com/grafana/dskit/concurrency with an explicit maxWorkers parameter — no ad-hoc goroutine pools. Individual Kubernetes API operations wrap themselves in their own context.WithTimeout rather than relying on the parent context deadline.

Every new worker that plugs into the same machinery inherits the span, the metrics, the progress reporting, and the mutation policy — none of it has to be re-implemented by the author.

Validators compose through a shared interface with first-error-wins

Validators in the resource API layer implement a shared Validator interface and are composed through a CombinedValidator that runs each validator in turn and returns the first error. Constructor functions return the interface, not the concrete pointer:

func NewReferencedByRepositoriesValidator(
    repoLister repository.RepositoryByConnectionLister,
) appadmission.Validator {
    return &ReferencedByRepositoriesValidator{repoLister: repoLister}
}

External dependencies are passed as constructor arguments — no global lookups, no lazy init, no registry reads. Single-purpose validators guard on operation type at entry and return nil for non-matching operations (if a.GetOperation() != admission.Delete { return nil }). Validation errors use apierrors.NewInvalid with a field.ErrorList carrying structured field paths rather than plain strings, so the client gets a precise, machine-readable error back.

Shared React components declare their contract up front

Every shared component in packages/grafana-ui/src/ declares an explicit Props type, destructures props with defaults in the function signature, accesses the theme via useTheme2(), builds styles through useStyles2(getStyles), uses Emotion's css() and cx() for class names, pulls user-facing strings through <Trans> and t() from @grafana/i18n, and picks up test IDs from @grafana/e2e-selectors rather than hardcoded strings. Button.tsx, Modal.tsx, and Branding.tsx all follow this pattern.

For mutually exclusive prop variants, the team uses discriminated union types with never for the impossible fields. Modal.tsx has WithStringTitleProps extends BaseProps { title: string } on one side and a parallel shape with title?: never on the other — TypeScript forces callers to pick one. Components using forwardRef set Component.displayName so React DevTools shows the right name (Button.displayName = 'Button').

Factories fail fast on duplicate registration and return sorted types

The factory pattern across apps/provisioning/pkg/repository/factory.go and apps/provisioning/pkg/connection/factory.go is identical on both sides. Each defines an Extra interface with Type, Build, Mutate, and Validate methods; accepts a map of enabled types and a slice of extras at construction time; returns an error if two extras register the same type ("repository type %q is already registered"); and returns the enabled types alphabetically via sort.Slice. GitHub client factories add an optional Client *http.Client field with the comment "exists primarily for testing", and construction follows a three-tier priority: injected client first, token-based auth second, anonymous fallback third.

Configuration mistakes — duplicate registrations, missing types — fail at startup rather than at runtime.

Deprecations name their replacement and their removal version

The convention detector records 100% compliance across the two scanned deprecation sites. Services being phased out include @deprecated JSDoc with both the replacement service and the version in which the old one disappears. LocationSrv.ts:

/**
 * @deprecated in favor of {@link locationService} and will be removed in Grafana 9
 */

backendSrv.ts marks request() and datasourceRequest() as deprecated and directs callers to fetch(). TypeScript types in packages/grafana-runtime/src/config.ts pair @deprecated with an inline // TODO remove in G13 so both the reader and the static checker see the planned removal.

The four conventions below 100% compliance

Of the 71 conventions the detector found, four dip below 100%. Each one involves 2–4 scanned files, so a single deviation moves the number significantly.

ConventionComplianceConformingDeviating
Reusable UI components use memo and forwardRef2/4 (50%)Button.tsx, LokiQueryEditor.tsxDashboardSettings.tsx (page-level, role-appropriate) + one other primitive
Compile-time interface assertion via blank identifier1/2 (50%)apps/provisioning/pkg/connection/factory.go (var _ Factory = (*factory)(nil))apps/provisioning/pkg/repository/factory.go (missing assertion)
Database models are unexported structs with xorm tags1/2 (50%)pkg/infra/serverlock/model.go (type serverLock struct, lowercase)pkg/infra/kvstore/model.go (exports Item and Key)
Validator constructors return interface, not concrete type2/3 (67%)NewCombinedValidator(...) Validator, NewReferencedByRepositoriesValidator(...) ValidatorNewRolePermissionValidator(...) *RolePermissionValidator in pkg/registry/apis/iam/authorizer/role_permission_validator.go

All four are local, low-stakes, and easy to bring in line: one memo wrap, one one-line compile-time assertion, one case conversion plus factory functions, one interface-typed return value.


Issues at a glance — breakdown by category

Octokraft classifies every finding into one of six categories: security, runtime, testing, code smell, dead code, and consistency. The category matrix for Grafana:

CategoryCriticalHighMediumLowInfoTotal
testing001,723101,724
code_smell0135751180706
dead_code012961980495
security11918101202
runtime015107
consistency001102

The rest of this section walks each category — where it concentrates in the codebase and a small selection of representative findings.

Security — 202 findings

Octokraft reports 202 security findings in Grafana — 1 critical, 19 high, 181 medium, and 1 informational. The category covers injection, trust, crypto defaults, and hardening.

Octokraft flagged the most consequential of the 202 as a critical path traversal in pkg/api/frontendlogging/source_maps.go. When the browser sends a JavaScript stack trace, the server resolves the source-map URL into a filesystem path and reads it — a legitimate feature that helps map minified stack frames back to the original source. The guard on that path is one line:

path := strings.ReplaceAll(sourceMapLocation.path, "../", "") // just in case

The comment is the team's own, and // just in case admits the strip is a token effort rather than a real containment check. A crafted URL can still walk outside StaticRootPath by encoding the traversal (..%2f), doubling it (....//), or using absolute path segments. The other high-severity findings in the table below follow the same pattern — small pieces of code at the request edge where a one-line change would close the inventory.

Top areas:

DirectoryFindings
pkg/services47
devenv/docker29
pkg/registry19
pkg/storage17
devenv/frontend-service12
pkg/util10

Representative issues:

  • critical
    pkg/api/frontendlogging/source_maps.go:69
    Path traversal in source map processing

    The function resolves filesystem paths from URL components in stack-trace payloads. A strings.ReplaceAll(path, "../", "") guard runs before the read, but it does not handle encoded traversal sequences or absolute-path escapes.

  • high
    pkg/services/validations/oss.go:19
    SSRF protection gap in OSS version

    OSSDataSourceRequestValidator.Validate and OSSDataSourceRequestURLValidator.Validate return nil unconditionally. The OSS build binds these no-op implementations; enterprise provides a real validator.

  • high
    pkg/services/authn/clients/render.go:36
    Render key authentication has low barrier to privilege escalation

    The render client authenticates any request carrying a renderKey cookie by handing the key to the render service. Possession of the cookie is the only credential.

  • high
    pkg/middleware/subpath_redirect.go:19
    Open redirect via user-supplied req.RequestURI

    The middleware builds the redirect target by concatenating cfg.AppURL with a trimmed req.RequestURI. A crafted path can manipulate the resulting URL.

  • high
    pkg/plugins/manager/signature/authorizer.go:27
    Dev mode bypasses all plugin signature checks

    When DevMode is set, any unsigned plugin loads without verification.

  • high
    pkg/middleware/cookies/cookies.go:42
    Session cookie Secure flag is configuration-driven

    The Secure attribute comes from CookieOptions, not from an unconditional default. Deployments running behind HTTPS must set cookie_secure = true.

  • high
    packages/grafana-ui/src/components/RenderUserContentAsHTML/RenderUserContentAsHTML.tsx:24
    dangerouslySetInnerHTML used from non-constant definition

    This is the intended sanitizing wrapper — textUtil.sanitize(content) runs before the assignment — but the rule still flags the sink as a hotspot to audit.

  • medium
    devenv/docker/blocks/alert_webhook_listener/Dockerfile:7
    Container does not specify a USER directive

    A container without an explicit non-root USER runs its entrypoint as root. Eight devenv/docker/blocks/ Dockerfiles share the finding.

Runtime — 7 findings

Octokraft reports 7 runtime findings — 1 high, 5 medium, and 1 low. Every one of the seven is inside pkg/services/ngalert/ except one storage-layer shadow-read. The category covers goroutine lifecycle, context cancellation, timers, and shutdown semantics.

Octokraft's single high-severity runtime finding is worth a close look. pkg/services/ngalert/notifier/redis_peer.go is the alerting cluster peer. Near the end of newRedisPeer, after registering Prometheus counters and subscribing to two Redis pubsub channels, the constructor launches five long-lived goroutines and returns:

p.subs[fullStateChannel] = p.redis.Subscribe(context.Background(), ...)
p.subs[fullStateChannelReq] = p.redis.Subscribe(context.Background(), ...)
p.subsMtx.Unlock()

go p.heartbeatLoop()
go p.membersSyncLoop()
go p.fullStateSyncPublishLoop()
go p.fullStateSyncReceiveLoop()
go p.fullStateReqReceiveLoop()

return p, nil

The goroutines each run forever on their own timers. They do not take a cancellation context, and the subscriptions above them hold context.Background() — also uncancellable. If the caller wants the peer to stop, it has to call a Stop() method and hope that every loop checks the stop signal. If a later initialization step fails downstream, or the constructor runs a second time, the first batch of goroutines is already running and nothing stops them. The other six runtime findings are all variations on the same pattern in different alerting subsystems.

Representative issues (all 7):

  • high
    pkg/services/ngalert/notifier/redis_peer.go:236
    Goroutine leak on Redis peer initialization failure
  • medium
    pkg/services/ngalert/sender/sender.go:136
    Context leak in ExternalAlertmanager sender
  • medium
    pkg/storage/unified/resource/search_client.go:144
    Background goroutine without completion tracking
  • medium
    pkg/services/ngalert/schedule/schedule.go:386
    Untracked goroutine in schedule updates
  • medium
    pkg/api/api.go
    Monolithic HTTP routing
  • medium
    pkg/services/ngalert/notifier/redis_peer.go:476
    Busy-wait pattern with time.Sleep in message loops
  • low
    pkg/services/ngalert/sender/router.go:433
    Mixed timer/ticker patterns across codebase

Six of the seven findings repeat the same pattern: a goroutine starts in a constructor, and cleanup relies on someone calling Stop() from outside rather than on an owned cancellation context.

Code smell — 706 findings

Octokraft reports 706 code-smell findings — 13 high, 575 medium, and 118 low. The category picks up high fan-out, high instability, god-class types, and coordination-overload patterns.

The headline finding Octokraft flagged is kvStorageBackend in pkg/storage/unified/resource/storage_backend.go:57 — a 25-method type inside a 2,199-line file. On paper, a god class. In context, it is the single implementation of the whole unified-storage backend: the key-value data store, the append-only event store, the notifier that fans watch events out to subscribers, the garbage collector, the pruner that enforces dashboard-version retention, the bulk lock, and the last-import tracker. Each piece is a distinct responsibility, and each one has to coordinate with the others (a prune mutates the data store while a watch subscriber is streaming; a garbage collector competes with a pruner for the same tombstones). Splitting them would force explicit coordination across the boundary. The team has chosen to accept the large type — and HTTPServer at 70+ fields is the same trade at the HTTP edge.

Top areas:

DirectoryFindings
pkg/services236
pkg/registry146
packages/grafana-ui80
pkg/storage60
apps/provisioning28
pkg/api18

Representative high-severity issues:

  • high
    pkg/api/
    HTTPServer dependency explosion (arch-review/coupling)
  • high
    pkg/services/ngalert/api/provisioning.go:66
    God class: ProvisioningApiHandler (31 methods, 2 lines)
  • high
    pkg/storage/unified/resource/storage_backend.go:57
    God class: kvStorageBackend (25 methods, 38 lines)
  • high
    pkg/registry/apis/provisioning/jobs/progress.go:40
    God class: jobProgressRecorder (21 methods, 20 lines)
  • high
    pkg/components/dashdiffs/formatter_json.go:114
    God class: JSONFormatter (17 methods, 11 lines)

Representative medium-severity issues:

  • medium
    apps/advisor/pkg/app/utils.go:42
    High fan-out: processCheck calls 22 functions
  • medium
    apps/dashboard/pkg/migration/conversion/conversion.go:94
    High fan-out: RegisterConversions calls 21 functions

Dead code — 495 findings

Octokraft reports 495 dead-code findings — 1 high, 296 medium, and 198 low. The category reports unreferenced symbols and untested transport methods that have callers but no direct test coverage.

A representative example: packages/grafana-runtime/src/services/LocationSrv.ts exports three functions that the Grafana team itself stopped using years ago. The file still ships because @grafana/runtime is a published npm package, and every external plugin author built against that API. The top of each exported function carries the team's own marker:

/**
 * @deprecated in favor of {@link locationService} and will be removed in Grafana 9
 */

Deleting the code would break every plugin that upgrades its @grafana/runtime dependency without first migrating to locationService. Keeping it in the repository is the team's explicit choice — the deprecation names the replacement and the removal version. For an SDK library, "dead code" and "retained compatibility" are the same code, and the distinction lives in the deprecation comment. The other dead-code findings in this section cluster around the same pattern: legacy RBAC action registrations kept for rollback safety, generated helper methods on transport types that external tooling may call, and similar SDK-boundary code paths.

Top areas:

DirectoryFindings
pkg/services113
packages/grafana-ui63
pkg/registry31
apps/provisioning30
packages/grafana-data19

Representative issues:

  • high
    pkg/login/social/connectors/generic_oauth.go:42
    God class: SocialGenericOAuth (27 methods, 12 lines)
  • medium
    pkg/registry/apis/iam/legacy/team_binding.go:413
    Untested method: DeleteTeamMember (1 caller, no test coverage)
  • medium
    pkg/services/correlations/correlations.go:109
    Untested method: DeleteCorrelationsBySourceUID (1 caller, no test coverage)
  • medium
    pkg/infra/features/client.go:124
    Untested function: CreateHTTPClientForProvider (1 caller, no test coverage)
  • medium
    apps/scope/pkg/apis/scope/v0alpha1/register.go:147
    Untested function: AddKnownTypes (1 caller, no test coverage)

Testing — 1,724 findings

Octokraft reports 1,724 testing findings. 1,723 of them come from the same rule — graph/untested_code — which flags public functions and methods that no test file calls directly.

A concrete example: packages/grafana-ui/src/themes/GlobalStyles/filterTable.ts exports getFilterTableStyles(theme), a small function that returns an Emotion css object for filter-table rows:

export function getFilterTableStyles(theme: GrafanaTheme2) {
  return css({
    '.filter-table *': { boxSizing: 'border-box' },
    // ... more style declarations ...
  });
}

The function is flagged as untested. The FilterTable component that consumes these styles has its own test file, which exercises the visual output end-to-end — the tests cover the styling behaviour through the component, but no test file calls getFilterTableStyles directly, so the graph analysis records one untested call target. The 469 packages/grafana-ui findings in the table below are mostly the same pattern: theme helpers, Storybook exports, and small formatting utilities that parent-component tests cover transitively. Direct unit tests on each of these would also catch a class of bug the parent tests cannot see — a regression in the helper that happens to leave the parent's rendered output visually intact.

Top areas:

DirectoryFindings
packages/grafana-ui469
pkg/services389
pkg/registry111
packages/grafana-data99
packages/grafana-prometheus73

Representative issues:

  • medium
    packages/grafana-ui/src/themes/GlobalStyles/filterTable.ts:5
    Untested function: getFilterTableStyles (1 caller, no test coverage)
  • medium
    packages/grafana-data/src/utils/variables.ts:5
    Untested function: containsSearchFilter (1 caller, no test coverage)
  • medium
    packages/grafana-data/src/utils/location.ts:14
    Untested function: maybeParseUrl (1 caller, no test coverage)
  • medium
    apps/secret/decrypt/v1beta1/decrypt_grpc.pb.go:63
    Untested method: DecryptSecureValues (2 callers, no test coverage)
  • medium
    pkg/services/secrets/migrator/provisioning.go:28
    Untested method: Rollback (1 caller, no test coverage)

Consistency — 2 findings

The consistency category is almost empty.

  • medium
    pkg/
    Technical debt accumulation
  • low
    public/app/features/plugins/sandbox/codeLoader.ts:93
    SRI checks feature-gated instead of mandatory

The SRI finding is the more concrete of the two. verifySRI in codeLoader.ts is responsible for checking the subresource-integrity hash of plugin code before the browser sandbox executes it. The first three lines of the function are:

async function verifySRI(pluginCode: string, moduleHash?: string): Promise<boolean> {
  if (!config.featureToggles.pluginsSriChecks) {
    return true;
  }

When the pluginsSriChecks feature toggle is off, verifySRI returns true without checking anything, and the plugin loads. The gate is intentional — the team wanted to roll the integrity check out progressively rather than break every installation on the same release. A deployment with the toggle left off loads plugin code without integrity verification; a deployment with the toggle on treats a tampered bundle as a hard error.


Findings, grouped by pattern

Where the previous section listed findings category by category, these patterns group them by shared cause — the same underlying decision, migration, or piece of code history showing up across multiple rules and severities.

Pattern 1 — The integration files that concentrate cross-service dependencies

The single arch-review/coupling high finding covers the same two files the architecture review names as the weakest-coupled: pkg/api/http_server.go (70+ backend services on HTTPServer) and pkg/services/ngalert/ngalert.go (17 sub-services on the alerting constructor). The architecture review itself calls these out as the top two decomposition targets. Both are large refactors — decomposing HTTPServer into route groups and splitting ngalert into bounded sub-services — and neither is blocking correctness today.

Pattern 2 — Four human-written coordinator types on subsystem boundaries

Outside the coupling finding above, Octokraft flagged four types as god classes:

  • high
    pkg/storage/unified/resource/storage_backend.go:57
    kvStorageBackend (25 methods)

    Key-value store, event store, notifier, garbage collector, pruner, bulk lock, dashboard-version retention, last-import tracker

  • high
    pkg/services/ngalert/api/provisioning.go:66
    ProvisioningApiHandler (31 methods)

    Every legacy alerting provisioning HTTP endpoint

  • high
    pkg/registry/apis/provisioning/jobs/progress.go:40
    jobProgressRecorder (21 methods)

    Totals, errors, warnings, summaries, URLs, resource outcomes, metrics for every provisioning job

  • high
    pkg/components/dashdiffs/formatter_json.go:114
    JSONFormatter (17 methods)

    Dashboard diff output formatting

Three of the four sit on migration or compatibility boundaries. kvStorageBackend is the whole resource-API storage layer in one type. ProvisioningApiHandler keeps the legacy alerting provisioning API working while the resource-API version matures. jobProgressRecorder is the single progress-reporting contract every worker uses. Splitting these would break a shared contract; the smaller fix is to extract whichever piece of each type churns most often.

Pattern 3 — Testing covers behaviour end-to-end but not every function

Of the 1,724 testing findings Octokraft reports, 1,723 come from one rule — graph/untested_code — which flags functions and methods that no test file references directly. The concentrations are in the shared UI and data packages: 469 in packages/grafana-ui/src, 389 in pkg/services, 99 in packages/grafana-data/src, 73 in packages/grafana-prometheus/src.

The overall testing picture is one of depth rather than breadth: tests directly call 35.3% of functions, and the test-to-code ratio is 45.3%. Assertion density is 0.505. Only 14.2% of test references point at mocks — the rest exercise real implementations. Adding contract tests on top shared UI primitives and moving story-only exports into dedicated *.stories.tsx files would close most of the untested-function count without changing the testing approach.

Pattern 4 — Dead code is retained compatibility layer

Octokraft's 495 dead-code findings concentrate in the places the team knowingly keeps around:

  • Deprecated runtime services in packages/grafana-runtime/src/services/LocationSrv.ts, deprecated methods on backendSrv.ts, legacy @deprecated-tagged config types in config.ts. External plugins still import them.
  • Legacy RBAC registrations in pkg/services/ngalert/accesscontrol/roles.go — specifically the deprecatedActionsRole block with a comment stating it exists "just to keep the actions in the registry".
  • Pre-migration helper methods on plugin, storage, and login code paths.

Pattern 5 — A centralized HTML sanitizer exists; seven call sites light up as sinks

Octokraft flagged seven high-severity dangerouslySetInnerHTML findings, all in the shared UI and Prometheus packages. packages/grafana-data/src/text/sanitize.ts is the shared DOMPurify-backed sanitizer. packages/grafana-ui/src/components/RenderUserContentAsHTML/RenderUserContentAsHTML.tsx wraps it in a documented component:

export function RenderUserContentAsHTML<T>({
  component, content, ...rest
}: PropsWithChildren<RenderUserContentAsHTMLProps<T>>): JSX.Element {
  return React.createElement(component || 'span', {
    ...rest,
    dangerouslySetInnerHTML: { __html: textUtil.sanitize(content) },
  });
}

All seven high security findings for dangerouslySetInnerHTML sit in packages/grafana-ui/src/components/ or packages/grafana-prometheus/src/querybuilder/. RenderUserContentAsHTML.tsx itself uses the sink — the point is that sanitization happens inside the wrapper. Six of the other call sites either render trusted bundled strings (Prometheus query help) or pass a noSanitize flag the caller controls. packages/grafana-ui/src/components/Table/TableNG/Cells/MarkdownCell.tsx takes the caller-controlled route:

dangerouslySetInnerHTML={{
  __html: renderMarkdown(renderValue.text, { noSanitize: disableSanitizeHtml }).trim(),
}}

Routing every remaining sink through RenderUserContentAsHTML and retiring the noSanitize escape hatch would close the inventory. Until then, annotating each call site with the reason sanitization is safe there would match the pattern RenderUserContentAsHTML.tsx already sets.

Pattern 6 — Older crypto defaults in infrastructure code

Octokraft flagged these medium-severity findings across pkg/services, pkg/storage, and pkg/util:

RuleCount
math/rand used29
Missing SSL minimum version21
MD5 used11
SHA1 used6
HTTP without TLS7

math/rand is used for jitter, lock IDs, and non-security identifiers. MD5 and SHA1 are used for integrity checksums and cache keys, not authentication. Missing SSL minimum version appears in outbound HTTP clients where the Go default is already safe but not explicitly pinned. The common thread is that these decisions were made before the current defaults became standard, and nothing has forced a full sweep since. A single pass — math/randmath/rand/v2, MD5/SHA1 → SHA-256, tls.Config{MinVersion: tls.VersionTLS12} on outbound clients — retires the inventory.

Pattern 7 — Deliberate unsafe usage in performance paths

Octokraft flagged 40 unsafe findings, concentrated in pkg/services, pkg/storage, and pkg/util. Typical uses are unsafe.Pointer casts between []byte and string in parsers and unsafe.Sizeof in memory-accounting code. Each one sits on a frequently executed code path and is there deliberately to avoid a copy. Adding a one-line justification comment above each unsafe block — what the cast is safe for, and a reference to the benchmark — would make future review cheaper.

Pattern 8 — Developer-only hardening findings in devenv/

Octokraft placed 29 security findings in devenv/docker/blocks, 12 in devenv/frontend-service/configs, and 8 in devenv/local_cdn. Sub-patterns:

RuleCount
Nginx add_header override (header redefinition)14
Dockerfile missing USER directive9
Nginx $host trust9

Plus the one high finding at devenv/frontend-service/configs/nginx.conf:41 — the $age path parameter is written into a response header without a whitespace filter, allowing newline injection.

None of these configs ship with Grafana. They run the developer compose environment locally with test blocks for Prometheus, Loki, InfluxDB, and similar. The nginx.conf high is still worth fixing because devenv/ configs are referenced in documentation and do get copied into real deployments.

Pattern 9 — The team has already flagged the weakest request-edge boundaries

Octokraft's critical security finding and several of the high ones all involve HTTP handlers or plugin loading. On three of them, the team has already written down that the guard is incomplete — and Octokraft surfaces the same spots from a different angle.

Path traversal in source-map resolution. The one critical finding in the codebase. pkg/api/frontendlogging/source_maps.go implements a helpful feature: when the browser reports a JavaScript error, Grafana tries to find the source map that matches the minified script, so the stack trace the operator sees in the UI can be resolved back to the original source filenames and line numbers. The lookup takes the source URL from the error report, figures out which filesystem directory the file lives in (core static build dir or a plugin dir), and reads the .map file from that directory.

guessSourceMapLocation handles the URL-to-path step:

if strings.HasPrefix(u.Path, "/public/build/") ||
   (store.cfg.CDNRootURL != nil && ...) {
    pathParts := strings.SplitN(u.Path, "/public/build/", 2)
    if len(pathParts) == 2 {
        return &sourceMapLocation{
            dir:      store.cfg.StaticRootPath,
            path:     filepath.Join("build", pathParts[1]+".map"),
            pluginID: "",
        }, nil
    }
} else if strings.HasPrefix(u.Path, "/public/plugins/") {
    // similar branch: resolve plugin path → plugin directory
}

The result is a sourceMapLocation with a base directory (dir) and a relative path (path). The caller then reads the file:

path := strings.ReplaceAll(sourceMapLocation.path, "../", "") // just in case
content, err := store.readSourceMap(sourceMapLocation.dir, path)

The // just in case comment on that one line is the team's own admission. ReplaceAll is a single string replacement; it removes ../ substrings once and walks on. It does not recognise URL-encoded traversal (..%2f decodes to ../ only after the reader hits the filesystem), it does not recognise ....// (after stripping one ../, the next pair slides together to form another), and it does nothing with absolute path segments. A crafted source URL with a path like /public/build/....//....//....//etc/passwd walks out of StaticRootPath after two strips. The attacker only needs to convince the browser to report a JavaScript error against a crafted URL — which plugin code running in the same browser session can do.

The readSourceMap call at the bottom of the function trusts that path is already safe. Grafana's own frontend security utilities include packages/grafana-data/src/text/sanitize.ts, which has a validatePath() helper that decodes repeatedly and rejects traversal patterns, mixed separators, and control characters. That helper is on the frontend; the Go backend has filepath.Clean, which collapses redundant separators, resolves .. components, and normalises the path. The containment check the team meant to write is two lines:

clean := filepath.Clean(filepath.Join(sourceMapLocation.dir, sourceMapLocation.path))
if !strings.HasPrefix(clean, filepath.Clean(sourceMapLocation.dir)+string(os.PathSeparator)) {
    return nil, errors.New("resolved source map path escapes the root")
}

The existing // just in case would then come out — the check it was apologising for would be real.

The other six findings in this pattern are shorter to describe, and they appear with full detail in the security findings listed earlier:

  • OSS data-source validator returns nil in pkg/services/validations/oss.go — both Validate methods have empty bodies. Enterprise plugs in a real SSRF validator; the OSS build expects the operator to control outbound reachability at the network layer.
  • Open redirect in the subpath middleware in pkg/middleware/subpath_redirect.go — the redirect URL is composed from user-supplied req.RequestURI, and a crafted path can rewrite the host component.
  • Render key authentication in pkg/services/authn/clients/render.go — a renderKey cookie is the only credential; a leaked key converts directly into the associated user's permissions.
  • Dev-mode plugin signature bypass in pkg/plugins/manager/signature/authorizer.goif u.cfg.DevMode { return true }. Every unsigned plugin loads when DevMode is on. The risk is a production deployment inheriting DevMode: true from a misapplied template.
  • Session cookie Secure flag in pkg/middleware/cookies/cookies.go — the attribute is read from CookieOptions.Secure, which is configuration-driven rather than defaulted to true.
  • Child-process execution from build helpers in packages/grafana-api-clients/src/generator/helpers.ts and scripts/levitate-show-affected-plugins.js — both call execSync with an interpolated argument string. Developer tooling, not runtime paths, but execFileSync with an argument array closes the pattern.

Pattern 10 — Alerting background goroutines have no owned cancellation

Octokraft placed every one of its seven runtime findings inside pkg/services/ngalert/. The same underlying pattern repeats across all seven:

  • high
    pkg/services/ngalert/notifier/redis_peer.go:236

    newRedisPeer starts five background loops (heartbeatLoop, membersSyncLoop, fullStateSyncPublishLoop, fullStateSyncReceiveLoop, fullStateReqReceiveLoop) after a ping test. If any later init step fails, the goroutines keep running.

  • medium
    pkg/services/ngalert/sender/sender.go:136

    External Alertmanager sender creates a long-lived background context; shutdown relies on Stop() being called by the caller.

  • medium
    pkg/services/ngalert/schedule/schedule.go:386

    Schedule updates spawn goroutines without tracking handles.

  • medium
    pkg/services/ngalert/notifier/redis_peer.go:476

    Message loops use time.Sleep polling rather than channel-driven wait.

  • medium
    pkg/storage/unified/resource/search_client.go:144

    Background goroutine without completion tracking.

  • medium
    pkg/api/api.go

    Monolithic HTTP routing (related to the HTTPServer coupling in Pattern 1).

  • low
    pkg/services/ngalert/sender/router.go:433

    Mixed timer/ticker patterns across the run loop.

Alerting is the most concurrent code in the codebase — it runs the scheduler, cluster gossip, external senders, and message loops in parallel. Goroutines start in constructors, and cleanup is the caller's job. pkg/storage/unified/resource/broadcaster.go in the same repository does the opposite: explicit termination channels, documented ownership, bounded buffers. Applying the broadcaster pattern to the alerting subsystem — passing a cancellation context into each constructor, storing goroutine handles, waiting on them in Stop() — would close every finding in this table with one refactor.


Tradeoffs visible in the code

Limitations the team has written into the code itself — TODOs with a removal target, deprecation markers that name their replacement, FIXMEs that admit a gap, feature flags that gate a half-finished control.

Deprecated runtime services kept alive. Octokraft flags packages/grafana-runtime/src/services/LocationSrv.ts as dead code — every exported function looks unreferenced from the internal graph. The team's own marker on all three exports, @deprecated in favor of {@link locationService} and will be removed in Grafana 9, explains why: external plugins still import from the file. packages/grafana-runtime/src/services/backendSrv.ts shows the same alignment — Octokraft dead-code findings on request() and datasourceRequest(), and a team deprecation comment pointing callers at fetch().

Config types duplicated with a target removal version. packages/grafana-runtime/src/config.ts defines AzureSettings, AzureCloudInfo, AppPluginConfig, and several others with the pattern:

/**
 * @deprecated Use the type from `@grafana/data`
 */
// TODO remove in G13
export interface AzureSettings { ... }

The type is kept so plugins compiled against the old location still work; the removal is scheduled.

Legacy RBAC actions kept in the registry. pkg/services/ngalert/accesscontrol/roles.go declares a deprecatedActionsRole with a comment explaining why:

// deprecatedActionsRole contains deprecated actions just to keep the actions
// in the registry. The actions are granted to Admin just to make sure we do
// not accidentally completely lose access to an API or feature that happen
// to use only legacy
deprecatedActionsRole = accesscontrol.RoleRegistration{ ... }

The team chose to register the role rather than delete the actions — the comment admits the motivation is defensive.

Plugin subresource-integrity checks gated by a feature toggle. Octokraft flagged this as the entire consistency category's one non-100% finding: "SRI checks feature-gated instead of mandatory" at public/app/features/plugins/sandbox/codeLoader.ts:93. The team's code says the same thing:

if (!config.featureToggles.pluginsSriChecks) {
  return true;
}

SRI verification exists; it runs only when pluginsSriChecks is enabled. The gate lets the team roll the check out progressively, but any deployment with the toggle off loads plugin code without integrity verification. Octokraft's finding is the observation, the feature-toggle branch is the team's admission — the same gap written twice.

KV-backend migration still expected to land everywhere. The most operationally interesting TODO in the codebase, at the top of kvStorageBackend.WriteEvent in pkg/storage/unified/resource/storage_backend.go:

func (k *kvStorageBackend) WriteEvent(ctx context.Context, event WriteEvent) (int64, error) {
    if err := event.Validate(); err != nil {
        return 0, apierrors.NewBadRequest(err.Error())
    }

    // If an RV was passed in the old (microsecond) format, convert it to snowflake.
    // TODO: remove this once the KV backend has been deployed everywhere.
    if event.PreviousRV > 0 && !isSnowflake(event.PreviousRV) {
        event.PreviousRV = rvmanager.SnowflakeFromRV(event.PreviousRV)
    }

    rv := k.snowflake.Generate().Int64()
    // ... verify PreviousRV against the latest stored RV, then write ...
}

WriteEvent is the single entry point for every create, update, and delete flowing into the unified resource storage. event.PreviousRV is the caller's optimistic-concurrency token: "I think the current resource version is X; fail if it is not." A few lines below, the code fetches the latest stored RV and returns a 409 conflict if latestKey.ResourceVersion != event.PreviousRV. That equality check is the reason the TODO exists.

The older storage backend generated resource versions as microsecond timestamps. The new KV backend generates them with k.snowflake.Generate() — 64-bit Snowflake IDs whose high bits encode a timestamp and whose low bits encode a per-node sequence. The two formats are not comparable as integers. A rolling deploy puts both formats in flight at the same time: the cluster node that just upgraded writes a snowflake RV, and the API gateway or plugin process that has not upgraded reads the resource, caches the microsecond RV, and sends it back on the next write as PreviousRV. Without the conversion, the equality check fails, the write is rejected with a spurious conflict, and the caller sees "resource version mismatch" even though it has the latest version.

The conversion normalises both sides before the compare. isSnowflake does a cheap magnitude check on the integer; SnowflakeFromRV re-encodes a microsecond timestamp into the snowflake format. The cost is one branch and one arithmetic operation per write. The benefit is that a half-migrated cluster keeps working. The TODO marks the exit condition: once every caller, every cache, and every stored RV is snowflake-format, the branch can go. Until then, the comment names the code as load-bearing migration scaffolding, and the next reader knows not to delete it on sight.

Legacy storage resource naming. pkg/registry/apps/alerting/notifications/receiver/legacy_storage.go, timeinterval/legacy_storage.go, and templategroup/legacy_storage.go all contain // TODO remove when metadata.name can be defined by user. The legacy alerting API uses a different naming contract than the new resource API; the TODOs mark the bridge code.

Modal accessibility gap. Octokraft flagged no direct finding at this exact line, but its convention detector notes Modal.tsx as one of the ten-plus files that form the shared UI component contract — precisely the code that is supposed to be accessible. On line 80, the team left a FIXME:

// FIXME: custom title components won't get an accessible title.

The shared Modal accepts either a string title (which gets an aria-label automatically) or a custom JSX title (which does not). The comment admits the accessibility guarantee is incomplete for the custom path — the kind of debt that sits for years in a comment because no one is tracking comments.

Spinner deprecated type kept for one more cycle. packages/grafana-ui/src/components/Spinner/Spinner.tsx has // TODO remove once we fully remove the deprecated type on two lines. The old prop shape is still accepted alongside the new one.


What Grafana does well

1. Compile-time dependency injection via Wire. pkg/server/wire.go declares every dependency set, and pkg/server/wire_gen.go is generated by the Wire tool. Every service has a Provide*() factory, and the generated code constructs the whole graph through explicit constructor calls. Circular dependencies become compile errors. The graph is human-readable. Startup order is deterministic. For a repository the size of Grafana's backend, compile-time DI is what lets the team add a new service without going hunting for registration points.

2. Interface-driven service construction. Domain services under pkg/services/ define their interface in the same package as the implementation. pkg/services/dashboards/dashboards.go declares DashboardService, pkg/services/dashboards/dashboardimpl/ implements it, pkg/services/dashboards/dashboardtest/ supplies the fake. Callers depend on the interface, tests can swap the implementation without pulling in the full service graph, and the pattern repeats across authentication, RBAC, search, provisioning, alerting, and every other domain service.

3. Kubernetes-style resource APIs on top of CUE schemas. pkg/apiserver/ and pkg/aggregator/ host a Kubernetes-compatible API layer. Resource schemas live as CUE definitions in kinds/ and apps/<name>/kinds/, and a code generator produces Go and TypeScript types from the same source. Resources support standard verbs — Get, List, Watch, Create, Update, Delete — plus admission validators and mutators. apps/provisioning/pkg/apis/admission/combined_validator.go is the reference: admission runs every registered Validator and returns the first structured field error. The approach gives the team a migration path off legacy /api/… handlers without rewriting every feature at once.

4. Runtime services as singletons with startup-time guards. packages/grafana-runtime/src/services/QueryRunner.ts, EchoSrv.ts, CorrelationsService.ts, backendSrv.ts, and appEvents.ts share the same structure: a module-level factory variable, a setXxx() function that throws if called twice, and a getXxx() / createXxx() function that throws if called before startup finishes. Global state stays global, but the lifecycle is explicit and misuse throws at the call site instead of silently returning stale state.

5. Shared worker contract with tracing and metrics by deferral. Every job worker in pkg/registry/apis/provisioning/jobs/ implements IsSupported(job) and Process(ctx, job, progress). Inside Process, the worker opens an OpenTelemetry span with tracing.Start(...), defers a cleanup function that records the error and calls span.End(), and records metrics through another deferred function that captures an outcome variable. Every worker is traced, measured, and cleaned up identically without each author writing the boilerplate.

6. One progress-reporting contract for all jobs. pkg/registry/apis/provisioning/jobs/progress.go defines JobProgressRecorder with SetTotal, SetMessage, Record, Complete, StrictMaxErrors, and related methods. Every worker takes the recorder as an argument; no worker writes job status fields directly. Progress semantics live in one file, and worker code stays focused on the domain task. The type is large (21 methods — Octokraft flags it as a god class), but the trade is that there is a single place to change if the progress contract evolves.

7. Factory with fail-fast type registration. apps/provisioning/pkg/repository/factory.go and apps/provisioning/pkg/connection/factory.go implement the same pattern: an Extra interface with Type, Build, Mutate, and Validate methods, a ProvideFactory(enabled map, extras []Extra) constructor that returns an error on duplicate type registration, and a Types() method that returns enabled types in alphabetical order. Configuration errors are caught at startup, not at runtime, and the order is deterministic so that tests, logs, and APIs produce stable output.

8. Validator composition with first-error-wins. apps/provisioning/pkg/apis/admission/combined_validator.go defines a Validator interface; NewCombinedValidator returns the interface type and iterates the registered validators, returning the first error. Each single-purpose validator guards on operation type at entry (if a.GetOperation() != admission.Delete { return nil }) and reports failures as structured apierrors.NewInvalid with field.ErrorList rather than plain strings. Admission logic stays composable and error messages stay machine-readable.

9. Centralized HTML sanitizer with a documented wrapper. packages/grafana-data/src/text/sanitize.ts is the shared DOMPurify-plus-xss sanitizer. packages/grafana-ui/src/components/RenderUserContentAsHTML/RenderUserContentAsHTML.tsx is a component that wraps dangerouslySetInnerHTML around textUtil.sanitize(content) and is the recommended way to render untrusted content. Having a named, documented component means every React reviewer has a single answer to "how do I render this HTML safely?" rather than having to justify each dangerouslySetInnerHTML call from first principles.

10. Broadcaster pattern for owned concurrency. pkg/storage/unified/resource/broadcaster.go is the reference implementation for fan-out concurrency in the codebase. Every decision it makes is the opposite of the pattern the alerting subsystem uses — and every runtime finding in pkg/services/ngalert/ would disappear if the alerting code adopted it.

The constructor and its contract:

// NewBroadcaster creates a broadcaster that fans out items received on input to
// all active subscribers. The caller owns the input channel and is responsible
// for closing it when no more data will be sent. The broadcaster terminates
// when either ctx is cancelled or input is closed.
func NewBroadcaster[T any](ctx context.Context, input <-chan T,
    metrics *BroadcasterMetrics) Broadcaster[T] {
    return newBroadcasterWithSizes[T](ctx, input, watchChanSize,
        defaultOverflowCap, metrics)
}

Three things are worth naming. First, ctx is a required argument, not an afterthought. The broadcaster's internal struct stores shouldTerminate <-chan struct{} and terminated chan struct{} — the first is ctx.Done(), the second is closed by the stream goroutine when it exits. Termination is a fact the broadcaster owns, not a side effect of the caller remembering to call Stop(). Second, input-channel ownership is documented in the comment. The caller closes it; the broadcaster does not. That single line of documentation prevents the most common goroutine-leak bug: two pieces of code both thinking they own the channel. Third, metrics are passed in at construction. There is no "add a counter later" path — a Prometheus registry produces a typed BroadcasterMetrics struct, the struct is handed to the constructor, and the broadcaster writes to it forever after.

Subscribe picks up the same discipline:

func (b *broadcaster[T]) Subscribe(ctx context.Context, name string) (<-chan T, error) {
    sub := &subscription[T]{name: name, ch: make(chan T, b.watchBufSize)}

    select {
    case <-ctx.Done():
        b.metrics.SubscriptionsTotal.WithLabelValues(subscriptionResultCtxCanceled).Inc()
        return nil, ctx.Err()
    case <-b.terminated:
        b.metrics.SubscriptionsTotal.WithLabelValues(subscriptionResultTerminated).Inc()
        return nil, io.EOF
    case b.subscribe <- sub:
        return sub.ch, nil
    }
}

A subscription request has three possible outcomes and each one is handled. The client context cancels, and the call returns ctx.Err() and increments a counter with the reason ctx_canceled. The broadcaster has already terminated, and the call returns io.EOF with the reason terminated. The subscription goes through, and a typed channel comes back. There is no fourth case where the subscriber lingers waiting for a broadcaster that no longer exists.

Backpressure is bounded. Each subscriber gets a channel of size watchChanSize (1,000 items). When the subscriber cannot keep up, the broadcaster buffers overflow into subscription.overflow up to defaultOverflowCap (50,000 items). Past that, the subscriber is disconnected and the reason — overflow_cap — is recorded in the metrics. The contract is explicit: slow subscribers get disconnected rather than blocking the stream for every other subscriber.

Now compare with pkg/services/ngalert/notifier/redis_peer.go. newRedisPeer launches five goroutines — heartbeatLoop, membersSyncLoop, fullStateSyncPublishLoop, fullStateSyncReceiveLoop, fullStateReqReceiveLoop — and returns. The goroutines do not take a cancellation context. The constructor does not return a handle the caller can call Stop() on. Two Redis Subscribe calls use context.Background() — a context that cannot be cancelled. If the caller wants the peer to stop, every loop has to poll a shared stop flag; if any init step fails after the go statements, nothing cleans up. That pattern is present in six of the seven runtime findings Octokraft flagged.

The broadcaster fixes the six findings structurally. Take a cancellation context in the constructor. Store it. Use it in every internal select. Close a terminated channel when the stream loop exits. Hand the caller a typed subscription API. Record the reason on every disconnect. One file in this repository already contains every piece of that pattern. Reusing it for the alerting subsystem is a concrete refactor, not a vague recommendation.

11. Component contract conventions applied uniformly. Every shared React component in packages/grafana-ui/src/ follows the same rules: explicit Props types, defaults in the destructuring signature, theme via useTheme2(), styles via useStyles2(getStyles), Emotion's css() and cx() for class composition, user-facing strings through <Trans> and t() from @grafana/i18n, and test IDs imported from @grafana/e2e-selectors rather than hardcoded. Components using forwardRef set Component.displayName. Components with mutually exclusive prop variants use discriminated union types with never for impossible fields. Pick one component at random from packages/grafana-ui/src/components/ and the same set of rules applies.

12. Deprecations that name replacement and removal version. The team consistently pairs @deprecated with the name of the new API and the version in which the old one disappears — @deprecated in favor of {@link locationService} and will be removed in Grafana 9 in LocationSrv.ts, @deprecated Use the fetch function instead on backendSrv.request(), or // TODO remove in G13 next to @deprecated Use the type from @grafana/data in config.ts. Every reader who hits a deprecation knows where to go and when the grace period ends. For a project shipping SDK packages that external plugins depend on, that single rule is what keeps deprecation debt manageable.


Closing summary

Grafana is a mature observability platform — a monorepo with a Go backend on Google Wire, a TypeScript and React frontend, a family of @grafana/ npm packages published for external plugin authors, and several standalone Go applications for Kubernetes-style resources. Two HTTP APIs serve traffic side by side while the team migrates off the original SQL-backed storage to a unified resource-API backend. The architecture review scores the codebase at 65 modularity, 55 coupling, 60 scalability, 75 patterns — overall C — and the full health score is B+ at 80.21. 67 of 71 detected conventions hit 100% compliance, and the four deviations are all local.

Two groups of findings warrant direct action. The first is a cluster of weak checks in HTTP handlers and plugin loading, and three of them are places where the team has already written down that the guard is incomplete: the strings.ReplaceAll(path, "../", "") // just in case comment in pkg/api/frontendlogging/source_maps.go, the empty-bodied OSS SSRF validator in pkg/services/validations/oss.go, and the feature-toggled plugin SRI check in public/app/features/plugins/sandbox/codeLoader.ts. The second is pkg/services/ngalert/ — every one of the seven runtime findings is an alerting-subsystem goroutine that relies on an external Stop() call rather than an owned cancellation context, and a reference implementation already lives in the same repository (pkg/storage/unified/resource/broadcaster.go). The debt is concentrated, understood, and fixable. Alongside it, the team holds 67 of 71 conventions at 100% compliance, ships deprecations with both a replacement name and a removal version, and runs every job worker through the same tracing-and-progress contract.


Explore this analysis in the showcase

The full Grafana analysis is available as an interactive project in Octokraft. Every finding this post references — and every finding it did not — is browsable there:

  • the scorecard with every dimension score (and what each dimension measures)
  • all 202 security findings, all 706 code-smell findings, and every other category filtered and searchable
  • the architecture review in full, with strengths, weaknesses, and recommendations linked to the files they describe
  • all 71 detected conventions with their conforming and deviating file lists
  • file-level drill-down: click any finding and the source opens at the flagged line

Methodology

Analysis was run against the grafana/grafana repository on branch main at commit b2bc44c38f1f49c59d7937b24d7ad06bdbce567e. Octokraft produced scores across eight dimensions (security, runtime risks, test coverage, code smell, duplication, dead code, consistency, compliance), detected 71 conventions, and generated an architecture review and documentation pass. Every number and file path in this post traces back to that analysis or to the source at that commit.

Run the same analysis on your codebase

Security, runtime risks, test quality, structural health, convention drift, dead code -- the same eight dimensions, the same standards. See where your code stands.

Try Octokraft