ai-memory · Engineering discipline — the hardcoded-literal ratchet

Why a gate, not a guideline

Instructions decay. Ratchets don't.

This codebase is developed by AI coding agents under operator direction — that is the substrate's own AI-NHI workflow, documented, not hidden. The campaign exists because of an honest observation about that workflow, recorded verbatim in the gate's header comments: repeating the "no hardcoded literals" instruction to the agent did not stop the regression. A scattered magic string (the header's examples: format!("anonymous:req-{}", …) ×~8, "memory not found" ×~6) gets reproduced every time an agent pattern-matches surrounding, already-rotten code. The proven fix — the same one that ended the vendor-literal regression class before it — is a mechanical HARD-BLOCK in CI.

# scripts/check-hardcoded-literals.sh — the contract, from the header WHAT IT BLOCKS: a string literal of length >= MIN_LEN (10) that appears on >= DUP_THRESHOLD (3) distinct PRODUCTION sites is a magic value that should be a single named `const` referenced by name. Such a duplicated literal is a HARD-BLOCK when its site-count exceeds the frozen baseline: - existing duplications are grandfathered in the baseline file; - ADDING an occurrence (count rises above baseline) FAILS; - a brand-new duplicated literal (absent from baseline) FAILS; - REMOVING occurrences never fails (burn-down is always allowed); - the baseline can only shrink — "thresholds rise, never fall".

Deliberately not flagged, to keep the gate low-false-positive and therefore load-bearing: literals under 10 chars (short JSON keys), single-site literals, comment / use-path / attribute lines, const/static definition lines (those are the good pattern), and test code behind the shared production-vs-test boundary heuristic. Magic numbers are out of scope here — the SECS_PER_* class is already gated by the companion scripts/check-vendor-literals.sh, and a general numeric gate is too noisy to be load-bearing. The gate also ships --self-test: it injects a contrived new triplicated literal, verifies the HARD-BLOCK fires, and cleans up — proving in CI that the gate is enforcement, not decoration.

The burn-down

497 entries in. 28 out.

The baseline was frozen at campaign start: every double-quoted literal ≥10 chars appearing on ≥3 production sites, 497 entries totalling 2,847 sites. Six batches later, 28 entries on 108 sites remain — and the regenerated baseline file (scripts/qc-allowlists/hardcoded-literals-baseline.txt) is exactly those 28 lines. Everything else was a byte-preserving hoist: each routed const or helper produces the exact pre-sweep wire/SQL/log bytes.

497 → 28

baseline entries (−94%)

From the campaign-start freeze to the post-batch-6 regeneration.

2,847 → 108

production sites (−96%)

≈2,700 production call sites now reference a named const or shared helper instead of a repeated literal.

batches

Quota DDL parity → identity sentinels → JSON-RPC wire consts → route paths → SQL/header/tracing/tool-name sweeps → the final field-name + census batch.

28 / 28

survivors justified

Every remaining entry classified with a one-line justification in the committed census.

Where the literals went: the SSOT modules

batch 2

identity::sentinels

Every internal/system principal string (DAEMON_PRINCIPAL, ANONYMOUS_INVALID, AI_CURATOR, …) as one named const — 82 production sites routed. These are authz-relevant: ownership gates exempt callers whose principal equals one of them. validate::RESERVED_AGENT_IDS is now built from the sentinel consts (it was a parallel literal list with a "MUST stay in sync" comment), pinned by an invariant test. One anonymous_request_id() helper replaced 10 divergent synthesis sites (#1560).

batch 3

mcp::jsonrpc

The JSON-RPC 2.0 version tag, reserved error codes (-32700/-32600/-32601/-32602), MCP method names, and the protocolVersion revision as named consts. The crate-root METHOD_* consts became aliases of the domain-canonical set.

batch 4

handlers::routes

One const per production HTTP route path — 78 consts. The router registers them, and the postgres surface gate (207 literals in postgres_gate.rs), the federation receiver, and the CLI doctor all match on them — so route gating structurally cannot drift from route registration.

batch 6

models::field_names

Extended by +57 consts for wire/row keys (agent_pubkey … updated_since), routed across ~60 files including the full federation-sync response-key set, via the established json!-key / .get() / try_get() forms.

batch 6

errors::msg

Filesystem-context helpers opening() / reading() / writing() plus reuse of the existing msg::invalid() — error prose synthesized in one place instead of re-typed per call site.

batches 4b–5

…and the long tail

SQL transaction fragments (SQL_BEGIN_IMMEDIATE et al. — 49 copies collapsed), shared auth-header spellings, 14 duplicated tracing targets routed through consts, and tool-name / wire-enum values routed through their owning types (MemoryLinkRelation::as_str(), AttestLevel::as_str() — both link allowlist arrays are now built from the enum).

Full batch mechanics: CHANGELOG §"#1558 hardcoded-literal SSOT remediation campaign" and v0.7.0 release notes §#1558.

Lint the linter

Three accuracy bugs — in the gate itself.

A gate that miscounts is worse than no gate: it either blocks honest work or silently waves duplication through. During the campaign the production-vs-test boundary heuristic was found wrong three times, and each finding was filed, fixed, and regression-pinned like any substrate defect:

#1561

cfg(test)-attributed modules

The boundary only excluded mod tests blocks by name — a #[cfg(test)] mod l2_2_audit_tests with a non-standard name leaked test literals into the production baseline. Fixed: the attr+mod pairing is part of the boundary.

batch 5c

Whole-file test fixtures

A file-level #![cfg(test)] inner attribute makes the entire file test code — those files no longer count toward production literal totals.

#1577

Declarations vs. inline bodies

The boundary must not fire on a #[cfg(test)] mod x; declaration whose body lives in another file (one such line made the gate skip 13.9k production lines), and cfg(all(test, …)) modules are now caught.

The discipline point: enforcement tooling gets the same defect workflow as the substrate — issue filed at discovery, fix, regression pin, close. The gate's accuracy is itself under test (--self-test runs in CI).

The irreducible floor

28 survivors, every one justified.

The campaign did not end with "good enough." It ended with a committed census — scripts/qc-allowlists/hardcoded-literals-irreducible.md — classifying each of the 28 surviving entries with a one-line justification for why it cannot shrink further under the current boundaries. Four classes:

Class	Entries	Meaning	Example survivor
CARVEOUT	9	Every production site lives in the 8 vendor carve-out files (`llm.rs`, `config.rs`, `mine.rs`, …) that hold the canonical vendor alias/default tables and are frozen for this campaign.	`http://localhost:11434` — all 9 sites are `config.rs` defaults/resolvers/template.
CARVEOUT-DOMINANT	3	≥3 sites are carve-out-frozen; the residual sites are below the duplication threshold on their own and cannot reference a const the frozen owner does not export.	`mini_lm_l6_v2` — 4 sites in the `config.rs` SSOT def; 1 residual match-arm pattern.
SEPARATE-CRATE	11	Sites live only in `tools/*` standalone QA/orchestration binaries — separate crates that cannot reference `ai_memory::` consts.	The `T0-A1-CORE` … `T0-CONTRACT` question ids in `tools/t0-orchestrate`.
HYBRID	5	Sites split between carve-out files and a tools binary; neither side can route to the other.	`ANTHROPIC_API_KEY` — the per-vendor key fallback table (frozen) + the t0 env table (separate crate).

That census is the auditable answer to "why does any duplication remain?" — and because the baseline is a shrink-only ratchet, the floor can only ever go down from here.

Honesty about behavior changes. The sweep was predominantly byte-preserving, and the two wire-visible exceptions are disclosed as operator advisories, not buried: #1562 — 58 tracing sites used field syntax that RUST_LOG target filtering cannot match; converting them to real metadata targets means postgres SAL adapter events now emit under store::postgres / store::postgres::kg (an ai_memory=debug filter no longer matches them). #1560 — unifying 10 divergent anonymous-request-id synthesis sites onto one helper fixed 8 of them that stamped a full 36-char UUID against the documented uuid8 contract; anonymous principals in logs/audit rows now carry the 8-char suffix everywhere.

What it signals

Maintainability you can verify, discipline you can replay.

For an SME evaluating the substrate, the campaign is evidence on two axes:

substrate maintainability

One spelling per concept

Reserved principals, JSON-RPC wire constants, HTTP route paths, wire field names, SQL fragments, auth headers, tracing targets — each now has exactly one definition site, and several previously parallel lists (RESERVED_AGENT_IDS, both link allowlist arrays) are derived from their owning type instead of maintained alongside it. Drift between registration and gating, or between enum and allowlist, is now a compile-time impossibility rather than a review-time hope.

enforcement, not aspiration

The floor cannot regress silently

check-hardcoded-literals.sh runs in CI as a HARD-BLOCK beside fmt/clippy/test/audit and the companion vendor-literal gate. New duplication above the 28-entry floor fails the build; the baseline file can only shrink; --self-test proves the block actually fires. At campaign close the working tree held: cargo check (default + sal-postgres) clean, clippy -D warnings -D clippy::pedantic clean, cargo fmt --check clean, and both literal gates PASS.

ai-nhi development discipline

Built by agents, governed mechanically

The six batches were executed by AI coding agents under operator direction — the same AI-NHI workflow the substrate documents for itself. The honest engineering lesson the campaign encodes: agent pattern-matching reproduces whatever the surrounding code does, so quality directives must live in mechanical gates and committed artifacts (baseline, census, self-test), not in prompts. The operator sets the rule once; CI enforces it on every future session, human or NHI.

Related reading: Tracing atlas (the #1562 target fix in operational context) · Developer deep-dive · Engineering standards · Frozen claims.

One spelling per fact.