ai-memory — Reproducible Baselines

▸ Contract 1 · do-1461 fleet

Two independent clean-room 0→60 fleet builds.

The do-1461 reference architecture (15 nodes, 3 regions, 9 federated peers, Batman MAXIMUM-SECURE posture) was stood up, fully verified, torn down, and stood up again from nothing — terraform destroy → fresh terraform apply → full push-based provision → complete re-verification. Both rounds minted the same 100% GREEN result: 119/119 verify checks PASS, asserted by the same harness against two physically distinct fleets.

# the whole contract is one directory, one deterministic flow
cd deploy/do-1461
make seed up provision validate test   # build fleet, verify, full-spectrum test
make down                              # tear it all down

Independent clean-room rounds

119/119

Verify checks PASS — both rounds

Nodes asserting the per-round golden binary sha256

150/150

Full-spectrum test checks PASS (round-2 retest)

What is pinned, and where

Pinned artifact	Mechanism	Asserted by
Golden binary	Per-round `sha256` pinned as the named constant `GOLDEN_SHA256` in `deploy/do-1461/provision/lib.sh` (env-overridable for forks). No literal SHA appears on this page by convention — the pin rotates with scheduled redeploys; the mechanism is the contract.	`binary.sha256` verify row, fleet-asserted on all 15 nodes; `binary.version` pins `0.9.0`
Schema	v78, lockstep across all 9 peers (postgres adapter)	`caps.db_schema_version` verify row per peer
Substrate stack	Pinned pgdg `.deb`s (PostgreSQL 18.4, Apache AGE 1.7.0, pgvector 0.8.2) + pinned Ollama release — named constants in `provision/lib.sh`	`store.pg_version` / `store.age_version` / `store.pgvector_version` / `store.age_graph` verify rows per pg node
Seed corpus	`sha256`-pinned in `CORPUS_MANIFEST.json`; attested Atlas Corpus baseline committed under `deploy/do-1461/atlas/results/`	Atlas ingest reports (`atlas-*.tsv`)
Trust material	Campaign CA + per-node keys generated once per round; zero-touch enrollment proves peers join via CA credential alone	`zerotouch::cred_ca_chain[peer]` ×9 + `zerotouch::unenrolled_status[peer]` ×9 (full-spectrum harness)
Tunables	Every knob is a named constant in `provision/lib.sh`, not a magic literal	Reviewed source; the same constants drive both rounds

The two-round evidence pair

Round 1: verify-20260608T231547Z.tsv. Round 2: verify-20260609T133956Z.tsv (both under .local-runs/do-1461/reports/, emitted by deploy/do-1461/validate/run.sh). Both reports contain the same 119 checks with the same expectations, all PASS. Row-by-row diff of the two rounds shows exactly three value-level differences — each one expected and none a behavioral delta:

Differing field	Why it differs	Rows
`binary.sha256` value	Each round pins and fleet-asserts its own golden build. Within a round, all 15 nodes report the identical hash — that is the per-round determinism claim.	15 (one per node)
`daemon_pg.tls` observed counter	The check asserts "≥1 ssl backend, 0 plaintext"; the observed `ssl=` count is a cumulative connection counter (9 vs 18). Criterion and verdict identical.	3 (one per pg node)
`fleet federation.write` / `federation.convergence` probe id	The convergence probe writes a fresh throwaway memory each run; the row carries that run's UUID. Criterion and verdict identical.	2

Every other field of every other row is byte-identical across the two rounds. The round-2 fleet additionally passed the full-spectrum suite (test-20260609T161203Z.tsv, 150/150 PASS across the regression / crypto / federation / zerotouch / a2a / ai_nhi / nsa_gaps / curator groups) and the recursive-learning suite (recursive-20260609T160851Z.tsv).

Honest scope "Reproducible" here means: from the committed deploy/do-1461/ directory, two independent clean-room executions of the same flow produced the same 100%-green verdict against the same pinned artifact set, with each round's binary hash fleet-asserted on every node. It does not mean bit-identical reports (timestamps, run UUIDs, and cumulative counters necessarily differ — itemised above), and it does not claim a deterministic build of the binary itself: the golden binary is pinned by hash and asserted, not rebuilt bit-for-bit per round. One platform footnote carries over from the reference page: DigitalOcean's per-region default VPC container cannot be deleted, so teardown destroys 100% of compute and data while the empty VPC shell is reused — it holds no state between runs.

▸ Contract 2 · Performance

Bench baselines: absolute budgets plus drift gates.

Performance claims are held by two independent guards in src/bench.rs, both reproducible from the CLI. The absolute guard checks every measured p95 against the published budget table; the baseline guard checks every measured p95 against a previously captured run, catching drift that stays inside the absolute budget.

Budget table

PERFORMANCE.md is the authoritative latency contract. Honest split, stated in the doc itself: 7 of 14 budget rows are bench-verified; the remaining 7 are explicitly marked [advisory] pending bench fixtures — advisory rows are never claimed as verified.

Absolute guard

src/bench.rs::P95_TOLERANCE = 1.10 — a run fails when any measured p95 exceeds its PERFORMANCE.md budget by more than the published 10% tolerance. Budgets are pinned in-code and cross-checked by in-module tests against the doc table.

Baseline guard

ai-memory bench --baseline <file.json> compares each operation's fresh p95 against a captured baseline JSON (load_baseline / compare_against_baseline); growth beyond --regression-threshold (default DEFAULT_REGRESSION_THRESHOLD_PCT = 10.0, src/bench.rs) is reported as a regression. Operations absent from the baseline are skipped, so adding a bench row never invalidates old baselines.

CI gate

.github/workflows/bench.yml (job bench) runs the bench workload on every pull request against main / develop / release/** and fails the PR when any p95 breaks budget × tolerance.

# capture a baseline on known-good HEAD, then gate future runs against it
ai-memory bench --json > baseline-$(git rev-parse --short HEAD).json
ai-memory bench --baseline baseline-<sha>.json --regression-threshold 10

The committed-baseline precedent

docs/BASELINE-v0.6.3.1.md is the project's committed-baseline precedent: a canonical, code-derived snapshot of what release v0.6.3.1 actually was (workspace, features, surfaces, budgets — every number with a traceable file:line source, zero aspirational claims). It is the reference document the long-running drift audit (#512) measures published surfaces against. The same discipline carries into v0.7.0: baselines are committed artifacts with provenance, not screenshots.

▸ Contract 3 · Test suite

The release-gate full-suite reproducibility contract.

The final v0.7.0 release-gate campaign (2026-05-22 dossier) closed with a full-suite verdict of 7,321 passed / 0 failed / 0 skipped — and published the six-point contract that lets anyone re-derive that verdict rather than take it on faith:

Contract point	Pinned value (from the dossier's "Reproducibility contract")
1 · Branch + tip	`release/v0.7.0-mobile-ci-1068` (tracking `origin/release/v0.7.0`), HEAD `fd172f2cf629309514cd5dad486c2e59ac4eed39`
2 · Binary	Single release binary, sha256 `d4b60aa5b8f97470d95007f30bddb15e7e35c3855f0085c6b4f43d57f6b4ef3e`
3 · Exact invocation	`cargo test --release --no-default-features --features sal,sal-postgres,sqlite-bundled -- --include-ignored --test-threads=1`
4 · Environment	macOS Sequoia / Darwin 25.4.0; lan-parity PG + AGE container (`PG16 + AGE 1.6.0 + pgvector 0.8.2` on `127.0.0.1:15432`); `AI_MEMORY_TEST_POSTGRES_URL` + `AI_MEMORY_TEST_AGE_URL` bound to it
5 · Schema version	v49 (current at that campaign's HEAD; the ladder has since advanced to v78 — re-running at today's HEAD reproduces against the current schema constant, which is the point of pinning the tip SHA)
6 · Authoring + QC agents	Named in the dossier, with two independent QC passes

The dossier also binds the verdict to the campaign discipline that produced it: every failure filed as a GitHub issue at the moment of discovery, every fix retested against a freshly recompiled binary, and the final run executed against the composite tip — so the published number is the output of a documented procedure, not a survivor-curated log. The same README carries the per-issue root-cause table and per-track results (track-a/b/c files alongside it).

▸ Why three contracts

Each contract pins a different failure mode.

Contract	Question it answers	Primary evidence
Fleet (do-1461)	"Does the documented deployment actually converge to the verified state — twice, from nothing?"	`verify-20260608T231547Z.tsv` + `verify-20260609T133956Z.tsv` (119/119 both), reference architecture page
Bench	"Is this build still inside its published latency contract, and has it drifted against a known-good run?"	`PERFORMANCE.md`, `src/bench.rs` (`P95_TOLERANCE`, `DEFAULT_REGRESSION_THRESHOLD_PCT`), `Bench · bench` CI job, BASELINE-v0.6.3.1.md
Full suite	"Can a third party re-derive the release-gate test verdict from pinned inputs?"	Release-gate dossier §Reproducibility contract

Control-level verification evidence (which named test or harness check exercises which NSA CSI control) lives on the companion page: NSA CSI MCP Control → Test Matrix.