A result you cannot reproduce is an anecdote. ai-memory v0.7.0 ships three reproducibility contracts, each with committed mechanism and citable evidence: the two-round clean-room fleet reproduction (do-1461), the performance-baseline + regression-gate mechanism (src/bench.rs), and the release-gate full-suite contract (the 2026-05-22 final campaign). This page documents what each contract pins, where the pins live, and how to re-run them.
The do-1461 reference architecture (15 nodes, 3 regions, 9 federated peers, Batman MAXIMUM-SECURE posture) was stood up, fully verified, torn down, and stood up again from nothing — terraform destroy → fresh terraform apply → full push-based provision → complete re-verification. Both rounds minted the same 100% GREEN result: 119/119 verify checks PASS, asserted by the same harness against two physically distinct fleets.
# the whole contract is one directory, one deterministic flow cd deploy/do-1461 make seed up provision validate test # build fleet, verify, full-spectrum test make down # tear it all down
| Pinned artifact | Mechanism | Asserted by |
|---|---|---|
| Golden binary | Per-round sha256 pinned as the named constant GOLDEN_SHA256 in deploy/do-1461/provision/lib.sh (env-overridable for forks). No literal SHA appears on this page by convention — the pin rotates with scheduled redeploys; the mechanism is the contract. |
binary.sha256 verify row, fleet-asserted on all 15 nodes; binary.version pins 0.7.0 |
| Schema | v55, lockstep across all 9 peers (postgres adapter) | caps.db_schema_version verify row per peer |
| Substrate stack | Pinned pgdg .debs (PostgreSQL 18.4, Apache AGE 1.7.0, pgvector 0.8.2) + pinned Ollama release — named constants in provision/lib.sh |
store.pg_version / store.age_version / store.pgvector_version / store.age_graph verify rows per pg node |
| Seed corpus | sha256-pinned in CORPUS_MANIFEST.json; attested Atlas Corpus baseline committed under deploy/do-1461/atlas/results/ |
Atlas ingest reports (atlas-*.tsv) |
| Trust material | Campaign CA + per-node keys generated once per round; zero-touch enrollment proves peers join via CA credential alone | zerotouch::cred_ca_chain[peer] ×9 + zerotouch::unenrolled_status[peer] ×9 (full-spectrum harness) |
| Tunables | Every knob is a named constant in provision/lib.sh, not a magic literal |
Reviewed source; the same constants drive both rounds |
Round 1: verify-20260608T231547Z.tsv. Round 2: verify-20260609T133956Z.tsv (both under .local-runs/do-1461/reports/, emitted by deploy/do-1461/validate/run.sh). Both reports contain the same 119 checks with the same expectations, all PASS. Row-by-row diff of the two rounds shows exactly three value-level differences — each one expected and none a behavioral delta:
| Differing field | Why it differs | Rows |
|---|---|---|
binary.sha256 value |
Each round pins and fleet-asserts its own golden build. Within a round, all 15 nodes report the identical hash — that is the per-round determinism claim. | 15 (one per node) |
daemon_pg.tls observed counter |
The check asserts "≥1 ssl backend, 0 plaintext"; the observed ssl= count is a cumulative connection counter (9 vs 18). Criterion and verdict identical. |
3 (one per pg node) |
fleet federation.write / federation.convergence probe id |
The convergence probe writes a fresh throwaway memory each run; the row carries that run's UUID. Criterion and verdict identical. | 2 |
Every other field of every other row is byte-identical across the two rounds. The round-2 fleet additionally passed the full-spectrum suite (test-20260609T161203Z.tsv, 150/150 PASS across the regression / crypto / federation / zerotouch / a2a / ai_nhi / nsa_gaps / curator groups) and the recursive-learning suite (recursive-20260609T160851Z.tsv).
deploy/do-1461/ directory, two independent clean-room executions of the same flow produced the same 100%-green verdict against the same pinned artifact set, with each round's binary hash fleet-asserted on every node. It does not mean bit-identical reports (timestamps, run UUIDs, and cumulative counters necessarily differ — itemised above), and it does not claim a deterministic build of the binary itself: the golden binary is pinned by hash and asserted, not rebuilt bit-for-bit per round. One platform footnote carries over from the reference page: DigitalOcean's per-region default VPC container cannot be deleted, so teardown destroys 100% of compute and data while the empty VPC shell is reused — it holds no state between runs.
Performance claims are held by two independent guards in src/bench.rs, both reproducible from the CLI. The absolute guard checks every measured p95 against the published budget table; the baseline guard checks every measured p95 against a previously captured run, catching drift that stays inside the absolute budget.
PERFORMANCE.md is the authoritative latency contract. Honest split, stated in the doc itself: 7 of 14 budget rows are bench-verified; the remaining 7 are explicitly marked [advisory] pending bench fixtures — advisory rows are never claimed as verified.
src/bench.rs::P95_TOLERANCE = 1.10 — a run fails when any measured p95 exceeds its PERFORMANCE.md budget by more than the published 10% tolerance. Budgets are pinned in-code and cross-checked by in-module tests against the doc table.
ai-memory bench --baseline <file.json> compares each operation's fresh p95 against a captured baseline JSON (load_baseline / compare_against_baseline); growth beyond --regression-threshold (default DEFAULT_REGRESSION_THRESHOLD_PCT = 10.0, src/bench.rs) is reported as a regression. Operations absent from the baseline are skipped, so adding a bench row never invalidates old baselines.
.github/workflows/bench.yml (job bench) runs the bench workload on every pull request against main / develop / release/** and fails the PR when any p95 breaks budget × tolerance.
# capture a baseline on known-good HEAD, then gate future runs against it
ai-memory bench --json > baseline-$(git rev-parse --short HEAD).json
ai-memory bench --baseline baseline-<sha>.json --regression-threshold 10
docs/BASELINE-v0.6.3.1.md is the project's committed-baseline precedent: a canonical, code-derived snapshot of what release v0.6.3.1 actually was (workspace, features, surfaces, budgets — every number with a traceable file:line source, zero aspirational claims). It is the reference document the long-running drift audit (#512) measures published surfaces against. The same discipline carries into v0.7.0: baselines are committed artifacts with provenance, not screenshots.
The final v0.7.0 release-gate campaign (2026-05-22 dossier) closed with a full-suite verdict of 7,321 passed / 0 failed / 0 skipped — and published the six-point contract that lets anyone re-derive that verdict rather than take it on faith:
| Contract point | Pinned value (from the dossier's "Reproducibility contract") |
|---|---|
| 1 · Branch + tip | release/v0.7.0-mobile-ci-1068 (tracking origin/release/v0.7.0), HEAD fd172f2cf629309514cd5dad486c2e59ac4eed39 |
| 2 · Binary | Single release binary, sha256 d4b60aa5b8f97470d95007f30bddb15e7e35c3855f0085c6b4f43d57f6b4ef3e |
| 3 · Exact invocation | cargo test --release --no-default-features --features sal,sal-postgres,sqlite-bundled -- --include-ignored --test-threads=1 |
| 4 · Environment | macOS Sequoia / Darwin 25.4.0; lan-parity PG + AGE container (PG16 + AGE 1.6.0 + pgvector 0.8.2 on 127.0.0.1:15432); AI_MEMORY_TEST_POSTGRES_URL + AI_MEMORY_TEST_AGE_URL bound to it |
| 5 · Schema version | v49 (current at that campaign's HEAD; the ladder has since advanced to v55 — re-running at today's HEAD reproduces against the current schema constant, which is the point of pinning the tip SHA) |
| 6 · Authoring + QC agents | Named in the dossier, with two independent QC passes |
The dossier also binds the verdict to the campaign discipline that produced it: every failure filed as a GitHub issue at the moment of discovery, every fix retested against a freshly recompiled binary, and the final run executed against the composite tip — so the published number is the output of a documented procedure, not a survivor-curated log. The same README carries the per-issue root-cause table and per-track results (track-a/b/c files alongside it).
| Contract | Question it answers | Primary evidence |
|---|---|---|
| Fleet (do-1461) | "Does the documented deployment actually converge to the verified state — twice, from nothing?" | verify-20260608T231547Z.tsv + verify-20260609T133956Z.tsv (119/119 both), reference architecture page |
| Bench | "Is this build still inside its published latency contract, and has it drifted against a known-good run?" | PERFORMANCE.md, src/bench.rs (P95_TOLERANCE, DEFAULT_REGRESSION_THRESHOLD_PCT), Bench · bench CI job, BASELINE-v0.6.3.1.md |
| Full suite | "Can a third party re-derive the release-gate test verdict from pinned inputs?" | Release-gate dossier §Reproducibility contract |
Control-level verification evidence (which named test or harness check exercises which NSA CSI control) lives on the companion page: NSA CSI MCP Control → Test Matrix.