ai-memory ship-gate¶
v0.6.0.0 — FULL 4/4 GREEN on real DigitalOcean infrastructure
Campaign v0.6.0.0-final-r25 returned overall_pass: true on 2026-04-20 at 18:05 UTC.
Every phase passed on fresh DigitalOcean droplets against commit 710ad76:
- Phase 1 functional — ✅ PASS on all three peer droplets
- Phase 2 federation — ✅ PASS (200/200/200 convergence, both quorum probes classified correctly)
- Phase 3 migration — ✅ PASS (1000/1000 SQLite↔Postgres round-trip, 0 errors)
- Phase 4 chaos — ✅ PASS (
kill_primary_mid_writeconvergence_bound = 1.0)
→ r25 evidence JSON (on GitHub) · → r25 offline HTML · → v0.6.0 release notes
Reproducible release-testing for ai-memory-mcp. Every release candidate is validated by a four-phase campaign on real DigitalOcean infrastructure before a git tag is cut. This site is where the evidence is published.
- Latest campaign runs · live dashboard of every attempt
- Methodology · exactly what each phase measures
- 14-day soak · long-horizon stability testing (84 runs per release, every 4h)
- Reproducing · run it yourself on your own DO account
- Security · TLS, mTLS, dead-man switch, key custody
The 60-second pitch¶
Releasing infrastructure software without publishing the evidence of its correctness is asking users to take it on faith. ai-memory-mcp does the opposite. Every release candidate runs through this campaign on fresh DigitalOcean droplets, every phase produces a JSON artifact, every artifact lands in this repository, and every artifact is rendered into a browsable page on this site. If you want to dispute a finding, fork the repo, point the workflow at your own DigitalOcean account, and reproduce the run in about fifteen minutes for roughly ten cents.
What this means to you¶
Correctness claims land differently depending on who's reading. Three audiences, same underlying truth, three framings:
What does the green "overall_pass" badge on a campaign page mean for you?
It means the version of ai-memory you are about to install via
brew, apt, Docker, or cargo install passed every test we
know how to put it through, on fresh cloud hardware, within the
past few hours. Specifically:
- Your memories write, read, and survive a process restart.
- If you run three ai-memory instances that talk to each other, a memory written on one shows up on the others.
- If one of those instances crashes mid-write, the survivors don't lose data.
- Upgrades between SQLite and Postgres storage don't corrupt anything.
You don't have to take our word for any of this. Every artifact — every write attempt, every crash, every count check — is linked from Campaign runs. Your IT administrator, your security reviewer, or your curious teenager can read them.
A green badge is a promise the evidence is there. A red badge means a release is blocked until we've fixed whatever failed.
What business risk does this campaign buy down, at what cost?
- Regression risk. Every release candidate is validated against a four-phase protocol that exercises functional, federation, migration, and chaos behaviors. Campaign status is a hard gate on publishing a git tag. A red campaign means no tag, no package, no Docker image. Product leadership can see what is and isn't shippable at any moment in time.
- Audit posture. Artifacts are immutable, versioned, and public. A compliance reviewer asking "how do you know this build is safe for production?" gets a URL, not a narrative. Every campaign run is signed by GitHub Actions (OIDC) and pinned to a specific git SHA, so provenance is traceable from "what's running in my fleet" back to "which exact test batch validated it".
- Velocity. A full campaign completes in ~13-15 minutes. Release decisions don't block on multi-hour QA cycles.
- Cost. ~$0.10 of DigitalOcean compute per clean run under the runner-driven SSH methodology. In-droplet dead-man switch caps worst-case spend at well under $10 per run even in pathological failure modes.
- Peer review. Any customer, partner, or regulator can reproduce the campaign on their own account and confirm or dispute the result. Structurally different from closed-box vendors attesting their own test suites.
ROI framing. The question is "what would a silent data-loss bug in the field cost us?" That number is large. This campaign is small-number insurance against it.
What invariants does the campaign defend, and how?
| Invariant | Phase | Pass criterion |
|---|---|---|
| Single-node CRUD + MCP + curator functional | 1 | All per-node assertions green |
| Quorum writes (W=2/N=3) eventually consistent on all three peers | 2 | Each node's count ≥ 95% of ok after 90 s settle; both quorum probes classify correctly |
| SQLite ↔ Postgres migration idempotent + lossless | 3 | src.total == dst.total, errors = 0, re-run no-op |
Fault tolerance: convergence_bound ≥ 0.995 per chaos class |
4 | min(count_node1, count_node2) / total_ok across cycles |
The aggregator (scripts/collect_reports.sh) emits a single
campaign-summary.json with top-level overall_pass. The
workflow fails the build on overall_pass: false.
Claim shape for chaos — specifically that we emit a convergence bound, not a loss probability, and why — is documented in ADR-0001 § Chaos-testing methodology. The full mechanics per phase, including settle-time derivation, probe semantics, and what the metric does and does NOT prove, live in Methodology.
Goals of the campaign¶
Every campaign chases six goals, in priority order. If a change doesn't strengthen one of these, it doesn't belong.
- Catch regressions before they ship. A red campaign blocks the release tag. No bypass paths — urgent hotfixes go through their own campaign first.
- Exercise real infrastructure, not mocks. Mocks lie about timing, about kernel-level partition semantics, about SIGKILL racing with an open WAL. Real droplets don't.
- Publish evidence, not claims. Every assertion is paired with
the artifact that established it. Sentences aren't truth;
runs/<campaign>/phase2.jsonis the thing that makes sentences true. - Keep the methodology peer-reviewable. Scripts are shell, not a DSL. Infrastructure is Terraform, not click-ops. Anyone with a DigitalOcean account and this repo can run the exact same test.
- Bound cost and blast radius. Campaigns cost dimes, not dollars. A runaway campaign is killed by the in-droplet dead-man switch at 8 hours regardless of workflow state. A misconfigured workflow cannot exceed its 60-minute single-job ceiling.
- Document what the campaign does NOT cover. LLM-mediated paths, benchmark numbers, and week-long soaks are out of scope and live elsewhere — see Methodology § What the methodology does NOT cover.
The four phases¶
| Phase | Measures | Default budget |
|---|---|---|
| 1 — functional | CRUD, MCP handshake, curator dry-run, backup | ~30 s (3 nodes in parallel) |
| 2 — federation | 4 agents × 50 writes × 3 nodes, 90 s settle, quorum probes | ~3 min |
| 3 — migration | 1000-memory SQLite → Postgres → SQLite round-trip | ~3 min |
| 4 — chaos | 50 cycles × 2 default fault classes (configurable) | ~6-8 min |
Full mechanics: Methodology.
Infrastructure¶
| Component | Size | Role |
|---|---|---|
node-a, node-b, node-c |
s-2vcpu-4gb |
Peer nodes for quorum mesh + migration host |
chaos-client |
s-1vcpu-2gb |
Orchestration + Phase 4 local 3-process harness |
| VPC | 10.250.0.0/24 in region |
Peer-to-peer federation (port 9077) is VPC-only |
| Firewall | DO Cloud Firewall | SSH from runner; 9077 from VPC + runner only |
All infrastructure is defined in
terraform/.
The workflow runs terraform apply at campaign start and
terraform destroy at teardown regardless of phase outcome.
Current release gate¶
v0.6.0.0 — CERTIFIED. Tag pushed at commit 710ad76 on
2026-04-20 at 18:07 UTC. See Campaign runs for the r25
artefacts and every prior attempt in the series.
Next gate: v0.6.0.1 — will focus on closing partition_minority
convergence, deferred from the v0.6.0 release (was informational,
not release-blocking — kill_primary_mid_write is the disaster
scenario customers care about and is the required chaos class).
Cost per run¶
~$0.10 for a clean ~15-minute run under the runner-driven SSH
methodology (commit f81bd76) — the GitHub Actions runner holds
orchestration, four DigitalOcean droplets hold workload. The
in-droplet dead-man switch still caps uptime at 8 hours regardless
of workflow state — see Security.