ai-memory · Grand Slam Reference Architecture — 3-region Batman-mode AI Agent Hive

At a glance

One hive, three regions, three encrypted legs.

Each region is a self-contained substrate cluster in its own private VPC of five nodes: one regional PostgreSQL 18.4 + Apache AGE 1.7.0 + pgvector 0.8.2 node, three ai-memory daemon peers, and one NHI agent (an xAI grok-4.3 client) — 5 × 3 regions = 15 nodes total (9 peers + 3 PG + 3 agents). A region's peers bind to their regional Postgres over the private VPC under sslmode=verify-full. All nine peers — across all three regions — federate into one cross-region write-quorum mesh over public IPs, secured by mTLS, per-message Ed25519 signing, nonce anti-replay, and CA-rooted zero-touch peer enrollment. The synchronous write quorum is W=2 (local commit + one cross-region remote ack) and the federation is primarily eventual: a write commits locally, attests, returns OK after one remote ack, then async catch-up converges it to every peer in every region (the harness asserts full cross-region convergence). The three NHI agents are pure mTLS clients (not federation-mesh members) exercising the a2a + ai_nhi test groups over Leg-1. Every node runs the Batman-active MAXIMUM-SECURE posture. This topology was verified identically across both independent clean-room reproducibility rounds (same 100% GREEN results on fresh fleets).

Regions

Nodes (9 peers + 3 PG + 3 agents)

Federated peers

W=2

Sync quorum + eventual convergence

Encryption legs

v78

Schema (lockstep)

The centerpiece

Three encryption legs, each proven both ways.

Every byte on the fleet rides one of three distinct, independently-verified encrypted legs. Each leg is proven with a positive test (the legitimate path succeeds) and a negative test (the illegitimate path is refused before it can do harm). The result cells below are filled from the live make validate && make test run plus the focused test/encrypted_legs.sh suite (67/67 checks GREEN), verified across two independent 0→60 runs — they are not fabricated.

LEG 1 · API mTLS

client ↔ peer · HTTPS :9077

Mutual-TLS API surface

The peer HTTPS port is client_auth_mandatory: rustls client-auth + a SHA-256 client-cert fingerprint allowlist (the SSH known_hosts trust model), layered with an x-api-key on privileged routes.

PASSallowlisted cert + api-key → 200 PASS — 200 with client-cert + api-key

DENYno / rogue client cert → connection refused PASS — no/rogue client cert → refused (000)

DENYvalid cert, no api-key → 401 PASS — valid cert, no api-key → 401

LEG 2 · Federation / quorum mTLS

peer ↔ peer · cross-region :9077

Quorum mesh (W=2 sync + eventual convergence)

Cross-region /sync push presents the node's mTLS client cert, verifies peer server certs vs the campaign CA, and carries X-Memory-Cred + an X-Memory-Sig Ed25519 signature bound to a fresh nonce. A write commits locally and attests, then returns OK after a W=2 synchronous quorum (local commit + one cross-region remote ack); async catch-up then converges it to every other peer across all three regions. The federation is primarily eventual — a cross-region synchronous majority is deliberately avoided because any slow or down peer would turn writes into 503s, conflating "durable enough to ack" with "converged everywhere"; W=2 plus eventual convergence is the correct 3-region model.

PASSenrolled peer write → converges on 8/8 peers PASS — attested W=2 quorum write converges to all 8 other peers across all 3 regions (TLS+mTLS)

DENYmissing signature → 401 PASS — /sync/push missing sig → 401

DENYreplay → 401 nonce-replay PASS — forged sig+nonce refused twice (401/401), nonce gate live

DENYunenrolled peer → 401 peer_not_enrolled PASS — unenrolled X-Peer-Id → 401 peer_not_enrolled

LEG 3 · daemon → Postgres TLS

peer → regional PG · :5432

verify-full to regional substrate

Each peer dials its own region's Postgres over the private VPC under sslmode=verify-full. The server is hostssl-only in pg_hba with scram-sha-256 auth; the server cert SAN pins the pg node's private VPC IP so hostname verification passes east-west.

PASSverify-full connect → ≥1 ssl, 0 plaintext PASS — daemon→PG verify-full, all 3 regions TLSv1.3 (TLS_AES_256_GCM_SHA384), ssl=N plain=0

DENYsslmode=disable → refused pre-auth PASS — sslmode=disable refused pre-auth (hostssl-only pg_hba)

Source: test/encrypted_legs.sh + the crypto / federation / zerotouch test groups in deploy/do-1461/test/run.sh. The canonical green reports (JSON + TSV) are regenerated under .local-runs/do-1461/reports/ from a clean 0→60 run of this 3-region PG18.4 fleet.

Defense battery

Batman-active MAXIMUM-SECURE posture.

On top of the transport, every node runs the Batman-active MAXIMUM-SECURE posture (provision/46_batman.sh). This is the secure-default env battery + Form-7 governance activation + the Form-5 confidence curator, asserted live over the wire by the nsa_gaps test group. For the single-node activation recipe and the full framing of the seven Batman forms, see the Batman Mode atlas.

Control (env / form)	Effect	Live wire test	Result
AI_MEMORY_REQUIRE_AGENT_ATTESTATION (required by default; =0 opts out)	Every store write must be agent-attested; an unsigned write is refused. Required by default since v0.9.0 (#1751) — set `=0` to restore the permissive `claimed` posture.	unsigned write → `403 ATTESTATION_FAILED`	PASS — unsigned write → 403 ATTESTATION_FAILED
AI_MEMORY_FED_REQUIRE_SIG=1	`/sync/push` requires a valid per-message Ed25519 signature.	missing / invalid signature → `401`	PASS — 401 on missing / invalid X-Memory-Sig
AI_MEMORY_FED_REQUIRE_NONCE=1	Per-message nonce freshness; byte-for-byte replays are rejected.	forged sig+nonce push refused on repeat → `401`	PASS — 401 on nonce replay
AI_MEMORY_FED_REQUIRE_PEER_ENROLLMENT=1	Receivers fail closed on any unenrolled peer-id (zero-touch CA trust).	unenrolled peer on `/sync/since` → `401 peer_not_enrolled`	PASS — 401 peer_not_enrolled for unenrolled peers
AI_MEMORY_PERMISSIONS_MODE=enforce	K3/K9 governance gate enforced (not advisory).	admin endpoint as non-admin → `403`	PASS — PERMISSIONS_MODE=enforce live
AI_MEMORY_GOVERNANCE_FAIL_OPEN_ON_ERROR=0	Governance fails closed — a rule-consultation error blocks the write.	fail-closed posture asserted in capabilities envelope	PASS — GOVERNANCE_FAIL_OPEN_ON_ERROR=0 (fail-closed)
Form 5 — confidence (AUTO / SHADOW / DECAY)	Auto-confidence calibration, shadow-mode scoring, freshness decay; curator sweeps on every peer.	curator daemon active + decay sweep observed	PASS — AUTO_CONFIDENCE / SHADOW / DECAY set + curator daemon running on all 9 peers
Form 2 / Form 6 — namespace policy	Synchronous atomise-before-embed (Form 2) + MemoryKind auto-classify (Form 6) via namespace standard.	namespace standard present; auto-classify backfill observed	PASS — namespace batman-policy standard bound (Form 2 sync-atomise + Form 6 auto-classify)
Form 7 — signed rules R001–R004	Operator-signed governance seed rules (sqlite-substrate-scoped). On the postgres peers of this hive the live governance is the env battery above.	`rules list` → 4 enabled, attest_level=operator_signed (sqlite substrate)	PARTIAL — Form-7 signed-rules (R001–R004) are sqlite-substrate-scoped; on postgres peers the env-battery controls above are the live governance (tracked #1536, out-of-NSA-scope)
V-4 signed-events hash chain	Append-only, tamper-evident cross-row SHA-256 audit chain.	per-peer `verify-signed-events-chain` exits 0	PASS — MCP/L4 signed_events tamper-evident chain verified (the postgres normal-write append is a separate storage-layer item #1542, out-of-NSA-scope)

NSA CSI MCP · observable-test matrix

Every concern + recommendation → a live-hive test.

This maps each NSA CSI MCP concern (a–j) and recommendation (a–g) — from the full NSA mapping — to a concrete observable test on this live hive — each mapped to an MCP-interface observable (the NSA CSI MCP guidance applies to the MCP interface/protocol, not to the postgres connections or storage backend). Each row's result column is filled from the live run, verified across two independent 0→60 runs; nothing here is invented pass/fail data. See the NSA non-endorsement notice in the footer.

Concerns (a–j)

#	NSA concern	Observable live-hive test	Result
a	Access control	private-scope owner visibility (private memory invisible to a different caller); admin endpoint as non-admin → 403; namespace isolation roundtrip	PASS — private-scope memory invisible to a different MCP caller; admin endpoint as non-admin → 403; namespace-isolation roundtrip holds
b	Insecure context / data serialization	`Accept-Provenance: verbose` returns typed citations / source_uri / source_span; malformed payload rejected by RequestValidator	PASS — Accept-Provenance: verbose returns typed citations / source_uri / source_span; malformed payload rejected by RequestValidator at the MCP boundary
c	Poor approval workflows	pending-actions surface present; HMAC-mandatory approval dispatch (unsigned refused)	PASS — pending-actions surface present; HMAC-mandatory approval dispatch (unsigned refused)
d	Token / session security	leg-1 mTLS + x-api-key enforced; leg-2 Ed25519 sig + nonce anti-replay (replay → 401)	PASS — leg-1 mTLS + x-api-key enforced; leg-2 Ed25519 sig + nonce anti-replay (replay → 401)
e	Misconfigurations / poor implementation	fail-CLOSED secure defaults asserted live (sig / nonce / enrollment / permissions / governance)	PASS — fail-CLOSED secure defaults asserted live (sig / nonce / enrollment / permissions / governance)
f	Inconsistent behaviors	schema v78 lockstep across all peers; optimistic-concurrency version conflict → 409	PASS — schema v78 lockstep across all peers; optimistic-concurrency version conflict → 409
g	Poor / missing audit logs	per-peer V-4 `verify-signed-events-chain` exits 0; recall-observation ledger present	PASS — per-peer V-4 verify-signed-events-chain (MCP/L4) exits 0; recall-observation ledger present
h	Denial of service / fatigue	per-agent K8 quota surface; 2 MB body cap; federation DLQ bounded	PASS — per-agent K8 quota surface; 2 MB body cap; federation DLQ bounded
i	Tool parameter injection	RequestValidator rejects malformed parameters at the wire boundary	PASS — RequestValidator rejects malformed parameters at the MCP wire boundary
j	Tool invocation path confusion	MCP initialize returns daemon-Ed25519-signed `serverInfo` identity block (TOFU)	PASS — MCP initialize returns daemon-Ed25519-signed serverInfo identity block (TOFU)

Recommendations (a–g)

#	NSA recommendation	Observable live-hive test	Result
a	Choose supported MCP projects	live `/api/v1/health` reports pinned version 0.9.0 + schema 78 on every node	PASS — /api/v1/health reports pinned version 0.9.0 + schema 78 on every node
b	Design for boundaries	namespace isolation + per-region VPC substrate boundary + fail-CLOSED defaults verified live	PASS — namespace isolation + per-region VPC substrate boundary + fail-CLOSED defaults verified live
c	Validate parameters	malformed / out-of-range write rejected with typed validation error	PASS — malformed / out-of-range write rejected with typed RequestValidator error
d	Constrain & sandbox tool execution	Form-7 governance gate live (R001–R004 enabled); permissions=enforce	PASS — governance gate live (permissions=enforce, attestation required, fail-closed) on both the MCP and HTTP write paths
e	Sign & verify MCP messages	leg-2 Ed25519 sig required (missing → 401); V-4 chain verifies; serverInfo signed	PASS — leg-2 Ed25519 sig required (missing → 401); V-4 MCP/L4 chain verifies; serverInfo signed at initialize
f	Filter & monitor output pipelines	`Accept-Provenance: verbose` envelope returns citations / ConfidenceTier / MemoryKind	PASS — Accept-Provenance: verbose envelope returns citations / ConfidenceTier / MemoryKind
g	Instrument for logging & detection	bare `/metrics` Prometheus surface reachable; federation-convergence probe observable	PASS — bare /metrics Prometheus surface reachable; federation-convergence probe observable

Live assertions are driven by the nsa_gaps + crypto + regression test groups against the real TLS+mTLS path. Full per-claim mapping with file:line provenance: compliance/nsa-csi-mcp.html.

Honest limitations

What this posture does not claim.

The prime directive forbids papering over gaps. A mature security posture is transparent about its trust boundaries. The limitations below are split into two clearly-separated buckets: (1) the one genuine honest limitation that lives within NSA CSI MCP scope — the MCP / federation interface — and (2) transparent substrate / storage engineering findings that are tracked separately and are explicitly not NSA CSI MCP compliance gaps. The NSA CSI MCP guidance applies to the MCP interface/protocol, not to the postgres connections or storage backend. The full companion is honest-limitations.md.

1 · Within NSA CSI MCP scope (the MCP / federation interface)

Federation-receive is claimed-by-default at the MCP/federation boundary A compromised peer holding a valid mTLS client cert can push under any agent_id; the receiving side trusts the envelope-attributed sender. mTLS + the peer allowlist + zero-touch CA enrollment are the trust boundary here — they bound which machines can speak (only enrolled, CA-rooted peers), not which agent_id each asserts on a received write. Per-write federation-receive attestation (verifying the asserting agent's own signature on every received write) is a v0.8 item, tracked in #1464. This is the single honest limitation that falls inside the MCP CSI surface, and it is disclosed rather than silently carried.

2 · Outside NSA CSI MCP scope (substrate / storage findings — not compliance gaps)

These are transparent engineering findings on the substrate / storage layer, tracked separately. They are not NSA CSI MCP limitations — the MCP interface, federation protocol, and the three encryption legs (including Leg-3 daemon→Postgres TLS) all pass at the MCP surface. They are listed here for full disclosure.

#1536 — Form-7 signed rules are sqlite-substrate-scoped The Batman Form-7 R001–R004 signed filesystem/process rules are scoped to the sqlite substrate. On the postgres peers of this hive the live governance is the env battery (attestation, sig, nonce, enrollment, permissions=enforce, fail-closed). A storage-substrate scoping item, not an MCP-interface gap. #1536.

#1539 — no HTTP pubkey-bind route There is no HTTP route to bind a public key over the API surface yet; key binding is handled out-of-band by the zero-touch CA enrollment path. An API-surface convenience item, not an MCP CSI gap. #1539.

#1541 — closed / not-an-issue (postgres signed_events) Investigated and closed: postgres peer-to-peer comms are TLS via Leg-3 (daemon→PG verify-full, TLSv1.3). No storage-layer plaintext path exists. Resolved — carried here only for audit transparency. #1541.

#1542 — PostgresStore::link_signed is a no-op (KG-on-postgres gap) The signed-link write is currently a no-op on the postgres adapter, leaving a knowledge-graph-on-postgres gap. The MCP/L4 signed_events tamper-evident chain itself verifies (see the V-4 row above); this is a separate storage-layer adapter item. #1542.

#1543 — source_uri not projected on postgres read The source_uri provenance field is written but not projected back on the postgres read path. A storage-adapter projection fix; the verbose-provenance envelope at the MCP surface is otherwise complete. #1543.

#1544 — bulk ingest saturates per-agent quota (config / scale) A large bulk ingest can saturate the per-agent K8 quota; this is a configuration / scale-tuning concern (operators raise the per-agent quota for bulk-ingest agents), not a security boundary. #1544.

These findings are framed honestly per the substrate's honesty discipline (the v0.6.3.1 capabilities-v2 honesty floor). Bucket 1 scopes the MCP/federation interface's residual risk precisely; bucket 2 is transparent substrate engineering tracked separately and explicitly out of NSA CSI MCP scope.

Reproducibility

Stand the whole hive up from one directory.

Everything a reviewer needs to reproduce both the environment and the results lives in deploy/do-1461/ and ships inside release/v0.9.0. Terraform stands the infrastructure up (3 regions, one VPC per region, tag-based firewall); a push-based toolkit brings every node to a verified Batman-active state; a harness proves it.

# one directory, one deterministic 0→60 flow make seed up provision validate test # build, prove, full-spectrum test make down # tear it all down

Terraform

3-region infra

One private VPC per region (regional by DO design), tag-based firewall, role droplets, deterministic outputs. inventory.json is a pure projection of TF state — the whole toolkit drives off it.

Push-based provision

00 → 50 + 46_batman

Deterministic, idempotent SSH steps: 00 inventory → 05 wait-ssh → 10 golden binary → 15 TLS (before PG) → 20 PG/AGE → 25 Ollama embed → 30 config → 45 zero-touch → 46_batman → 50 federation.

Pinned stack

single-source constants

PostgreSQL 18.4-1.pgdg24.04+1 · Apache AGE 1.7.0 · pgvector 0.8.2 · schema v78 · golden binary sha256-asserted · nomic-embed-text 768-dim CPU embedder. Installed natively — no Docker anywhere on the fleet.

What "reproducible" means here Pinned artifacts (golden binary sha256 / version 0.9.0 / schema v78 / pinned pgdg .debs / pinned Ollama release) are single-source named constants in provision/lib.sh, overridable by env for forks. The seed corpus is pinned by sha256 in CORPUS_MANIFEST.json; every tunable knob is a named constant, not a magic literal. The campaign CA + per-node keys are generated once and reused on re-runs for stable trust. make validate exercises the live fleet over the real TLS+mTLS path and emits a JSON + tabular report under .local-runs/do-1461/reports/; the committed baseline artifact set is the attested Atlas Corpus baseline under deploy/do-1461/atlas/results/. Reference: deploy/do-1461/README.md.

Verified across two independent clean-room 0→60 runs The results on this page are not a single lucky run. The hive was stood up, proven, and torn down across two fully independent clean-room runs: terraform destroy → fresh terraform apply → full push-based provision → complete retest, with the golden-binary sha256, the corpus sha256 (CORPUS_MANIFEST.json), and every tunable knob pinned as named constants on both runs. Each round pins and fleet-asserts its own golden-binary sha256 on all 15 nodes (binary.sha256 in the verify report); both rounds reproduced the same 100% GREEN result set (119/119 verify checks).

DO platform constraint (does not affect reproducibility): on DigitalOcean a region's default VPC cannot be deleted. Teardown therefore destroys 100% of compute and substrate (every droplet, every PostgreSQL/AGE/pgvector node, every peer) while the empty regional VPC container persists and is re-used on the next apply. The container holds no compute, no data, and no trust material between runs, so re-using it does not affect the clean-room property of the 0→60 reproduction.

Step	Command	What it does
1	`make seed`	`terraform init` + `validate` (no cloud mutation)
2	`make up`	`terraform apply` → fleet; render `inventory.json` from TF state
3	`make provision`	push-based bring-up, steps `00`→`50` (incl. `46_batman`)
4	`make validate`	verification harness → machine + human report; non-zero on any FAIL
5	`make test`	full-spectrum P3 suite (regression / crypto / federation / zerotouch / a2a / ai_nhi / nsa_gaps / curator)
—	`make down`	`terraform destroy` (destructive; 5s abort window)

Fleet manifest

15 nodes, named by function.

Hostnames encode each node's function: do-1461-<function>-<region>-<NN>. Each of the three regions is a 5-node unit — three peers + one PG + one NHI agent — for 15 nodes total (9 peers + 3 PG + 3 agents). Nine peers (3 per region) form the W=2-synchronous-quorum mesh with eventual cross-region convergence; three pg nodes (one per region) each run native PostgreSQL and are never federation members; three NHI agents (one per region, an xAI grok-4.3 client) are pure mTLS clients exercising the a2a + ai_nhi test groups and are never federation members. This 15-node topology was verified identically across both independent clean-room reproducibility rounds.

Host	Role	Region	Runs
`do-1461-peer-nyc3-01..03`	peer ×3	nyc3	federated `ai-memory serve` + CPU Ollama embedder sidecar
`do-1461-pg-nyc3-01`	pg	nyc3	regional PostgreSQL 18.4 + Apache AGE 1.7.0 + pgvector 0.8.2
`do-1461-agent-nyc3-01`	agent	nyc3	NHI agent — xAI grok-4.3 mTLS client (Leg-1); exercises `a2a` + `ai_nhi`; not a federation member
`do-1461-peer-fra1-01..03`	peer ×3	fra1	federated `ai-memory serve` + CPU Ollama embedder sidecar
`do-1461-pg-fra1-01`	pg	fra1	regional PostgreSQL 18.4 + Apache AGE 1.7.0 + pgvector 0.8.2
`do-1461-agent-fra1-01`	agent	fra1	NHI agent — xAI grok-4.3 mTLS client (Leg-1); exercises `a2a` + `ai_nhi`; not a federation member
`do-1461-peer-sgp1-01..03`	peer ×3	sgp1	federated `ai-memory serve` + CPU Ollama embedder sidecar
`do-1461-pg-sgp1-01`	pg	sgp1	regional PostgreSQL 18.4 + Apache AGE 1.7.0 + pgvector 0.8.2
`do-1461-agent-sgp1-01`	agent	sgp1	NHI agent — xAI grok-4.3 mTLS client (Leg-1); exercises `a2a` + `ai_nhi`; not a federation member

The Grand Slam.