ai-memory · T4 — Data-center swarm

The core fabric ships in v0.6.3: the peer mesh, W-of-N quorum-writes, mTLS with fingerprint allowlist, federated governance, pending decisions, namespace metadata, capabilities introspection v2. The piece still maturing toward GA at this scale is the shared distributed store (Postgres + pgvector, behind the sal-postgres Cargo feature today, GA targeted v0.7).

The diagram distinguishes shipping vs roadmap: solid edges = today, indigo dashed = v0.7 GA piece.

Quorum-write contract

Shipping today.

The big T4-relevant capability that's already wired:

src/main.rs:405-411 — --quorum-writes N enables W-of-N replication; --quorum-peers a,b,c lists targets.
src/handlers.rs:442-454 — write path calls broadcast_store_quorum then finalise_quorum. Local commit + (W-1) peer acks within --quorum-timeout-ms (default 2000) → 200 OK with the achieved ack count returned in the response. Falling short → 503 quorum_not_met.
Same wiring on link, consolidate, pending_decision, namespace_meta, and the clear_namespace_meta path.
mTLS for the outbound side via --quorum-client-cert / --quorum-client-key / --quorum-ca-cert.

ADR-0001 is realised, not aspirational. Phase 1 (scaffold) and Phase 2 (wired into write path) shipped. Phase 3 (chaos harness) and Phase 4 (formal convergence-bound report) are still work — the convergence guarantees in production are exercised by the a2a-gate scenario suite (CHANGELOG #325/#326/#327).

Track B roadmap

Postgres + pgvector backbone.

The single-SQLite-per-node ceiling bites at T4 scale (>10⁶ memories per node, hot writers). The structural fix is a shared store.

Today (v0.6.x): sal-postgres Cargo feature flag exists; sqlx + pgvector deps wired. The Postgres adapter compiles and runs; correctness fixes shipped in v0.6.0 pre-tag (#294 SAL upsert key alignment, #295 metadata.agent_id immutability via jsonb_set, #296 tier-downgrade protection via SQL tier_rank(), #297 schema parity with 6 tables + generated scope_idx column). That work is the foundation.

v0.7 GA: Performance maturation, migration tool for SQLite→Postgres on existing fleets, pgvector index tuning at >10⁷ memories, official deployment recipes.

What the shared store unlocks at GA:

Horizontal read scaling (Postgres read replicas).
Shared vector index across the fleet — one embedding stored once.
Standard Postgres operational primitives — backups, PITR, monitoring.
Network filesystems no longer dangerous (the SQLite fcntl lock issue goes away).

What ships today at scale

Every primitive used in the T3 diagram extends to T4.

The topology just gets denser.

Rack-to-rack peer mesh — sync_push works the same whether you have 4 nodes or 40. The number of edges is n*(n-1)/2 for full mesh, which is a gossip cost the operator should plan for. Most T4 deployments will use a partial mesh with a few designated bridge nodes per rack.
Federated governance plane — broadcast_namespace_meta_quorum and broadcast_pending_decision_quorum mean a tightening of policy on org/eu/finance propagates to every rack, every node. v0.6.3 capabilities introspection lets every agent and operator confirm that the policy version they're seeing matches the published version.
Per-rack pending queues — Approval workflows can be scoped to a rack (rack-a/approvers) so on-call shifts work cleanly even at fleet scale.
mTLS — rustls is already wired in the peer transport. Enforcement is opt-in today; it's the obvious default at T4.
Capabilities introspection v2 — Operators can fleet-scan /api/v1/capabilities to confirm every node is on the same schema version and feature tier. Drift is an alerting condition, not a surprise.

Limits

Honest ceilings.

Dimension	Today (v0.6.3)	v0.7 GA
Total memories per cluster	bounded by per-node HNSW RAM	pgvector → 10⁸+
Write durability across racks	W-of-N quorum (`--quorum-writes N`)	unchanged
Partition tolerance	quorum-bounded divergence; `503 quorum_not_met` on shortfall	unchanged
Operational primitives	per-node SQLite, replicated; backups per node	Postgres ecosystem (PITR, replicas, central monitoring)
Vector index drift	each node embeds & indexes independently	shared pgvector
mTLS	enforced via fingerprint allowlist	unchanged

A 10-rack, 100-node fleet with 3-of-5 quorum is in scope today. The v0.7 piece that's still maturing is the Postgres backbone — the vector-index drift across nodes and the per-node SQLite operational footprint are the things shared-store fixes.

Deployment recipe

Today: quorum + mTLS allowlist.

# rack-a/node-2 — quorum 3-of-5 across the rack, mTLS enforced
ai-memory --db /var/lib/ai-memory/store.db serve \
  --bind 0.0.0.0:9077 \
  --tier semantic \
  --quorum-writes 3 \
  --quorum-peers https://rack-a-1:9077,https://rack-a-3:9077,https://rack-b-1:9077,https://rack-b-2:9077,https://rack-c-1:9077 \
  --quorum-timeout-ms 1500 \
  --tls-cert /etc/ai-memory/tls.crt \
  --tls-key /etc/ai-memory/tls.key \
  --mtls-allowlist /etc/ai-memory/peer-fingerprints.txt \
  --quorum-client-cert /etc/ai-memory/client.crt \
  --quorum-client-key /etc/ai-memory/client.key \
  --quorum-ca-cert /etc/ai-memory/ca.crt \
  --catchup-interval-secs 15

v0.7 GA target — same node, swap to Postgres-backed store:

ai-memory serve \
  --store-url postgres://ai-memory@pg-cluster.svc.cluster.local:5432/store \
  --bind 0.0.0.0:9077 \
  --tier semantic \
  --quorum-writes 3 \
  --quorum-peers https://rack-a-1:9077,https://rack-a-3:9077,https://rack-b-1:9077

The peer-fingerprints.txt file lists trusted SHA-256 fingerprints (one per line, optional : separators, # comments). The --mtls-allowlist flag's presence is what makes mTLS enforcement required — if a peer's cert isn't on the list, the connection is refused at TLS handshake.

Wiring

Governance, skills, and attestations at T4.

Governance — Same machinery as T2/T3, just at scale. The federated control plane is the central nervous system: a single policy update on org/eu/finance reaches every node within seconds. Pending-action queues are visible per-rack for on-call routing.
Skills — Skills register as agents and get the same scope visibility. At T4, skills are typically pinned to specific racks (latency, data-residency) but their outputs (tags, contradictions, consolidations) propagate fleet-wide via sync_push.
Attestations — v0.7 attested identity is the gating capability for trustworthy multi-tenant T4. Without it, the cluster is "trust-the-network" — operationally fine inside a controlled VPC, not fine for cross-tenant or cross-organization deployments.

Source

Source-of-truth references.

src/main.rs:405-447 — quorum-write CLI surface
src/main.rs:230 — --store-url postgres://... URL syntax
src/main.rs:380-393 — TLS / mTLS allowlist enforcement
src/handlers.rs:442-454, 524-540 — broadcast_store_quorum + ack counting
src/replication.rs (422 lines) — QuorumWriter + AckTracker (functional)
Cargo.toml — sal-postgres feature, sqlx + pgvector deps
CHANGELOG.md — v0.6.0 pre-tag SAL fixes (#294-#297), v0.6.2 fanout bugs (#325-#327)
docs/ADR-0001-quorum-replication.md — realised design
docs/ARCHITECTURAL_LIMITS.md — the honest ceiling on each dimension

Tier 4 — Data-center swarm

Multi-rack swarm · pgvector backbone (roadmap).