Hundreds of nodes. Thousands of agents. Multiple racks. One coherent memory fabric. The core fabric ships in v0.6.3; the piece still maturing toward GA at this scale is the shared distributed store (Postgres + pgvector).
The core fabric ships in v0.6.3: the peer mesh, W-of-N quorum-writes, mTLS with fingerprint allowlist, federated governance, pending decisions, namespace metadata, capabilities introspection v2. The piece still maturing toward GA at this scale is the shared distributed store (Postgres + pgvector, behind the sal-postgres Cargo feature today, GA targeted v0.7).
The diagram distinguishes shipping vs roadmap: solid edges = today, indigo dashed = v0.7 GA piece.
The big T4-relevant capability that's already wired:
src/main.rs:405-411 — --quorum-writes N enables W-of-N replication; --quorum-peers a,b,c lists targets.src/handlers.rs:442-454 — write path calls broadcast_store_quorum then finalise_quorum. Local commit + (W-1) peer acks within --quorum-timeout-ms (default 2000) → 200 OK with the achieved ack count returned in the response. Falling short → 503 quorum_not_met.link, consolidate, pending_decision, namespace_meta, and the clear_namespace_meta path.--quorum-client-cert / --quorum-client-key / --quorum-ca-cert.ADR-0001 is realised, not aspirational. Phase 1 (scaffold) and Phase 2 (wired into write path) shipped. Phase 3 (chaos harness) and Phase 4 (formal convergence-bound report) are still work — the convergence guarantees in production are exercised by the a2a-gate scenario suite (CHANGELOG #325/#326/#327).
The single-SQLite-per-node ceiling bites at T4 scale (>10⁶ memories per node, hot writers). The structural fix is a shared store.
Today (v0.6.x): sal-postgres Cargo feature flag exists; sqlx + pgvector deps wired. The Postgres adapter compiles and runs; correctness fixes shipped in v0.6.0 pre-tag (#294 SAL upsert key alignment, #295 metadata.agent_id immutability via jsonb_set, #296 tier-downgrade protection via SQL tier_rank(), #297 schema parity with 6 tables + generated scope_idx column). That work is the foundation.
v0.7 GA: Performance maturation, migration tool for SQLite→Postgres on existing fleets, pgvector index tuning at >10⁷ memories, official deployment recipes.
What the shared store unlocks at GA:
fcntl lock issue goes away).The topology just gets denser.
sync_push works the same whether you have 4 nodes or 40. The number of edges is n*(n-1)/2 for full mesh, which is a gossip cost the operator should plan for. Most T4 deployments will use a partial mesh with a few designated bridge nodes per rack.broadcast_namespace_meta_quorum and broadcast_pending_decision_quorum mean a tightening of policy on org/eu/finance propagates to every rack, every node. v0.6.3 capabilities introspection lets every agent and operator confirm that the policy version they're seeing matches the published version.rack-a/approvers) so on-call shifts work cleanly even at fleet scale.rustls is already wired in the peer transport. Enforcement is opt-in today; it's the obvious default at T4./api/v1/capabilities to confirm every node is on the same schema version and feature tier. Drift is an alerting condition, not a surprise.| Dimension | Today (v0.6.3) | v0.7 GA |
|---|---|---|
| Total memories per cluster | bounded by per-node HNSW RAM | pgvector → 10⁸+ |
| Write durability across racks | W-of-N quorum (--quorum-writes N) | unchanged |
| Partition tolerance | quorum-bounded divergence; 503 quorum_not_met on shortfall | unchanged |
| Operational primitives | per-node SQLite, replicated; backups per node | Postgres ecosystem (PITR, replicas, central monitoring) |
| Vector index drift | each node embeds & indexes independently | shared pgvector |
| mTLS | enforced via fingerprint allowlist | unchanged |
A 10-rack, 100-node fleet with 3-of-5 quorum is in scope today. The v0.7 piece that's still maturing is the Postgres backbone — the vector-index drift across nodes and the per-node SQLite operational footprint are the things shared-store fixes.
# rack-a/node-2 — quorum 3-of-5 across the rack, mTLS enforced
ai-memory --db /var/lib/ai-memory/store.db serve \
--bind 0.0.0.0:9077 \
--tier semantic \
--quorum-writes 3 \
--quorum-peers https://rack-a-1:9077,https://rack-a-3:9077,https://rack-b-1:9077,https://rack-b-2:9077,https://rack-c-1:9077 \
--quorum-timeout-ms 1500 \
--tls-cert /etc/ai-memory/tls.crt \
--tls-key /etc/ai-memory/tls.key \
--mtls-allowlist /etc/ai-memory/peer-fingerprints.txt \
--quorum-client-cert /etc/ai-memory/client.crt \
--quorum-client-key /etc/ai-memory/client.key \
--quorum-ca-cert /etc/ai-memory/ca.crt \
--catchup-interval-secs 15
v0.7 GA target — same node, swap to Postgres-backed store:
ai-memory serve \
--store-url postgres://ai-memory@pg-cluster.svc.cluster.local:5432/store \
--bind 0.0.0.0:9077 \
--tier semantic \
--quorum-writes 3 \
--quorum-peers https://rack-a-1:9077,https://rack-a-3:9077,https://rack-b-1:9077
The peer-fingerprints.txt file lists trusted SHA-256 fingerprints (one per line, optional : separators, # comments). The --mtls-allowlist flag's presence is what makes mTLS enforcement required — if a peer's cert isn't on the list, the connection is refused at TLS handshake.
org/eu/finance reaches every node within seconds. Pending-action queues are visible per-rack for on-call routing.sync_push.src/main.rs:405-447 — quorum-write CLI surfacesrc/main.rs:230 — --store-url postgres://... URL syntaxsrc/main.rs:380-393 — TLS / mTLS allowlist enforcementsrc/handlers.rs:442-454, 524-540 — broadcast_store_quorum + ack countingsrc/replication.rs (422 lines) — QuorumWriter + AckTracker (functional)Cargo.toml — sal-postgres feature, sqlx + pgvector depsCHANGELOG.md — v0.6.0 pre-tag SAL fixes (#294-#297), v0.6.2 fanout bugs (#325-#327)docs/ADR-0001-quorum-replication.md — realised designdocs/ARCHITECTURAL_LIMITS.md — the honest ceiling on each dimension