Twenty agents. Four nodes. One coherent knowledge mesh — with W-of-N quorum writes shipping today. Each node runs its own ai-memory + SQLite; writes, links, governance, and pending decisions fan out via sync_push with vector-clock causality.
This is the first distributed tier. Each node runs an ai-memory process backed by its own SQLite. Writes, links, archives, namespace metadata, and governance decisions fan out to every peer over POST /api/v1/sync/push. Vector-clock causality (sync/since) lets a peer that drops off the network catch up cleanly. The recall pipeline stays local on each node — fast hot reads, no cross-node round trips.
The big upgrade in current builds: quorum-write contract is wired. Set --quorum-writes N --quorum-peers a,b,c and every HTTP write fans out to every peer and only returns OK after the local commit + W-1 peer acks land within --quorum-timeout-ms. Falling short returns HTTP 503 quorum_not_met. This is real durability, not best-effort.
src/daemon_runtime.rs (ServeArgs): --quorum-writes N --quorum-peers <list> is wired into the write path. The handlers call broadcast_store_quorum then finalise_quorum to count peer acks; 503 quorum_not_met returns when the deadline (--quorum-timeout-ms, default 2000) lapses without W-1 acks. ADR-0001 is realised, not aspirational. Same wiring applies to broadcast_link_quorum, broadcast_consolidate_quorum, broadcast_pending_decision_quorum, broadcast_namespace_meta_quorum.sync_push fanout for everything — src/federation/sync.rs ships 10 broadcast functions: store, delete, archive, restore, link, consolidate, pending action, pending decision, namespace metadata, namespace metadata clear. Each is invoked from the corresponding handler in src/handlers/ after the local commit lands.sync_state (last-seen / last-pulled). A peer that drops off polls GET /api/v1/sync/since?peer=node-a&since=<rfc3339-ts> and gets only the rows updated after that point. The serve daemon also runs a --catchup-interval-secs loop (default 30s) that pulls peers proactively for any updates missed during partition.--tls-cert + --tls-key switches serve to TLS; adding --mtls-allowlist <path> enforces client-cert mTLS where every connection's peer must present a cert whose fingerprint is on the allowlist. Quorum POSTs use --quorum-client-cert / --quorum-client-key / --quorum-ca-cert for the outbound side.broadcast_namespace_meta_quorum (src/federation/sync.rs) propagates GovernancePolicy changes to every peer with quorum semantics. A new strict policy on agents/secops set on node-A is enforced on all 4 nodes within seconds, with the same W-of-N durability as memory writes.broadcast_pending_decision_quorum (src/federation/sync.rs) means an approve/reject on one node turns into a committed (or audited) state on every peer with quorum-bounded latency.This is real, durable, multi-node consistency for a knowledge mesh.
signature column on memory_links (reserved v0.6.3 schema v15) is POPULATED with Ed25519 per-link attestation, attest_level + signed_at landed, and #626 Layer-3 verifies caller-presented store-surface signatures. Still permissive by default: unsigned writes land attest_level = "claimed" unless AI_MEMORY_REQUIRE_AGENT_ATTESTATION=1, and the federation receive path trusts mTLS + peer allowlist rather than per-row signatures — both tracked for v0.8 hardening under #1464.--features sal-postgres. Apache AGE Cypher graph backend auto-detected when pg_extension installed. SAL trait abstracts sqlite vs postgres+AGE; ai-memory serve --store-url postgres://… selects the postgres path.DELETE /api/v1/links works today.If your fleet needs strong cross-region governance consensus, that's v1.0+. Everything else listed is shipping.
a1 on node-A calls memory_store.Allow → continue, Pending → queue, Deny → reject.INSERT succeeds. WAL fsynced. Vector clock bumped.broadcast_store_quorum spawns one HTTP POST /api/v1/sync/push per configured peer (B, C, D). With --quorum-writes 0 the local response returns immediately; with quorum enabled it returns only after W-1 peer acks (or 503 quorum_not_met at the deadline).Node-C drops off the cluster for 4 hours. When it comes back, its supervisor calls:
curl 'http://node-a:9077/api/v1/sync/since?peer=node-c&since=2026-06-09T04:00:00Z'
Node-A returns the delta — every row updated after that timestamp. Node-C applies them in causal order, bumps its own sync_state, and re-enters the steady-state mesh.
# On node-A: tighten policy on a sensitive namespace
# (the standard memory carries the policy; bind it by id)
ai-memory namespace set-standard --namespace org/legal/contracts \
--id <standard-memory-uuid> \
--governance '{"write":"approve","delete":"approve"}'
Within seconds:
namespace_meta row updated on node-A.broadcast_namespace_meta_quorum fires POST /api/v1/sync/push to B, C, D.org/legal/contracts now hits the local governance gate, which queues the write as Pending.memory_pending_approve. The decision broadcasts via broadcast_pending_decision_quorum. All four nodes commit (or audit-reject) the queued write coherently.That's the honest version of "federated governance" — eventually consistent, but coherent.
All real flags, all in v0.7.0:
# node-a — quorum writes require 2-of-3 peer acks within 2s
ai-memory --db /var/lib/ai-memory/store.db serve \
--host 0.0.0.0 --port 9077 \
--quorum-writes 2 \
--quorum-peers https://node-b:9077,https://node-c:9077,https://node-d:9077 \
--quorum-timeout-ms 2000 \
--tls-cert /etc/ai-memory/tls.crt \
--tls-key /etc/ai-memory/tls.key \
--mtls-allowlist /etc/ai-memory/peer-fingerprints.txt \
--quorum-client-cert /etc/ai-memory/client.crt \
--quorum-client-key /etc/ai-memory/client.key \
--quorum-ca-cert /etc/ai-memory/ca.crt \
--catchup-interval-secs 30
The peer-fingerprints.txt file is a newline-delimited list of SHA-256 fingerprints (with or without : separators; comments start with #). serve refuses any peer whose cert fingerprint is not on the list — that's the peer-mesh identity gate.
For long-running pull-based reconciliation, run a sync-daemon alongside serve:
ai-memory sync-daemon \
--peers https://node-b:9077,https://node-c:9077,https://node-d:9077 \
--interval 2 \
--client-cert /etc/ai-memory/client.crt \
--client-key /etc/ai-memory/client.key
The postgres-backed substrate pins an exact, tested matrix. These are the single source of truth in deploy/docker-1461/provision/lib.sh and are asserted at bring-up — the validate harness refuses to certify a stack whose probed versions drift from the pins below.
| Component | Canonical version | SSOT pin |
|---|---|---|
| PostgreSQL | 18.4 | PG_APT_VERSION=18.4-1.pgdg13+1 · EXPECTED_PG_VERSION=18.4 |
| Apache AGE | 1.7.0 | AGE_BASE_IMAGE=apache/age:release_PG18_1.7.0 · EXPECTED_AGE_VERSION=1.7.0 |
| pgvector (server extension) | 0.8.2 | PGVECTOR_APT_VERSION=0.8.2-1.pgdg13+1 |
| pgvector (Rust binding crate) | 0.4 | Cargo.toml → pgvector = "0.4" |
| ai-memory postgres schema | v57 | schema parity with SQLite at v57 (v0.7.0; CURRENT_SCHEMA_VERSION = 57 in both adapters; Track C validated at v55 — the v56/v57 arms are additive index/column DDL) |
The bundled stacked image at deploy/docker-1461/Dockerfile.pg-age-vector (ARG AGE_BASE_IMAGE=apache/age:release_PG18_1.7.0, ARG PG_MAJOR=18) layers pgvector 0.8.2 onto the AGE base so K8s / ECS / Cloud Run operators don't build AGE from source. Alternate tested matrix: infra/lan-parity-test/ legitimately runs PG 16 + AGE 1.6.0 + pgvector 0.8.2 as a second tested combination; the recommended install targets the PG 18.4 / AGE 1.7.0 matrix above.
config.toml — the full option surface.Every deployment reads a single schema-versioned file at ~/.config/ai-memory/config.toml. The precedence ladder is uniform: CLI flag > AI_MEMORY_* env > [section] > legacy flat field > compiled default. The complete field-by-field reference (with types, defaults, and resolver semantics) lives in docs/CONFIG_SCHEMA.md.
schema_version = 2
tier = "autonomous"
db = "/var/lib/ai-memory/store.db"
# Postgres connection-pool + query bounds.
postgres_pool_max_connections = 16 # env: AI_MEMORY_PG_POOL_MAX
postgres_pool_min_connections = 2 # env: AI_MEMORY_PG_POOL_MIN
postgres_acquire_timeout_secs = 30 # env: AI_MEMORY_PG_ACQUIRE_TIMEOUT_SECS
postgres_statement_timeout_secs = 30 # 0 disables
request_timeout_secs = 60 # axum slowloris guard
llm_call_timeout_secs = 30
[llm] # backend / model / base_url / api_key_env|api_key_file
[embeddings] # backend / url / model / backfill_batch
[reranker] # enabled / model
[storage] # default_namespace / archive_on_gc / archive_max_days / max_memory_mb
[limits] # max_memories_per_day / max_storage_bytes / max_links_per_day / max_page_size
[identity] # anonymize_default
[audit] # enabled / path / hash_chain / attestation_cadence_minutes / retention_days
# + [audit.compliance.{soc2,hipaa,gdpr,fedramp}]
[transcripts] # default_ttl_secs / archive_grace_secs / max_decompressed_bytes
# + [transcripts.namespaces."<pattern>"]
[hooks] # [hooks.subscription] hmac_secret (secret; chmod 600)
[subscriptions] # allow_loopback_webhooks (default false — SSRF guard)
[verify] # require_nonce (link-verify replay protection)
[agents] # [agents.defaults.recall_scope] namespaces / since / tier / limit
[governance] # require_operator_pubkey (fail-closed rule enforcement)
[confidence] # shadow_retention_days
[admin] # agent_ids = [...] (default-closed admin allowlist)
[mcp] # profile
[permissions] # mode = "enforce" | "advisory" | "off"
All sections are default-safe — an absent block selects the compiled default and preserves existing behaviour. Inline api_key = "<literal>" is rejected at parse time; reference secrets via api_key_env / api_key_file (mode 0400) only.
memory_agent_register) get the same scope visibility on every node. Auto-tagging on node-A produces tags that are stored as memory metadata and replicated to all peers — every node sees the same enriched view.| Dimension | T3 ceiling | When it bites |
|---|---|---|
| Cluster size | ~10 nodes before fanout latency dominates | Beyond that, walk to T4 |
| Concurrent writes per node | ~10–20 (T2 ceiling, multiplied) | Each node is independently bottlenecked by its mutex |
| Write durability | W-of-N quorum-writes when --quorum-writes >= 1; local-WAL-only when 0 | Choose your contract per deployment |
| Cross-node consistency | Eventual; vector clocks resolve order; ties break last-writer-wins | Use memory_detect_contradiction to surface drift |
| Vector index | Per-node, independent | Each node holds its own HNSW; embedding cost paid N times |
| TLS | mTLS supported, not enforced by default | Enforce before joining real peers |
src/daemon_runtime.rs (ServeArgs) — quorum-write CLI flags (--quorum-writes, --quorum-peers, --quorum-timeout-ms, --quorum-client-cert/-key/-ca-cert, --catchup-interval-secs)src/daemon_runtime.rs (ServeArgs) — TLS / mTLS allowlist flags (--tls-cert, --tls-key, --mtls-allowlist)src/federation/quorum.rs + handlers — broadcast_store_quorum + finalise_quorum ack-counting in the write pathsrc/federation/sync.rs — 10 broadcast functions: broadcast_store_quorum, broadcast_delete_quorum, broadcast_archive_quorum, broadcast_restore_quorum, broadcast_link_quorum, broadcast_consolidate_quorum, broadcast_pending_quorum, broadcast_pending_decision_quorum, broadcast_namespace_meta_quorum, broadcast_namespace_meta_clear_quorumsrc/replication.rs — QuorumPolicy + AckTracker implementationsrc/handlers/ — POST /api/v1/sync/push ingress, GET /api/v1/sync/since causal catch-updocs/ADR-0001-quorum-replication.md — the realised designCHANGELOG.md — #325 link fanout, #326 consolidate fanout, #327 embedder readiness