SHIPS TODAY · v0.7.0

Tier 3 — Multi-node cluster

Twenty agents. Four nodes. One coherent knowledge mesh — with W-of-N quorum writes shipping today. Each node runs its own ai-memory + SQLite; writes, links, governance, and pending decisions fan out via sync_push with vector-clock causality.

4 nodes · 20 agents W-of-N quorum mTLS allowlist federated governance

This is the first distributed tier. Each node runs an ai-memory process backed by its own SQLite. Writes, links, archives, namespace metadata, and governance decisions fan out to every peer over POST /api/v1/sync/push. Vector-clock causality (sync/since) lets a peer that drops off the network catch up cleanly. The recall pipeline stays local on each node — fast hot reads, no cross-node round trips.

The big upgrade in current builds: quorum-write contract is wired. Set --quorum-writes N --quorum-peers a,b,c and every HTTP write fans out to every peer and only returns OK after the local commit + W-1 peer acks land within --quorum-timeout-ms. Falling short returns HTTP 503 quorum_not_met. This is real durability, not best-effort.

Architecture diagram

4 nodes · 5 agents each · sync_push mesh.

T3 · 4 nodes · 5 agents each · sync_push mesh
Federated governance namespace_meta fanout pending_decision sync node-a 10.0.1.10:9077 vector clock: 7421 SQLite + HNSW policy gate a1 a2 a3 a4 a5 5 agents · namespace-isolated node-b 10.0.1.11:9077 vector clock: 7421 SQLite + HNSW policy gate a1 a2 a3 a4 a5 5 agents · namespace-isolated node-c 10.0.1.12:9077 vector clock: 7421 SQLite + HNSW policy gate a1 a2 a3 a4 a5 5 agents · namespace-isolated node-d 10.0.1.13:9077 vector clock: 7421 SQLite + HNSW policy gate a1 a2 a3 a4 a5 5 agents · namespace-isolated 12-edge full peer mesh · sync_push fanout · vector-clock causality (sync/since)
sync_push (memories · links · namespace_meta · pending decisions) per-node governance gate local recall on each node
Each node holds its own SQLite + HNSW. Writes commit locally then fan out to peers. Recalls stay local. Governance policies, namespace metadata, and pending-action decisions propagate the same way memories do.
What ships today

The mesh is real. Source-cited.

This is real, durable, multi-node consistency for a knowledge mesh.

What's still roadmap

Honest about the remaining gaps.

If your fleet needs strong cross-region governance consensus, that's v1.0+. Everything else listed is shipping.

Walkthrough

What's actually happening.

Write path on node-A

  1. Agent a1 on node-A calls memory_store.
  2. Local governance gate runs against the federated policy. Allow → continue, Pending → queue, Deny → reject.
  3. Local INSERT succeeds. WAL fsynced. Vector clock bumped.
  4. broadcast_store_quorum spawns one HTTP POST /api/v1/sync/push per configured peer (B, C, D). With --quorum-writes 0 the local response returns immediately; with quorum enabled it returns only after W-1 peer acks (or 503 quorum_not_met at the deadline).
  5. Each peer receives the push, validates, applies, and bumps its own SyncState entry for node-A.

A peer rejoins after a partition

Node-C drops off the cluster for 4 hours. When it comes back, its supervisor calls:

curl 'http://node-a:9077/api/v1/sync/since?peer=node-c&since=2026-06-09T04:00:00Z'

Node-A returns the delta — every row updated after that timestamp. Node-C applies them in causal order, bumps its own sync_state, and re-enters the steady-state mesh.

Federated governance — a real example

# On node-A: tighten policy on a sensitive namespace
# (the standard memory carries the policy; bind it by id)
ai-memory namespace set-standard --namespace org/legal/contracts \
  --id <standard-memory-uuid> \
  --governance '{"write":"approve","delete":"approve"}'

Within seconds:

That's the honest version of "federated governance" — eventually consistent, but coherent.

Deployment recipe

Quorum writes + mTLS peer mesh.

All real flags, all in v0.7.0:

# node-a — quorum writes require 2-of-3 peer acks within 2s
ai-memory --db /var/lib/ai-memory/store.db serve \
  --host 0.0.0.0 --port 9077 \
  --quorum-writes 2 \
  --quorum-peers https://node-b:9077,https://node-c:9077,https://node-d:9077 \
  --quorum-timeout-ms 2000 \
  --tls-cert /etc/ai-memory/tls.crt \
  --tls-key /etc/ai-memory/tls.key \
  --mtls-allowlist /etc/ai-memory/peer-fingerprints.txt \
  --quorum-client-cert /etc/ai-memory/client.crt \
  --quorum-client-key /etc/ai-memory/client.key \
  --quorum-ca-cert /etc/ai-memory/ca.crt \
  --catchup-interval-secs 30

The peer-fingerprints.txt file is a newline-delimited list of SHA-256 fingerprints (with or without : separators; comments start with #). serve refuses any peer whose cert fingerprint is not on the list — that's the peer-mesh identity gate.

For long-running pull-based reconciliation, run a sync-daemon alongside serve:

ai-memory sync-daemon \
  --peers https://node-b:9077,https://node-c:9077,https://node-d:9077 \
  --interval 2 \
  --client-cert /etc/ai-memory/client.crt \
  --client-key /etc/ai-memory/client.key
Substrate

Pinned Enterprise Federated component versions.

The postgres-backed substrate pins an exact, tested matrix. These are the single source of truth in deploy/docker-1461/provision/lib.sh and are asserted at bring-up — the validate harness refuses to certify a stack whose probed versions drift from the pins below.

ComponentCanonical versionSSOT pin
PostgreSQL18.4PG_APT_VERSION=18.4-1.pgdg13+1 · EXPECTED_PG_VERSION=18.4
Apache AGE1.7.0AGE_BASE_IMAGE=apache/age:release_PG18_1.7.0 · EXPECTED_AGE_VERSION=1.7.0
pgvector (server extension)0.8.2PGVECTOR_APT_VERSION=0.8.2-1.pgdg13+1
pgvector (Rust binding crate)0.4Cargo.tomlpgvector = "0.4"
ai-memory postgres schemav57schema parity with SQLite at v57 (v0.7.0; CURRENT_SCHEMA_VERSION = 57 in both adapters; Track C validated at v55 — the v56/v57 arms are additive index/column DDL)

The bundled stacked image at deploy/docker-1461/Dockerfile.pg-age-vector (ARG AGE_BASE_IMAGE=apache/age:release_PG18_1.7.0, ARG PG_MAJOR=18) layers pgvector 0.8.2 onto the AGE base so K8s / ECS / Cloud Run operators don't build AGE from source. Alternate tested matrix: infra/lan-parity-test/ legitimately runs PG 16 + AGE 1.6.0 + pgvector 0.8.2 as a second tested combination; the recommended install targets the PG 18.4 / AGE 1.7.0 matrix above.

Configuration

Federated config.toml — the full option surface.

Every deployment reads a single schema-versioned file at ~/.config/ai-memory/config.toml. The precedence ladder is uniform: CLI flag > AI_MEMORY_* env > [section] > legacy flat field > compiled default. The complete field-by-field reference (with types, defaults, and resolver semantics) lives in docs/CONFIG_SCHEMA.md.

schema_version = 2
tier = "autonomous"
db   = "/var/lib/ai-memory/store.db"

# Postgres connection-pool + query bounds.
postgres_pool_max_connections   = 16    # env: AI_MEMORY_PG_POOL_MAX
postgres_pool_min_connections   = 2     # env: AI_MEMORY_PG_POOL_MIN
postgres_acquire_timeout_secs   = 30    # env: AI_MEMORY_PG_ACQUIRE_TIMEOUT_SECS
postgres_statement_timeout_secs = 30    # 0 disables
request_timeout_secs  = 60              # axum slowloris guard
llm_call_timeout_secs = 30

[llm]            # backend / model / base_url / api_key_env|api_key_file
[embeddings]     # backend / url / model / backfill_batch
[reranker]       # enabled / model
[storage]        # default_namespace / archive_on_gc / archive_max_days / max_memory_mb
[limits]         # max_memories_per_day / max_storage_bytes / max_links_per_day / max_page_size
[identity]       # anonymize_default
[audit]          # enabled / path / hash_chain / attestation_cadence_minutes / retention_days
                 #   + [audit.compliance.{soc2,hipaa,gdpr,fedramp}]
[transcripts]    # default_ttl_secs / archive_grace_secs / max_decompressed_bytes
                 #   + [transcripts.namespaces."<pattern>"]
[hooks]          # [hooks.subscription] hmac_secret  (secret; chmod 600)
[subscriptions]  # allow_loopback_webhooks  (default false — SSRF guard)
[verify]         # require_nonce  (link-verify replay protection)
[agents]         # [agents.defaults.recall_scope] namespaces / since / tier / limit
[governance]     # require_operator_pubkey  (fail-closed rule enforcement)
[confidence]     # shadow_retention_days
[admin]          # agent_ids = [...]  (default-closed admin allowlist)
[mcp]            # profile
[permissions]    # mode = "enforce" | "advisory" | "off"

All sections are default-safe — an absent block selects the compiled default and preserves existing behaviour. Inline api_key = "<literal>" is rejected at parse time; reference secrets via api_key_env / api_key_file (mode 0400) only.

Wiring

Governance, skills, and attestations at T3.

Limits

Honest ceilings.

DimensionT3 ceilingWhen it bites
Cluster size~10 nodes before fanout latency dominatesBeyond that, walk to T4
Concurrent writes per node~10–20 (T2 ceiling, multiplied)Each node is independently bottlenecked by its mutex
Write durabilityW-of-N quorum-writes when --quorum-writes >= 1; local-WAL-only when 0Choose your contract per deployment
Cross-node consistencyEventual; vector clocks resolve order; ties break last-writer-winsUse memory_detect_contradiction to surface drift
Vector indexPer-node, independentEach node holds its own HNSW; embedding cost paid N times
TLSmTLS supported, not enforced by defaultEnforce before joining real peers
Source

Source-of-truth references.