ai-memory · Enterprise Reference Architecture — CPU + Memory + GPU Federated Nodes

Local Ollama embeddings on GPU-equipped nodes — zero per-token cost, no embedding egress. This is the #1598 reference shape for enterprise federated deployments where every memory-bearing node has a compatible GPU. Operator GPU policy: Ollama runs only on GPU-equipped nodes — ai-memory doctor's "Embeddings Reachability (#1598)" section fires a GPU-policy WARN when backend = ollama resolves on a host with no detectable NVIDIA GPU. On this architecture every node satisfies the policy by construction.

The CPU-only sibling of this page is Enterprise Reference Architecture: CPU + Memory Federated Nodes (API embeddings, no Ollama anywhere). The visual catalog of all deployment topologies is docs/reference-architectures.md; capacity / cost / SLA planning is docs/enterprise-deployment.md.

Topology

NHI agent — mTLS client → :9077 GPU embed — localhost /api/embed (loopback) Store — SQLite / Postgres + AGE Federation — W-of-N quorum · mTLS + Ed25519 Batman maximum-secure

v78

DB schema

101 / 7

MCP tools (full / core)

87 / 89

CLI subcommands (default / sal)

92 / 78

HTTP routes / paths

Link relations

Memory fields

When to choose this. Your fleet nodes have GPUs (or you are sizing new nodes and embedding volume justifies them). You want the lowest embedding latency, zero per-token embedding cost, and no embedding traffic leaving the node — at the price of GPU hardware, Ollama as a per-node dependency, and model weights in every node image.

Per-node configuration

Every memory-bearing node in this fleet is the same shape — one ai-memory serve daemon, one node-local GPU Ollama embedder, one store. The spec below is the per-node contract; the canonical config.toml and the verified AI_MEMORY_* override battery follow.

Node role

Memory-bearing peer

One ai-memory serve daemon per node, autonomous tier (keyword → semantic → smart → autonomous reranking). Identical across the fleet — every node both serves recall and is a federation quorum member.

Hardware / sizing

CPU + RAM + GPU

CPU cores for FTS5 / HNSW / cross-encoder rerank; RAM for the in-memory HNSW index + the 256 MiB sqlite mmap default. NVIDIA GPU mandatory (operator GPU policy) to host Ollama — ai-memory doctor WARNs on a GPU-less node running backend = ollama.

Embedder

Local GPU Ollama

backend = ollama, nomic-embed-text (768d, in KNOWN_EMBEDDING_DIMS) over the loopback /api/embed wire — ~5–30 ms, $0 per token, no embedding egress. The compiled-default backend.

Chat LLM

Decoupled (#1067)

Independent of the embedder: point [llm] at a cloud vendor (openrouter / x-ai/grok-4.3 reference) or at the same local Ollama for a full airgap. Any tier can speak to any provider.

Store

SQLite or Postgres + AGE

Single-node: SQLite WAL + FTS5. Shared fleet: PostgreSQL + Apache AGE + pgvector via ai-memory serve --store-url postgres://… (the SAL trait path). Schema v78 in lockstep across all peers.

Federation & security

W-of-N · Batman

W-of-N synchronous quorum, vector-clock merge, periodic catch-up pull. mTLS + fingerprint allowlist, per-message Ed25519 (X-Memory-Sig + X-Memory-Nonce), peer enrollment fail-closed by default at v0.8.0 (#1789), permissions = enforce.

Key `AI_MEMORY_*` overrides

The env battery a node sets explicitly (each resolves through the uniform ladder CLI flag > AI_MEMORY_* env > config.toml > compiled default):

Knob	Value	Effect
`AI_MEMORY_EMBED_BACKEND`	`ollama`	Node-local GPU embeddings via native `/api/embed` (no auth)
`AI_MEMORY_EMBED_BASE_URL`	`http://localhost:11434`	Loopback Ollama endpoint — keeps embedding traffic on-node
`AI_MEMORY_EMBED_MODEL`	`nomic-embed-text`	768-dim, resolved from `KNOWN_EMBEDDING_DIMS`
`AI_MEMORY_LLM_BACKEND`	`openrouter`	Chat LLM, decoupled from the embedder (#1067)
`AI_MEMORY_LLM_MODEL`	`x-ai/grok-4.3`	Vendor model id, passed verbatim to the chat endpoint
`AI_MEMORY_PERMISSIONS_MODE`	`enforce`	K3/K9 governance gate (secure default)
`AI_MEMORY_REQUIRE_AGENT_ATTESTATION`	`1`	Fail-closed: unsigned writes rejected `403`
`AI_MEMORY_FED_REQUIRE_SIG`	`1`	Reject unsigned `/sync/push` with `401`
`AI_MEMORY_FED_REQUIRE_NONCE`	`1`	Per-message replay guard (`X-Memory-Nonce`)
`AI_MEMORY_FED_REQUIRE_PEER_ENROLLMENT`	`1`	Unenrolled peer → `401`; secure default ON at v0.8.0 (#1789)

Canonical `config.toml`

schema_version = 2
tier = "autonomous"

[llm]
# Chat LLM is independent of the embedder (#1067): point it at a
# cloud vendor, or at the same local Ollama for full airgap.
backend     = "openrouter"
model       = "x-ai/grok-4.3"
api_key_env = "OPENROUTER_API_KEY"

[embeddings]
backend = "ollama"                    # local GPU Ollama — the compiled
                                      # default backend
url     = "http://localhost:11434"    # synonym of base_url; this is the
                                      # ollama default, shown explicit
model   = "nomic-embed-text"          # 768d, Apache 2.0, USA (Nomic);
                                      # in KNOWN_EMBEDDING_DIMS
backfill_batch = 100
# No api_key_* — Ollama's native /api/embed wire shape is unauthenticated
# and loopback-only in this shape.

[reranker]
enabled = true
model   = "ms-marco-MiniLM-L-6-v2"

[storage]
default_namespace = "fleet"
archive_on_gc     = true

Backfill requests to Ollama send truncate: true (#1595) so over-length inputs are truncated server-side instead of failing the batch; per-row fallback + skip-with-WARN applies on this backend the same as on API backends.

Federation skeleton

Identical to the CPU-only sibling — the embedder leg is the only difference. Per docs/federation.md: TLS + mTLS fingerprint allowlist at the transport layer, --api-key at the application layer, per-message Ed25519 signed sync (X-Memory-Sig + X-Memory-Nonce, secure-by-default) with peer enrollment fail-closed by default at v0.8.0 (#1789) at the identity layer, W-of-N quorum writes + vector-clock merge + periodic catch-up pull.

Embedding-dim consistency is fleet-critical here too: every peer must run the same embedding model/dim. Mixed fleets (some GPU nodes on local nomic-embed-text, some CPU nodes on a 3072-dim API model) are NOT a supported shape — pick one architecture per federation, or align the API nodes on the same 768-dim model the GPU nodes serve. Migrate with ai-memory reembed --dry-run → ai-memory reembed per node after any model change.

When to choose which architecture

Dimension	CPU + Memory (API embeddings)	CPU + Memory + GPU (local Ollama)
Node hardware	Commodity VMs / containers; no accelerator	GPU on every memory-bearing node
Embedding backend	Any #1067 alias (`openrouter` reference) or self-hosted TEI / vLLM / llama.cpp server (`openai-compatible`)	Local Ollama, native `/api/embed`
Ollama on nodes	None anywhere	Required, GPU-backed (operator GPU policy)
Embed latency p50	~80–300 ms (API hop)	~5–30 ms (localhost GPU)
Marginal embed cost	Per-token API spend (e.g. ~$0.20/M on the gemini-embedding-2 reference)	$0 after hardware
Embedding egress	Cloud shape: yes (paid no-training routes); airgapped shape: LAN-only	None (loopback)
Reference model	`google/gemini-embedding-2` (3072d) cloud; `nomic-embed-text-v1.5` (768d) airgapped	`nomic-embed-text` (768d)
Re-embed on adoption	Cloud shape: yes (768d → 3072d, `ai-memory reembed`); airgapped nomic shape: none	None (same default model/dim)
Node image size	Small (no weights)	+ Ollama + model weights
Failure mode	API outage → loud keyword-mode degradation (#1593), truthful capabilities (#1594)	Local Ollama down → same fail-closed degradation, but failure domain is per-node
Doctor GPU-policy WARN	Never fires (`backend != ollama`)	Never fires (GPU present); fires on a mis-scheduled CPU-only node
Choose when	Fleet is CPU-only; elastic / containerized; embedding volume modest or bursty	GPUs already present; highest embed volume; hard data-locality requirements with no self-host serving tier

Hybrid note: a self-hosted TEI/vLLM serving node with a GPU, fronting CPU-only ai-memory nodes via backend = "openai-compatible", is the CPU + Memory architecture (Shape B) with GPU-accelerated serving — it keeps the fleet nodes Ollama-free and complies with the GPU policy at the serving tier.

Enterprise Reference Architecture: CPU + Memory + GPU Federated Nodes

Topology

Per-node configuration

Memory-bearing peer

CPU + RAM + GPU

Local GPU Ollama

Decoupled (#1067)

SQLite or Postgres + AGE

W-of-N · Batman

Key `AI_MEMORY_*` overrides

Canonical `config.toml`

Federation skeleton

When to choose which architecture

See also

Enterprise Reference Architecture: CPU + Memory + GPU Federated Nodes

Topology

Per-node configuration

Memory-bearing peer

CPU + RAM + GPU

Local GPU Ollama

Decoupled (#1067)

SQLite or Postgres + AGE

W-of-N · Batman

Key AI_MEMORY_* overrides

Canonical config.toml

Federation skeleton

When to choose which architecture

See also

Key `AI_MEMORY_*` overrides

Canonical `config.toml`