Reference Architecture

Enterprise Reference Architecture: CPU + Memory Federated Nodes

API embeddings on commodity CPU + memory nodes — no GPU, no Ollama, full semantic/autonomous tier. The #1598 reference shape for enterprise federated ai-memory deployments.

API embeddings on commodity CPU nodes — no GPU, no Ollama, full semantic/autonomous tier. This is the #1598 reference shape for enterprise federated deployments where the fleet is CPU + memory only (cloud VMs, droplets, containers, on-prem servers without accelerators). The embedder is wired to an embedding API — either a cloud vendor (the operator-selected reference is OpenRouter + google/gemini-embedding-2) or a self-hosted OpenAI-compatible serving stack for airgapped sites — so no node runs Ollama anywhere.

The GPU sibling of this page is Enterprise Reference Architecture: CPU + Memory + GPU Federated Nodes (local Ollama embeddings on GPU-equipped nodes). A when-to-choose-which table lives on that page. The visual catalog of all deployment topologies is docs/reference-architectures.md; capacity / cost / SLA planning is docs/enterprise-deployment.md.

Topology

                      federated mesh (N nodes, W-of-N quorum)
   ┌────────────────────────────────────────────────────────────────────┐
   │                                                                    │
   │   node-a (CPU+RAM)          node-b (CPU+RAM)      node-c (CPU+RAM) │
   │   ┌────────────────┐  mTLS  ┌────────────────┐    ┌──────────────┐ │
   │   │ ai-memory serve│◀──────▶│ ai-memory serve│◀──▶│ ai-memory    │ │
   │   │  sqlite/pg+AGE │ signed │  sqlite/pg+AGE │    │ serve        │ │
   │   └───────┬────────┘  sync  └───────┬────────┘    └──────┬───────┘ │
   │           │ HTTPS (Bearer)          │                    │         │
   └───────────┼────────────────────────┼────────────────────┼─────────┘
               ▼                        ▼                    ▼
        ┌──────────────────────────────────────────────────────────┐
        │  Embedding API — ONE of:                                  │
        │   • cloud: openrouter → google/gemini-embedding-2 (3072d) │
        │   • airgapped: self-hosted TEI / vLLM / llama.cpp server  │
        │     exposing OpenAI-compatible /v1/embeddings             │
        └──────────────────────────────────────────────────────────┘

   embeddings hop p50: ~80–300 ms (API)    federation: W=2 quorum, mTLS,
                                           X-Memory-Sig + X-Memory-Nonce

When to choose this. Your fleet is commodity CPU nodes (cloud VMs, containers, droplets) and you want full semantic recall + autonomous-tier features without provisioning GPUs. You accept an 80–300 ms embedding hop (per store / per recall query embed) in exchange for zero accelerator cost and a fleet image with no model weights. Operator policy: Ollama runs only on GPU-equipped nodes — on a CPU-only fleet the ai-memory doctor "Embeddings Reachability (#1598)" section fires a GPU-policy WARN if backend = ollama is resolved, which this architecture never triggers.

Shape A — cloud embeddings (OpenRouter reference)

Operator-selected reference cloud model: google/gemini-embedding-2 via OpenRouter — 3072-dim, 8192-token context, ~$0.20/M tokens, Google AI Studio provider, USA. Policy constraints baked into this reference: USA models only, paid-tier only (no :free routes — paid routing avoids free-tier providers with data-retention/training terms).

/etc/ai-memory/config.toml (per node, identical across the fleet):

schema_version = 2
tier = "autonomous"

[llm]
backend     = "openrouter"            # chat LLM — any #1067 alias works
model       = "x-ai/grok-4.3"         # USA-vendor policy example
api_key_env = "OPENROUTER_API_KEY"

[embeddings]
backend     = "openrouter"            # #1598 — API embeddings
# base_url omitted — the openrouter alias pre-fills
# https://openrouter.ai/api/v1
model       = "google/gemini-embedding-2"   # native 3072-dim
dim         = 768                     # #1598 fleet follow-up: explicit dim is
                                      # sent as the wire `dimensions` param —
                                      # gemini-embedding-2 truncates server-side
                                      # (Matryoshka). REQUIRED posture for
                                      # pgvector-backed federated fleets: ANN
                                      # indexes cap at 2000 dims and the fleet
                                      # schemas template vector(768). sqlite-only
                                      # single nodes may omit dim (native 3072).
api_key_file = "/etc/ai-memory/keys/embed.key"   # mode 0400 enforced;
                                                 # or api_key_env
backfill_batch = 100

[reranker]
enabled = true                        # cross-encoder runs on-CPU
model   = "ms-marco-MiniLM-L-6-v2"

[storage]
default_namespace = "fleet"
archive_on_gc     = true

Shape B — self-hosted / airgapped embeddings

For airgapped sites, serve an Apache-2.0 USA model behind an OpenAI-compatible /v1/embeddings endpoint on a shared CPU inference node (or per-rack):

Model Dims License Note
nomic-embed-text-v1.5 768 Apache 2.0 Zero re-embed for existing corpora — it is the compiled default model, so vectors are unchanged.
Snowflake arctic-embed l / m / s 1024 / 768 / 384 Apache 2.0 Pick by recall-quality vs. CPU budget.
IBM granite-embedding 768 / 384 Apache 2.0 In KNOWN_EMBEDDING_DIMS (#1598).

Serving stacks exposing OpenAI-compatible /v1/embeddings: HuggingFace text-embeddings-inference (Apache 2.0), vLLM (Apache 2.0), llama.cpp server (MIT).

[embeddings]
backend  = "openai-compatible"        # generic escape hatch — base_url
base_url = "http://tei.internal:8080/v1"   # REQUIRED for this backend
model    = "nomic-embed-text-v1.5"    # 768d — zero re-embed migration
# keyless internal endpoint: omit api_key_env / api_key_file entirely

Env-var equivalents (highest precedence; useful for container fleets): AI_MEMORY_EMBED_BACKEND, AI_MEMORY_EMBED_BASE_URL, AI_MEMORY_EMBED_MODEL, AI_MEMORY_EMBED_API_KEY (secret).

Federation skeleton

Identical to the standard hardened mesh — this architecture changes only the embedder leg. Per docs/federation.md:

Embedding-dim consistency is fleet-critical. Every federated peer MUST resolve the same embedding model/dim (this architecture pins model = "google/gemini-embedding-2" + dim = 768 fleet-wide so pgvector vector(768) schemas and their HNSW indexes — capped at 2000 dims by pgvector — keep serving ANN recall) — vectors from different models do not compare. Roll out a model change fleet-wide: (1) update [embeddings] on every node, (2) run ai-memory reembed --dry-run then ai-memory reembed per node, (3) verify with ai-memory doctor (the "Embeddings Reachability (#1598)" section reports the resolved backend/model/provenance).

Sizing notes

Security posture

See also