Autonomous-tier intelligence.

Auto-tagging. Auto-consolidation. Query expansion. Contradiction detection. Memory reflection. Six features that turn ai-memory from a store into an agent — local-first by default (Gemma via Ollama) and, as of v0.7.0 (#1067 + #1146), provider-agnostic by config: route any of these through xAI Grok, OpenAI, Anthropic, Gemini, DeepSeek, Kimi, Qwen, Mistral, Groq, Together, Cerebras, OpenRouter, Fireworks, LMStudio, vLLM, or any other OpenAI-compatible endpoint via a [llm] section in ~/.config/ai-memory/config.toml. Local stays the default for the privacy story below; cloud unlocks the CPU-only and cellphone postures. Canonical schema (single source of truth, every surface): CONFIG_SCHEMA.md. Per-vendor recipes (config.toml + MCP env-block override): integrations/llm-backends.md.

Ollama local (default) 15+ cloud vendors via #1067 CPU-only mode no telemetry
▸ Thanks

Powered by Gemma 4 — open-sourced by Google.

Every autonomous feature on this page is made possible by Google's Gemma 4 family, released under an open weights license. Gemma 4 Effective 2B (~1 GB Q4) and Gemma 4 Effective 4B (~2.3 GB Q4) are the two models ai-memory targets — small enough to run locally on a laptop, capable enough to drive real agent reasoning. Thank you to the Gemma team and to Google for choosing to ship these models open. ai-memory is materially better because of it, and the entire local-first agent ecosystem stands on this contribution.

Gemma 4 E2B
~1 GB Q4 · Smart tier · Google · ai.google.dev/gemma
Gemma 4 E4B
~2.3 GB Q4 · Autonomous tier · Google · ai.google.dev/gemma
Ollama
Local LLM serving · MIT · ollama.com
nomic-embed-text
Embeddings · Apache 2.0 · Nomic AI · nomic.ai
MiniLM-L6-v2
Embeddings · Apache 2.0 · Hugging Face · model card
cross-encoder/ms-marco
Reranker · Apache 2.0 · Hugging Face · model card

See the credits page for the full open-source acknowledgement and license enumeration.

The four feature tiers

Pick the model size your hardware allows.

Autonomous features unlock as the operator allocates more memory to the daemon. Keyword tier needs zero extra RAM — just FTS5. Semantic tier loads embeddings (~256 MB). Smart tier adds Gemma 4 E2B (~1 GB). Autonomous tier upgrades to Gemma 4 E4B + cross-encoder reranker (~4 GB total). The RAM figures below are the local-model path (Ollama serving Gemma + a local embedder); when you point [llm] + [embeddings] at a remote API (see No GPU required below), the daemon host carries no model weight at all — only the ~90 MB CPU cross-encoder if reranking is enabled.

Keyword
RAM: 0 MB extra · always available
FTS5 keyword search only. The lowest-overhead option — runs on a Raspberry Pi.
▸ keyword_search × semantic_search × auto_tag × consolidate × contradictions
Semantic
RAM: ~256 MB · MiniLM-L6-v2
Adds 384-dim embeddings + HNSW vector index. Hybrid 70/30 recall (FTS5 + semantic).
▸ keyword_search ▸ semantic_search ▸ hybrid_recall × auto_tag × consolidate
Smart
RAM: ~1 GB · nomic-embed + Gemma 4 E2B
Adds 768-dim embeddings + Google's Gemma 4 E2B for reasoning. Unlocks the LLM-driven features below.
▸ all of Semantic ▸ auto_tag ▸ consolidate ▸ expand_query ▸ contradiction
Autonomous
RAM: ~4 GB · nomic-embed + Gemma 4 E4B + cross-encoder
Top tier — Google's Gemma 4 E4B for stronger reasoning, plus cross-encoder reranking for top-k recall precision. The full agent intelligence stack.
▸ all of Smart ▸ cross_encoder_reranking ▸ memory_reflect (SHIPPED at v0.7.0 — recursive learning with reflection_depth cap) ▸ session_start boot recall
The six autonomous features

What Gemma 4 unlocks.

memory_auto_tagSmart tier+
LLM looks at a memory's title + content and proposes tags. New tags merge into the existing tag set (no overwrites). Operators use this to tag bulk-imported memories without writing rules.
// MCP — memory_auto_tag {"id": "550e8400-e29b-41d4-a716-446655440000"} → {"id": "…", "new_tags": ["okr", "q3", "engineering"], "all_tags": ["draft", "okr", "q3", "engineering"]} // Gemma 4 reads title+content, returns 3-5 relevant tags. The new tags are // merged with whatever was already there.
memory_consolidateSmart tier+
Bulk-collapses N memories (up to 100) into 1 derived summary. Source memories are linked to the consolidated output via derived_from KG relation, so provenance survives. The biological-memory analog of sleep-driven episodic-to-long-term consolidation.
// MCP — memory_consolidate { "ids": ["id-1", "id-2", "id-3", …], // 2-100 ids "title": "Q3 OKR — consolidated retrospective", "namespace": "alphaone/eng" } → {"consolidated_id": "…", "summary": "<Gemma 4 generates a coherent summary>", "source_count": 12, "links_created": 12} // each source → derived_from edge
memory_expand_querySmart tier+
Takes a short user query and expands it into a richer set of related terms. Used to widen recall when the literal query doesn't match enough rows. Especially useful for vague natural-language queries against a corpus that uses precise jargon.
// MCP — memory_expand_query {"query": "how do we deploy"} → {"original": "how do we deploy", "expanded_terms": ["deploy", "deployment", "release", "ship", "rollout", "kubernetes", "ci pipeline", "container registry"]} // Caller can then run memory_search across the expanded set.
memory_detect_contradictionSmart tier+
Compares two memories and tells you if they contradict. Powers the v0.6.3 KG contradicts relation: when the LLM flags a contradiction, the system can auto-link the pair so future recall surfaces the conflict.
// MCP — memory_detect_contradiction { "id_a": "id of the older memory", "id_b": "id of the newer memory" } → {"contradicts": true, "memory_a": {"id": "…", "title": "We use Postgres"}, "memory_b": {"id": "…", "title": "We migrated to MySQL"}}
cross_encoder_rerankingAutonomous tier
Cross-encoder reranker scores top-K recall results against the query, reordering for precision. Where keyword + vector recall return a candidate set, the cross-encoder is the final pass that puts the best match first. Adds ~50ms to a recall but materially improves top-1 quality.
// Implicit — automatically applied during memory_recall when: // 1. Autonomous tier is configured, AND // 2. cross-encoder model loaded successfully at startup // // Recall pipeline becomes: // FTS5 70% ⊕ HNSW 30% → candidate set (top-100 typical) → // Cross-encoder rerank → final top-K (default 10)
memory_session_startAll tiers
Run at the start of an agent session. Auto-recalls the recent + high-priority memories (optionally scoped to a namespace) so the agent boots with context — the "morning briefing" without explicit recall calls peppered through the prompt. Default output is token-efficient toon_compact.
// MCP — memory_session_start { "namespace": "alphaone/eng/leadership", // optional filter "limit": 10, // default 10 · cap 50 "format": "toon_compact" // json | toon | toon_compact } → count:10|mode:session_start memories[id|title|tier|namespace|priority|score|tags|agent_id]: …
Why local matters

Every byte stays on the host.

No content leaves your machine. Auto-tag, consolidate, expand-query, contradiction-detect — all four prompt Gemma 4 with your memory contents. Because Ollama serves Gemma locally, the contents never leave the host. No SaaS provider sees your data; no API key risks leaking it.
No telemetry. ai-memory itself ships zero telemetry. Ollama itself ships zero telemetry. Your agent's memory operations stay between you and the daemon.
Air-gap compatible. Once Gemma 4 is pulled (one-time download), the daemon runs offline. Useful for regulated environments, classified work, or anywhere outbound network egress is restricted.
Cost-stable. No per-token billing. The model is yours. Auto-tagging 100 000 memories costs the same in API fees as auto-tagging 0 (zero).
Made possible by Google's open-weight Gemma 4. Without Google's choice to ship Gemma open, every feature on this page would require a paid hosted API. The local-first agent ecosystem is much smaller without Gemma 4 in it.
Operator quick start

From zero to autonomous in 4 commands.

# 1. Install Ollama $ brew install ollama $ ollama serve & # 2. Pull Gemma 4 (one-time, ~2.3 GB for E4B) $ ollama pull gemma4:e4b # or for the smaller Smart-tier model: $ ollama pull gemma4:e2b # 3. Configure ai-memory for autonomous tier (canonical v0.7.x schema-v2 # sectioned form, per CONFIG_SCHEMA.md / #1146). The legacy flat shape # (`ollama_url = "..."` / `cross_encoder = true` at the top level) still # parses for backward compat — run `ai-memory config migrate` to rewrite # in v2 shape (`.bak.<ts>` saved); legacy fields are removed in v0.8.0. $ cat > ~/.config/ai-memory/config.toml <<EOF schema_version = 2 tier = "autonomous" [llm] backend = "ollama" model = "gemma4:e4b" # the model pulled in step 2 (compiled default: gemma3:4b) base_url = "http://localhost:11434" [embeddings] backend = "ollama" # #1598: or any API vendor alias / "openai-compatible" url = "http://localhost:11434" model = "nomic-embed-text-v1.5" [reranker] enabled = true model = "ms-marco-MiniLM-L-6-v2" EOF # 4. Start the daemon $ ai-memory serve # Verify the autonomous features came online (HTTP capabilities surface, # or the memory_capabilities MCP tool, or `ai-memory doctor`): $ curl -s http://127.0.0.1:9077/api/v1/capabilities → "tier": "autonomous" → "features": { "auto_tagging": true, "auto_consolidation": true, "cross_encoder_reranking": true, "semantic_search": true, … }

From this point on, the autonomous MCP tools are live. Wire them into your AI client (Claude Code, Cursor, Codex, Continue, etc.) — see integrations atlas.

No GPU required — any LLM backend

Autonomous mode on CPU-only or air-gapped hosts — point [llm] at any LLM you have API access to.

Nothing on this page is hard-wired to a GPU, to Ollama, or to Gemma. The Ollama + Gemma quick-start above is the local-first default, not a requirement. As of v0.7.0 (#1067 + #1146) any LLM reachable over an HTTP endpoint can drive every autonomous feature — wherever you have API access, wherever you run on hosts with no GPUs, and you want to run ai-memory in autonomous mode with --profile full. The backend is a config choice ([llm] in config.toml or the AI_MEMORY_LLM_* env vars), resolved through the canonical precedence ladder CLI > env > [llm] > legacy > default.

Two CPU-only / air-gap postures below. The model names are examples, not requirements — substitute whatever model your provider or internal endpoint serves.

A. Cloud API — no local model, no GPU (OpenRouter shown as one low-cost example)

# Any OpenAI-compatible vendor works (openai | xai | anthropic | gemini | # deepseek | kimi | qwen | mistral | groq | together | cerebras | # openrouter | fireworks | …). OpenRouter + a Gemma 4 chat model is shown # here only as a low-cost EXAMPLE — swap `model` for any model the vendor serves. $ cat > ~/.config/ai-memory/config.toml <<EOF schema_version = 2 tier = "autonomous" [llm] backend = "openrouter" model = "google/gemma-4-26b-it" # example; any served model is fine api_key_env = "OPENROUTER_API_KEY" # env-var reference — never inline the key [embeddings] backend = "openai-compatible" # CPU-only remote embeddings — no local model url = "https://api.openai.com/v1" model = "text-embedding-3-small" [reranker] enabled = true model = "ms-marco-MiniLM-L-6-v2" # ~90 MB CPU cross-encoder; no GPU EOF $ export OPENROUTER_API_KEY="sk-or-…" # secret via env, never written to config $ ai-memory serve --profile full $ ai-memory doctor # LLM Reachability (#1146) — confirms which layer won → "tier": "autonomous"

B. Air-gapped — internal HA inference endpoint (no public internet, no GPU on the ai-memory host)

# The ai-memory host needs no GPU and no public egress — it speaks HTTP to an # internal load-balanced inference endpoint (vLLM / llama.cpp server / TGI / # your own gateway). `openai-compatible` requires an explicit base_url. $ cat > ~/.config/ai-memory/config.toml <<EOF schema_version = 2 tier = "autonomous" [llm] backend = "openai-compatible" model = "gemma-4-26b" # example; whatever your gateway serves base_url = "https://llm-ha.internal.corp/v1" # internal HA VIP, mTLS terminated upstream api_key_env = "AI_MEMORY_LLM_API_KEY" [embeddings] backend = "openai-compatible" url = "https://llm-ha.internal.corp/v1" model = "nomic-embed-text-v1.5" [reranker] enabled = true model = "ms-marco-MiniLM-L-6-v2" EOF $ ai-memory serve --profile full

Secrets discipline: the key is referenced by env-var name (api_key_env) or by a mode-0400 file (api_key_file) — an inline api_key = "…" literal is rejected at parse time. Canonical schema + per-vendor recipes: CONFIG_SCHEMA.md · integrations/llm-backends.md.