Performance — measured, gated, public.

v0.6.3 ships a bench tool with three operator-friendly flags: --baseline for regression detection, --history for JSONL accumulation, --update-performance-md for auto-spliced docs. CI runs the bench on every PR. Regressions >N% fail the gate. The numbers in PERFORMANCE.md are the public contract.

bench tool (Stream E) PERFORMANCE.md (Stream F) CI bench gate 3 operator flags
Public p95 budgets

The numbers, committed to source.

Every operation has a target p95 latency budget on a reference machine (Apple M2, 8GB free RAM, SQLite). PRs that regress beyond a configurable threshold fail the bench CI gate. v0.7's Apache AGE work will extend this with KG-query budgets; v0.8 adds task-queue p95.

Rows tagged [advisory] are published targets without a matching bench in src/bench.rs yet (Stream E follow-up — embedder-bound, background-job, or federation paths). Unmarked rows are exercised by ai-memory bench on every PR. PERFORMANCE.md is the canonical budget contract.

OperationTierp95 budgetNotes
memory_storekeyword≤ 5 msFTS5 only · single-row insert
memory_storesemantic≤ 25 ms[advisory] + MiniLM embedding (384d)
memory_storeautonomous≤ 60 ms[advisory] + nomic embedding (768d)
memory_getany≤ 2 ms[advisory] indexed PK lookup + access bump
memory_searchkeyword≤ 8 msFTS5 query, top-20
memory_recallsemantic≤ 35 ms[advisory] FTS5 70% + HNSW 30% fusion
memory_recallautonomous≤ 90 ms[advisory] + cross-encoder rerank (top-100→top-10)
memory_linkany≤ 4 ms[advisory] FK insert + idempotency check
memory_promoteany≤ 8 ms[advisory] + governance verdict
memory_consolidatesmart≤ 1500 ms[advisory] LLM-bound · Gemma 4 E2B
memory_kg_queryany≤ 50 msrecursive CTE · depth 3 · < 1k edges
memory_get_taxonomyany≤ 30 ms[advisory] tree walk · default depth 8 · limit 1000
memory_archive_purgeany≤ 200 ms[advisory] per 1000 archived rows
sync_push (per row)any≤ 15 ms[advisory] peer-to-peer · TLS 1.3
bulk_createany≤ 2000 ms[advisory] 100 rows + fanout · concurrent post W12

Reference machine: Apple M2, 16GB RAM, macOS 14, SQLite default settings, single-process daemon. Numbers degrade gracefully on lower-spec hardware; budgets are headroom-aware. Live numbers under PERFORMANCE.md.

The bench tool

Three flags that make it operator-friendly.

--baseline <path>
Compare this run against a saved baseline JSON. Flags any operation whose p95 increased by more than the configured threshold (default 10%). Used by the CI bench gate to refuse regressions before merge.
$ ai-memory bench --baseline ./bench/baseline-v0.6.3.json → memory_store keyword baseline 4.2ms current 4.4ms ✓ within +5% → memory_recall semantic baseline 31ms current 33ms ✓ within +6% → memory_consolidate smart baseline 1100ms current 1750ms ✗ +59% — REGRESSION Exit 1 — bench detected regressions, see report
--history <path>
Append the current run as one JSONL row to a history file. Trends over time become a one-liner — jq, awk, or any chart tool reads the JSONL directly. Used to keep a rolling p95 trend that survives across releases.
$ ai-memory bench --history ./bench/history.jsonl → Appended run 2026-04-27T05:00:00Z to history.jsonl # Each row in history.jsonl: {"timestamp":"2026-04-27T05:00:00Z", "git_sha":"d4bc4b6", "results":{ "memory_store_keyword": {"p50":3.1, "p95":4.4, "p99":5.2}, "memory_recall_semantic": {"p50":22, "p95":33, "p99":41}, … }}
--update-performance-md
Splice the latest measurements into the public PERFORMANCE.md file, in-place between the marker comments. The next git commit captures the doc update. Operators run this after a known-good release to keep the public numbers fresh.
$ ai-memory bench --update-performance-md ./PERFORMANCE.md → Updated 12 budget rows · staged for commit · git diff PERFORMANCE.md # Workflow: # 1. cut a release candidate # 2. run bench on the reference machine # 3. ai-memory bench --update-performance-md # 4. git diff to review the splice # 5. commit + ship
CI bench gate

Regressions fail the build.

Every PR runs the bench job (.github/workflows/bench.yml) on a fixed-spec runner. The job pulls the baseline from the previous release tag, runs the bench, compares. Regressions > threshold fail the job and block merge.

# .github/workflows/bench.yml — pseudocode on: pull_request: branches: [main, "release/**"] jobs: bench: runs-on: ubuntu-latest # consistent runner — bench is comparative steps: - name: Checkout - name: Build release - name: Pull baseline run: gh release download v0.6.3 --pattern bench-baseline.json - name: Run bench run: ./target/release/ai-memory bench --baseline bench-baseline.json --history bench-history.jsonl - name: Upload history (artifact) uses: actions/upload-artifact@v5
Tracing — span-level attribution

Why was that one slow?

Every MCP tool dispatch is wrapped in a tracing::info_span!("mcp_tool_call") with attributes including tool, elapsed_ms, outcome. Operators can chart per-tool p95/p99 against the public budgets — and when a single call regresses, the trace tree shows which sub-operation took the time. v0.7's tracing layer extends this with hook-pipeline spans.

# Example — find recall calls slower than 100ms in the last hour $ grep "mcp_tool_call" daemon.log | jq 'select(.tool=="memory_recall" and .elapsed_ms > 100)' {"timestamp":"…", "tool":"memory_recall", "elapsed_ms":142, "outcome":"ok", "namespace":"alphaone/eng/platform/team-a", "limit":50, "reranker_used":true}
Scaling envelope

When the numbers stop holding.

Up to ~100k memories on SQLite — comfortable. FTS5 stays sub-10ms recall, HNSW vector index sub-30ms. The reference numbers above hold.
100k–1M on SQLite — still works, FTS5 + HNSW remain fast (both are O(log n) for typical queries), but consolidation + bulk_create see noticeable degradation. Recommend Postgres SAL adapter (--features sal) past ~1M.
1M+ on Postgres — supported via SAL. Bench numbers shift; v0.7 Apache AGE wraps the KG schema for graph-query speedup at this scale. The dual-path SAL contract means SQLite operations and PG operations stay behaviorally identical.
Federation soak — 5+ peers, sustained burst writes — v0.6.3 W12 hardening + S40 catchup batch close the gap from v0.6.2 (which dropped 1/500 rows under sustained mTLS load). The certification matrix on release-pipeline.html shows v0.6.2's a2a-gate certification streak; v0.6.3 builds on that with an additional 3.05% coverage gain across federation paths.