# Gemma 4 MTP — Empirical Bench (2026-05-17) > Reproducibility doc for the v0.7.0 / RFC #651 inference-backend > evaluation. Pins a one-day measurement on an M4 Mac Mini at the > exact Ollama/model versions present 2026-05-17 so a future agent can > re-run and compare. ## TL;DR **On `gemma4:e4b` Q4_K_M (the model `ai-memory`'s autonomous tier uses), setting `OLLAMA_MLX_MTP_MAX_DRAFT_TOKENS=4` produces zero measurable speedup.** Decode rate stays at 31 tok/sec, p50 wall stays at 2.9 s for the standard 80-token completion the curator's auto-tag / contradiction passes drive. MTP needs the separately- published draft model (`google/gemma-4-E4B-it-assistant`) to have anything to speculate against, and that model ships only as HF safetensors — Ollama's llama.cpp pull path returns `400: Repository is not GGUF or is not compatible with llama.cpp`. Until a GGUF-converted drafter lands (Ollama library, llama.cpp `convert_hf_to_gguf.py`, or a community port), MTP gating on this host is a no-op. ## Test environment | Component | Value | |---|---| | Host | M4 Mac Mini 2026 (FROSTYi.local), macOS 26.4.1 | | Ollama version (`ollama --version`) | 0.24.0 | | Ollama serve binary | `/Applications/Ollama.app/Contents/Resources/ollama serve` | | Ollama backend | Native MLX (`libmlx.dylib` loaded into PID per `lsof` — see `OllamaMLXBackendCheck` in §Caveats) | | Model | `gemma4:e4b` Q4_K_M GGUF, 8.0 B params, 131k ctx, requires Ollama ≥ 0.20.0 | | Draft model | **Absent** — pull of `hf.co/google/gemma-4-E4B-it-assistant` fails: `400: Repository is not GGUF or is not compatible with llama.cpp` | | ai-memory | v0.7.0 + PR #820 (curator dispatch deadlock fix + #819 hermetic tests + clippy pedantic) | | Curator config | `--interval-secs 300 --max-ops 100`, autonomous tier, gemma4:e4b for both auto-tag and contradiction-detect | ## Reproduction ```bash # Path A — Direct Ollama latency bench (no curator) cd /Users/fate/v07/v07-fixes/.local-runs/mtp-bench-2026-05-17 ./bench.sh baseline-pre-mtp # 20 calls, captures total/eval/prompt durations as JSONL # Restart Ollama with MTP env (kills the .app's auto-spawned serve) pkill -TERM -f 'Ollama|ollama serve' sleep 4 nohup env OLLAMA_MLX_MTP_MAX_DRAFT_TOKENS=4 \ OLLAMA_MODELS=/Users/fate/.ollama/models \ /Applications/Ollama.app/Contents/Resources/ollama serve \ > .local-runs/mtp-bench-2026-05-17/ollama-serve-mtp.log 2>&1 & disown sleep 4 ps eww -p "$(pgrep -f 'ollama serve' | head -1)" | tr ' ' '\n' | grep MTP # confirm env ./bench.sh post-mtp # same 20 calls under MTP ``` Both runs write `.jsonl` (per-call timings) and `.summary.txt` (model card + start/end wall times) to the same dir. Compare via: ```bash jq -s '{ calls: length, decode_tokens_per_sec: (([.[].eval_count] | add) * 1e9 / ([.[].eval_duration] | add)), p50_wall_ms: (sort_by(.wall_ns)[length/2|floor].wall_ns / 1000000), warm_avg_skip_outliers_ms: (sort_by(.wall_ns)[1:-1] | (map(.wall_ns) | add) / length / 1000000) }' baseline-pre-mtp.jsonl # repeat for post-mtp.jsonl ``` ## Numbers (this run, 2026-05-17 11:48–11:59 UTC) ``` Baseline (no MTP) MTP enabled Δ decode tokens/sec 31.35 31.01 −1% (noise) p50 wall (ms) 2897 2884 −0.4% warm-avg skip outliers 2911 2991 +2.7% total wall 20 calls (ms) 56401 68476 +21% (one 11.7 s outlier in MTP run) ``` 1542 output tokens generated each side, same prompt set (verbatim from `bench.sh`). The 11.7 s spike on the 20th MTP-run call was a model unload/swap, unrelated to MTP behavior. Removing it pulls the warm averages closer (≤3% wall delta, ≤1% tok/sec delta — all within single-run noise). ## End-to-end curator cycle delta Measured cycle wall times from `_curator/reports`: ``` created_at dur_ms pg tagged contrad errs notes 2026-05-17T12:01:24+00:00 98904 0 0 0 3 POST-MTP, new ai-memory daemon binary (PR #820) 2026-05-17T11:53:59+00:00 97841 0 0 0 3 pre-MTP 2026-05-17T11:47:21+00:00 117069 0 0 0 5 pre-MTP 2026-05-17T11:40:24+00:00 97222 0 0 0 3 pre-MTP 2026-05-17T11:33:46+00:00 60132 0 0 0 6 pre-MTP 2026-05-17T11:27:45+00:00 100476 0 0 0 3 pre-MTP ``` Pre-MTP steady-state cycle band: 60 – 117 s (range ≈ 57 s, mean ≈ 94 s). Post-MTP cycle: 98.9 s — squarely inside the pre-MTP band. Same `personas_generated=0` (nothing new to mint), `errors_total=3` (typical Ollama contradiction-detect timeouts), `auto_tagged=0` (no new eligible memories on this idle DB). **The end-to-end null result mirrors the per-call bench.** The MTP env var doesn't engage on this model + this Ollama, so no speedup propagates to the cycle path either. A second validation: PR #820's dispatch-deadlock fix is working in production — without it the cycle would never have persisted (the daemon would deadlock on the spawn-blocking thread's first `tracing::info!`). Re-run via: ```sql SELECT created_at, json_extract(content,'$.cycle_duration_ms') AS dur_ms, json_extract(content,'$.personas_generated') AS pg, json_extract(content,'$.errors_total') AS errs FROM memories WHERE namespace='_curator/reports' ORDER BY created_at DESC LIMIT 10; ``` ## Why MTP didn't help `OLLAMA_MLX_MTP_MAX_DRAFT_TOKENS` gates Ollama's MLX speculative decoding path. Speculative decoding requires **both** a target model and a smaller draft model — the draft predicts N tokens autoregressively, the target verifies N predictions in one parallel forward pass. Without a draft, the env var has no effect. Two possible draft paths exist: 1. **Integrated MTP heads on the target model** (DeepSeek-V3 style — single weight file, multiple prediction heads). Not present in `gemma4:e4b` GGUF on this host (`ollama show gemma4:e4b` lists no draft/speculative capability, and the bench shows zero speedup). 2. **A separate companion drafter.** `google/gemma-4-E4B-it-assistant` on HuggingFace fits this shape ("MTP is implemented by extending the base model with a smaller, faster draft model"), but it's safetensors-only — Ollama's pull path rejects it with `400: not GGUF or not compatible with llama.cpp`. Ollama's Library doesn't yet publish `gemma4:e4b-draft` or `gemma4:e4b-assistant` either. So **enabling MTP via env on this Ollama + this model is a no-op**. The Ollama serve doesn't error or warn — it just runs vanilla decode. ## Paths forward | Path | LoE | Outcome | |---|---|---| | **Wait for Ollama library to ship the drafter** | 0 (passive) | When `ollama pull gemma4:e4b-assistant` succeeds, re-run this bench and expect ~1.5–2× decode speedup per the #651 RFC analysis. | | **Convert HF safetensors → GGUF via `llama.cpp/convert_hf_to_gguf.py`** | ~1 hr investigation, may fail on novel MTP-drafter architecture | Manual but unblocks today. Risk: llama.cpp may not recognize MTP head topology yet. | | **Wait for #651 Phase 1 — `inference-mistralrs` or `inference-mlx` backends** | 1–2 weeks per #651 RFC | In-process MLX or mistralrs backends supporting MTP natively; bypasses Ollama's GGUF dependency for the drafter. | | **Distilled hot-path model** | weeks (training/distill effort) | The #651 ULTRA-1 fallback strategy. <1B-param model for recall hot-path, full gemma4 reserved for autonomous tier. Reaches sub-50ms p95. | For most ai-memory deployments today (curator cycles already ≤2 min at steady state, recall-tier doesn't hit Ollama), MTP is a nice-to-have, not load-bearing. The substantive curator throughput wins live in #651 Phase 1 backends (in-process Rust, no HTTP serialization overhead) and Phase 3 distillation. ## Caveats - **MLX backend check.** `lsof -p | grep mlx` must show `libmlx.dylib` / `libmlxc.dylib` loaded for MTP to be reachable at all. Apple Silicon hosts get MLX auto-selected on Metal-4-capable chips (M2 Pro and later); Intel Macs / Linux NVIDIA hosts won't have an MLX path regardless of env. - **20-call sample size** isn't statistically rigorous — for a real benchmark, run ≥ 200 calls per condition with warm-up, randomized prompt order, and a coefficient-of-variation guard. The 20-call result here is sufficient to detect the null-result we found (decode rates differ by <1%) but wouldn't surface a 5% true speedup confidently. - **Ollama 0.24.0** is the version we tested. The `OLLAMA_MLX_MTP_*` family of env vars was added in 0.23+; behavior may evolve in later Ollama releases. Re-run after any Ollama upgrade. - **Project hard rule on `/tmp`** — this bench writes all artifacts under `/Users/fate/v07/v07-fixes/.local-runs/mtp-bench-2026-05-17/`, including the Ollama serve log when launched with the MTP env. ## Artifacts on disk ``` .local-runs/mtp-bench-2026-05-17/ ├── bench.sh # the bench driver ├── baseline-pre-mtp.jsonl # 20 calls, no MTP env ├── baseline-pre-mtp.summary.txt # model card + wall times ├── post-mtp.jsonl # 20 calls, OLLAMA_MLX_MTP_MAX_DRAFT_TOKENS=4 ├── post-mtp.summary.txt └── ollama-serve-mtp.log # MTP-env-enabled Ollama serve stdout/stderr ``` ## Cross-references - Issue [#651](https://github.com/alphaonedev/ai-memory-mcp/issues/651) — RFC: pluggable inference backend trait (multi-platform GPU acceleration for autonomous + ULTRA-1 tiers). - PR [#820](https://github.com/alphaonedev/ai-memory-mcp/pull/820) — curator dispatch deadlock fix that landed in the same session; unblocks the end-to-end measurement (without that fix the curator daemon never reached `Connection::open` so cycle timings would never settle). - v0.7.0 release notes: this doc joins the empirical-evidence set the v0.8+ inference RFC builds on.