# Gemma 4 MTP — Empirical Bench (2026-05-17)

> Reproducibility doc for the v0.7.0 / RFC #651 inference-backend
> evaluation. Pins a one-day measurement on an M4 Mac Mini at the
> exact Ollama/model versions present 2026-05-17 so a future agent can
> re-run and compare.

## TL;DR

**On `gemma4:e4b` Q4_K_M (the model `ai-memory`'s autonomous tier
uses), setting `OLLAMA_MLX_MTP_MAX_DRAFT_TOKENS=4` produces zero
measurable speedup.** Decode rate stays at 31 tok/sec, p50 wall stays
at 2.9 s for the standard 80-token completion the curator's
auto-tag / contradiction passes drive. MTP needs the separately-
published draft model (`google/gemma-4-E4B-it-assistant`) to have
anything to speculate against, and that model ships only as HF
safetensors — Ollama's llama.cpp pull path returns
`400: Repository is not GGUF or is not compatible with llama.cpp`.

Until a GGUF-converted drafter lands (Ollama library, llama.cpp
`convert_hf_to_gguf.py`, or a community port), MTP gating on this
host is a no-op.

## Test environment

| Component | Value |
|---|---|
| Host | M4 Mac Mini 2026 (FROSTYi.local), macOS 26.4.1 |
| Ollama version (`ollama --version`) | 0.24.0 |
| Ollama serve binary | `/Applications/Ollama.app/Contents/Resources/ollama serve` |
| Ollama backend | Native MLX (`libmlx.dylib` loaded into PID per `lsof` — see `OllamaMLXBackendCheck` in §Caveats) |
| Model | `gemma4:e4b` Q4_K_M GGUF, 8.0 B params, 131k ctx, requires Ollama ≥ 0.20.0 |
| Draft model | **Absent** — pull of `hf.co/google/gemma-4-E4B-it-assistant` fails: `400: Repository is not GGUF or is not compatible with llama.cpp` |
| ai-memory | v0.7.0 + PR #820 (curator dispatch deadlock fix + #819 hermetic tests + clippy pedantic) |
| Curator config | `--interval-secs 300 --max-ops 100`, autonomous tier, gemma4:e4b for both auto-tag and contradiction-detect |

## Reproduction

```bash
# Path A — Direct Ollama latency bench (no curator)
cd /Users/fate/v07/v07-fixes/.local-runs/mtp-bench-2026-05-17
./bench.sh baseline-pre-mtp        # 20 calls, captures total/eval/prompt durations as JSONL

# Restart Ollama with MTP env (kills the .app's auto-spawned serve)
pkill -TERM -f 'Ollama|ollama serve'
sleep 4
nohup env OLLAMA_MLX_MTP_MAX_DRAFT_TOKENS=4 \
          OLLAMA_MODELS=/Users/fate/.ollama/models \
          /Applications/Ollama.app/Contents/Resources/ollama serve \
          > .local-runs/mtp-bench-2026-05-17/ollama-serve-mtp.log 2>&1 &
disown
sleep 4
ps eww -p "$(pgrep -f 'ollama serve' | head -1)" | tr ' ' '\n' | grep MTP   # confirm env

./bench.sh post-mtp                # same 20 calls under MTP
```

Both runs write `<label>.jsonl` (per-call timings) and `<label>.summary.txt`
(model card + start/end wall times) to the same dir. Compare via:

```bash
jq -s '{
  calls: length,
  decode_tokens_per_sec: (([.[].eval_count] | add) * 1e9 / ([.[].eval_duration] | add)),
  p50_wall_ms: (sort_by(.wall_ns)[length/2|floor].wall_ns / 1000000),
  warm_avg_skip_outliers_ms: (sort_by(.wall_ns)[1:-1] | (map(.wall_ns) | add) / length / 1000000)
}' baseline-pre-mtp.jsonl
# repeat for post-mtp.jsonl
```

## Numbers (this run, 2026-05-17 11:48–11:59 UTC)

```
                           Baseline (no MTP)     MTP enabled         Δ
decode tokens/sec          31.35                 31.01               −1% (noise)
p50 wall (ms)              2897                  2884                −0.4%
warm-avg skip outliers     2911                  2991                +2.7%
total wall 20 calls (ms)   56401                 68476               +21% (one 11.7 s outlier in MTP run)
```

1542 output tokens generated each side, same prompt set (verbatim from
`bench.sh`). The 11.7 s spike on the 20th MTP-run call was a model
unload/swap, unrelated to MTP behavior. Removing it pulls the warm
averages closer (≤3% wall delta, ≤1% tok/sec delta — all within
single-run noise).

## End-to-end curator cycle delta

Measured cycle wall times from `_curator/reports`:

```
created_at                            dur_ms   pg  tagged contrad errs   notes
2026-05-17T12:01:24+00:00             98904    0   0      0       3      POST-MTP, new ai-memory daemon binary (PR #820)
2026-05-17T11:53:59+00:00             97841    0   0      0       3      pre-MTP
2026-05-17T11:47:21+00:00             117069   0   0      0       5      pre-MTP
2026-05-17T11:40:24+00:00             97222    0   0      0       3      pre-MTP
2026-05-17T11:33:46+00:00             60132    0   0      0       6      pre-MTP
2026-05-17T11:27:45+00:00             100476   0   0      0       3      pre-MTP
```

Pre-MTP steady-state cycle band: 60 – 117 s (range ≈ 57 s, mean
≈ 94 s). Post-MTP cycle: 98.9 s — squarely inside the pre-MTP
band. Same `personas_generated=0` (nothing new to mint),
`errors_total=3` (typical Ollama contradiction-detect timeouts),
`auto_tagged=0` (no new eligible memories on this idle DB).

**The end-to-end null result mirrors the per-call bench.** The
MTP env var doesn't engage on this model + this Ollama, so no
speedup propagates to the cycle path either. A second validation:
PR #820's dispatch-deadlock fix is working in production —
without it the cycle would never have persisted (the daemon
would deadlock on the spawn-blocking thread's first
`tracing::info!`).

Re-run via:

```sql
SELECT created_at, json_extract(content,'$.cycle_duration_ms') AS dur_ms,
                   json_extract(content,'$.personas_generated') AS pg,
                   json_extract(content,'$.errors_total') AS errs
FROM memories WHERE namespace='_curator/reports'
ORDER BY created_at DESC LIMIT 10;
```

## Why MTP didn't help

`OLLAMA_MLX_MTP_MAX_DRAFT_TOKENS` gates Ollama's MLX speculative
decoding path. Speculative decoding requires **both** a target model
and a smaller draft model — the draft predicts N tokens
autoregressively, the target verifies N predictions in one parallel
forward pass. Without a draft, the env var has no effect.

Two possible draft paths exist:

1. **Integrated MTP heads on the target model** (DeepSeek-V3 style —
   single weight file, multiple prediction heads). Not present in
   `gemma4:e4b` GGUF on this host (`ollama show gemma4:e4b` lists no
   draft/speculative capability, and the bench shows zero speedup).
2. **A separate companion drafter.** `google/gemma-4-E4B-it-assistant`
   on HuggingFace fits this shape ("MTP is implemented by extending
   the base model with a smaller, faster draft model"), but it's
   safetensors-only — Ollama's pull path rejects it with `400: not
   GGUF or not compatible with llama.cpp`. Ollama's Library doesn't
   yet publish `gemma4:e4b-draft` or `gemma4:e4b-assistant` either.

So **enabling MTP via env on this Ollama + this model is a no-op**.
The Ollama serve doesn't error or warn — it just runs vanilla decode.

## Paths forward

| Path | LoE | Outcome |
|---|---|---|
| **Wait for Ollama library to ship the drafter** | 0 (passive) | When `ollama pull gemma4:e4b-assistant` succeeds, re-run this bench and expect ~1.5–2× decode speedup per the #651 RFC analysis. |
| **Convert HF safetensors → GGUF via `llama.cpp/convert_hf_to_gguf.py`** | ~1 hr investigation, may fail on novel MTP-drafter architecture | Manual but unblocks today. Risk: llama.cpp may not recognize MTP head topology yet. |
| **Wait for #651 Phase 1 — `inference-mistralrs` or `inference-mlx` backends** | 1–2 weeks per #651 RFC | In-process MLX or mistralrs backends supporting MTP natively; bypasses Ollama's GGUF dependency for the drafter. |
| **Distilled hot-path model** | weeks (training/distill effort) | The #651 ULTRA-1 fallback strategy. <1B-param model for recall hot-path, full gemma4 reserved for autonomous tier. Reaches sub-50ms p95. |

For most ai-memory deployments today (curator cycles already
≤2 min at steady state, recall-tier doesn't hit Ollama), MTP is a
nice-to-have, not load-bearing. The substantive curator throughput
wins live in #651 Phase 1 backends (in-process Rust, no HTTP
serialization overhead) and Phase 3 distillation.

## Caveats

- **MLX backend check.** `lsof -p <ollama-serve-pid> | grep mlx`
  must show `libmlx.dylib` / `libmlxc.dylib` loaded for MTP to be
  reachable at all. Apple Silicon hosts get MLX auto-selected on
  Metal-4-capable chips (M2 Pro and later); Intel Macs / Linux NVIDIA
  hosts won't have an MLX path regardless of env.
- **20-call sample size** isn't statistically rigorous — for a real
  benchmark, run ≥ 200 calls per condition with warm-up, randomized
  prompt order, and a coefficient-of-variation guard. The 20-call
  result here is sufficient to detect the null-result we found
  (decode rates differ by <1%) but wouldn't surface a 5% true
  speedup confidently.
- **Ollama 0.24.0** is the version we tested. The `OLLAMA_MLX_MTP_*`
  family of env vars was added in 0.23+; behavior may evolve in
  later Ollama releases. Re-run after any Ollama upgrade.
- **Project hard rule on `/tmp`** — this bench writes all artifacts
  under `/Users/fate/v07/v07-fixes/.local-runs/mtp-bench-2026-05-17/`,
  including the Ollama serve log when launched with the MTP env.

## Artifacts on disk

```
.local-runs/mtp-bench-2026-05-17/
├── bench.sh                     # the bench driver
├── baseline-pre-mtp.jsonl       # 20 calls, no MTP env
├── baseline-pre-mtp.summary.txt # model card + wall times
├── post-mtp.jsonl               # 20 calls, OLLAMA_MLX_MTP_MAX_DRAFT_TOKENS=4
├── post-mtp.summary.txt
└── ollama-serve-mtp.log         # MTP-env-enabled Ollama serve stdout/stderr
```

## Cross-references

- Issue [#651](https://github.com/alphaonedev/ai-memory-mcp/issues/651)
  — RFC: pluggable inference backend trait (multi-platform GPU
  acceleration for autonomous + ULTRA-1 tiers).
- PR [#820](https://github.com/alphaonedev/ai-memory-mcp/pull/820)
  — curator dispatch deadlock fix that landed in the same session;
  unblocks the end-to-end measurement (without that fix the
  curator daemon never reached `Connection::open` so cycle timings
  would never settle).
- v0.7.0 release notes: this doc joins the empirical-evidence set
  the v0.8+ inference RFC builds on.
