Testing Corpus — every test, every result, every insight¶
01
What's under test¶
Subject: ai-memory 0.6.3+patch.1 — Apache-2.0 substrate for AI-to-AI memory + coordination. Schema v19. 1,886 library tests, 93.84% line coverage upstream. The job of this campaign is to put a working substrate under realistic agent-loop pressure and write down what survives and what doesn't.
Three agent frameworks exercise the substrate under three different transports:
Primary cert agent IronClaw (Rust) First-class agent. DigitalOcean 4-node mesh. Drives the Phase 3 NHI behavioral playbook (scenarios A–J) at n=3 across four control arms.
Cross-framework counterpart Hermes (Python) Counterpart to IronClaw on the same DO topology. Proves substrate scenarios are framework-agnostic — same scenarios pass on different agent runtimes.
First-class third agent OpenClaw (Node.js) Workstation 4-node Docker mesh, 16 GB / openclaw container. xAI Grok 4.3 backed. 3-green substrate streak + Tier 1–4 behavioral assessment (52 probes, 8 phases).
02
The five evidence streams¶
Per Principle 1 (two truth-claims, two evidence streams, never conflated), the campaign produces independently-auditable artifacts. Substrate-side answers "does the code work?". Behavioral-side answers "do agents actually use it well?". Each stream is layered into tiers; higher-tier instruments produce stronger signal but cost more to run.
Stream 1 · Substrate cert Phase 0 + 1 — Testbook v3.0.0 substrate 35 scenarios per run covering MCP stdio + HTTP REST + federation + audit + governance + KG. 3-consecutive-green is the cert criterion. 9 / 9 streaks complete this release (3 ironclaw + 3 hermes + 3 openclaw).
Stream 2 · Phase 3 NHI playbook Behavioral matrix — 4 arms × 10 scenarios × n=3 Scenarios A–J at four control arms (cold / isolated / stubbed / treatment). Per-cell grounding rate, hallucination rate, recall hit rate, treatment-vs-control attribution. Phase 4 meta-analysis is independent (third Claude, no namespace access).
Stream 3 · Forensic audit Hash-chain + tamper detection + append-only S25 (audit chain), S26 (byte-mutation tamper detection), S27 (OS append-only). Audit log integrity proof — legally reproducible.
Stream 4 · OpenClaw behavioral (Tier 1–4) 52 probes × 3 agents × 8 phases Qualitative awareness, quantitative recall@k, cross-session durability ablation, Byzantine peer trust calibration, tool-surface discovery, RoadMap recommendations, soft-restart + hard-restart context recovery.
Stream 5 · Network + MCP roundtrip
DNS + TLS + xAI live + ai-memory MCP per node
Per-container egress through CCC firewall. xAI Grok 4.3 → openclaw agent --local → MCP stdio → ai-memory write/read with quorum_acks=2. Roundtrip ~46 s, ~25K tokens.
03
The headline numbers (with receipts)¶
recall fidelity 1.000 recall@1 over 18 trials (6 queries × 3 agents) against a 52-memory pre-seeded corpus. JSON
cross-session durability 1.000 Token-keyed write in session α, recall in fresh session β. n=3 agents. Phase 3
trust calibration 1.000 Byzantine peer test: alice priority=10/conf=1.0 (MongoDB) vs bob priority=3/conf=0.4 (Cassandra). 3 / 3 agents picked correct + cited trust signals. Phase 5
substrate cert (openclaw) 35 / 35 Three consecutive green runs, 0 failure reasons. Cert doc
04
What the testing uncovered (the gap that informs v0.6.4)¶
After running concrete tasks — cross-session recall, multi-agent collaboration, conflicting-memory resolution, KG reasoning — three independent agents converged on the same top-3 capability gaps:
RoadMap signal #1
Auto-suggest memory_link during/after memory_store
Manual linking is the biggest workflow friction in KG reasoning + multi-agent collab. Filed ai-memory-mcp #517; v0.6.4 Track G-AX (lightweight) + v0.7 Bucket 0 R3 (full daemon-mode hook).
RoadMap signal #2
Session-aware memory_recall defaults + auto-cue
Closes the Phase 9 organic-no-cue failure case by converting the cue to a default — agent runtime injects memory_recall results into the system prompt at session start, no agent decision required. Filed #518.
RoadMap signal #3
Proactive conflict detection inside memory_store
Surfaces conflicts at write time with merge_strategy suggestions (replace / link.supersedes / link.contradicts / consolidate). Eliminates the post-hoc detect+resolve round trip. Filed #519.
All three issues are filed against ai-memory-mcp milestone v0.6.4 (sprint window 2026-05-04 → 2026-05-08). Behavioral evidence directly drove the v0.6.4 sprint scope Track G-AX.
05
Methodology — how this campaign is run¶
Every claim above is reproducible. The methodology is named in three places — pick the depth you want:
| Audience | Read |
|---|---|
| Decision-maker (15 min) | Why this campaign exists, the 60-second pitch, and the headline numbers above |
| Reviewer (30 min) | Methodology, Scope, Governance — the First-Principles design |
| Operator (1 hour) | Reproducing on DO, Local Docker mesh reproducibility, Testbook v3.0.0, Every test performed |
Reproducibility floor: every campaign run lives under runs/ with a2a-summary.json, campaign.meta.json, a2a-baseline.json, f3-peer-a2a.json, per-scenario scenario-N.json + .log, and (Phase 3+ runs only) phase2-orchestration.json, phase3-*.json, phase4-analysis.json. The runs/ index now also surfaces per-framework subtotals + cross-framework instruments overview.
Governance floor: scope-tagged artifacts (scope=ironclaw / scope=hermes / scope=openclaw) join the umbrella v0.6.3.1 release via release-tag linkage only. Cross-framework data is never collapsed into a single verdict per Principle 6 (scope discipline).
06
How this informs v0.6.4 + beyond¶
The substrate side is proven for the use cases tested. The agent-side is gated on prompt design. The next-level value is lowering the cue threshold — making the agent reach for ai-memory more often, with less friction.
| Finding | RoadMap consequence | Where to track |
|---|---|---|
| recall + durability + trust calibration all = 1.000 | Substrate is production-ready for the agent-side use cases tested | Cert doc |
| Organic-no-cue recovery 0/1; cued recovery 1/1 | Highest-leverage RoadMap item — converts the failure case to a default success | ai-memory-mcp #518, v0.6.4-G2 |
Three-agent unanimous: manual memory_link is the biggest workflow friction |
v0.6.4 Track G-AX (lightweight, response-field) + v0.7 Bucket 0 R3 (full daemon-mode hook) | #517 |
| Trust signals (priority/confidence/agent_id/tier/tags) weighted correctly when surfaced | Surface them at write time instead of post-hoc | #519 |
| OpenClaw 2026.4.x → 2026.5.x config schema breaking change | Documented openly; cert harness updated; no production blocker | agents/openclaw |
| ai-memory-a2a-v0.6.3.1 mesh state-reset was non-functional (in-place rm during running daemon) | Replaced with docker compose down -v && up -d; documented in cert PR #46 |
PR #46 |
For the full reconciled roadmap (substrate + behavioral findings + v0.6.4 sprint scope), see ai-memory-mcp/ROADMAP2.md.
07
Honest reality findings¶
Five things we wrote down honestly that an "everything is great" deck wouldn't:
- OpenClaw 2026.5.x has a breaking config-schema change. The repo's existing
entrypoint.shopenclaw.json shape is rejected by both 2026.4.22 and 2026.5.2. Modern config is gateway-centric; we documented the validatedopenclaw onboard --auth-choice xai-api-keyrecipe rather than glossing. - The fictional
openclaw runflag set never existed. Neither in 2026.4.22 nor 2026.5.2. The repo'sdrive_agent.shopenclaw branch silently fell back to HTTP. Substrate scenarios passed without a working openclaw runtime — true but uncomfortable, and now named. - Identity propagation is not automatic. Container env carries
AGENT_ID=ai:alice|bob|charliebut the OpenClawagent --localruntime does not read it. MCP write metadata is correct (the env inmcpServers.memory.envflows through), but the LLM's verbal self-reference can drift. Logged. - Mesh state-reset via in-place
docker exec rm -f a2a.db*does not actually clean the volume.serveholds open WAL handles;rmonly unlinks directory entries; quorum-resync from peer nodes can repopulate. First openclaw r1 attempt failed 21/35 from this; cleaned up viadocker compose down -v && up -dand re-ran 35/35 GREEN. - Substrate verdict in
releases/v0.6.3.1/summary.jsonispending. Notcert. The campaign is not done.expected_redfor S23 (#507) and S24 (#318) is documented; both flip to expected-green at v0.6.3.2 (Patch 2).
If anything on this page contradicts the JSON in the run artifacts, trust the JSON. Open an issue.
Reading order¶
- New to the campaign → start at the home page
- Want the verdict → latest-run NHI insights + per-run NHI matrix
- Want the receipts → Campaign runs + per-run evidence pages
- Want to reproduce → Reproducing (DO) or Local Docker mesh (workstation)
- Want to read the design → Methodology, Scope, Governance
— Authored 2026-05-04 by AI NHI (Claude Opus 4.7 1M) on behalf of AlphaOne LLC.