Skip to content

Testing Corpus — every test, every result, every insight

CAMPAIGN-LEVEL ORIENTATION
This page is the single entry point to the v0.6.3.1 testing corpus. Every test we ran, what it measures, what it produced, where the receipts live. **Five evidence streams, three frameworks, one substrate under test.** Every claim links to a committed artifact.
9 / 9Substrate runs (3 frameworks × 3 streaks)
52Behavioral probes (Tier 1–4)
1.000recall@1 (n=18)
1.000Cross-session durability (n=3)
1.000Trust calibration (n=3)
3RoadMap signals filed → v0.6.4

01

What's under test

Subject: ai-memory 0.6.3+patch.1 — Apache-2.0 substrate for AI-to-AI memory + coordination. Schema v19. 1,886 library tests, 93.84% line coverage upstream. The job of this campaign is to put a working substrate under realistic agent-loop pressure and write down what survives and what doesn't.

Three agent frameworks exercise the substrate under three different transports:

Primary cert agent IronClaw (Rust) First-class agent. DigitalOcean 4-node mesh. Drives the Phase 3 NHI behavioral playbook (scenarios A–J) at n=3 across four control arms. Setup · Runs

Cross-framework counterpart Hermes (Python) Counterpart to IronClaw on the same DO topology. Proves substrate scenarios are framework-agnostic — same scenarios pass on different agent runtimes. Setup · Runs

First-class third agent OpenClaw (Node.js) Workstation 4-node Docker mesh, 16 GB / openclaw container. xAI Grok 4.3 backed. 3-green substrate streak + Tier 1–4 behavioral assessment (52 probes, 8 phases). Setup · Runs · Behavioral


02

The five evidence streams

Per Principle 1 (two truth-claims, two evidence streams, never conflated), the campaign produces independently-auditable artifacts. Substrate-side answers "does the code work?". Behavioral-side answers "do agents actually use it well?". Each stream is layered into tiers; higher-tier instruments produce stronger signal but cost more to run.

Stream 1 · Substrate cert Phase 0 + 1 — Testbook v3.0.0 substrate 35 scenarios per run covering MCP stdio + HTTP REST + federation + audit + governance + KG. 3-consecutive-green is the cert criterion. 9 / 9 streaks complete this release (3 ironclaw + 3 hermes + 3 openclaw). 3-GREEN × 3 Testbook v3.0.0 · Runs

Stream 2 · Phase 3 NHI playbook Behavioral matrix — 4 arms × 10 scenarios × n=3 Scenarios A–J at four control arms (cold / isolated / stubbed / treatment). Per-cell grounding rate, hallucination rate, recall hit rate, treatment-vs-control attribution. Phase 4 meta-analysis is independent (third Claude, no namespace access). Assessor · Per-run matrix · Insights

Stream 3 · Forensic audit Hash-chain + tamper detection + append-only S25 (audit chain), S26 (byte-mutation tamper detection), S27 (OS append-only). Audit log integrity proof — legally reproducible. Audit trail · Per-run forensics

Stream 4 · OpenClaw behavioral (Tier 1–4) 52 probes × 3 agents × 8 phases Qualitative awareness, quantitative recall@k, cross-session durability ablation, Byzantine peer trust calibration, tool-surface discovery, RoadMap recommendations, soft-restart + hard-restart context recovery. recall@1=1.000 Full report

Stream 5 · Network + MCP roundtrip DNS + TLS + xAI live + ai-memory MCP per node Per-container egress through CCC firewall. xAI Grok 4.3 → openclaw agent --local → MCP stdio → ai-memory write/read with quorum_acks=2. Roundtrip ~46 s, ~25K tokens. Tracking issue #45


03

The headline numbers (with receipts)

recall fidelity 1.000 recall@1 over 18 trials (6 queries × 3 agents) against a 52-memory pre-seeded corpus. JSON

cross-session durability 1.000 Token-keyed write in session α, recall in fresh session β. n=3 agents. Phase 3

trust calibration 1.000 Byzantine peer test: alice priority=10/conf=1.0 (MongoDB) vs bob priority=3/conf=0.4 (Cassandra). 3 / 3 agents picked correct + cited trust signals. Phase 5

substrate cert (openclaw) 35 / 35 Three consecutive green runs, 0 failure reasons. Cert doc


04

What the testing uncovered (the gap that informs v0.6.4)

THE SINGLE HIGHEST-LEVERAGE SIGNAL
Phase 9 organic-no-cue recovery: 0 / 1. Without an explicit cue ("memory_recall on namespace=…") the agent confabulated bootstrap activity instead of reaching for memory. Cued recovery: 100%. The data was always there. Cue language gates the agent's decision to invoke memory tools, not data availability.

After running concrete tasks — cross-session recall, multi-agent collaboration, conflicting-memory resolution, KG reasoning — three independent agents converged on the same top-3 capability gaps:

RoadMap signal #1 Auto-suggest memory_link during/after memory_store Manual linking is the biggest workflow friction in KG reasoning + multi-agent collab. Filed ai-memory-mcp #517; v0.6.4 Track G-AX (lightweight) + v0.7 Bucket 0 R3 (full daemon-mode hook). v0.6.4-G1

RoadMap signal #2 Session-aware memory_recall defaults + auto-cue Closes the Phase 9 organic-no-cue failure case by converting the cue to a default — agent runtime injects memory_recall results into the system prompt at session start, no agent decision required. Filed #518. v0.6.4-G2

RoadMap signal #3 Proactive conflict detection inside memory_store Surfaces conflicts at write time with merge_strategy suggestions (replace / link.supersedes / link.contradicts / consolidate). Eliminates the post-hoc detect+resolve round trip. Filed #519. v0.6.4-G3

All three issues are filed against ai-memory-mcp milestone v0.6.4 (sprint window 2026-05-04 → 2026-05-08). Behavioral evidence directly drove the v0.6.4 sprint scope Track G-AX.


05

Methodology — how this campaign is run

Every claim above is reproducible. The methodology is named in three places — pick the depth you want:

Audience Read
Decision-maker (15 min) Why this campaign exists, the 60-second pitch, and the headline numbers above
Reviewer (30 min) Methodology, Scope, Governance — the First-Principles design
Operator (1 hour) Reproducing on DO, Local Docker mesh reproducibility, Testbook v3.0.0, Every test performed

Reproducibility floor: every campaign run lives under runs/ with a2a-summary.json, campaign.meta.json, a2a-baseline.json, f3-peer-a2a.json, per-scenario scenario-N.json + .log, and (Phase 3+ runs only) phase2-orchestration.json, phase3-*.json, phase4-analysis.json. The runs/ index now also surfaces per-framework subtotals + cross-framework instruments overview.

Governance floor: scope-tagged artifacts (scope=ironclaw / scope=hermes / scope=openclaw) join the umbrella v0.6.3.1 release via release-tag linkage only. Cross-framework data is never collapsed into a single verdict per Principle 6 (scope discipline).


06

How this informs v0.6.4 + beyond

The substrate side is proven for the use cases tested. The agent-side is gated on prompt design. The next-level value is lowering the cue threshold — making the agent reach for ai-memory more often, with less friction.

Finding RoadMap consequence Where to track
recall + durability + trust calibration all = 1.000 Substrate is production-ready for the agent-side use cases tested Cert doc
Organic-no-cue recovery 0/1; cued recovery 1/1 Highest-leverage RoadMap item — converts the failure case to a default success ai-memory-mcp #518, v0.6.4-G2
Three-agent unanimous: manual memory_link is the biggest workflow friction v0.6.4 Track G-AX (lightweight, response-field) + v0.7 Bucket 0 R3 (full daemon-mode hook) #517
Trust signals (priority/confidence/agent_id/tier/tags) weighted correctly when surfaced Surface them at write time instead of post-hoc #519
OpenClaw 2026.4.x → 2026.5.x config schema breaking change Documented openly; cert harness updated; no production blocker agents/openclaw
ai-memory-a2a-v0.6.3.1 mesh state-reset was non-functional (in-place rm during running daemon) Replaced with docker compose down -v && up -d; documented in cert PR #46 PR #46

For the full reconciled roadmap (substrate + behavioral findings + v0.6.4 sprint scope), see ai-memory-mcp/ROADMAP2.md.


07

Honest reality findings

Five things we wrote down honestly that an "everything is great" deck wouldn't:

  1. OpenClaw 2026.5.x has a breaking config-schema change. The repo's existing entrypoint.sh openclaw.json shape is rejected by both 2026.4.22 and 2026.5.2. Modern config is gateway-centric; we documented the validated openclaw onboard --auth-choice xai-api-key recipe rather than glossing.
  2. The fictional openclaw run flag set never existed. Neither in 2026.4.22 nor 2026.5.2. The repo's drive_agent.sh openclaw branch silently fell back to HTTP. Substrate scenarios passed without a working openclaw runtime — true but uncomfortable, and now named.
  3. Identity propagation is not automatic. Container env carries AGENT_ID=ai:alice|bob|charlie but the OpenClaw agent --local runtime does not read it. MCP write metadata is correct (the env in mcpServers.memory.env flows through), but the LLM's verbal self-reference can drift. Logged.
  4. Mesh state-reset via in-place docker exec rm -f a2a.db* does not actually clean the volume. serve holds open WAL handles; rm only unlinks directory entries; quorum-resync from peer nodes can repopulate. First openclaw r1 attempt failed 21/35 from this; cleaned up via docker compose down -v && up -d and re-ran 35/35 GREEN.
  5. Substrate verdict in releases/v0.6.3.1/summary.json is pending. Not cert. The campaign is not done. expected_red for S23 (#507) and S24 (#318) is documented; both flip to expected-green at v0.6.3.2 (Patch 2).

If anything on this page contradicts the JSON in the run artifacts, trust the JSON. Open an issue.


Reading order

— Authored 2026-05-04 by AI NHI (Claude Opus 4.7 1M) on behalf of AlphaOne LLC.