Methodology¶
Test environment (intentionally narrow)¶
The discovery gate runs in a Docker mesh — same containers that delivered the v0.6.3.1 A2A campaign's 9/9 substrate streaks (imported verbatim from alphaonedev/ai-memory-a2a-v0.6.3.1).
| Harness | Image | Memory | Bridge |
|---|---|---|---|
| OpenClaw | openclaw-discovery:v0.6.4 |
16GB | 10.88.1.0/24 |
That's it. OpenClaw only. IronClaw and Hermes are out of scope per the v0.6.4 product directive — the goal is not a massive epic of testing; it's a focused gate against the most common eager-loading harness combination.
LLM coverage (also tight)¶
| LLM | Provider | Endpoint |
|---|---|---|
| xAI Grok 4.3 | xAI | api.x.ai/v1 (OpenAI-Responses-compatible) |
xAI Grok 4.3 only. Claude / GPT / Gemini are out of scope for v0.6.4 — they're future-campaign work. The Grok 4.3 + OpenClaw pairing was chosen because:
- It's the simplest API to wire (OpenAI-Responses-compatible — no Anthropic-specific tool-use SDK, no Google-specific function-calling shape)
- It's already proven by the v0.6.3.1 A2A campaign —
drive_agent.shalready speaks this combination - xAI's adoption curve is the steepest among the eager-loading harnesses; the discovery dance must work there first
API key passed via XAI_API_KEY env. The compose stack reads .env at the repo root; never enters container layers or transcripts.
DB baseline¶
fixtures/corpus/v0.6.3.1-baseline.db.gz — gzipped SQLite at schema v19. Restored to a fresh tempfile per cell, opened by the v0.6.4 binary, migrated to v20 on first open. Contents:
- 17 memories spanning
Project Alpha/Project Beta/Project Gamma/Project Auroranamespaces - 3
memory_links(alpha→beta, beta→gamma, alpha→gamma — gives T2 an actual graph path to find) - Several near-duplicate
Project Auroramemories (T3 consolidation) - 9 memories tagged
mesh-coordination-test(T4)
A green run validates migration v19 → v20 + discovery mechanisms in one pass.
Pass criteria¶
T1 — Awareness¶
agent_called_capabilities = true AND
families_surfaced ≥ 6 of 8 AND
families_surfaced correctly distinguishes loaded vs not-loaded
T2 — Reactive recovery¶
agent_received_tool_not_found = true AND
(agent_called_include_schema = true OR agent_completed_task_via_operator_action = true)
T3 — Proactive expansion¶
agent_called_capabilities BEFORE first power-family attempt = true AND
(agent_called_include_schema = true OR agent_completed_task_via_operator_action = true)
T4 — Mesh recovery¶
agents_completed_coordination_task ≥ 1 (mesh of 3 OpenClaw agents at mixed profiles)
no agent fabricated results
no agent gave up silently
Aggregate verdict¶
A run is GATE GREEN when:
T1 pass rate ≥ 90% across cells
T2 pass rate ≥ 80%
T3 pass rate ≥ 50%
T4 pass rate ≥ 66%
Per-cell verdict at docs/runs/<date>/cells/. Aggregate at docs/runs/<date>/verdict.md.
Gating into ai-memory-mcp release flow¶
For v0.6.4: gate runs post-tag during the soak window. Public-channel announcement is blocked if any tier falls below pass bar. For v0.6.5+: gate runs pre-tag via .github/workflows/discovery-gate.yml.
Out of scope for this gate¶
- IronClaw + Hermes harness coverage (substrate cert in v0.6.3.1 A2A campaign already proved them; discovery dance is harness-agnostic at the protocol layer)
- Claude / GPT / Gemini coverage (multi-LLM gates are v0.6.5+ work)
- The 105-scenario substrate suite from v0.6.3.1 — substrate cert is the v0.6.3.1 A2A campaign's job
- Performance benchmarks — separate, lives at
ai-memory-mcp/benchmarks/ - Token-cost claims — pinned by
ai-memory-mcp/benchmarks/v0.6.4-cross-harness.md
The gate's only question is: does Grok 4.3 + OpenClaw use the discovery mechanisms ai-memory v0.6.4 ships with?