A2A test book — full spectrum¶
The formal test plan for maximum-coverage AI-to-AI integration testing through ai-memory.
Covers same-node A2A (two agent identities sharing a single droplet's ai-memory) and between-node A2A (agents on distinct droplets in the W=2/N=4 federation mesh) across every ai-memory primitive relevant to agent-to-agent coordination.
Test book version: 3.0.0 (2026-04-21) Changelog: - 3.0.0 — 100% A2A coverage: transport security (S20-S21), adversarial + cross-framework (S22-S27), every uncovered MCP primitive (S28-S37), every HTTP-only endpoint (S38-S42). 42 scenarios across 9 suites. Baseline bumped to v1.4.0 (adds F6 TLS handshake + F7 mTLS enforcement probes). Tracked in EPIC #14. - 2.0.0 — full-spectrum expansion to 18 scenarios across 6 suites; adds mutation / lifecycle / resilience / observability / same-node A2A coverage - 1.0.0 — initial 8-scenario core
Baseline v1.4.0 must pass before any scenario runs. The test book assumes baseline green. F6 + F7 probes gate baseline_pass when tls_mode ≥ tls and tls_mode = mtls respectively.
1. Nine test suites, 42 scenarios¶
The scenario corpus is partitioned into suites so failures are diagnostically localized. A red result in Suite A tells a different story than a red in Suite G.
| Suite | Scenarios | What it proves |
|---|---|---|
| A. Core A2A | S1, S1b, S2 | Basic read/write visibility across agents; shared-context handoff |
| B. A2A Primitives | S3, S5, S6, S11 | memory_share, memory_consolidate, memory_detect_contradiction, memory_link — the tools agents reach for explicitly |
| C. Mutation + Lifecycle | S9, S10, S16 | memory_update, memory_delete / memory_forget, tier promotion across agents |
| D. Scope + Governance | S7, S8, S12 | Scope enforcement matrix, auto-tag pipeline, agent registration (Task 1.3) |
| E. Resilience + Observability | S13, S14, S15, S17, S18 | Concurrent contention, partition tolerance, read-your-writes, stats consistency, semantic query expansion |
| F. Topology variants | S4, S19 | Quorum-under-load, SAME-NODE A2A (two agents sharing one droplet's ai-memory) |
| G. Transport + Adversarial + Cross-framework (v3.0.0) | S20, S21, S22, S23, S24, S25, S26, S27 | mTLS happy-path, anonymous rejected, identity spoofing, malicious content fuzz, Byzantine peer, clock skew, mixed ironclaw+hermes, OpenClaw regression |
| H. Uncovered primitives — 100% MCP surface (v3.0.0) | S28, S29, S30, S31, S32, S33, S34, S35, S36, S37 | memory_search, archive lifecycle, capabilities, gc, inbox+notify, pub/sub, pending governance, namespace standards, session_start, get_links bidirectional |
| I. HTTP-only endpoints — 100% REST surface (v3.0.0) | S38, S39, S40, S41, S42 | /export+/import, /sync/since delta sync, /memories/bulk, /metrics, /namespaces |
Coverage scorecard (v3.0.0 target): 37/37 MCP tools exercised, 27/27 HTTP endpoints exercised, 3 transport modes (http/tls/mtls), 3 agent groupings (ironclaw/hermes/mixed). Suites A–F are the v2.0.0 baseline; G–I are new in v3.0.0.
2. Scenario register¶
Each row is a one-line contract. Full per-scenario plans follow in §4.
| # | Name | Suite | Runnable on v0.6.0? | Primary invariant |
|---|---|---|---|---|
| S1 | Per-agent write + read (MCP-stdio) | A | ✅ (RED expected until #318 fix ships) | MCP write visible on local node |
| S1b | Per-agent write + read (HTTP) | A | ✅ | Quorum-HTTP write visible on peers |
| S2 | Shared-context handoff | A | ✅ | Writer's memory readable by recipient within quorum settle |
| S3 | Targeted memory_share |
B | ⚠️ Requires v0.6.0.1 (#311) | Subset sync lands on exactly the targeted peer |
| S4 | Federation-aware concurrent writes | F | ✅ | Quorum preserved under N-agent concurrent write burst |
| S5 | Consolidation + curation | B | ✅ (HTTP-direct; LLM-consolidation variant deferred) | consolidated_from_agents metadata preserved |
| S6 | Contradiction detection | B | ✅ if memory_detect_contradiction in v0.6.0; else deferred |
Contradicting memories produce a link visible to third agent |
| S7 | Scope visibility matrix | D | ⚠️ Partial (Task 1.5 scope enforcement ongoing) | Each (scope, caller_scope) pair produces the correct visibility |
| S8 | Auto-tagging round-trip | D | ⚠️ Requires Ollama-backed droplets (s-4vcpu-16gb) | Agent writes without tags; tags appear; recall-by-tag works |
| S9 | Mutation round-trip | C | ✅ | memory_update from agent A visible with new content on agent B |
| S10 | Deletion propagation | C | ✅ | memory_delete / memory_forget propagates to all peers |
| S11 | Link integrity | B | ✅ if memory_link exposed via HTTP; else documented-only |
Linked memories returned together on peer query |
| S12 | Agent registration | D | ✅ (Task 1.3 shipped in v0.6.0) | memory_agent_register on A is visible to B's memory_agent_list |
| S13 | Concurrent write contention | E | ✅ | Two agents updating the same row converge to a consistent outcome |
| S14 | Partition tolerance | E | ✅ | Temporary peer loss → recovery → convergence within bounded time |
| S15 | Read-your-writes | E | ✅ | Writing agent sees its own write immediately (no settle required) |
| S16 | Tier promotion | C | ✅ if tier endpoint exposed | short → mid → long promotion visible to peers |
| S17 | Stats consistency | E | ✅ | memory_stats returns equal counts across peers post-settle |
| S18 | Semantic query expansion | E | ✅ | memory_expand_query returns semantically related memories across writers |
| S19 | Same-node A2A | F | ✅ (with topology override) | Two agents on ONE droplet share local ai-memory; writes + reads work without federation |
| S20 | mTLS happy-path | G | ✅ tls_mode=mtls | Write+read round-trip over HTTPS with client-cert on the allowlist |
| S21 | Anonymous client rejected | G | ✅ tls_mode=mtls | rustls rejects the TLS handshake when no client cert is presented |
| S22 | Identity spoofing resistance | G | ✅ any tls_mode | X-Agent-Id/body resolution precedence honored; transport identity not bypassable |
| S23 | Malicious content fuzz | G | ✅ any tls_mode | SQLi / NUL / oversize / unicode round-trip faithfully or reject cleanly |
| S24 | Byzantine peer | G | ✅ any tls_mode | Tampered sync_push — metadata.agent_id preserved as declared, no silent re-attribution |
| S25 | Clock skew (300s) | G | ✅ any tls_mode | Vector clock still converges with node-3 offset +300s |
| S26 | Mixed ironclaw + hermes on same VPC | G | ⚠️ requires agent_group=mixed |
Heterogeneous A2A through shared ai-memory |
| S27 | Legacy OpenClaw regression | G | ⚠️ requires agent_group=openclaw |
Pre-ironclaw regression lane |
| S28 | memory_search keyword A2A |
H | ✅ | Keyword search (distinct from recall/expand_query) consistent across peers |
| S29 | memory_archive lifecycle |
H | ✅ | archive_list/purge/restore/stats round-trip cross-peer |
| S30 | memory_capabilities handshake |
H | ✅ v0.6.2+ | Protocol version + tool surface match across peers |
| S31 | memory_gc quiescence |
H | ✅ | After forget+gc, non-deleted rows remain readable on all peers |
| S32 | memory_inbox + memory_notify |
H | ⚠️ requires inbox feature | Notify delivers to target's inbox; non-target cannot read |
| S33 | pub/sub via memory_subscribe |
H | ⚠️ requires sub feature | Subscribe→write→delivery→unsubscribe→no-delivery |
| S34 | memory_pending_* governance flow |
H | ⚠️ requires governance.write=approve | Approve vs reject yields correct downstream visibility |
| S35 | memory_namespace_*_standard rule layering |
H | ✅ | Parent chain rules merged into namespace standard |
| S36 | memory_session_start lifecycle |
H | ✅ | Session-tagged writes recall by session_id only |
| S37 | memory_get_links explicit bidirectional |
H | ✅ | Forward and reverse link traversal both return the pair |
| S38 | /export + /import round-trip |
I | ✅ | Export one peer → import elsewhere → stats match |
| S39 | /sync/since delta sync |
I | ✅ | Post-partition delta returns exactly the missed rows |
| S40 | /memories/bulk bulk write |
I | ✅ | 500-row bulk POST reaches all peers + aggregator |
| S41 | /metrics Prometheus shape |
I | ✅ | Required counters present and monotonic post-activity |
| S42 | /namespaces enumeration |
I | ✅ | Namespace list (with counts) equivalent across peers |
3. Coverage by ai-memory primitive¶
Which scenario tests which tool. Empty cells = not directly exercised.
| ai-memory primitive | Scenarios |
|---|---|
memory_store |
S1, S1b, S2, S4, S5, S9, S10, S11, S13, S14, S15, S18, S19 |
memory_recall |
S1, S1b, S2, S4, S5, S15, S18 |
memory_list |
S2, S6, S7, S11 |
memory_get |
S6, S11 |
memory_update |
S9, S13 |
memory_delete / memory_forget |
S10 |
memory_share |
S3 |
memory_link |
S11 |
memory_consolidate |
S5 |
memory_detect_contradiction |
S6 |
memory_auto_tag |
S8 |
memory_agent_register / memory_agent_list |
S12 |
memory_stats |
S17 |
memory_promote (tier) |
S16 |
memory_expand_query |
S18 |
| Federation: quorum write | S1b, S2, S4, S9, S10, S13 |
| Federation: partition tolerance | S14 |
| Federation: convergence | S17 |
| Scope enforcement (Task 1.5) | S7 |
metadata.agent_id immutability (Task 1.2) |
S1, S1b, S2, S5, S6, S7, S9 |
Saturation: every primitive listed in the ai-memory-mcp tool catalog is covered by at least one scenario.
4. Per-scenario plans¶
Each plan has the same six sections: Objective · Pre-conditions · Procedure · Pass criteria · Failure modes · Evidence.
Suite A — Core A2A¶
S1 — Per-agent write + read (MCP-stdio path)¶
- Objective: Proves that an agent framework can accept a natural-language prompt, select the
memory_store/memory_recalltool, invoke it via MCP stdio, and the memory lands on the local substrate with correctmetadata.agent_id. - Pre-conditions: Baseline v1.2.0 green. F2a passed (substrate up). F2b is the scoped-down version of this scenario.
- Procedure: Each of 3 agents drives 10
memory_storecalls via MCP intoscenario1-<agent-id>. After 15s settle, each agent recalls the other two namespaces. Node-4 aggregator is queried independently. - Pass criteria: Every reader sees ≥ 20 memories (10 from each of the other 2 writers).
metadata.agent_idstamped on every row matches the writer. - Failure modes:
recall 0on all readers + local HNSW index grew → MCP stdio writes bypass federation fanout (substrate bug #318)- Agent CLI crash → agent framework install gap
- Evidence:
scenario-1.json+scenario-1.logwith per-agent recall counts, identity-check dump from node-4.
S1b — Per-agent write + read (HTTP path)¶
- Objective: Same invariant as S1 but bypasses MCP stdio. Proves the federation layer works independent of the MCP-stdio substrate bug. Green-path counterpart to S1.
- Pre-conditions: Baseline v1.2.0 green. F3 green (federation live).
- Procedure: Each of 3 agents
POST /api/v1/memoriesto local serve HTTP intoscenario1b-<agent-id>. 15s settle. Each reader counts other-namespace rows via local HTTP. - Pass criteria: Every reader sees ≥ 20 rows with correct
metadata.agent_id. - Failure modes: Federation fanout broken (all 0 rows on peers) → PR #309 regression. Partial replication (1 peer sees, 2 don't) → quorum threshold issue.
- Evidence:
scenario-1b.json+scenario-1b.log.
S2 — Shared-context handoff¶
- Objective: Prove an agent can drop a handoff memory that another agent picks up within bounded settle. This is the "task handed from A to B" primitive.
- Pre-conditions: S1b green.
- Procedure: Agent A (
ai:aliceon node-1) writesscenario2-handoff/handoff-to-bobwith content describing a task. Settle 8s. Agent B (ai:bobon node-2) queries that namespace + title. Agent B writesscenario2-handoff/ack-from-bobwith its acknowledgement. Settle 8s. Agent A reads the ack. - Pass criteria: B sees the exact handoff payload with
metadata.agent_id = ai:alice. A sees the exact ack payload withmetadata.agent_id = ai:bob. - Failure modes: B sees nothing → settle too short OR fanout failed. B sees handoff but A doesn't see ack → asymmetric replication (one direction works, other doesn't).
- Evidence:
scenario-2.json.
Suite B — A2A Primitives¶
S3 — Targeted share¶
- Objective:
memory_sharedelivers a specific subset of memories from A to C without C querying (push semantics). - Pre-conditions: Requires ai-memory v0.6.0.1 (#311) — memory_share CLI + MCP tool not in v0.6.0. DEFERRED until v0.6.0.1 ships.
- Procedure (documented for when runnable): A writes 10 memories. A invokes
memory_share --target=ai:charlie --namespace=<X>. Settle. C lists namespace. C sees the 10 rows withmetadata.shared_from: ai:alice. - Pass criteria: C sees exactly the shared set. B (
ai:bob) does NOT see them (targeted, not broadcast). - Evidence:
scenario-3.json.
S5 — Consolidation + curation¶
- Objective:
memory_consolidateproduces a single memory whosemetadata.consolidated_from_agentslists all contributing agents in correct order, preserving provenance. - Pre-conditions: S1b green.
- Procedure: All 3 agents write 3 related memories to
scenario5-consolidate(9 rows total). Request consolidation via HTTP (POST /api/v1/memories/consolidateif exposed, else CLI). Read back the consolidated memory from a 4th peer. - Pass criteria: Consolidated memory has
metadata.consolidated_from_agentsset to[ai:alice, ai:bob, ai:charlie]. All three original memories still recoverable. - Failure modes: Consolidation drops provenance (empty
consolidated_from_agents). Consolidation deletes originals before consolidation row replicates. - Evidence:
scenario-5.json.
S6 — Contradiction detection¶
- Objective:
memory_detect_contradictionsurfaces conflicting claims from different agents to an uninvolved third agent, with acontradictslink between them. - Pre-conditions: S1b green. Requires
memory_detect_contradictionendpoint. - Procedure: Agent A writes "X is true". Agent B writes "X is false" (different-content, same-topic). Settle. Agent C calls
memory_detect_contradictionon topic X. Verify response includes both rows + acontradictslink between them. - Pass criteria: C's query returns the pair + the link; neither A nor B's row is silently overwritten.
- Failure modes: Only one memory returned (LWW overwrite). No link present (detection not running). Memory returns but with swapped
agent_id(provenance bug). - Evidence:
scenario-6.json.
S11 — Link integrity¶
- Objective:
memory_linkcreates a relationship between two memories, and that relationship is traversable from any peer. - Pre-conditions: S1b green.
- Procedure: Agent A writes memory M1. Agent B writes memory M2. Agent A invokes
memory_link --from M1 --to M2 --relation related_to. Settle. Agent C on node-3 queriesmemory_get_links M1, expects M2 in the response. - Pass criteria: C's link query returns M2. Inverse lookup (
memory_get_links M2) also returns M1. - Failure modes: Links don't replicate (only A's node sees them). Link direction not preserved (M1→M2 ≠ M2→M1 reflexive).
- Evidence:
scenario-11.json.
Suite C — Mutation + Lifecycle¶
S9 — Mutation round-trip¶
- Objective:
memory_updatefrom one agent is visible with the new content on other agents within quorum settle.metadata.agent_idof the ORIGINAL writer is preserved (Task 1.2 immutability) whileupdated_byreflects the mutator. - Pre-conditions: S1b green.
- Procedure: A writes M1 with content "v1". B issues
memory_update M1 content="v2". Settle. C reads M1. - Pass criteria: C sees content="v2".
metadata.agent_idis stillai:alice(immutable).metadata.updated_byor equivalent isai:bob. - Failure modes:
metadata.agent_idoverwritten toai:bob(Task 1.2 breach). Update doesn't propagate (C still sees "v1"). - Evidence:
scenario-9.json.
S10 — Deletion propagation¶
- Objective:
memory_delete/memory_forgetfrom one agent propagates to all peers and to the aggregator node-4 within quorum settle. - Pre-conditions: S1b green.
- Procedure: A writes M1. Settle. A lists M1 via HTTP, confirms present. A issues
memory_delete M1(ormemory_forget M1). Settle 15s. B, C, node-4 all query for M1. - Pass criteria: None of the 3 peers still have M1. No tombstone leak (the row's content is not retrievable via recall).
- Failure modes: Soft delete: row hidden on A but still present on peers (replication failure). Deletion creates a gap:
memory_listreturnsnullentries. - Evidence:
scenario-10.json.
S16 — Tier promotion¶
- Objective:
memory_promotefromshort→mid→longtier replicates across peers without losing content or provenance. - Pre-conditions: S1b green. Tier endpoint exposed.
- Procedure: A writes M1 with
tier: short. Settle. A issuesmemory_promote M1 --tier long. Settle. B reads M1 and checkstier. - Pass criteria: B sees
tier: long. Content identical.metadata.agent_idunchanged. - Evidence:
scenario-16.json.
Suite D — Scope + Governance¶
S7 — Scope visibility matrix¶
- Objective: Task 1.5 scope contract enforced across agents on different nodes. Every (caller_scope, target_scope) pair produces the documented visibility.
- Pre-conditions: S1b green. Task 1.5 scope enforcement shipped in v0.6.0.
- Procedure: A writes 5 memories at scopes
private,team,unit,org,collective. Settle. B queries each scope from differentas_agentvalues (same-team, different-team, different-unit, different-org). Verify the matrix: private→ visible only to A itselfteam→ visible to same-team agentsunit→ visible to same-unit agents (and same-team inherited)org→ visible to same-orgcollective→ globally visible- Pass criteria: Every cell of the visibility matrix matches the Task 1.5 contract.
- Failure modes: Private memory leaked across scope boundary = critical. Collective memory not visible globally = replication gap.
- Evidence:
scenario-7.jsonwith full (scope, caller_scope) matrix.
S8 — Auto-tagging round-trip¶
- Objective: Auto-tag pipeline generates tags for untagged memories and makes them recallable by tag.
- Pre-conditions: Requires Ollama on s-4vcpu-16gb droplets. Auto-tag pipeline daemon running. DEFERRED unless
auto_tag=truedispatch input is set. - Procedure: A writes memory without tags. Wait 30s for auto-tag pipeline. B recalls by one of the generated tags.
- Pass criteria: B's tag-based recall returns A's memory. Tags are non-empty, non-trivial (not just content truncation).
- Evidence:
scenario-8.json.
S12 — Agent registration¶
- Objective:
memory_agent_registeron one node makes the agent visible inmemory_agent_liston all peers. Task 1.3 shipped in v0.6.0. - Pre-conditions: S1b green.
- Procedure: A calls
memory_agent_register --agent-id=ai:dave --namespace=auto-registered. Settle. B, C, node-4 each callmemory_agent_list. All see ai:dave. - Pass criteria: Consistent view of registered agents across all nodes.
- Evidence:
scenario-12.json.
Suite E — Resilience + Observability¶
S13 — Concurrent write contention¶
- Objective: Two agents updating the same memory converge to a consistent outcome across peers — whatever the resolution semantics (LWW, CRDT, vector clock) — everyone agrees on which version won.
- Pre-conditions: S1b green, S9 green.
- Procedure: A writes M1 with "v0". Settle. A and B concurrently issue
memory_update M1 content="vA"andmemory_update M1 content="vB". Settle. All 3 peers + node-4 read M1. - Pass criteria: All 4 readers return the SAME content value (whether it's vA, vB, or a CRDT merge is less important than agreement). No split-brain.
- Failure modes: Different peers report different content → split-brain. Metadata partial update (content vA but
updated_byai:bob) → inconsistent transaction. - Evidence:
scenario-13.jsonwith the outcome vote per peer.
S14 — Partition tolerance¶
- Objective: Temporary loss of one peer doesn't permanently break the mesh; writes during partition converge after recovery (PR #309 + PR #312 regression test at A2A level).
- Pre-conditions: S1b green.
- Procedure: Suspend node-3 via
kill -STOPon its ai-memory process. A and B write memories. 10s settle. Resume node-3 viakill -CONT. 15s settle. Query node-3 for the memories written during partition. - Pass criteria: Node-3 eventually sees all writes made during its outage. Quorum (W=2) was satisfied during partition (A + B constituted quorum for each other).
- Failure modes: Writes during partition lost (W=2 not satisfied during outage). Node-3 partition prevents ALL writes (W=3 requirement bug).
- Evidence:
scenario-14.json.
S15 — Read-your-writes¶
- Objective: An agent reading immediately after its own write sees the write (no settle required locally).
- Pre-conditions: Baseline green.
- Procedure: A writes M1 at time T0. A issues recall at T0+1ms from the same node. A reads its own write.
- Pass criteria: The write is visible to the writer itself with zero-delay local read. (Inter-agent propagation delay is separate — tested elsewhere.)
- Failure modes: Writer doesn't see own write locally → local-write-not-persisted bug.
- Evidence:
scenario-15.json.
S17 — Stats consistency¶
- Objective:
memory_statsreturns the same counts across all 4 peers post-settle (total memories, per-namespace, per-agent). - Pre-conditions: S1b green.
- Procedure: All 3 agents write 5 memories each (15 total). Settle 15s. Query
memory_statson each of the 4 peers. - Pass criteria:
total_memories >= 15on all 4 peers AND equal across all 4 peers (±0 tolerance post-settle). - Failure modes: Count drift between peers → incomplete replication. Count mismatch with writer-count → write bug.
- Evidence:
scenario-17.json.
S18 — Semantic query expansion¶
- Objective:
memory_expand_queryreturns semantically related memories written by different agents, proving the HNSW/embedding layer works across the federated set. - Pre-conditions: S1b green. Semantic tier enabled (baseline.mcp args include
--tier semantic). - Procedure: A writes "Lyra prefers morning runs". B writes "Lyra loves breakfast before exercise". Settle. C calls
memory_expand_query "dawn activity routine"(semantically related but no literal keyword overlap). - Pass criteria: Query returns both A's and B's memories despite keyword mismatch. Both have the expected
metadata.agent_id. - Failure modes: Only keyword matches returned (semantic expansion broken). Neither returned (index fails across peers).
- Evidence:
scenario-18.json.
Suite F — Topology variants¶
S4 — Federation-aware concurrent writes¶
- Objective: Quorum semantics hold under concurrent write burst. PR #309 regression test at A2A level.
- Pre-conditions: S1b green.
- Procedure: All 3 agents concurrently write 50 memories each (150 total, ~5s burst) into distinct namespaces. Settle 20s. Query node-4 aggregator for total row count per namespace.
- Pass criteria: Node-4 sees 50 memories in each of the 3 namespaces. No row loss. No wrong
agent_id. - Failure modes: Row loss → W=2 fanout dropping under load. Wrong
agent_id→ race condition in metadata stamping (Task 1.2 breach under concurrency). - Evidence:
scenario-4.json.
S19 — SAME-NODE A2A¶
- Objective: Two agent identities running on the SAME droplet, sharing the same local
ai-memory serve, can write and read each other's memories. This is the on-device-multi-agent case (Claude Desktop + VS Code on one dev laptop, both talking to local ai-memory). Distinct from between-node federated A2A. - Pre-conditions: Baseline v1.2.0 green. A modified provision: node-1 runs TWO agent identities (
ai:alice-primaryandai:alice-assistant) both with MCP config pointing at the same/var/lib/ai-memory/a2a.db. - Procedure: Primary agent writes 10 memories to
scenario19-same-node-alice. Assistant agent (on same droplet, same serve) recalls the namespace immediately (no settle — same-process DB). Primary reads assistant's writes back. - Pass criteria: Both agents see each other's writes with correct
metadata.agent_id. No federation involvement — this is a pure local-shared-memory test. - Failure modes: Agent isolation bug (alice-primary can't see alice-assistant despite shared DB). MCP stdio session caches state across calls and misses new writes from sibling agents.
- Evidence:
scenario-19.json.
5. Dispatch plans¶
5.1 Standard campaign¶
gh workflow run a2a-gate.yml \
-f agent_group=openclaw \
-f scenarios="1 1b 2 4 5 9 10 12 13 15 17 18"
5.2 Deep resilience campaign¶
gh workflow run a2a-gate.yml \
-f agent_group=openclaw \
-f scenarios="13 14"
5.3 Auto-tag campaign (requires larger droplets)¶
gh workflow run a2a-gate.yml \
-f agent_group=openclaw \
-f scenarios="8" \
-f agent_droplet_size=s-4vcpu-16gb
5.4 Same-node A2A campaign (requires topology override)¶
Documented for future dispatch; requires a topology=same-node dispatch input (not yet implemented).
6. Reading the evidence¶
Every scenario leaves a scenario-N.json + scenario-N.log in runs/<campaign-id>/. The aggregator rolls them up into a2a-summary.json with overall_pass = all-scenarios-pass. The dashboard at https://alphaonedev.github.io/ai-memory-ai2ai-gate/runs/ surfaces PASS/FAIL per campaign; the per-run page shows PASS/FAIL per scenario with reasons.
Evidence is committed to git with redaction; no secret value reaches the repo (see baseline.md §9b).
7. Change control¶
Adding a scenario is a minor test-book version bump. Removing or relaxing a scenario is major and requires a narrative entry in analysis/run-insights.json. Every scenario added must include:
- Entry in §2 register
- Full §4 plan (Objective / Pre-conditions / Procedure / Pass / Failure / Evidence)
- Working script in
scripts/scenarios/<N>_<slug>.py(Python 3; see §7b) - Primitive coverage updated in §3
7b. Scenario scripting convention (v3.0.0+)¶
All scenarios are Python 3. Every script in scripts/scenarios/ is a .py file that imports scripts/a2a_harness.py (stdlib-only — no pip installs on the runner), runs its checks, and calls harness.emit(...) to produce the JSON report. The workflow dispatches via python3.
Contract:
- stdout — single-line JSON scenario report (consumed by aggregator)
- stderr — human-readable log lines
- exit 0 on a clean run (pass, fail, or skip); non-zero only on hard crash
Why Python, not bash: complex payload construction (fuzz / bulk / export-import), real concurrency (concurrent.futures), clean assertions, reusable fixtures, JSON handled as data instead of shell-quoting hell.
Reference implementation: scripts/scenarios/26_mixed_framework.py.
8. Read next¶
- Baseline configuration — what must be true before any scenario runs
- Methodology — the philosophy behind the invariants
- Reproducing — how to dispatch a campaign yourself
- Campaign runs — live evidence dashboard
9. Phase 2/3/4/5 inheritance — substrate cert is one layer of a five-phase campaign¶
The S1–S51 testbook scenarios in this document are Phase 1 — Substrate cert in the canonical First-Principles taxonomy. They are not the entire A2A campaign — they are the substrate-cert layer that the upper phases gate on.
Per-release campaigns (ai-memory-a2a-v<X.Y.Z>) now run a five-phase structure:
| Phase | Layer | Where it lives |
|---|---|---|
| 0 | Pre-flight (infrastructure baseline) | a2a-baseline.json per baseline.md |
| 1 | Substrate cert (this testbook, S1–S |
This repo — testbook scenarios |
| 2 | AI Orchestration Test (scripted bidirectional dry run between IronClaw + Hermes through ai-memory) | Per-release campaign repo, governed by canonical First-Principles |
| 3 | Autonomous NHI Playbook (four scenarios x four control arms x n=3 = 48 runs) | Per-release campaign repo |
| 4 | Meta-analysis (third Claude instance, no namespace access, computes grounding rate / cross-layer consistency from Phase 3 logs) | Per-release campaign repo |
| 5 | Verdict commit + findings sync (publishes substrate verdict and NHI behavioral verdict separately, populates next-patch candidate list) | Per-release campaign repo + ai-memory-mcp umbrella issue |
The canonical governance for Phases 2-5 lives in this repo at governance/META-GOVERNANCE.md, with the binding §7 JSON log schema at governance/phase-log.schema.json and the per-release instantiation procedure at governance/instantiation-guide.md.
The S1-S51 substrate scenarios in this testbook are not changed by the upper-phase governance. The substrate cert layer is the same scripted, binary-evidence layer it has always been; the upper phases run on top of a green-or-known-RED substrate. The substrate testbook continues to be the source of truth for what the A2A surfaces are required to do, and substrate-cert evidence remains the bedrock that NHI behavioral evidence sits on top of.
The reference per-release instantiation is the v0.6.3.1 campaign — see https://github.com/alphaonedev/ai-memory-a2a-v0.6.3.1/pull/2 for the pattern, and Appendix C of governance/META-GOVERNANCE.md for the list of fields each release must pin.