Campaign a2a-openclaw-v0.6.0-r5 FAIL

Agent group: openclaw (homogeneous)
ai-memory ref: v0.6.0
Completed at: 2026-04-20T23:44:58Z
Overall pass: false
Skipped reports: 1

Infrastructure

Provider: ?
Region: ?
Droplet size: ?
Topology: ?
Scenarios started: ?
Scenarios ended: ?
Dispatched by: a2a-gate-bot
Harness SHA: ?

Back-filled by scripts/backfill_legacy_runs.sh — historical run predates campaign.meta.json emission.

Run focus

First end-to-end run — infra GREEN, MCP writes succeeded on all 3 agents, scenario script had two measurement bugs

What this campaign tested: The full end-to-end provisioning flow for the first time: four DO droplets, ai-memory v0.6.0 binary install with federation mesh (W=2 / N=4), agent-specific grok-CLI installation from the grok-dev@1.6.0 release, MCP config at ~/.grok/user-settings.json pointing at local ai-memory with agent_id stamped, scenario 1 writes + reads through the agent MCP path.

What it demonstrated: (1) The infrastructure stack is correct — DO provisioning, SSH, ai-memory federation bootstrap, MCP config wiring, agent install. (2) The agent-driven MCP write path works: on every agent node, the grok-CLI binary accepted a prompt, chose the ai-memory memory_store tool, invoked it over MCP stdio, and ai-memory rebuilt its HNSW semantic index after each successful store. (3) The measurement bugs in phases B + C are in the scenario harness, NOT in the memory substrate or the agent path.

AI NHI analysis · Claude Opus 4.7

First end-to-end run — infra GREEN, MCP writes succeeded on all 3 agents, scenario script had two measurement bugs

MIXED — 15 m 23 s wall. Terraform green, SSH green, Provision green on all 4 nodes, Phase A writes succeeded (9 MCP writes observed per agent, HNSW index grew 1→9). Phase B row-count crashed on shell arithmetic over multi-line ssh stdout. Phase C crashed because the runner tried to reach node-4:9077 over the public IP, which the firewall (by design) blocks.

For three audiences

Non-technical end users

First end-to-end working run on live cloud. Three AI agents on three DigitalOcean servers each wrote memories through the shared memory system. Every write landed. The index rebuilt itself each time a memory came in. The only failures were in the scoring script that counts rows — the agents and the memory did their jobs correctly; the scoreboard just had some formatting bugs.

C-level decision makers

First concrete evidence that heterogeneous A2A works end-to-end on AlphaOne's memory substrate with a real LLM-backed agent (xAI Grok via grok-CLI) on disposable cloud infrastructure. Five iterations of controlled failure delivered this in under 2 hours of wall clock and under $0.30 of DO spend. The remaining work is harness polish, not product correctness. Green scoreboard is within 1–2 iterations.

Engineers & architects

Provision step is sequential (node-4 memory-only → node-1/2/3 agents). On each node: apt-get install base packages, UFW off (ship-gate lesson), shutdown -P +480 dead-man switch, curl ai-memory release tarball → /usr/local/bin, serve with --quorum-writes 2 --quorum-peers <other-3> on 0.0.0.0:9077, wait for /api/v1/health 200 OK. On agent nodes additionally: MCP config JSON at /etc/ai-memory-a2a/mcp-config/config.json, grok-CLI binary install, symlink /usr/local/bin/openclaw → grok, /root/.grok/user-settings.json with xAI key + ai-memory stdio MCP + AI_MEMORY_AGENT_ID env. Scenario 1 Phase A invoked grok -p "Store a memory in namespace scenario1-<agent> titled w<i>-<agent>..." via drive_agent.sh; each invocation spawned ai-memory mcp as a stdio child, handled one JSON-RPC call, persisted to /var/lib/ai-memory/a2a.db, rebuilt HNSW, exited. Phase B failure: ssh reader-node drive_agent.sh list <ns> | jq '.memories | length' produced newline-joined digits from the LLM's response formatting rather than a single int; shell arith $((sum + count)) crashed. Phase C failure: firewall source_addresses = digitalocean_vpc.a2a.ip_range blocks port 9077 from the runner's public IP.

What changes going into the next campaign

Dispatch r6 with scenario-1 harness fixes; scenario 1 should emit a clean pass/fail summary and evidence HTML should render with tri-audience blocks populated.

Tests performed in this run

Every scenario that produced a JSON report in this campaign, in testbook order. Click a row's scenario id to jump to its full report below. See the Every test performed page for the authoritative catalog.

ID	Title	Result	Reason