Skip to content

Campaign runs

Every A2A campaign run commits its scenario artifacts + generated evidence HTML here. Rows sort newest-first by completed_at from a2a-summary.json (absent/legacy runs fall back to the directory mtime). PASS / FAIL reflect overall_pass; NO SUMMARY means the campaign ran but the aggregator couldn't recover any scenario JSON — open the evidence page for the raw trace.

Runs are grouped by agent framework. IronClaw (Rust, AlphaOne), Hermes (Python, NousResearch), and OpenClaw (Python, openclaw.ai) are all first-class agents in active per-release campaigns. OpenClaw runs labelled *-local-docker-* use the local Docker mesh (see docs/local-docker-mesh.md) which allocates 16 GB per container on a single workstation — no DO General Purpose tier required.

Test playbooks executed per campaign

Phase Tier What it measures Artifact (per run)
Phase 0 — Preflight gate Baseline attestations + F3 peer-A2A canary a2a-baseline.json, f3-peer-a2a.json
Phase 1 — Substrate testbook substrate 35 scenarios (S1-S42 minus skips); mTLS adds S20/S21 a2a-summary.json, scenario-N.{json,log}
Phase 2 — Scripted A2A dry run NHI 6 scripted exchanges via ai-memory (gates Phase 3) phase2-orchestration.json
Phase 3 — Autonomous NHI playbook NHI A-J scenarios × 4 arms × n=3 = up to 96 cells phase3-summary.json, phase3-{A-J}-{cold,isolated,stubbed,treatment}-runN.json
Phase 4 — Meta-analysis NHI Grounding rate, hallucination rate, recall hit rate, treatment-vs-control attribution phase4-analysis.json
Phase 5 — Verdict roll-up release Funnels findings into releases/v0.6.3.1/summary.json phase5-findings.md

Cross-framework agent-side instruments

Behavioral evidence layered on top of the substrate testbook. Tier 1 = qualitative; Tiers 2-4 = quantitative + adversarial; Tier 5 = stress (deferred).

Instrument Subject Methodology Artifacts
Phase 3 NHI playbook (A-J) IronClaw / Hermes 48-cell behavioral matrix on DO mesh per-run phase3-* + phase4-analysis.json (linked in NHI column above)
OpenClaw v0.6.3.1 behavioral assessment (Tier 1-4) OpenClaw 2026.5.x on xAI grok-4.3 52 probes across 8 phases (qualitative / recall@k / cross-session ablation / Byzantine peer / soft-restart / hard-restart) behavioral page · openclaw-behavioral-assessment.json · v3.log · v4.log
Substrate cert (3-green streak) OpenClaw on local-docker 3 consecutive 35-scenario green runs cert doc

IronClaw campaigns

All agent droplets in each campaign run IronClaw (Rust). Primary agent.

Campaign Verdict Baseline F3 Scenarios NHI Artifacts
a2a-ironclaw-v0.6.3.1-r2 ❌ FAIL 0/0 HTML · json · all
a2a-ironclaw-v0.6.3.1-local-docker-r3 ✅ PASS ✅ OK ✅ OK 35/35 HTML · json · baseline · all
a2a-ironclaw-v0.6.3.1-local-docker-r2 ✅ PASS ✅ OK ✅ OK 35/35 HTML · json · baseline · all
a2a-ironclaw-v0.6.3.1-local-docker-r1 ✅ PASS ❌ VIOLATION ✅ OK 35/35 HTML · json · baseline · all
a2a-ironclaw-v0.6.3.1-r27 ❌ FAIL ✅ OK ✅ OK 38/44 NHI HTML · json · baseline · phase4 · all
a2a-ironclaw-v0.6.3.1-r26 ❌ FAIL ✅ OK ✅ OK 39/44 NHI HTML · json · baseline · phase4 · all
a2a-ironclaw-v0.6.3.1-r25 ❌ FAIL ✅ OK ✅ OK 39/44 NHI HTML · json · baseline · phase4 · all
a2a-ironclaw-v0.6.3.1-r24 ❌ FAIL ✅ OK ✅ OK 39/44 HTML · json · baseline · all
a2a-ironclaw-v0.6.3.1-r23 ❌ FAIL ✅ OK ✅ OK 39/44 HTML · json · baseline · all
a2a-ironclaw-v0.6.3.1-r22 ❌ FAIL ✅ OK ✅ OK 39/44 HTML · json · baseline · all
a2a-ironclaw-v0.6.3.1-r21 ❌ FAIL ❌ VIOLATION 0/0 HTML · json · baseline · all
a2a-ironclaw-v0.6.3.1-r20 ❌ FAIL ❌ VIOLATION 0/0 HTML · json · baseline · all
a2a-ironclaw-v0.6.3.1-r19 ✅ PASS ✅ OK ✅ OK 39/39 NHI HTML · json · baseline · phase4 · all
a2a-ironclaw-v0.6.3.1-r18 ✅ PASS ✅ OK ✅ OK 39/39 NHI HTML · json · baseline · phase4 · all
a2a-ironclaw-v0.6.3.1-r17 ✅ PASS ✅ OK ✅ OK 39/39 HTML · json · baseline · all
a2a-ironclaw-v0.6.3.1-r16 ✅ PASS ✅ OK ✅ OK 39/39 NHI HTML · json · baseline · phase4 · all
a2a-ironclaw-v0.6.3.1-r15 ✅ PASS ✅ OK ✅ OK 39/39 NHI HTML · json · baseline · phase4 · all
a2a-ironclaw-v0.6.3.1-r14 ❌ FAIL ✅ OK ✅ OK 37/39 NHI HTML · json · baseline · phase4 · all
a2a-ironclaw-v0.6.3.1-r13 ✅ PASS ✅ OK ✅ OK 37/37 NHI HTML · json · baseline · phase4 · all
a2a-ironclaw-v0.6.3.1-r11 ❌ FAIL 0/0 HTML · json · all
a2a-ironclaw-v0.6.3.1-r10 ❌ FAIL 0/0 HTML · json · all
a2a-ironclaw-v0.6.3.1-r9 ✅ PASS ✅ OK ✅ OK 37/37 HTML · json · baseline · all
a2a-ironclaw-v0.6.3.1-r8 ✅ PASS ✅ OK ✅ OK 37/37 HTML · json · baseline · all
a2a-ironclaw-v0.6.3.1-r7 ❌ FAIL ✅ OK 0/0 HTML · json · baseline · all
a2a-ironclaw-v0.6.3.1-r6 ❌ FAIL 0/0 HTML · json · all
a2a-ironclaw-v0.6.3.1-r5 ❌ FAIL ❌ VIOLATION 0/0 HTML · json · baseline · all
a2a-ironclaw-v0.6.3.1-r4 ❌ FAIL ✅ OK 0/0 HTML · json · baseline · all
a2a-ironclaw-v0.6.3.1-r3 ❌ FAIL 0/0 HTML · json · all

Hermes campaigns

All agent droplets in each campaign run Hermes (Python). Primary agent.

Campaign Verdict Baseline F3 Scenarios NHI Artifacts
a2a-hermes-v0.6.3.1-local-docker-r3 ✅ PASS ✅ OK ✅ OK 35/35 HTML · json · baseline · all
a2a-hermes-v0.6.3.1-local-docker-r2 ✅ PASS ✅ OK ✅ OK 35/35 HTML · json · baseline · all
a2a-hermes-v0.6.3.1-local-docker-r1 ✅ PASS ✅ OK ✅ OK 35/35 HTML · json · baseline · all
a2a-hermes-v0.6.3.1-r16 ❌ FAIL ✅ OK ✅ OK 39/44 NHI HTML · json · baseline · phase4 · all
a2a-hermes-v0.6.3.1-r15 ❌ FAIL ✅ OK ✅ OK 39/44 NHI HTML · json · baseline · phase4 · all
a2a-hermes-v0.6.3.1-r14 ❌ FAIL ❌ VIOLATION 0/0 HTML · json · baseline · all
a2a-hermes-v0.6.3.1-r13 ❌ FAIL ❌ VIOLATION 0/0 HTML · json · baseline · all
a2a-hermes-v0.6.3.1-r12 ❌ FAIL ✅ OK ✅ OK 39/44 NHI HTML · json · baseline · phase4 · all
a2a-hermes-v0.6.3.1-r11 ❌ FAIL ❌ VIOLATION 0/0 HTML · json · baseline · all
a2a-hermes-v0.6.3.1-r10 ❌ FAIL ❌ VIOLATION 0/0 HTML · json · baseline · all
a2a-hermes-v0.6.3.1-r9 ❌ FAIL 0/0 HTML · json · all
a2a-hermes-v0.6.3.1-r8 ❌ FAIL ✅ OK ✅ OK 39/44 HTML · json · baseline · all
a2a-hermes-v0.6.3.1-r7 ❌ FAIL ✅ OK ✅ OK 0/0 HTML · json · baseline · all
a2a-hermes-v0.6.3.1-r6 ❌ FAIL ✅ OK ✅ OK 0/0 HTML · json · baseline · all
a2a-hermes-v0.6.3.1-r5 ❌ FAIL ✅ OK ✅ OK 0/0 HTML · json · baseline · all
a2a-hermes-v0.6.3.1-r4 ❌ FAIL ✅ OK ✅ OK 0/0 HTML · json · baseline · all
a2a-hermes-v0.6.3.1-r3 ❌ FAIL ❌ VIOLATION 0/0 HTML · json · baseline · all
a2a-hermes-v0.6.3.1-r2 ❌ FAIL ❌ VIOLATION 0/0 HTML · json · baseline · all
a2a-hermes-v0.6.3.1-r1 ❌ FAIL ❌ VIOLATION 0/0 HTML · json · baseline · all

OpenClaw campaigns

All agent droplets / containers in each campaign run OpenClaw (Python). First-class as of 2026-04-24.

⚠ Infrastructure note: campaigns suffixed -local-docker-* were run on a 4-node Docker mesh on a single workstation (3 openclaw containers + 1 memory-only aggregator, 16 GB per openclaw container). No DigitalOcean infrastructure was provisioned for these runs — they are fully local and reproducible per docs/local-docker-mesh.md. Other openclaw campaigns use DO General Purpose tier droplets.

Campaign Verdict Baseline F3 Scenarios NHI Artifacts
a2a-openclaw-v0.6.3.1-r3 ✅ PASS ✅ OK ✅ OK 35/35 HTML · json · baseline · all
a2a-openclaw-v0.6.3.1-r2 ✅ PASS ✅ OK ✅ OK 35/35 HTML · json · baseline · all
a2a-openclaw-v0.6.3.1-r1 ✅ PASS ✅ OK ✅ OK 35/35 HTML · json · baseline · all