Campaign runs¶
Every A2A campaign run commits its scenario artifacts +
generated evidence HTML here. Rows sort newest-first by
completed_at from a2a-summary.json (absent/legacy
runs fall back to the directory mtime). PASS / FAIL
reflect overall_pass; NO SUMMARY means the campaign
ran but the aggregator couldn't recover any scenario JSON
— open the evidence page for the raw trace.
Runs are grouped by agent framework. IronClaw (Rust,
AlphaOne), Hermes (Python, NousResearch), and
OpenClaw (Python, openclaw.ai) are all first-class
agents in active per-release campaigns. OpenClaw runs
labelled *-local-docker-* use the local Docker mesh
(see docs/local-docker-mesh.md)
which allocates 16 GB per container on a single
workstation — no DO General Purpose tier required.
Test playbooks executed per campaign¶
| Phase | Tier | What it measures | Artifact (per run) |
|---|---|---|---|
| Phase 0 — Preflight | gate | Baseline attestations + F3 peer-A2A canary | a2a-baseline.json, f3-peer-a2a.json |
| Phase 1 — Substrate testbook | substrate | 35 scenarios (S1-S42 minus skips); mTLS adds S20/S21 | a2a-summary.json, scenario-N.{json,log} |
| Phase 2 — Scripted A2A dry run | NHI | 6 scripted exchanges via ai-memory (gates Phase 3) | phase2-orchestration.json |
| Phase 3 — Autonomous NHI playbook | NHI | A-J scenarios × 4 arms × n=3 = up to 96 cells | phase3-summary.json, phase3-{A-J}-{cold,isolated,stubbed,treatment}-runN.json |
| Phase 4 — Meta-analysis | NHI | Grounding rate, hallucination rate, recall hit rate, treatment-vs-control attribution | phase4-analysis.json |
| Phase 5 — Verdict roll-up | release | Funnels findings into releases/v0.6.3.1/summary.json |
phase5-findings.md |
Cross-framework agent-side instruments¶
Behavioral evidence layered on top of the substrate testbook. Tier 1 = qualitative; Tiers 2-4 = quantitative + adversarial; Tier 5 = stress (deferred).
| Instrument | Subject | Methodology | Artifacts |
|---|---|---|---|
| Phase 3 NHI playbook (A-J) | IronClaw / Hermes | 48-cell behavioral matrix on DO mesh | per-run phase3-* + phase4-analysis.json (linked in NHI column above) |
| OpenClaw v0.6.3.1 behavioral assessment (Tier 1-4) | OpenClaw 2026.5.x on xAI grok-4.3 | 52 probes across 8 phases (qualitative / recall@k / cross-session ablation / Byzantine peer / soft-restart / hard-restart) | behavioral page · openclaw-behavioral-assessment.json · v3.log · v4.log |
| Substrate cert (3-green streak) | OpenClaw on local-docker | 3 consecutive 35-scenario green runs | cert doc |
IronClaw campaigns¶
All agent droplets in each campaign run IronClaw (Rust). Primary agent.
| Campaign | Verdict | Baseline | F3 | Scenarios | NHI | Artifacts |
|---|---|---|---|---|---|---|
| a2a-ironclaw-v0.6.3.1-r2 | ❌ FAIL | — | — | 0/0 | — | HTML · json · all |
| a2a-ironclaw-v0.6.3.1-local-docker-r3 | ✅ PASS | ✅ OK | ✅ OK | 35/35 | — | HTML · json · baseline · all |
| a2a-ironclaw-v0.6.3.1-local-docker-r2 | ✅ PASS | ✅ OK | ✅ OK | 35/35 | — | HTML · json · baseline · all |
| a2a-ironclaw-v0.6.3.1-local-docker-r1 | ✅ PASS | ❌ VIOLATION | ✅ OK | 35/35 | — | HTML · json · baseline · all |
| a2a-ironclaw-v0.6.3.1-r27 | ❌ FAIL | ✅ OK | ✅ OK | 38/44 | NHI | HTML · json · baseline · phase4 · all |
| a2a-ironclaw-v0.6.3.1-r26 | ❌ FAIL | ✅ OK | ✅ OK | 39/44 | NHI | HTML · json · baseline · phase4 · all |
| a2a-ironclaw-v0.6.3.1-r25 | ❌ FAIL | ✅ OK | ✅ OK | 39/44 | NHI | HTML · json · baseline · phase4 · all |
| a2a-ironclaw-v0.6.3.1-r24 | ❌ FAIL | ✅ OK | ✅ OK | 39/44 | — | HTML · json · baseline · all |
| a2a-ironclaw-v0.6.3.1-r23 | ❌ FAIL | ✅ OK | ✅ OK | 39/44 | — | HTML · json · baseline · all |
| a2a-ironclaw-v0.6.3.1-r22 | ❌ FAIL | ✅ OK | ✅ OK | 39/44 | — | HTML · json · baseline · all |
| a2a-ironclaw-v0.6.3.1-r21 | ❌ FAIL | ❌ VIOLATION | — | 0/0 | — | HTML · json · baseline · all |
| a2a-ironclaw-v0.6.3.1-r20 | ❌ FAIL | ❌ VIOLATION | — | 0/0 | — | HTML · json · baseline · all |
| a2a-ironclaw-v0.6.3.1-r19 | ✅ PASS | ✅ OK | ✅ OK | 39/39 | NHI | HTML · json · baseline · phase4 · all |
| a2a-ironclaw-v0.6.3.1-r18 | ✅ PASS | ✅ OK | ✅ OK | 39/39 | NHI | HTML · json · baseline · phase4 · all |
| a2a-ironclaw-v0.6.3.1-r17 | ✅ PASS | ✅ OK | ✅ OK | 39/39 | — | HTML · json · baseline · all |
| a2a-ironclaw-v0.6.3.1-r16 | ✅ PASS | ✅ OK | ✅ OK | 39/39 | NHI | HTML · json · baseline · phase4 · all |
| a2a-ironclaw-v0.6.3.1-r15 | ✅ PASS | ✅ OK | ✅ OK | 39/39 | NHI | HTML · json · baseline · phase4 · all |
| a2a-ironclaw-v0.6.3.1-r14 | ❌ FAIL | ✅ OK | ✅ OK | 37/39 | NHI | HTML · json · baseline · phase4 · all |
| a2a-ironclaw-v0.6.3.1-r13 | ✅ PASS | ✅ OK | ✅ OK | 37/37 | NHI | HTML · json · baseline · phase4 · all |
| a2a-ironclaw-v0.6.3.1-r11 | ❌ FAIL | — | — | 0/0 | — | HTML · json · all |
| a2a-ironclaw-v0.6.3.1-r10 | ❌ FAIL | — | — | 0/0 | — | HTML · json · all |
| a2a-ironclaw-v0.6.3.1-r9 | ✅ PASS | ✅ OK | ✅ OK | 37/37 | — | HTML · json · baseline · all |
| a2a-ironclaw-v0.6.3.1-r8 | ✅ PASS | ✅ OK | ✅ OK | 37/37 | — | HTML · json · baseline · all |
| a2a-ironclaw-v0.6.3.1-r7 | ❌ FAIL | ✅ OK | — | 0/0 | — | HTML · json · baseline · all |
| a2a-ironclaw-v0.6.3.1-r6 | ❌ FAIL | — | — | 0/0 | — | HTML · json · all |
| a2a-ironclaw-v0.6.3.1-r5 | ❌ FAIL | ❌ VIOLATION | — | 0/0 | — | HTML · json · baseline · all |
| a2a-ironclaw-v0.6.3.1-r4 | ❌ FAIL | ✅ OK | — | 0/0 | — | HTML · json · baseline · all |
| a2a-ironclaw-v0.6.3.1-r3 | ❌ FAIL | — | — | 0/0 | — | HTML · json · all |
Hermes campaigns¶
All agent droplets in each campaign run Hermes (Python). Primary agent.
| Campaign | Verdict | Baseline | F3 | Scenarios | NHI | Artifacts |
|---|---|---|---|---|---|---|
| a2a-hermes-v0.6.3.1-local-docker-r3 | ✅ PASS | ✅ OK | ✅ OK | 35/35 | — | HTML · json · baseline · all |
| a2a-hermes-v0.6.3.1-local-docker-r2 | ✅ PASS | ✅ OK | ✅ OK | 35/35 | — | HTML · json · baseline · all |
| a2a-hermes-v0.6.3.1-local-docker-r1 | ✅ PASS | ✅ OK | ✅ OK | 35/35 | — | HTML · json · baseline · all |
| a2a-hermes-v0.6.3.1-r16 | ❌ FAIL | ✅ OK | ✅ OK | 39/44 | NHI | HTML · json · baseline · phase4 · all |
| a2a-hermes-v0.6.3.1-r15 | ❌ FAIL | ✅ OK | ✅ OK | 39/44 | NHI | HTML · json · baseline · phase4 · all |
| a2a-hermes-v0.6.3.1-r14 | ❌ FAIL | ❌ VIOLATION | — | 0/0 | — | HTML · json · baseline · all |
| a2a-hermes-v0.6.3.1-r13 | ❌ FAIL | ❌ VIOLATION | — | 0/0 | — | HTML · json · baseline · all |
| a2a-hermes-v0.6.3.1-r12 | ❌ FAIL | ✅ OK | ✅ OK | 39/44 | NHI | HTML · json · baseline · phase4 · all |
| a2a-hermes-v0.6.3.1-r11 | ❌ FAIL | ❌ VIOLATION | — | 0/0 | — | HTML · json · baseline · all |
| a2a-hermes-v0.6.3.1-r10 | ❌ FAIL | ❌ VIOLATION | — | 0/0 | — | HTML · json · baseline · all |
| a2a-hermes-v0.6.3.1-r9 | ❌ FAIL | — | — | 0/0 | — | HTML · json · all |
| a2a-hermes-v0.6.3.1-r8 | ❌ FAIL | ✅ OK | ✅ OK | 39/44 | — | HTML · json · baseline · all |
| a2a-hermes-v0.6.3.1-r7 | ❌ FAIL | ✅ OK | ✅ OK | 0/0 | — | HTML · json · baseline · all |
| a2a-hermes-v0.6.3.1-r6 | ❌ FAIL | ✅ OK | ✅ OK | 0/0 | — | HTML · json · baseline · all |
| a2a-hermes-v0.6.3.1-r5 | ❌ FAIL | ✅ OK | ✅ OK | 0/0 | — | HTML · json · baseline · all |
| a2a-hermes-v0.6.3.1-r4 | ❌ FAIL | ✅ OK | ✅ OK | 0/0 | — | HTML · json · baseline · all |
| a2a-hermes-v0.6.3.1-r3 | ❌ FAIL | ❌ VIOLATION | — | 0/0 | — | HTML · json · baseline · all |
| a2a-hermes-v0.6.3.1-r2 | ❌ FAIL | ❌ VIOLATION | — | 0/0 | — | HTML · json · baseline · all |
| a2a-hermes-v0.6.3.1-r1 | ❌ FAIL | ❌ VIOLATION | — | 0/0 | — | HTML · json · baseline · all |
OpenClaw campaigns¶
All agent droplets / containers in each campaign run OpenClaw (Python). First-class as of 2026-04-24.
⚠ Infrastructure note: campaigns suffixed -local-docker-* were run on a 4-node Docker mesh on a single workstation (3 openclaw containers + 1 memory-only aggregator, 16 GB per openclaw container). No DigitalOcean infrastructure was provisioned for these runs — they are fully local and reproducible per docs/local-docker-mesh.md. Other openclaw campaigns use DO General Purpose tier droplets.
| Campaign | Verdict | Baseline | F3 | Scenarios | NHI | Artifacts |
|---|---|---|---|---|---|---|
| a2a-openclaw-v0.6.3.1-r3 | ✅ PASS | ✅ OK | ✅ OK | 35/35 | — | HTML · json · baseline · all |
| a2a-openclaw-v0.6.3.1-r2 | ✅ PASS | ✅ OK | ✅ OK | 35/35 | — | HTML · json · baseline · all |
| a2a-openclaw-v0.6.3.1-r1 | ✅ PASS | ✅ OK | ✅ OK | 35/35 | — | HTML · json · baseline · all |