Skip to content

Incident log — what r2–r12 taught us

Every iteration of the A2A gate on 2026-04-20 / 2026-04-21 surfaced one new bug in the provisioning or test infrastructure. This log records each honestly so future operators (human or AI NHI) can see the lineage from first scaffolding dispatch to first green campaign.

Classic ship-gate pattern: the iterations ARE the diagnostic. Every red teaches something; every fix moves one step deeper into the stack.


Iteration index

Run Date Verdict Root cause Fix commit
r2 2026-04-20 RED VPC CIDR typo 10.260.0.0/24 (invalid IPv4) 18f05d2
r3 2026-04-20 RED Concurrent VPC CIDR collision (both groups requested same /24) + DO tag dots 82d9fe2, 4998ae2
r3-hermes 2026-04-20 RED Stale CIDR hardcoded in firewall inbound_rule source_addresses c7bffef
r4 2026-04-20 RED Stale .pub file → wrong SSH key fingerprint in repo secret key rotation
r5-openclaw 2026-04-20 MIXED Phase B shell arithmetic crash on multi-line output + Phase C firewall-blocked public IP ac6f56d (hermes) + scenario fixes
r5-hermes 2026-04-20 RED hermes -p is --profile, not --prompt ac6f56d
r6-openclaw 2026-04-20 RED (honest) MCP stdio writes bypass federation fanout coordinatorai-memory-mcp#318, targeted v0.6.0.1 substrate fix pending
r6-hermes 2026-04-20 RED hermes import fails with ModuleNotFoundError: dotenv — upstream install.sh gap e129b53
r7 2026-04-20 RED (a) openclaw install.sh exit-1 despite success (non-TTY wizard) (b) baseline PROBE F2 caught #318 dd49e5a (real openclaw install) + bd0db1b (baseline spec)
r7 (post-baseline) 2026-04-20 RED (honest) Baseline gate working — F2b canary correctly flagged #318 gate doing its job
r8 2026-04-21 RED openclaw pipe SIGPIPE on openclaw --version \| head -1 under set -o pipefail 195e349
r9 (hermes only) 2026-04-21 GREEN First successful campaign — baseline + F3 + S1b all pass
r9 (openclaw) 2026-04-21 RED Race: dispatched before install fix landed; install.sh exit-1 still bc5f025
r10 (openclaw) 2026-04-21 RED npm openclaw@2026.4.20 ETARGET — that's a display label, not an npm semver revert to install.sh + drift-capture
r11 2026-04-21 CANCELLED Provision step stalled 37+ min — no timeouts on F2b canary / install scripts 6face55 (timeout hardening)
r12 2026-04-21 in flight Full timeout-hardened pipeline

Key learnings

Lesson 1 — set -o pipefail + | head -N is a SIGPIPE landmine

In r7–r8, openclaw --version 2>&1 | head -1 returned non-zero despite openclaw succeeding — head closed stdin after 1 line, SIGPIPE'd openclaw, and pipefail propagated the non-zero exit. Fix: capture output to a variable, then echo | head from the variable. Commit 195e349.

Lesson 2 — Upstream install scripts exit non-zero on headless environments

Both openclaw.ai/install.sh and parts of hermes-agent's installer assume a TTY and exit non-zero when run from a GitHub Actions runner (no TTY). The binaries install correctly; the wizard fails. Fix: pipe install through | sed 'prefix' and ignore exit code; rely on presence check. Commit bc5f025.

Lesson 3 — Every long-running subprocess MUST have a timeout

r11 hung 37+ min on Provision because F2b agent canary had no timeout — if xAI Grok rate-limits or the agent CLI stdin-blocks, the script hangs until the 60-min workflow timeout. Fix: explicit timeout <sec> on every subprocess. Commit 6face55.

Lesson 4 — Display versions ≠ registry versions

"OpenClaw 2026.4.20 (f50202e)" is a git-commit-style label, NOT an npm semver. npm install -g openclaw@2026.4.20 fails with ETARGET. Pinning requires knowing the actual npm semver (currently unknown for openclaw — documented as an upstream gap in baseline.md §3b).

Lesson 5 — The gate catches real product bugs

r6-openclaw and r7-hermes both surfaced ai-memory-mcp#318 — MCP stdio writes don't trigger federation fanout. The gate's job is exactly this: to refuse to run scenarios when the substrate has gaps. F2b (non-gating in v1.2.0) correctly reports this bug on every dispatch while F2a (HTTP substrate, gating) passes — making the dashboard tell the honest story without halting CI.

Lesson 6 — jq '// empty' treats false as null

A subtle bug in the HTML generator: jq '.pass // empty' returned empty string when .pass: false because jq's // treats false as the null-alternative. Fix: jq 'if .pass == null then empty else .pass end'. This made r6-openclaw render "UNKNOWN" instead of "FAIL" on the dashboard. [Commit chain around e30586f].

Lesson 7 — VPC CIDR partitioning is mandatory for concurrent dispatch

Running both groups simultaneously requires distinct CIDRs; otherwise terraform race-conditions on VPC creation. local.vpc_cidr partitions to 10.251.0.0/24 (openclaw) vs 10.252.0.0/24 (hermes). [Commit 82d9fe2].

Lesson 8 — UFW on Ubuntu 24.04 is default-on and must be hard-disabled

Ship-gate r21/r23 lesson, re-confirmed here: Ubuntu 24.04 ships UFW enabled and configured, interfering with loopback + intra-VPC traffic the federation mesh needs. setup_node.sh does ufw --force reset && ufw --force disable twice (reset re-enables on some UFW builds), then iptables -P ACCEPT + iptables -F, then baseline-level verification with exit 3 on failure. Same belt-and-suspenders pattern as the secret redaction.

Lesson 9 — "Absolutely truthful reporting" is a code contract

Every scenario emits a JSON verdict BEFORE the exit code is set. Every failure lists specific reasons. Every attestation is computed from live droplet state, never from config intent. The dashboard renders what the JSON says, not what we wish it said. testbook.md §4.4 + baseline.md §8.


Forward-looking (not yet surfaced, documented for completeness)

Expected future incidents and their mitigations:

  • xAI rate limiting under S4 burst: scenario 4 writes 90 rows in ~5s with 3 agents. If xAI throttles the agent CLIs, F2b will fail cleanly (60s timeout). S4 itself doesn't use xAI — it's HTTP-direct — so rate limits don't block the scenario. Baseline attestation captures the issue as F2b red while S4 passes.

  • DO region capacity: a specific region running out of droplet capacity would fail terraform apply at step 3. region is a workflow input; operators can switch to sfo3, ams3, etc.

  • ai-memory v0.6.0 EOL: when v0.6.0.1 ships with the #318 fix, the workflow's ai_memory_git_ref default bumps to v0.6.0.1 + F2b moves from "attestation-only" to "gates baseline" (minor spec bump per baseline.md §12).

  • openclaw post-install wizard change: future openclaw releases may change the post-install behavior (could fail in a new way OR finally work headless). The drift-capture in framework_version attestation field would surface the version that's running.

  • Hermes breaking change on main ref: we pin the install script URL to main branch (unpinned at tag level — documented gap). A future hermes main could change the install surface. HERMES_INSTALL_REF env override exists to pin to a specific tag once upstream ships a stable release tag.


Change log of this document

  • 2026-04-21 — initial authoring after r12 dispatch. Captures r2–r11 full post-mortem.

Future entries go at the top; keep the table at the top sorted newest-first once r12+ land.