Incident log — what r2–r12 taught us¶
Every iteration of the A2A gate on 2026-04-20 / 2026-04-21 surfaced one new bug in the provisioning or test infrastructure. This log records each honestly so future operators (human or AI NHI) can see the lineage from first scaffolding dispatch to first green campaign.
Classic ship-gate pattern: the iterations ARE the diagnostic. Every red teaches something; every fix moves one step deeper into the stack.
Iteration index¶
| Run | Date | Verdict | Root cause | Fix commit |
|---|---|---|---|---|
| r2 | 2026-04-20 | RED | VPC CIDR typo 10.260.0.0/24 (invalid IPv4) |
18f05d2 |
| r3 | 2026-04-20 | RED | Concurrent VPC CIDR collision (both groups requested same /24) + DO tag dots |
82d9fe2, 4998ae2 |
| r3-hermes | 2026-04-20 | RED | Stale CIDR hardcoded in firewall inbound_rule source_addresses |
c7bffef |
| r4 | 2026-04-20 | RED | Stale .pub file → wrong SSH key fingerprint in repo secret |
key rotation |
| r5-openclaw | 2026-04-20 | MIXED | Phase B shell arithmetic crash on multi-line output + Phase C firewall-blocked public IP | ac6f56d (hermes) + scenario fixes |
| r5-hermes | 2026-04-20 | RED | hermes -p is --profile, not --prompt |
ac6f56d |
| r6-openclaw | 2026-04-20 | RED (honest) | MCP stdio writes bypass federation fanout coordinator → ai-memory-mcp#318, targeted v0.6.0.1 | substrate fix pending |
| r6-hermes | 2026-04-20 | RED | hermes import fails with ModuleNotFoundError: dotenv — upstream install.sh gap |
e129b53 |
| r7 | 2026-04-20 | RED | (a) openclaw install.sh exit-1 despite success (non-TTY wizard) (b) baseline PROBE F2 caught #318 | dd49e5a (real openclaw install) + bd0db1b (baseline spec) |
| r7 (post-baseline) | 2026-04-20 | RED (honest) | Baseline gate working — F2b canary correctly flagged #318 | gate doing its job |
| r8 | 2026-04-21 | RED | openclaw pipe SIGPIPE on openclaw --version \| head -1 under set -o pipefail |
195e349 |
| r9 (hermes only) | 2026-04-21 | GREEN | First successful campaign — baseline + F3 + S1b all pass | — |
| r9 (openclaw) | 2026-04-21 | RED | Race: dispatched before install fix landed; install.sh exit-1 still | bc5f025 |
| r10 (openclaw) | 2026-04-21 | RED | npm openclaw@2026.4.20 ETARGET — that's a display label, not an npm semver |
revert to install.sh + drift-capture |
| r11 | 2026-04-21 | CANCELLED | Provision step stalled 37+ min — no timeouts on F2b canary / install scripts | 6face55 (timeout hardening) |
| r12 | 2026-04-21 | in flight | Full timeout-hardened pipeline | — |
Key learnings¶
Lesson 1 — set -o pipefail + | head -N is a SIGPIPE landmine¶
In r7–r8, openclaw --version 2>&1 | head -1 returned non-zero despite openclaw succeeding — head closed stdin after 1 line, SIGPIPE'd openclaw, and pipefail propagated the non-zero exit. Fix: capture output to a variable, then echo | head from the variable. Commit 195e349.
Lesson 2 — Upstream install scripts exit non-zero on headless environments¶
Both openclaw.ai/install.sh and parts of hermes-agent's installer assume a TTY and exit non-zero when run from a GitHub Actions runner (no TTY). The binaries install correctly; the wizard fails. Fix: pipe install through | sed 'prefix' and ignore exit code; rely on presence check. Commit bc5f025.
Lesson 3 — Every long-running subprocess MUST have a timeout¶
r11 hung 37+ min on Provision because F2b agent canary had no timeout — if xAI Grok rate-limits or the agent CLI stdin-blocks, the script hangs until the 60-min workflow timeout. Fix: explicit timeout <sec> on every subprocess. Commit 6face55.
Lesson 4 — Display versions ≠ registry versions¶
"OpenClaw 2026.4.20 (f50202e)" is a git-commit-style label, NOT an npm semver. npm install -g openclaw@2026.4.20 fails with ETARGET. Pinning requires knowing the actual npm semver (currently unknown for openclaw — documented as an upstream gap in baseline.md §3b).
Lesson 5 — The gate catches real product bugs¶
r6-openclaw and r7-hermes both surfaced ai-memory-mcp#318 — MCP stdio writes don't trigger federation fanout. The gate's job is exactly this: to refuse to run scenarios when the substrate has gaps. F2b (non-gating in v1.2.0) correctly reports this bug on every dispatch while F2a (HTTP substrate, gating) passes — making the dashboard tell the honest story without halting CI.
Lesson 6 — jq '// empty' treats false as null¶
A subtle bug in the HTML generator: jq '.pass // empty' returned empty string when .pass: false because jq's // treats false as the null-alternative. Fix: jq 'if .pass == null then empty else .pass end'. This made r6-openclaw render "UNKNOWN" instead of "FAIL" on the dashboard. [Commit chain around e30586f].
Lesson 7 — VPC CIDR partitioning is mandatory for concurrent dispatch¶
Running both groups simultaneously requires distinct CIDRs; otherwise terraform race-conditions on VPC creation. local.vpc_cidr partitions to 10.251.0.0/24 (openclaw) vs 10.252.0.0/24 (hermes). [Commit 82d9fe2].
Lesson 8 — UFW on Ubuntu 24.04 is default-on and must be hard-disabled¶
Ship-gate r21/r23 lesson, re-confirmed here: Ubuntu 24.04 ships UFW enabled and configured, interfering with loopback + intra-VPC traffic the federation mesh needs. setup_node.sh does ufw --force reset && ufw --force disable twice (reset re-enables on some UFW builds), then iptables -P ACCEPT + iptables -F, then baseline-level verification with exit 3 on failure. Same belt-and-suspenders pattern as the secret redaction.
Lesson 9 — "Absolutely truthful reporting" is a code contract¶
Every scenario emits a JSON verdict BEFORE the exit code is set. Every failure lists specific reasons. Every attestation is computed from live droplet state, never from config intent. The dashboard renders what the JSON says, not what we wish it said. testbook.md §4.4 + baseline.md §8.
Forward-looking (not yet surfaced, documented for completeness)¶
Expected future incidents and their mitigations:
-
xAI rate limiting under S4 burst: scenario 4 writes 90 rows in ~5s with 3 agents. If xAI throttles the agent CLIs, F2b will fail cleanly (60s timeout). S4 itself doesn't use xAI — it's HTTP-direct — so rate limits don't block the scenario. Baseline attestation captures the issue as F2b red while S4 passes.
-
DO region capacity: a specific region running out of droplet capacity would fail terraform apply at step 3.
regionis a workflow input; operators can switch tosfo3,ams3, etc. -
ai-memory v0.6.0 EOL: when v0.6.0.1 ships with the #318 fix, the workflow's
ai_memory_git_refdefault bumps to v0.6.0.1 + F2b moves from "attestation-only" to "gates baseline" (minor spec bump per baseline.md §12). -
openclaw post-install wizard change: future openclaw releases may change the post-install behavior (could fail in a new way OR finally work headless). The drift-capture in
framework_versionattestation field would surface the version that's running. -
Hermes breaking change on
mainref: we pin the install script URL tomainbranch (unpinned at tag level — documented gap). A future hermesmaincould change the install surface.HERMES_INSTALL_REFenv override exists to pin to a specific tag once upstream ships a stable release tag.
Change log of this document¶
- 2026-04-21 — initial authoring after r12 dispatch. Captures r2–r11 full post-mortem.
Future entries go at the top; keep the table at the top sorted newest-first once r12+ land.