Operator runbook¶

Day-to-day operations for an operator running A2A gate campaigns. Every task below is a single named procedure, runnable in bounded time.

1. Dispatch a campaign¶

Per-group, matching the documented testbook defaults:

FORK=alphaonedev/ai-memory-ai2ai-gate   # or your fork

# IronClaw campaign (primary Rust agent — default since 2026-04-21)
gh workflow run a2a-gate.yml -R "$FORK" \
  -f ai_memory_git_ref=v0.6.1 \
  -f agent_group=ironclaw \
  -f campaign_id="a2a-ironclaw-v0.6.1-r$(date +%s)" \
  -f scenarios="1 1b 2 4 9 10 15 17"

# Hermes campaign
gh workflow run a2a-gate.yml -R "$FORK" \
  -f ai_memory_git_ref=v0.6.1 \
  -f agent_group=hermes \
  -f campaign_id="a2a-hermes-v0.6.1-r$(date +%s)" \
  -f scenarios="1 1b 2 4 9 10 15 17"

Both can run concurrently — they provision distinct VPCs (10.251/24 ironclaw vs 10.252/24 hermes). Legacy openclaw dispatches remain accepted (agent_group=openclaw) but require a 16GB droplet slug override + DO account-tier upgrade — see agents/ironclaw.md for the switch rationale.

2. Watch a running campaign¶

gh run list -R alphaonedev/ai-memory-ai2ai-gate --workflow=a2a-gate.yml --limit 3
gh run watch -R alphaonedev/ai-memory-ai2ai-gate <run-id>

Per-step progress:

# Get the job ID
JOB=$(gh run view <run-id> --repo alphaonedev/ai-memory-ai2ai-gate --json jobs --jq '.jobs[0].databaseId')
gh run view --job=$JOB --repo alphaonedev/ai-memory-ai2ai-gate

Live log of a specific failed step:

gh run view <run-id> --repo alphaonedev/ai-memory-ai2ai-gate --log-failed

3. Expected timing¶

Step	OpenClaw	Hermes
Terraform apply	~60s	~60s
SSH wait	~30s	~30s
Provision (4 nodes)	~5-10 min	~12-20 min
Collect + enforce BASELINE	~10s	~10s
F3 peer A2A probe	~12s	~12s
Scenarios (8 default)	~10-15 min	~10-15 min
Aggregate + HTML + redact + commit + destroy	~90s	~90s
Total	15-25 min	25-40 min

If any step exceeds 1.5× its expected duration, investigate — either upstream (xAI, npm, GitHub, DO) has issues OR there's a harness regression.

4. Triage a baseline violation¶

If baseline_ok=false at step 6 of the workflow:

git pull to get the committed runs/<campaign-id>/a2a-baseline.json
jq '.per_node[] | {node_index, baseline_pass, config_attestation, functional_probes, negative_invariants}' runs/<campaign-id>/a2a-baseline.json
Find the first field that is false for each node. That's the failing invariant.
Consult the violation table:

Field (false)	Class	Likely cause	First fix to try
`framework_is_authentic`	C1	Binary is a symlink to another CLI	Check `readlink -f $(which openclaw)` on the droplet; re-run install
`mcp_server_ai_memory_registered`	C2	Config file malformed	Inspect `~/.openclaw/openclaw.json` or `~/.hermes/config.yaml`
`llm_backend_is_xai_grok`	C3	Model string wrong in config	Check `default_model` field; spec says `grok-4-fast-non-reasoning`
`llm_is_default_provider`	C4	`defaultProvider` != `xai`	OpenClaw config only
`mcp_command_is_ai_memory`	C5	MCP server `command` not `ai-memory`	Config drift; re-run setup_node.sh
`agent_id_stamped`	C6	`AI_MEMORY_AGENT_ID` env not in MCP config	Check `AGENT_ID` was exported before setup_node.sh
`federation_live`	C7	Local `ai-memory serve` crashed or port 9077 unreachable	`/var/log/ai-memory-serve.log` on the droplet; UFW status
`ufw_disabled`	C8	UFW re-enabled itself	`ufw status verbose`; check for re-enable scripts
`iptables_flushed`	C9	Residual DROP rules	`iptables -L -v`
`dead_man_switch_scheduled`	C10	`shutdown -P +480` not running	`ps aux \\| grep shutdown`
`xai_grok_chat_reachable` (F1)	Functional	xAI key invalid / out of credit / network	`curl -v https://api.x.ai/v1/models -H "Authorization: Bearer $XAI_API_KEY"` from droplet
`substrate_http_canary_f2a` (F2a)	Functional	Local serve HTTP path broken — unusual	Check ai-memory serve log; spec bug if this fails
`agent_mcp_canary_f2b` (F2b)	Functional (non-gating)	Agent didn't invoke tool correctly OR #318 substrate bug	See `/tmp/canary-<agent_type>.log` for agent response; expected to fail until #318 ships
`a2a_protocol_off`	Negative	Config file lost the explicit disable	Re-run setup_node.sh
`tool_allowlist_is_memory_only`	Negative	Non-`memory_*` tool leaked into allowlist	Check config file; re-run

5. Triage F3 failure¶

If baseline passes but F3 fails:

jq . runs/<campaign-id>/f3-peer-a2a.json
Note the canary_uuid and writer_agent
F3 writes to node-1 and verifies on nodes 2, 3, 4. If F3 fails, federation fanout is broken from node-1 — escalate to ai-memory-mcp team.

6. Triage a hung campaign¶

A campaign stalled at the same step for >15 min beyond expected:

Check which step via gh run view --job=$JOB
If "Provision all 4 nodes": something in setup_node.sh is hung. Every long-running subprocess has a timeout (see architecture.md §4); worst-case a 600s install timeout + 60s agent canary + 15 min baseline collection = ~25 min ceiling per run.
If truly stuck past ceiling: gh run cancel <run-id> to stop DO spend, then inspect the Actions log for the last [setup-node-N HH:MM:SS] timestamp emitted.

Historical incidents (see incidents.md): - r11 stalled 37+ min on Provision → added timeouts to every subprocess in commit 6face55. Should not recur.

7. Cancel a campaign¶

gh run cancel <run-id> --repo alphaonedev/ai-memory-ai2ai-gate

terraform destroy runs automatically in the if: always() teardown step, so cancellation is safe — no orphan droplets. If for some reason destroy fails (DO API outage), the 8h dead-man switch on every droplet is the backstop.

Manual teardown (only if needed):

cd terraform
terraform destroy -auto-approve \
  -var "campaign_id=<id>" \
  -var "do_token=$DO_TOKEN" \
  -var "ssh_key_fingerprint=$SSH_FP"

8. Inspect evidence locally¶

git pull
cd runs/<campaign-id>

# Overall campaign verdict
jq '.overall_pass, .reasons' a2a-summary.json

# Baseline per-node view
jq '[.per_node[] | {node: .node_index, pass: .baseline_pass}]' a2a-baseline.json

# Per-scenario verdict grid
for f in scenario-*.json; do
  echo "=== $f ==="
  jq '{scenario, pass, reasons}' "$f"
done

# Open the human-readable page
open index.html     # macOS
xdg-open index.html # Linux

Or browse on Pages: https://alphaonedev.github.io/ai-memory-ai2ai-gate/evidence/<campaign-id>/

9. Add a new scenario¶

Write scripts/scenarios/<N>_<slug>.sh following the pattern of 15_read_your_writes.sh (simplest).
Emit JSON on stdout + logs on stderr. Final JSON must match the contract in testbook.md §4.4.
chmod +x scripts/scenarios/<N>_<slug>.sh.
Add to the default dispatch scenarios in .github/workflows/a2a-gate.yml OR pass via -f scenarios="..." at dispatch.
Add a full test plan entry to docs/testbook.md §4 (Objective / Pre-conditions / Procedure / Pass criteria / Failure modes / Evidence).
Update the coverage matrix in testbook.md §3.
Bump the test book version (minor or major per §7 change-control).
Commit + push + dispatch.

10. Tighten a baseline invariant¶

Update the invariant spec in docs/baseline.md §8.1.
Add the check logic to scripts/setup_node.sh (follow the baseline_check / jq -e patterns).
Add the field to the jq emit block.
Add the field to the baseline_pass conjunction.
Update scripts/generate_run_html.sh render_baseline() table to include the new column.
Bump baseline spec version in docs/baseline.md §0 changelog.
Commit. The pages workflow regenerates every run's HTML on next deploy so historical runs retroactively render with a — in the new column.

11. Rotate credentials¶

# xAI key
gh secret set XAI_API_KEY -R alphaonedev/ai-memory-ai2ai-gate

# DO token
gh secret set DIGITALOCEAN_TOKEN -R alphaonedev/ai-memory-ai2ai-gate

# SSH key (after registering new public half with DO)
gh secret set DIGITALOCEAN_SSH_PRIVATE_KEY -R alphaonedev/ai-memory-ai2ai-gate < ~/.ssh/id_ed25519
gh secret set DIGITALOCEAN_SSH_KEY_FINGERPRINT -R alphaonedev/ai-memory-ai2ai-gate

The redaction pass (see baseline.md §9b) catches any pre-rotation secret value that might be in historical logs: it regex-masks xai-[A-Za-z0-9_-]{20,} patterns even when the specific old key isn't known to the workflow. So rotated-key safety is automatic for XAI keys.

12. Investigate a past run¶

# Every artifact is in git history
git log -- runs/a2a-openclaw-v0.6.0-r7/
git show <commit-sha> -- runs/a2a-openclaw-v0.6.0-r7/a2a-summary.json

Every commit includes the campaign ID in the message. The actor + workflow URL are in the run's campaign.meta.json.