Operator runbook¶
Day-to-day operations for an operator running A2A gate campaigns. Every task below is a single named procedure, runnable in bounded time.
1. Dispatch a campaign¶
Per-group, matching the documented testbook defaults:
FORK=alphaonedev/ai-memory-ai2ai-gate # or your fork
# IronClaw campaign (primary Rust agent — default since 2026-04-21)
gh workflow run a2a-gate.yml -R "$FORK" \
-f ai_memory_git_ref=v0.6.1 \
-f agent_group=ironclaw \
-f campaign_id="a2a-ironclaw-v0.6.1-r$(date +%s)" \
-f scenarios="1 1b 2 4 9 10 15 17"
# Hermes campaign
gh workflow run a2a-gate.yml -R "$FORK" \
-f ai_memory_git_ref=v0.6.1 \
-f agent_group=hermes \
-f campaign_id="a2a-hermes-v0.6.1-r$(date +%s)" \
-f scenarios="1 1b 2 4 9 10 15 17"
Both can run concurrently — they provision distinct VPCs (10.251/24 ironclaw vs 10.252/24 hermes). Legacy openclaw dispatches remain accepted (agent_group=openclaw) but require a 16GB droplet slug override + DO account-tier upgrade — see agents/ironclaw.md for the switch rationale.
2. Watch a running campaign¶
gh run list -R alphaonedev/ai-memory-ai2ai-gate --workflow=a2a-gate.yml --limit 3
gh run watch -R alphaonedev/ai-memory-ai2ai-gate <run-id>
Per-step progress:
# Get the job ID
JOB=$(gh run view <run-id> --repo alphaonedev/ai-memory-ai2ai-gate --json jobs --jq '.jobs[0].databaseId')
gh run view --job=$JOB --repo alphaonedev/ai-memory-ai2ai-gate
Live log of a specific failed step:
gh run view <run-id> --repo alphaonedev/ai-memory-ai2ai-gate --log-failed
3. Expected timing¶
| Step | OpenClaw | Hermes |
|---|---|---|
| Terraform apply | ~60s | ~60s |
| SSH wait | ~30s | ~30s |
| Provision (4 nodes) | ~5-10 min | ~12-20 min |
| Collect + enforce BASELINE | ~10s | ~10s |
| F3 peer A2A probe | ~12s | ~12s |
| Scenarios (8 default) | ~10-15 min | ~10-15 min |
| Aggregate + HTML + redact + commit + destroy | ~90s | ~90s |
| Total | 15-25 min | 25-40 min |
If any step exceeds 1.5× its expected duration, investigate — either upstream (xAI, npm, GitHub, DO) has issues OR there's a harness regression.
4. Triage a baseline violation¶
If baseline_ok=false at step 6 of the workflow:
git pullto get the committedruns/<campaign-id>/a2a-baseline.jsonjq '.per_node[] | {node_index, baseline_pass, config_attestation, functional_probes, negative_invariants}' runs/<campaign-id>/a2a-baseline.json- Find the first field that is
falsefor each node. That's the failing invariant. - Consult the violation table:
| Field (false) | Class | Likely cause | First fix to try |
|---|---|---|---|
framework_is_authentic |
C1 | Binary is a symlink to another CLI | Check readlink -f $(which openclaw) on the droplet; re-run install |
mcp_server_ai_memory_registered |
C2 | Config file malformed | Inspect ~/.openclaw/openclaw.json or ~/.hermes/config.yaml |
llm_backend_is_xai_grok |
C3 | Model string wrong in config | Check default_model field; spec says grok-4-fast-non-reasoning |
llm_is_default_provider |
C4 | defaultProvider != xai |
OpenClaw config only |
mcp_command_is_ai_memory |
C5 | MCP server command not ai-memory |
Config drift; re-run setup_node.sh |
agent_id_stamped |
C6 | AI_MEMORY_AGENT_ID env not in MCP config |
Check AGENT_ID was exported before setup_node.sh |
federation_live |
C7 | Local ai-memory serve crashed or port 9077 unreachable |
/var/log/ai-memory-serve.log on the droplet; UFW status |
ufw_disabled |
C8 | UFW re-enabled itself | ufw status verbose; check for re-enable scripts |
iptables_flushed |
C9 | Residual DROP rules | iptables -L -v |
dead_man_switch_scheduled |
C10 | shutdown -P +480 not running |
ps aux \| grep shutdown |
xai_grok_chat_reachable (F1) |
Functional | xAI key invalid / out of credit / network | curl -v https://api.x.ai/v1/models -H "Authorization: Bearer $XAI_API_KEY" from droplet |
substrate_http_canary_f2a (F2a) |
Functional | Local serve HTTP path broken — unusual | Check ai-memory serve log; spec bug if this fails |
agent_mcp_canary_f2b (F2b) |
Functional (non-gating) | Agent didn't invoke tool correctly OR #318 substrate bug | See /tmp/canary-<agent_type>.log for agent response; expected to fail until #318 ships |
a2a_protocol_off |
Negative | Config file lost the explicit disable | Re-run setup_node.sh |
tool_allowlist_is_memory_only |
Negative | Non-memory_* tool leaked into allowlist |
Check config file; re-run |
5. Triage F3 failure¶
If baseline passes but F3 fails:
jq . runs/<campaign-id>/f3-peer-a2a.json- Note the
canary_uuidandwriter_agent - F3 writes to node-1 and verifies on nodes 2, 3, 4. If F3 fails, federation fanout is broken from node-1 — escalate to ai-memory-mcp team.
6. Triage a hung campaign¶
A campaign stalled at the same step for >15 min beyond expected:
- Check which step via
gh run view --job=$JOB - If "Provision all 4 nodes": something in
setup_node.shis hung. Every long-running subprocess has a timeout (see architecture.md §4); worst-case a 600s install timeout + 60s agent canary + 15 min baseline collection = ~25 min ceiling per run. - If truly stuck past ceiling:
gh run cancel <run-id>to stop DO spend, then inspect the Actions log for the last[setup-node-N HH:MM:SS]timestamp emitted.
Historical incidents (see incidents.md):
- r11 stalled 37+ min on Provision → added timeouts to every subprocess in commit 6face55. Should not recur.
7. Cancel a campaign¶
gh run cancel <run-id> --repo alphaonedev/ai-memory-ai2ai-gate
terraform destroy runs automatically in the if: always() teardown step, so cancellation is safe — no orphan droplets. If for some reason destroy fails (DO API outage), the 8h dead-man switch on every droplet is the backstop.
Manual teardown (only if needed):
cd terraform
terraform destroy -auto-approve \
-var "campaign_id=<id>" \
-var "do_token=$DO_TOKEN" \
-var "ssh_key_fingerprint=$SSH_FP"
8. Inspect evidence locally¶
git pull
cd runs/<campaign-id>
# Overall campaign verdict
jq '.overall_pass, .reasons' a2a-summary.json
# Baseline per-node view
jq '[.per_node[] | {node: .node_index, pass: .baseline_pass}]' a2a-baseline.json
# Per-scenario verdict grid
for f in scenario-*.json; do
echo "=== $f ==="
jq '{scenario, pass, reasons}' "$f"
done
# Open the human-readable page
open index.html # macOS
xdg-open index.html # Linux
Or browse on Pages: https://alphaonedev.github.io/ai-memory-ai2ai-gate/evidence/<campaign-id>/
9. Add a new scenario¶
- Write
scripts/scenarios/<N>_<slug>.shfollowing the pattern of15_read_your_writes.sh(simplest). - Emit JSON on stdout + logs on stderr. Final JSON must match the contract in testbook.md §4.4.
chmod +x scripts/scenarios/<N>_<slug>.sh.- Add to the default dispatch scenarios in
.github/workflows/a2a-gate.ymlOR pass via-f scenarios="..."at dispatch. - Add a full test plan entry to
docs/testbook.md§4 (Objective / Pre-conditions / Procedure / Pass criteria / Failure modes / Evidence). - Update the coverage matrix in testbook.md §3.
- Bump the test book version (minor or major per §7 change-control).
- Commit + push + dispatch.
10. Tighten a baseline invariant¶
- Update the invariant spec in
docs/baseline.md§8.1. - Add the check logic to
scripts/setup_node.sh(follow thebaseline_check/jq -epatterns). - Add the field to the jq emit block.
- Add the field to the
baseline_passconjunction. - Update
scripts/generate_run_html.shrender_baseline()table to include the new column. - Bump baseline spec version in
docs/baseline.md§0 changelog. - Commit. The pages workflow regenerates every run's HTML on next deploy so historical runs retroactively render with a
—in the new column.
11. Rotate credentials¶
# xAI key
gh secret set XAI_API_KEY -R alphaonedev/ai-memory-ai2ai-gate
# DO token
gh secret set DIGITALOCEAN_TOKEN -R alphaonedev/ai-memory-ai2ai-gate
# SSH key (after registering new public half with DO)
gh secret set DIGITALOCEAN_SSH_PRIVATE_KEY -R alphaonedev/ai-memory-ai2ai-gate < ~/.ssh/id_ed25519
gh secret set DIGITALOCEAN_SSH_KEY_FINGERPRINT -R alphaonedev/ai-memory-ai2ai-gate
The redaction pass (see baseline.md §9b) catches any pre-rotation secret value that might be in historical logs: it regex-masks xai-[A-Za-z0-9_-]{20,} patterns even when the specific old key isn't known to the workflow. So rotated-key safety is automatic for XAI keys.
12. Investigate a past run¶
# Every artifact is in git history
git log -- runs/a2a-openclaw-v0.6.0-r7/
git show <commit-sha> -- runs/a2a-openclaw-v0.6.0-r7/a2a-summary.json
Every commit includes the campaign ID in the message. The actor + workflow URL are in the run's campaign.meta.json.