Operator runbook — v0.6.3.1 First-Principles campaign¶
End-to-end procedure for an operator running the v0.6.3.1 A2A gated testing campaign. The campaign is governed by docs/governance.md — that document is authoritative; this page is the operator's checklist.
The five phases (Phase 0 pre-flight → Phase 1 substrate → Phase 2 dry run → Phase 3 autonomous NHI → Phase 4 meta-analysis → Phase 5 verdict commit) execute mostly inside the a2a-gate.yml workflow. The operator's job is to dispatch, observe, and commit.
Prerequisites¶
| Item | Where it lives | Purpose |
|---|---|---|
DigitalOcean account + DIGITALOCEAN_TOKEN |
GitHub repo secret | Provision the 4-node mesh in Phase 0. |
| GitHub PAT | gh CLI auth (gh auth login) |
Dispatch workflows and post Phase 5 verdicts. |
XAI_API_KEY |
GitHub repo secret | Powers the LLM behind IronClaw / Hermes (grok-4-fast-non-reasoning per baseline.md). |
| SSH key | ~/.ssh/id_ed25519 (registered with DO; fingerprint in DIGITALOCEAN_SSH_KEY_FINGERPRINT repo secret) |
Provision and inspect mesh nodes. |
| Local clone of this repo | git clone git@github.com:alphaonedev/ai-memory-a2a-v0.6.3.1.git |
Pull run artifacts, push verdicts. |
jq + gh locally |
Homebrew / apt | Inspect run JSON; dispatch / watch workflows. |
Confirm before starting:
gh auth status
gh secret list -R alphaonedev/ai-memory-a2a-v0.6.3.1 \
| grep -E '(DIGITALOCEAN_TOKEN|XAI_API_KEY|DIGITALOCEAN_SSH_PRIVATE_KEY|DIGITALOCEAN_SSH_KEY_FINGERPRINT)'
Phase 0 — Pre-flight (dispatch + baseline check)¶
Dispatch the workflow¶
Per-group, one dispatch each. Both can run concurrently — they provision distinct VPCs:
REPO=alphaonedev/ai-memory-a2a-v0.6.3.1
# IronClaw campaign (primary cert agent)
gh workflow run a2a-gate.yml -R "$REPO" \
-f ai_memory_git_ref=v0.6.3.1 \
-f agent_group=ironclaw \
-f campaign_id="a2a-ironclaw-v0.6.3.1-r$(date +%s)"
# Hermes campaign (counterpart agent)
gh workflow run a2a-gate.yml -R "$REPO" \
-f ai_memory_git_ref=v0.6.3.1 \
-f agent_group=hermes \
-f campaign_id="a2a-hermes-v0.6.3.1-r$(date +%s)"
OpenClaw is in scope for v0.6.3.1 — but it's dispatched from a higher-resource node (workstation with 16+ GB available per container, via the local Docker mesh), not from the Basic-tier DigitalOcean droplets used here. From the workstation, dispatch via make a2a-openclaw (or run docker-compose directly per the local-docker-mesh page). Per Principle 6 (scope discipline), the OpenClaw run produces its own campaign_id and artifact set; the orchestrator joins IronClaw + Hermes + OpenClaw evidence via release=v0.6.3.1 linkage but never collapses them across framework boundaries.
Watch the run¶
gh run list -R "$REPO" --workflow=a2a-gate.yml --limit 3
gh run watch -R "$REPO" <run-id>
gh run view -R "$REPO" <run-id> --log-failed # if it fails
Read the Phase 0 baseline¶
Phase 0 emits runs/<campaign-id>/a2a-baseline.json and gates Phase 1 on overall_pass=true. Pull and inspect:
git pull --rebase
jq '.overall_pass, .reasons' runs/<campaign-id>/a2a-baseline.json
jq '[.per_node[] | {node: .node_index, pass: .baseline_pass}]' runs/<campaign-id>/a2a-baseline.json
overall_pass=false means the harness was not in a known-good state at Phase 0 entry — Phase 1 will not run. The most common per-node violations and first-fix:
| Violation field | Cause | First-fix |
|---|---|---|
framework_is_authentic |
Binary is a symlink to another CLI | ssh root@<node> and readlink -f $(which ironclaw); re-run install. |
mcp_server_ai_memory_registered |
Config malformed | Inspect ~/.ironclaw/ironclaw.json or ~/.hermes/config.yaml. |
llm_backend_is_xai_grok |
Wrong model SKU | Spec requires grok-4-fast-non-reasoning; fix default_model. |
mcp_command_is_ai_memory |
MCP command not ai-memory |
Re-run setup_node.sh. |
agent_id_stamped |
AI_MEMORY_AGENT_ID env not propagated |
Confirm AGENT_ID exported before setup_node.sh. |
federation_live |
Local ai-memory serve crashed or port unreachable |
Inspect /var/log/ai-memory-serve.log; UFW status. |
xai_grok_chat_reachable (F1) |
xAI key invalid / out of credit / network | curl -v https://api.x.ai/v1/models -H "Authorization: Bearer $XAI_API_KEY" from droplet. |
dead_man_switch_scheduled |
shutdown -P +480 not running |
ps aux \| grep shutdown on droplet. |
Re-run the workflow only after fixing the violation. Do not advance to Phase 1 with overall_pass=false.
Phase 1 — Substrate cert (S1–S24 on the 6-cell matrix)¶
The workflow runs Phase 1 automatically once Phase 0 baseline is GREEN. Per governance §4, the operator does not redesign Phase 1 — they observe and accept the published verdict.
What you should see in runs/<id>/¶
scenario-S1.json…scenario-S24.json— one per scenario, withpass,reasons, and a minimal repro hint for any RED.a2a-summary.json— aggregate verdict for the run.summary.jsonpartial update ofreleases/v0.6.3.1/summary.jsonsubstrate_verdictblock.
Required outcome to advance to Phase 2¶
Per governance §4:
substrate_verdict.value = "PARTIAL — pending Patch 2"
S1–S22 = GREEN
S23 = RED (issue #507 — known-open)
S24 = RED (issue #318 — known-open)
Failure modes¶
- Unexpected RED on
S1–S22. Regression on a carry-forward or v0.6.3.1-specific surface. Halt the campaign — Phase 2 must not run on a degraded substrate (Principle 2). File abug+v0.6.3.2-candidatechild issue under #511 and re-dispatch only after the regression is understood. - GREEN on
S23orS24. Harness integrity violation — the harness is lying. Halt immediately and file aharness_defectissue. Do not commit a Phase 5 verdict. - Cell other than
ironclaw / mTLSred on cert sweep. Surfaces as a finding and may downgrade the verdict, but does not halt by itself. Record and continue if the cert cell is clean.
Phase 2 — AI Orchestration Test (scripted dry run)¶
Per governance §5, Phase 2 is a six-step scripted dry-run of MCP wiring + namespace plumbing + JSON log sink before autonomy is enabled in Phase 3.
What you should see in runs/<id>/¶
phase2-orchestration.json— six exchange records + aggregatepass: bool+ SHA-256 of theaudit verifyoutput from step 5.
The six exchanges (in order): write round-trip → cross-agent recall → scope enforcement → tag write + tagged recall → audit verify hook → JSON log sink check.
Required outcome to advance to Phase 3¶
phase2-orchestration.json.pass == true and all six exchange records present and well-formed per the §7 schema.
Failure modes¶
- Any single exchange fails. Phase 3 does not run on a degraded Phase 2 — a Phase 3 failure with broken plumbing is indistinguishable from a Phase 3 failure caused by ai-memory itself. File a Phase 2 issue, fix, re-dispatch from Phase 0.
- Scope leak (exchange #3 returns a private memory across agents). Treat as a high-severity substrate defect even though Phase 1 was GREEN — it means S7 is undertesting. File
harness_defectandcarry_forward_patch2candidates. - Audit verify (exchange #5) returns dirty. Halt; the audit chain is not actually tamper-evident on this run. File
carry_forward_patch2against#511.
Phase 3 — Autonomous NHI playbook (4 scenarios × 4 arms × n=3 = 48 runs)¶
Per governance §6, Phase 3 runs the four scenarios (A/B/C/D) under each of the four control arms (Cold / Isolated / Stubbed / Treatment) three times each. The operator does not coach agents during Phase 3 — autonomy is the point. Caps: 12 turns, 50 ai-memory ops, 10 minutes per run.
What you should see in runs/<id>/¶
- 48 files:
phase3-<scenario>-<arm>-run<n>.json— each conforming to the §7 schema. - One aggregator:
phase3-summary.jsonwith run counts, completion outcomes (task_complete/cap_reached/refusal/error), and pointers to the individual logs.
Required outcome to advance to Phase 4¶
All 48 logs present and schema-valid. Malformed logs are rejected and the affected scenario × arm × run combination is re-dispatched.
Failure modes¶
sku_mismatchflag on any record. The runtime served a model SKU different from baseline.md. Re-dispatch the affected run after confirming the SKU pin.malformedflag on any record. Schema-invalid JSON; the Orchestrator rejects and re-runs.- Excess
cap_reached(>1/3 of runs in a cell). Surface to Phase 4 — likely indicates the cap is too tight for the scenario or the agent is looping. Not a halt; recorded as a finding. - Cross-layer inconsistency surfaces — e.g. substrate
S24RED but Phase 3 Scenario D Hermes recall returns the write within the settle window. Do not auto-resolve. Phase 4 will compute the cross-layer consistency table and flag the inconsistent row; record it as aharness_defectcandidate (the NHI test is broken, or substrate is mis-categorized — Phase 4 decides).
Phase 4 — Meta-analysis (third Claude instance, no namespace access)¶
Per governance §8, a separate Claude instance with read-only access to the JSON logs (and no namespace access, no ability to query agents directly) computes:
- Grounding rate, hallucination rate per scenario × arm × run.
- Cross-agent recall hit rate (scenarios A / B / C).
- Cross-layer consistency for scenario D vs substrate S24.
- Termination distribution.
- Treatment effect (Arm-T − each control arm), point estimates with min/max — no p-values (n=3 doesn't support them).
What you should see in runs/<id>/¶
phase4-analysis.json— all metrics + cross-layer consistency table + classified findings + narrative summary (≤2000 words) + SHA-256 manifest of every input log consumed.
Operator role¶
The operator does not author Phase 4 — the meta-analyst does. The operator's role is to confirm phase4-analysis.json exists and is signed (SHA-256 manifest of inputs), then proceed to Phase 5.
If the meta-analyst flags any finding as needs_review, escalate to the human maintainer (Jim) before Phase 5 commits.
Phase 5 — Verdict commit + findings sync¶
Per governance §9, the operator publishes the final artifact and fires the findings sync.
Step 1 — Update releases/v0.6.3.1/summary.json¶
Merge the Phase 1 substrate verdict block, the Phase 3 NHI verdict block, and the Phase 4 cross-layer consistency table into the schema-v2 surface. The two top-level verdicts (substrate_verdict, nhi_verdict) remain separate — never collapse them (Principle 1).
git pull --rebase
# (apply phase outputs to releases/v0.6.3.1/summary.json — substrate_verdict, nhi_verdict,
# cross_layer_consistency, campaign.last_run_id, campaign.updated_at)
git add releases/v0.6.3.1/summary.json runs/<campaign-id>/
git commit -m "phase5: commit v0.6.3.1 verdict for run <campaign-id>"
git push
The release-summary-gate.yml workflow verifies the schema and rejects malformed pushes.
Step 2 — Run findings-sync.yml¶
gh workflow run findings-sync.yml -R alphaonedev/ai-memory-a2a-v0.6.3.1
gh run watch -R alphaonedev/ai-memory-a2a-v0.6.3.1 <run-id>
The workflow reads phase4-analysis.json and opens / updates child issues on ai-memory-mcp for each finding tagged carry_forward_patch2 or carry_forward_v0_6_4. Each issue is parent-linked to umbrella #511 and labelled bug + the candidate label.
Step 3 — Update test-hub aggregator¶
Open a PR on the ai-memory-test-hub repo to bind the v0.6.3.1 row to the new summary.json. The substrate verdict cell is filled from substrate_verdict.value; the NHI cell from nhi_verdict.value.
The campaign is complete when the test-hub PR merges and the v0.6.3.1 cell renders the two verdicts correctly.
Patch 2 funnel¶
How findings make it into the v0.6.3.2 candidate list under #511:
- Phase 4 classifies the finding as
carry_forward_patch2(per findings.md). - Phase 5 step 2 (
findings-sync.yml) opens the child issue onai-memory-mcp, parent-linked to #511, labelledbug+v0.6.3.2-candidate, and assigned to thev0.6.3.2milestone. - The Patch 2 candidate list under #511 is the canonical view of everything Patch 2 needs to fix. The known seeds are #507 (S23) and #318 (S24).
- When every parent-linked issue's fix has merged on
release/v0.6.3.2and the corresponding scenarios are GREEN on the successorai-memory-a2a-v0.6.3.2campaign's first run, #511 closes and the umbrella tracking issue is satisfied.
A finding only enters Patch 2 if it has a scenario that reproduces it, a child issue on ai-memory-mcp, the v0.6.3.2-candidate label, a parent-link to #511, and the v0.6.3.2 milestone. Anything missing one of those does not count — re-classify under findings §classes.
Common operations¶
Cancel a run¶
terraform destroy runs in the if: always() teardown; cancellation is safe (no orphan droplets). 8h dead-man switch on every droplet is the backstop if DO API is out.
Inspect evidence locally¶
git pull --rebase
cd runs/<campaign-id>
jq '.overall_pass, .reasons' a2a-summary.json
jq '[.per_node[] | {node: .node_index, pass: .baseline_pass}]' a2a-baseline.json
for f in scenario-*.json; do echo "=== $f ===" ; jq '{scenario, pass, reasons}' "$f" ; done
for f in phase3-*-run*.json; do jq '{scenario_id, control_arm, run_index, termination_reason}' "$f" ; done
jq '.cross_layer_consistency' phase4-analysis.json
Rotate credentials¶
gh secret set XAI_API_KEY -R alphaonedev/ai-memory-a2a-v0.6.3.1
gh secret set DIGITALOCEAN_TOKEN -R alphaonedev/ai-memory-a2a-v0.6.3.1
gh secret set DIGITALOCEAN_SSH_PRIVATE_KEY -R alphaonedev/ai-memory-a2a-v0.6.3.1 < ~/.ssh/id_ed25519
gh secret set DIGITALOCEAN_SSH_KEY_FINGERPRINT -R alphaonedev/ai-memory-a2a-v0.6.3.1
The redaction pass on runs/ regex-masks xai-[A-Za-z0-9_-]{20,} so rotated-key safety is automatic for XAI keys in historical logs.
Cross-links¶
- Governance — authoritative
- Scope — what is in / out of cert
- Matrix — substrate cells + Phase 3 cells + cross-layer consistency
- Findings — finding classes + Patch 2 funnel
- Phase log schema:
scripts/schema/phase-log.schema.json - Umbrella tracking issue:
alphaonedev/ai-memory-mcp#511