Operator runbook — v0.6.3.1 First-Principles campaign¶

End-to-end procedure for an operator running the v0.6.3.1 A2A gated testing campaign. The campaign is governed by docs/governance.md — that document is authoritative; this page is the operator's checklist.

The five phases (Phase 0 pre-flight → Phase 1 substrate → Phase 2 dry run → Phase 3 autonomous NHI → Phase 4 meta-analysis → Phase 5 verdict commit) execute mostly inside the a2a-gate.yml workflow. The operator's job is to dispatch, observe, and commit.

Prerequisites¶

Item	Where it lives	Purpose
DigitalOcean account + `DIGITALOCEAN_TOKEN`	GitHub repo secret	Provision the 4-node mesh in Phase 0.
GitHub PAT	`gh` CLI auth (`gh auth login`)	Dispatch workflows and post Phase 5 verdicts.
`XAI_API_KEY`	GitHub repo secret	Powers the LLM behind IronClaw / Hermes (`grok-4-fast-non-reasoning` per baseline.md).
SSH key	`~/.ssh/id_ed25519` (registered with DO; fingerprint in `DIGITALOCEAN_SSH_KEY_FINGERPRINT` repo secret)	Provision and inspect mesh nodes.
Local clone of this repo	`git clone git@github.com:alphaonedev/ai-memory-a2a-v0.6.3.1.git`	Pull run artifacts, push verdicts.
`jq` + `gh` locally	Homebrew / apt	Inspect run JSON; dispatch / watch workflows.

Confirm before starting:

gh auth status
gh secret list -R alphaonedev/ai-memory-a2a-v0.6.3.1 \
  | grep -E '(DIGITALOCEAN_TOKEN|XAI_API_KEY|DIGITALOCEAN_SSH_PRIVATE_KEY|DIGITALOCEAN_SSH_KEY_FINGERPRINT)'

Phase 0 — Pre-flight (dispatch + baseline check)¶

Dispatch the workflow¶

Per-group, one dispatch each. Both can run concurrently — they provision distinct VPCs:

REPO=alphaonedev/ai-memory-a2a-v0.6.3.1

# IronClaw campaign (primary cert agent)
gh workflow run a2a-gate.yml -R "$REPO" \
  -f ai_memory_git_ref=v0.6.3.1 \
  -f agent_group=ironclaw \
  -f campaign_id="a2a-ironclaw-v0.6.3.1-r$(date +%s)"

# Hermes campaign (counterpart agent)
gh workflow run a2a-gate.yml -R "$REPO" \
  -f ai_memory_git_ref=v0.6.3.1 \
  -f agent_group=hermes \
  -f campaign_id="a2a-hermes-v0.6.3.1-r$(date +%s)"

OpenClaw is in scope for v0.6.3.1 — but it's dispatched from a higher-resource node (workstation with 16+ GB available per container, via the local Docker mesh), not from the Basic-tier DigitalOcean droplets used here. From the workstation, dispatch via make a2a-openclaw (or run docker-compose directly per the local-docker-mesh page). Per Principle 6 (scope discipline), the OpenClaw run produces its own campaign_id and artifact set; the orchestrator joins IronClaw + Hermes + OpenClaw evidence via release=v0.6.3.1 linkage but never collapses them across framework boundaries.

Watch the run¶

gh run list -R "$REPO" --workflow=a2a-gate.yml --limit 3
gh run watch -R "$REPO" <run-id>
gh run view  -R "$REPO" <run-id> --log-failed   # if it fails

Read the Phase 0 baseline¶

Phase 0 emits runs/<campaign-id>/a2a-baseline.json and gates Phase 1 on overall_pass=true. Pull and inspect:

git pull --rebase
jq '.overall_pass, .reasons' runs/<campaign-id>/a2a-baseline.json
jq '[.per_node[] | {node: .node_index, pass: .baseline_pass}]' runs/<campaign-id>/a2a-baseline.json

overall_pass=false means the harness was not in a known-good state at Phase 0 entry — Phase 1 will not run. The most common per-node violations and first-fix:

Violation field	Cause	First-fix
`framework_is_authentic`	Binary is a symlink to another CLI	`ssh root@<node>` and `readlink -f $(which ironclaw)`; re-run install.
`mcp_server_ai_memory_registered`	Config malformed	Inspect `~/.ironclaw/ironclaw.json` or `~/.hermes/config.yaml`.
`llm_backend_is_xai_grok`	Wrong model SKU	Spec requires `grok-4-fast-non-reasoning`; fix `default_model`.
`mcp_command_is_ai_memory`	MCP `command` not `ai-memory`	Re-run setup_node.sh.
`agent_id_stamped`	`AI_MEMORY_AGENT_ID` env not propagated	Confirm `AGENT_ID` exported before setup_node.sh.
`federation_live`	Local `ai-memory serve` crashed or port unreachable	Inspect `/var/log/ai-memory-serve.log`; UFW status.
`xai_grok_chat_reachable` (F1)	xAI key invalid / out of credit / network	`curl -v https://api.x.ai/v1/models -H "Authorization: Bearer $XAI_API_KEY"` from droplet.
`dead_man_switch_scheduled`	`shutdown -P +480` not running	`ps aux \\| grep shutdown` on droplet.

Re-run the workflow only after fixing the violation. Do not advance to Phase 1 with overall_pass=false.

Phase 1 — Substrate cert (S1–S24 on the 6-cell matrix)¶

The workflow runs Phase 1 automatically once Phase 0 baseline is GREEN. Per governance §4, the operator does not redesign Phase 1 — they observe and accept the published verdict.

What you should see in `runs/<id>/`¶

scenario-S1.json … scenario-S24.json — one per scenario, with pass, reasons, and a minimal repro hint for any RED.
a2a-summary.json — aggregate verdict for the run.
summary.json partial update of releases/v0.6.3.1/summary.json substrate_verdict block.

Required outcome to advance to Phase 2¶

Per governance §4:

substrate_verdict.value = "PARTIAL — pending Patch 2"
S1–S22                 = GREEN
S23                    = RED  (issue #507 — known-open)
S24                    = RED  (issue #318 — known-open)

Failure modes¶

Unexpected RED on S1 – S22. Regression on a carry-forward or v0.6.3.1-specific surface. Halt the campaign — Phase 2 must not run on a degraded substrate (Principle 2). File a bug + v0.6.3.2-candidate child issue under #511 and re-dispatch only after the regression is understood.
GREEN on S23 or S24. Harness integrity violation — the harness is lying. Halt immediately and file a harness_defect issue. Do not commit a Phase 5 verdict.
Cell other than ironclaw / mTLS red on cert sweep. Surfaces as a finding and may downgrade the verdict, but does not halt by itself. Record and continue if the cert cell is clean.

Phase 2 — AI Orchestration Test (scripted dry run)¶

Per governance §5, Phase 2 is a six-step scripted dry-run of MCP wiring + namespace plumbing + JSON log sink before autonomy is enabled in Phase 3.

What you should see in `runs/<id>/`¶

phase2-orchestration.json — six exchange records + aggregate pass: bool + SHA-256 of the audit verify output from step 5.

The six exchanges (in order): write round-trip → cross-agent recall → scope enforcement → tag write + tagged recall → audit verify hook → JSON log sink check.

Required outcome to advance to Phase 3¶

phase2-orchestration.json.pass == true and all six exchange records present and well-formed per the §7 schema.

Failure modes¶

Any single exchange fails. Phase 3 does not run on a degraded Phase 2 — a Phase 3 failure with broken plumbing is indistinguishable from a Phase 3 failure caused by ai-memory itself. File a Phase 2 issue, fix, re-dispatch from Phase 0.
Scope leak (exchange #3 returns a private memory across agents). Treat as a high-severity substrate defect even though Phase 1 was GREEN — it means S7 is undertesting. File harness_defect and carry_forward_patch2 candidates.
Audit verify (exchange #5) returns dirty. Halt; the audit chain is not actually tamper-evident on this run. File carry_forward_patch2 against #511.

Phase 3 — Autonomous NHI playbook (4 scenarios × 4 arms × n=3 = 48 runs)¶

Per governance §6, Phase 3 runs the four scenarios (A/B/C/D) under each of the four control arms (Cold / Isolated / Stubbed / Treatment) three times each. The operator does not coach agents during Phase 3 — autonomy is the point. Caps: 12 turns, 50 ai-memory ops, 10 minutes per run.

What you should see in `runs/<id>/`¶

48 files: phase3-<scenario>-<arm>-run<n>.json — each conforming to the §7 schema.
One aggregator: phase3-summary.json with run counts, completion outcomes (task_complete / cap_reached / refusal / error), and pointers to the individual logs.

Required outcome to advance to Phase 4¶

All 48 logs present and schema-valid. Malformed logs are rejected and the affected scenario × arm × run combination is re-dispatched.

Failure modes¶

sku_mismatch flag on any record. The runtime served a model SKU different from baseline.md. Re-dispatch the affected run after confirming the SKU pin.
malformed flag on any record. Schema-invalid JSON; the Orchestrator rejects and re-runs.
Excess cap_reached (>1/3 of runs in a cell). Surface to Phase 4 — likely indicates the cap is too tight for the scenario or the agent is looping. Not a halt; recorded as a finding.
Cross-layer inconsistency surfaces — e.g. substrate S24 RED but Phase 3 Scenario D Hermes recall returns the write within the settle window. Do not auto-resolve. Phase 4 will compute the cross-layer consistency table and flag the inconsistent row; record it as a harness_defect candidate (the NHI test is broken, or substrate is mis-categorized — Phase 4 decides).

Phase 4 — Meta-analysis (third Claude instance, no namespace access)¶

Per governance §8, a separate Claude instance with read-only access to the JSON logs (and no namespace access, no ability to query agents directly) computes:

Grounding rate, hallucination rate per scenario × arm × run.
Cross-agent recall hit rate (scenarios A / B / C).
Cross-layer consistency for scenario D vs substrate S24.
Termination distribution.
Treatment effect (Arm-T − each control arm), point estimates with min/max — no p-values (n=3 doesn't support them).

What you should see in `runs/<id>/`¶

phase4-analysis.json — all metrics + cross-layer consistency table + classified findings + narrative summary (≤2000 words) + SHA-256 manifest of every input log consumed.

Operator role¶

The operator does not author Phase 4 — the meta-analyst does. The operator's role is to confirm phase4-analysis.json exists and is signed (SHA-256 manifest of inputs), then proceed to Phase 5.

If the meta-analyst flags any finding as needs_review, escalate to the human maintainer (Jim) before Phase 5 commits.

Phase 5 — Verdict commit + findings sync¶

Per governance §9, the operator publishes the final artifact and fires the findings sync.

Step 1 — Update `releases/v0.6.3.1/summary.json`¶

Merge the Phase 1 substrate verdict block, the Phase 3 NHI verdict block, and the Phase 4 cross-layer consistency table into the schema-v2 surface. The two top-level verdicts (substrate_verdict, nhi_verdict) remain separate — never collapse them (Principle 1).

git pull --rebase
# (apply phase outputs to releases/v0.6.3.1/summary.json — substrate_verdict, nhi_verdict,
#  cross_layer_consistency, campaign.last_run_id, campaign.updated_at)
git add releases/v0.6.3.1/summary.json runs/<campaign-id>/
git commit -m "phase5: commit v0.6.3.1 verdict for run <campaign-id>"
git push

The release-summary-gate.yml workflow verifies the schema and rejects malformed pushes.

Step 2 — Run `findings-sync.yml`¶

gh workflow run findings-sync.yml -R alphaonedev/ai-memory-a2a-v0.6.3.1
gh run watch -R alphaonedev/ai-memory-a2a-v0.6.3.1 <run-id>

The workflow reads phase4-analysis.json and opens / updates child issues on ai-memory-mcp for each finding tagged carry_forward_patch2 or carry_forward_v0_6_4. Each issue is parent-linked to umbrella #511 and labelled bug + the candidate label.

Step 3 — Update test-hub aggregator¶

Open a PR on the ai-memory-test-hub repo to bind the v0.6.3.1 row to the new summary.json. The substrate verdict cell is filled from substrate_verdict.value; the NHI cell from nhi_verdict.value.

The campaign is complete when the test-hub PR merges and the v0.6.3.1 cell renders the two verdicts correctly.

Patch 2 funnel¶

How findings make it into the v0.6.3.2 candidate list under #511:

Phase 4 classifies the finding as carry_forward_patch2 (per findings.md).
Phase 5 step 2 (findings-sync.yml) opens the child issue on ai-memory-mcp, parent-linked to #511, labelled bug + v0.6.3.2-candidate, and assigned to the v0.6.3.2 milestone.
The Patch 2 candidate list under #511 is the canonical view of everything Patch 2 needs to fix. The known seeds are #507 (S23) and #318 (S24).
When every parent-linked issue's fix has merged on release/v0.6.3.2 and the corresponding scenarios are GREEN on the successor ai-memory-a2a-v0.6.3.2 campaign's first run, #511 closes and the umbrella tracking issue is satisfied.

A finding only enters Patch 2 if it has a scenario that reproduces it, a child issue on ai-memory-mcp, the v0.6.3.2-candidate label, a parent-link to #511, and the v0.6.3.2 milestone. Anything missing one of those does not count — re-classify under findings §classes.

Common operations¶

Cancel a run¶

gh run cancel <run-id> -R alphaonedev/ai-memory-a2a-v0.6.3.1

terraform destroy runs in the if: always() teardown; cancellation is safe (no orphan droplets). 8h dead-man switch on every droplet is the backstop if DO API is out.

Inspect evidence locally¶

git pull --rebase
cd runs/<campaign-id>

jq '.overall_pass, .reasons' a2a-summary.json
jq '[.per_node[] | {node: .node_index, pass: .baseline_pass}]' a2a-baseline.json
for f in scenario-*.json;       do echo "=== $f ===" ; jq '{scenario, pass, reasons}' "$f" ; done
for f in phase3-*-run*.json;    do jq '{scenario_id, control_arm, run_index, termination_reason}' "$f" ; done
jq '.cross_layer_consistency' phase4-analysis.json

Rotate credentials¶

gh secret set XAI_API_KEY                         -R alphaonedev/ai-memory-a2a-v0.6.3.1
gh secret set DIGITALOCEAN_TOKEN                  -R alphaonedev/ai-memory-a2a-v0.6.3.1
gh secret set DIGITALOCEAN_SSH_PRIVATE_KEY        -R alphaonedev/ai-memory-a2a-v0.6.3.1 < ~/.ssh/id_ed25519
gh secret set DIGITALOCEAN_SSH_KEY_FINGERPRINT    -R alphaonedev/ai-memory-a2a-v0.6.3.1

The redaction pass on runs/ regex-masks xai-[A-Za-z0-9_-]{20,} so rotated-key safety is automatic for XAI keys in historical logs.

Cross-links¶

Governance — authoritative
Scope — what is in / out of cert
Matrix — substrate cells + Phase 3 cells + cross-layer consistency
Findings — finding classes + Patch 2 funnel
Phase log schema: scripts/schema/phase-log.schema.json
Umbrella tracking issue: alphaonedev/ai-memory-mcp#511

Operator runbook — v0.6.3.1 First-Principles campaign¶

Prerequisites¶

Phase 0 — Pre-flight (dispatch + baseline check)¶

Dispatch the workflow¶

Watch the run¶

Read the Phase 0 baseline¶

Phase 1 — Substrate cert (S1–S24 on the 6-cell matrix)¶

What you should see in runs/<id>/¶

Required outcome to advance to Phase 2¶

Failure modes¶

Phase 2 — AI Orchestration Test (scripted dry run)¶

What you should see in runs/<id>/¶

Required outcome to advance to Phase 3¶

Failure modes¶

Phase 3 — Autonomous NHI playbook (4 scenarios × 4 arms × n=3 = 48 runs)¶

What you should see in runs/<id>/¶

Required outcome to advance to Phase 4¶

Failure modes¶

Phase 4 — Meta-analysis (third Claude instance, no namespace access)¶

What you should see in runs/<id>/¶

Operator role¶

Phase 5 — Verdict commit + findings sync¶

Step 1 — Update releases/v0.6.3.1/summary.json¶

Step 2 — Run findings-sync.yml¶

Step 3 — Update test-hub aggregator¶

Patch 2 funnel¶

Common operations¶

Cancel a run¶

Inspect evidence locally¶

Rotate credentials¶

Cross-links¶

What you should see in `runs/<id>/`¶

What you should see in `runs/<id>/`¶

What you should see in `runs/<id>/`¶

What you should see in `runs/<id>/`¶

Step 1 — Update `releases/v0.6.3.1/summary.json`¶

Step 2 — Run `findings-sync.yml`¶