Reproducing the baseline — operator playbook¶
Step-by-step recipe to run this exact baseline anywhere. Two paths: A — dispatch a full campaign on DigitalOcean via GitHub Actions (the intended operator path). B — provision a single baseline node locally for debugging or verification.
The authoritative spec is docs/baseline.md. This file is the repeatable operational recipe. If the two disagree, the spec file wins; please file an issue.
Path A — full campaign on DigitalOcean (15-25 minutes)¶
A.1 Prerequisites¶
- A DigitalOcean account with available resources for 4 ×
s-2vcpu-4gbdroplets in your chosen region - A GitHub fork of alphaonedev/ai-memory-ai2ai-gate (or direct access if you're an alphaonedev operator)
- An xAI API key with access to
grok-4-fast-non-reasoning(get one at console.x.ai) ghCLI authenticated to your GitHub account- An ed25519 SSH keypair whose public key is uploaded to DO and whose fingerprint is recorded
A.2 Configure repository secrets¶
FORK=your-gh-user/ai-memory-ai2ai-gate # or alphaonedev/ai-memory-ai2ai-gate
# DigitalOcean
gh secret set DIGITALOCEAN_TOKEN -R "$FORK" # paste DO API token
gh secret set DIGITALOCEAN_SSH_PRIVATE_KEY -R "$FORK" < ~/.ssh/id_ed25519
gh secret set DIGITALOCEAN_SSH_KEY_FINGERPRINT -R "$FORK" # paste the DO-registered fingerprint
# xAI
gh secret set XAI_API_KEY -R "$FORK" # paste xAI key
None of these values touch the repository. They live in GitHub's encrypted secret store. The workflow consumes them at dispatch time. The commit-pass redaction step blocks any that leak into artifacts — see baseline.md §9b.
A.3 Dispatch a campaign¶
# One dispatch per agent group. Both can run in parallel — distinct
# VPC CIDRs, distinct concurrency groups.
gh workflow run a2a-gate.yml -R "$FORK" \
-f ai_memory_git_ref=v0.6.1 \
-f agent_group=ironclaw \
-f campaign_id=a2a-ironclaw-v0.6.1-rN \
-f scenarios="1 1b"
gh workflow run a2a-gate.yml -R "$FORK" \
-f ai_memory_git_ref=v0.6.1 \
-f agent_group=hermes \
-f campaign_id=a2a-hermes-v0.6.1-rN \
-f scenarios="1 1b"
A.4 Watch the baseline gate + scenarios¶
gh run list -R "$FORK" --workflow=a2a-gate.yml --limit 2
gh run watch -R "$FORK" <run-id>
Typical timing:
- Terraform apply: ~60s
- SSH wait + provision: ~4-6 min (ironclaw — Rust binary install + postgres) / ~15-25 min (hermes — heavier Python install). Legacy openclaw was ~8-12 min but required DO tier upgrade (see agents/ironclaw.md).
- Per-node functional probes F1 + F2 (xAI chat + agent-driven MCP canary): part of provision, ~6s each
- Baseline enforcement (per-node attestation): ~5s (any node failing = scenarios skipped, job fails at that step)
- F3 peer A2A probe: ~12s (write + 8s settle + 3-peer verify)
- Scenarios: ~90s each
- Redaction pass: ~1s (fails build if any secret value leaks to artifacts)
- Terraform destroy: ~30s (runs via if: always())
A.5 Find the evidence¶
git pull
ls runs/a2a-*-v0.6.0-rN/
# a2a-baseline.json ← per-node C1-C8 + F1 + F2 + negative invariants
# f3-peer-a2a.json ← F3 cross-node peer A2A probe verdict
# a2a-summary.json ← scenario rollup + overall_pass
# baselines/node-N.json ← raw attestation from each node
# campaign.meta.json ← DO region, node IPs, actor, workflow URL
# scenario-1.json ← scenario 1 (MCP-native) verdict
# scenario-1.log ← scenario 1 full console trace (redacted)
# scenario-1b.json ← scenario 1b (serve-HTTP) verdict
# scenario-1b.log ← scenario 1b full console trace (redacted)
# index.html ← human-readable dashboard page
Dashboard (after the pages workflow completes): https://alphaonedev.github.io/ai-memory-ai2ai-gate/evidence/a2a-ironclaw-v0.6.1-rN/
Path B — single-node baseline verification¶
For operators who want to verify the baseline recipe on a single machine before dispatching a full campaign, or who want to debug a baseline violation without paying DO costs.
B.1 Prerequisites¶
- Ubuntu 24.04 LTS host (local VM, dedicated box, single DO droplet)
- Root or sudo
- Outbound network access to
api.x.ai,github.com,raw.githubusercontent.com, andopenclaw.ai(for the openclaw group) - An xAI API key
B.2 Clone the repo¶
git clone https://github.com/alphaonedev/ai-memory-ai2ai-gate.git
cd ai-memory-ai2ai-gate
B.3 Run setup_node.sh with baseline env¶
export NODE_INDEX=5
export ROLE=agent
export AGENT_TYPE=ironclaw # or hermes, or openclaw
export AGENT_ID=ai:dave # any ai:-prefixed id
export PEER_URLS="http://<peer-1>:9077,http://<peer-2>:9077,http://<peer-3>:9077"
export AI_MEMORY_VERSION=0.6.1
export XAI_API_KEY=<your-xAI-key>
bash scripts/setup_node.sh
The script will:
- Install base packages (curl, jq, python3, nodejs, npm, sqlite3)
- Disable UFW — belt-and-suspenders (
ufw --force reset && ufw --force disable), verify, exit 3 on failure - Flush iptables to ACCEPT
- Set an 8-hour
shutdown -P +480dead-man switch (skip on local machines) - Install
ai-memoryv0.6.1 binary - Start
ai-memory servein federation mode on0.0.0.0:9077 - Install the agent framework (authentic upstream —
nearai/ironclaw,NousResearch/hermes-agent, oropenclaw/openclaw) - Write framework config with full baseline lock-down — xAI Grok as the only LLM, ai-memory as the only MCP server, every alternative A2A channel disabled
- PROBE F1 — xAI Grok reachability + auth (direct
POST /v1/chat/completions, expects non-empty content) - PROBE F2 — end-to-end agent → MCP → ai-memory canary (agent invokes
memory_store, probe verifies via local HTTP +metadata.agent_id) - Emit
/etc/ai-memory-a2a/baseline.jsonwith config_attestation + negative_invariants + functional_probes + baseline_pass - Exit 2 if
baseline_passis false
Note: setup_node.sh covers the per-node side of the baseline (C1–C8 + F1 + F2). The workflow-level probe F3 — peer A2A via shared memory requires coordination across multiple nodes and runs from the GitHub Actions runner in the a2a-gate.yml step named "Functional probe F3". On a single local host you can skip F3, but no campaign scenario will ever run without F3 green.
B.4 Verify¶
cat /etc/ai-memory-a2a/baseline.json | jq '.baseline_pass'
# must print: true
If true — your node is baseline-equivalent to a campaign agent droplet. You can use it as a fourth peer for a live federation, or as a debug environment to iterate on scenario scripts.
If false — inspect:
cat /etc/ai-memory-a2a/baseline.json | jq '.config_attestation, .negative_invariants, .functional_probes'
Common baseline violations + fixes¶
| Symptom | Field | Likely cause | Fix |
|---|---|---|---|
framework_is_authentic: false |
C1 | Binary is a symlink to another CLI | Re-install upstream binary; check readlink -f $(which openclaw) |
llm_backend_is_xai_grok: false |
C3 | Wrong base_url or default_model in config |
Re-run setup_node.sh with correct XAI_API_KEY |
agent_id_stamped: false |
C6 | AI_MEMORY_AGENT_ID env missing in MCP env block |
Check AGENT_ID was exported before setup_node.sh |
federation_live: false |
C7 | ai-memory serve crashed or port 9077 blocked |
Check /var/log/ai-memory-serve.log; verify UFW off |
ufw_disabled: false |
C8 | UFW re-enabled itself after reset | Ship-gate r21/r23 lesson; ufw status verbose |
xai_grok_chat_reachable: false |
F1 | Key invalid, out of credit, or outbound HTTPS blocked | curl -v https://api.x.ai/v1/models -H "Authorization: Bearer $XAI_API_KEY" |
agent_mcp_ai_memory_canary: false |
F2 | MCP dispatch broken, tool misselection, agent reasoning failure | Check /tmp/canary-$AGENT_TYPE.log for the agent's actual output |
a2a_protocol_off: false |
Negative | Config file lost the explicit disable | Re-run setup_node.sh; verify with jq '.agentToAgent' |
tool_allowlist_is_memory_only: false |
Negative | A non-memory_* tool leaked into allowlist |
Edit config back to spec; see baseline.md §6b.2 |
Cost envelope¶
| Path | Cost | Time |
|---|---|---|
| Path A full campaign (per group) | ~$0.20-0.30 per clean run | 15-25 min wall |
| Path B single-node on DO | ~$0.02/hour | 5-10 min |
| Baseline-violation triage | Campaign halts before scenarios, ~1 min of droplet time wasted | 1-3 min |
| Dead-man switch worst case | ~$1.30 per droplet × 4 = ~$5.20 | 8 hours |
Disputing a finding¶
If your fork's campaign passes and this repo's fails (or vice versa), open an issue citing:
- Both campaign IDs
- The specific scenario that disagrees
- The
ai_memory_git_refeach ran against - Any infra differences (region, droplet size, DO account tier)
- Both
a2a-baseline.jsonfiles — if the negative_invariants differ, the gate wasn't running the same test
Cross-fork comparison is the point. An unreproducible result is a bug in the harness, and we want to find it.
Change control¶
If you change ANY baseline configuration during a campaign's life, follow the semver rules in baseline.md §12:
- Adding a new invariant → minor bump
- Tightening an existing invariant → minor bump with migration note
- Relaxing or removing → major bump with
analysis/run-insights.jsonnarrative entry
Every change must update:
1. docs/baseline.md (authoritative spec)
2. scripts/setup_node.sh (emit/check)
3. The aggregator + dashboard renderer (if visible)
4. analysis/run-insights.json for the first run under the new baseline