Reproducing the baseline — operator playbook¶

Step-by-step recipe to run this exact baseline anywhere. Two paths: A — dispatch a full campaign on DigitalOcean via GitHub Actions (the intended operator path). B — provision a single baseline node locally for debugging or verification.

The authoritative spec is docs/baseline.md. This file is the repeatable operational recipe. If the two disagree, the spec file wins; please file an issue.

Path A — full campaign on DigitalOcean (15-25 minutes)¶

A.1 Prerequisites¶

A DigitalOcean account with available resources for 4 × s-2vcpu-4gb droplets in your chosen region
A GitHub fork of alphaonedev/ai-memory-ai2ai-gate (or direct access if you're an alphaonedev operator)
An xAI API key with access to grok-4-fast-non-reasoning (get one at console.x.ai)
gh CLI authenticated to your GitHub account
An ed25519 SSH keypair whose public key is uploaded to DO and whose fingerprint is recorded

A.2 Configure repository secrets¶

FORK=your-gh-user/ai-memory-ai2ai-gate    # or alphaonedev/ai-memory-ai2ai-gate

# DigitalOcean
gh secret set DIGITALOCEAN_TOKEN                -R "$FORK"  # paste DO API token
gh secret set DIGITALOCEAN_SSH_PRIVATE_KEY      -R "$FORK" < ~/.ssh/id_ed25519
gh secret set DIGITALOCEAN_SSH_KEY_FINGERPRINT  -R "$FORK"  # paste the DO-registered fingerprint

# xAI
gh secret set XAI_API_KEY                       -R "$FORK"  # paste xAI key

None of these values touch the repository. They live in GitHub's encrypted secret store. The workflow consumes them at dispatch time. The commit-pass redaction step blocks any that leak into artifacts — see baseline.md §9b.

A.3 Dispatch a campaign¶

# One dispatch per agent group. Both can run in parallel — distinct
# VPC CIDRs, distinct concurrency groups.

gh workflow run a2a-gate.yml -R "$FORK" \
  -f ai_memory_git_ref=v0.6.1 \
  -f agent_group=ironclaw \
  -f campaign_id=a2a-ironclaw-v0.6.1-rN \
  -f scenarios="1 1b"

gh workflow run a2a-gate.yml -R "$FORK" \
  -f ai_memory_git_ref=v0.6.1 \
  -f agent_group=hermes \
  -f campaign_id=a2a-hermes-v0.6.1-rN \
  -f scenarios="1 1b"

A.4 Watch the baseline gate + scenarios¶

gh run list -R "$FORK" --workflow=a2a-gate.yml --limit 2
gh run watch -R "$FORK" <run-id>

Typical timing: - Terraform apply: ~60s - SSH wait + provision: ~4-6 min (ironclaw — Rust binary install + postgres) / ~15-25 min (hermes — heavier Python install). Legacy openclaw was ~8-12 min but required DO tier upgrade (see agents/ironclaw.md). - Per-node functional probes F1 + F2 (xAI chat + agent-driven MCP canary): part of provision, ~6s each - Baseline enforcement (per-node attestation): ~5s (any node failing = scenarios skipped, job fails at that step) - F3 peer A2A probe: ~12s (write + 8s settle + 3-peer verify) - Scenarios: ~90s each - Redaction pass: ~1s (fails build if any secret value leaks to artifacts) - Terraform destroy: ~30s (runs via if: always())

A.5 Find the evidence¶

git pull
ls runs/a2a-*-v0.6.0-rN/
#   a2a-baseline.json          ← per-node C1-C8 + F1 + F2 + negative invariants
#   f3-peer-a2a.json           ← F3 cross-node peer A2A probe verdict
#   a2a-summary.json           ← scenario rollup + overall_pass
#   baselines/node-N.json      ← raw attestation from each node
#   campaign.meta.json         ← DO region, node IPs, actor, workflow URL
#   scenario-1.json            ← scenario 1 (MCP-native) verdict
#   scenario-1.log             ← scenario 1 full console trace (redacted)
#   scenario-1b.json           ← scenario 1b (serve-HTTP) verdict
#   scenario-1b.log            ← scenario 1b full console trace (redacted)
#   index.html                 ← human-readable dashboard page

Dashboard (after the pages workflow completes): https://alphaonedev.github.io/ai-memory-ai2ai-gate/evidence/a2a-ironclaw-v0.6.1-rN/

Path B — single-node baseline verification¶

For operators who want to verify the baseline recipe on a single machine before dispatching a full campaign, or who want to debug a baseline violation without paying DO costs.

B.1 Prerequisites¶

Ubuntu 24.04 LTS host (local VM, dedicated box, single DO droplet)
Root or sudo
Outbound network access to api.x.ai, github.com, raw.githubusercontent.com, and openclaw.ai (for the openclaw group)
An xAI API key

B.2 Clone the repo¶

git clone https://github.com/alphaonedev/ai-memory-ai2ai-gate.git
cd ai-memory-ai2ai-gate

B.3 Run setup_node.sh with baseline env¶

export NODE_INDEX=5
export ROLE=agent
export AGENT_TYPE=ironclaw              # or hermes, or openclaw
export AGENT_ID=ai:dave                 # any ai:-prefixed id
export PEER_URLS="http://<peer-1>:9077,http://<peer-2>:9077,http://<peer-3>:9077"
export AI_MEMORY_VERSION=0.6.1
export XAI_API_KEY=<your-xAI-key>

bash scripts/setup_node.sh

The script will:

Install base packages (curl, jq, python3, nodejs, npm, sqlite3)
Disable UFW — belt-and-suspenders (ufw --force reset && ufw --force disable), verify, exit 3 on failure
Flush iptables to ACCEPT
Set an 8-hour shutdown -P +480 dead-man switch (skip on local machines)
Install ai-memory v0.6.1 binary
Start ai-memory serve in federation mode on 0.0.0.0:9077
Install the agent framework (authentic upstream — nearai/ironclaw, NousResearch/hermes-agent, or openclaw/openclaw)
Write framework config with full baseline lock-down — xAI Grok as the only LLM, ai-memory as the only MCP server, every alternative A2A channel disabled
PROBE F1 — xAI Grok reachability + auth (direct POST /v1/chat/completions, expects non-empty content)
PROBE F2 — end-to-end agent → MCP → ai-memory canary (agent invokes memory_store, probe verifies via local HTTP + metadata.agent_id)
Emit /etc/ai-memory-a2a/baseline.json with config_attestation + negative_invariants + functional_probes + baseline_pass
Exit 2 if baseline_pass is false

Note: setup_node.sh covers the per-node side of the baseline (C1–C8 + F1 + F2). The workflow-level probe F3 — peer A2A via shared memory requires coordination across multiple nodes and runs from the GitHub Actions runner in the a2a-gate.yml step named "Functional probe F3". On a single local host you can skip F3, but no campaign scenario will ever run without F3 green.

B.4 Verify¶

cat /etc/ai-memory-a2a/baseline.json | jq '.baseline_pass'
# must print: true

If true — your node is baseline-equivalent to a campaign agent droplet. You can use it as a fourth peer for a live federation, or as a debug environment to iterate on scenario scripts.

If false — inspect:

cat /etc/ai-memory-a2a/baseline.json | jq '.config_attestation, .negative_invariants, .functional_probes'

to find the specific failing field.

Common baseline violations + fixes¶

Symptom	Field	Likely cause	Fix
`framework_is_authentic: false`	C1	Binary is a symlink to another CLI	Re-install upstream binary; check `readlink -f $(which openclaw)`
`llm_backend_is_xai_grok: false`	C3	Wrong `base_url` or `default_model` in config	Re-run setup_node.sh with correct `XAI_API_KEY`
`agent_id_stamped: false`	C6	`AI_MEMORY_AGENT_ID` env missing in MCP env block	Check `AGENT_ID` was exported before setup_node.sh
`federation_live: false`	C7	`ai-memory serve` crashed or port 9077 blocked	Check `/var/log/ai-memory-serve.log`; verify UFW off
`ufw_disabled: false`	C8	UFW re-enabled itself after reset	Ship-gate r21/r23 lesson; `ufw status verbose`
`xai_grok_chat_reachable: false`	F1	Key invalid, out of credit, or outbound HTTPS blocked	`curl -v https://api.x.ai/v1/models -H "Authorization: Bearer $XAI_API_KEY"`
`agent_mcp_ai_memory_canary: false`	F2	MCP dispatch broken, tool misselection, agent reasoning failure	Check `/tmp/canary-$AGENT_TYPE.log` for the agent's actual output
`a2a_protocol_off: false`	Negative	Config file lost the explicit disable	Re-run setup_node.sh; verify with `jq '.agentToAgent'`
`tool_allowlist_is_memory_only: false`	Negative	A non-`memory_*` tool leaked into allowlist	Edit config back to spec; see baseline.md §6b.2

Cost envelope¶

Path	Cost	Time
Path A full campaign (per group)	~$0.20-0.30 per clean run	15-25 min wall
Path B single-node on DO	~$0.02/hour	5-10 min
Baseline-violation triage	Campaign halts before scenarios, ~1 min of droplet time wasted	1-3 min
Dead-man switch worst case	~$1.30 per droplet × 4 = ~$5.20	8 hours

Disputing a finding¶

If your fork's campaign passes and this repo's fails (or vice versa), open an issue citing:

Both campaign IDs
The specific scenario that disagrees
The ai_memory_git_ref each ran against
Any infra differences (region, droplet size, DO account tier)
Both a2a-baseline.json files — if the negative_invariants differ, the gate wasn't running the same test

Cross-fork comparison is the point. An unreproducible result is a bug in the harness, and we want to find it.

Change control¶

If you change ANY baseline configuration during a campaign's life, follow the semver rules in baseline.md §12:

Adding a new invariant → minor bump
Tightening an existing invariant → minor bump with migration note
Relaxing or removing → major bump with analysis/run-insights.json narrative entry

Every change must update: 1. docs/baseline.md (authoritative spec) 2. scripts/setup_node.sh (emit/check) 3. The aggregator + dashboard renderer (if visible) 4. analysis/run-insights.json for the first run under the new baseline