Run focus
Federation convergence miss (B=89.5%, C=73%) + probe SSH 255 killed script
What this campaign set out to test: 4×50-write burst, 30-second settle, convergence assertion, two quorum probes (one-peer-down → 201, both-peers-down → 503).
What it demonstrated: Disproved that a 30s settle was sufficient for 200 burst writes to reach all three peers under the then-current federation implementation. On node-A (leader) every write landed. On node-B, 179 of 200. On node-C, 146 of 200. This is not an eventual-consistency "slow catch-up" pattern — it's a deterministic "each write is dropped on one peer" pattern. (Run 14 would later confirm this.) Also disproved that a shell-level `|| true` inside quoted remote ssh commands is enough to survive transport-level 255 exits.
Detailed tri-audience analysis is below, followed by per-phase test results for all four phases of the protocol — including any phase that did not run in this campaign.
AI NHI analysis · Claude Opus 4.7
Federation convergence miss (B=89.5%, C=73%) + probe SSH 255 killed script
First run where writes actually exercised federation. And federation was visibly wrong: non-leader peers sat well below the 95% convergence threshold even after 30 seconds of settle. Separately, the probe step exposed a harness-fragility sub-bug.
What this campaign tested
4×50-write burst, 30-second settle, convergence assertion, two quorum probes (one-peer-down → 201, both-peers-down → 503).
What it proved (or disproved)
Disproved that a 30s settle was sufficient for 200 burst writes to reach all three peers under the then-current federation implementation. On node-A (leader) every write landed. On node-B, 179 of 200. On node-C, 146 of 200. This is not an eventual-consistency "slow catch-up" pattern — it's a deterministic "each write is dropped on one peer" pattern. (Run 14 would later confirm this.) Also disproved that a shell-level `|| true` inside quoted remote ssh commands is enough to survive transport-level 255 exits.
For three audiences
Non-technical end users
Two problems at once: the back-of-house syncing wasn't caught up before we measured (could be explained by the test window being too short, or by something worse), and a cleanup step tripped over itself killing the test before it could report. Both turn out to be fixable. The second one is a test-rig issue; the first turns out to be a real bug in the product, uncovered fully in the NEXT run.
C-level decision makers
The campaign now has instrumented visibility into federation correctness. Two gaps identified: (a) test window may be too tight OR product convergence path is slower than expected (to be distinguished by r14), (b) probe harness needs transport-level fault tolerance. Neither is a product regression by itself yet — but the campaign is now exposing signal rather than hiding it. This is what a release-gate is supposed to do.
Engineers & architects
Quorum write (W=2/N=3) guarantees leader + one peer synchronously. Third peer catches up via sync-daemon. 30s was THREE sync-daemon cycles at the then-default 10s cadence. Observed pattern: B=89.5%, C=73% with a convergence ratio that doesn't match a simple settle-time explanation (if it were purely time, we'd expect B and C to be roughly equal, both lagging). Separately, `ssh root@X "pkill -f 'ai-memory serve' || true"` returns SSH exit 255 when the remote sshd closes after pkill kills a process in the same session. `|| true` inside quotes shields the remote exit code; the SSH client's own 255 propagates under `set -e`. Shield fix: wrap the SSH call itself, plus ConnectTimeout + ServerAliveInterval for good measure.
Bugs surfaced and where they were fixed
-
Phase 2 30s settle insufficient for 200-write convergence (symptom — not root cause)
Impact: node-B at 89.5%, node-C at 73% — below 95% pass threshold. Initial diagnosis was "settle too short"; real cause emerged in r14.
Root cause: Quorum's synchronous replication only covers W-1 peers; the third relies on an async path. At the time we believed the async path was slow. It was actually broken.
Fixed in:
-
Probe SSH returned 255 and killed Phase 2 under set -e
Impact: probe1 + probe2 never ran; Phase 2 produced no verdict JSON.
Root cause: `|| true` inside remote quotes only shields the remote command's exit code, not SSH transport failures.
Fixed in:
What changed going into the next campaign
r14 bumps the settle to 90s AND wraps probe SSH in outer `|| true`. Convergence stayed below threshold — revealing that the real problem wasn't the test window, it was the product.
Phase 1 — functional (per-node) PASS
What this phase proves: Single-node CRUD, backup, curator dry-run, and MCP handshake on each of the three peer droplets. Establishes that ai-memory starts and is functional at the one-node level before federation is exercised.
Test results
node-a
- ✓ Stats total ≥ 1 (store + list + stats round-trip) — 1 memories
- ✓ Recall returned ≥ 1 hit — 1 hits
- ✓ Backup snapshot file emitted — 1 snapshot(s)
- ✓ Backup manifest file emitted — 1 manifest(s)
- ✓ MCP handshake advertises ≥ 30 tools — 36 tools
- ✓ Curator dry-run clean (Ollama-not-configured is accepted) — 1 errors
- ✓ Overall phase-1 pass flag
node-b
- ✓ Stats total ≥ 1 (store + list + stats round-trip) — 1 memories
- ✓ Recall returned ≥ 1 hit — 1 hits
- ✓ Backup snapshot file emitted — 1 snapshot(s)
- ✓ Backup manifest file emitted — 1 manifest(s)
- ✓ MCP handshake advertises ≥ 30 tools — 36 tools
- ✓ Curator dry-run clean (Ollama-not-configured is accepted) — 1 errors
- ✓ Overall phase-1 pass flag
node-c
- ✓ Stats total ≥ 1 (store + list + stats round-trip) — 1 memories
- ✓ Recall returned ≥ 1 hit — 1 hits
- ✓ Backup snapshot file emitted — 1 snapshot(s)
- ✓ Backup manifest file emitted — 1 manifest(s)
- ✓ MCP handshake advertises ≥ 30 tools — 36 tools
- ✓ Curator dry-run clean (Ollama-not-configured is accepted) — 1 errors
- ✓ Overall phase-1 pass flag
Raw evidence
phase1-node-a
{
"phase": 1,
"host": "aim-v0-6-0-0-final-r13-node-a",
"version": "ai-memory 0.6.0",
"pass": true,
"reasons": [
""
],
"stats": {
"total": 1,
"by_tier": [
{
"tier": "mid",
"count": 1
}
],
"by_namespace": [
{
"namespace": "ship-gate-phase1",
"count": 1
}
],
"expiring_soon": 0,
"links_count": 0,
"db_size_bytes": 139264
},
"curator": {
"started_at": "2026-04-20T04:26:49.272066563+00:00",
"completed_at": "2026-04-20T04:26:49.272521117+00:00",
"cycle_duration_ms": 0,
"memories_scanned": 1,
"memories_eligible": 1,
"auto_tagged": 0,
"contradictions_found": 0,
"operations_attempted": 0,
"operations_skipped_cap": 0,
"autonomy": {
"clusters_formed": 0,
"memories_consolidated": 0,
"memories_forgotten": 0,
"priority_adjustments": 0,
"rollback_entries_written": 0,
"errors": []
},
"errors": [
"no LLM client configured"
],
"dry_run": true
},
"mcp_tool_count": 36,
"recall_count": 1,
"snapshot_count": 1,
"manifest_count": 1
}
raw JSON
phase1-node-b
{
"phase": 1,
"host": "aim-v0-6-0-0-final-r13-node-b",
"version": "ai-memory 0.6.0",
"pass": true,
"reasons": [
""
],
"stats": {
"total": 1,
"by_tier": [
{
"tier": "mid",
"count": 1
}
],
"by_namespace": [
{
"namespace": "ship-gate-phase1",
"count": 1
}
],
"expiring_soon": 0,
"links_count": 0,
"db_size_bytes": 139264
},
"curator": {
"started_at": "2026-04-20T04:26:49.581991284+00:00",
"completed_at": "2026-04-20T04:26:49.582578735+00:00",
"cycle_duration_ms": 0,
"memories_scanned": 1,
"memories_eligible": 1,
"auto_tagged": 0,
"contradictions_found": 0,
"operations_attempted": 0,
"operations_skipped_cap": 0,
"autonomy": {
"clusters_formed": 0,
"memories_consolidated": 0,
"memories_forgotten": 0,
"priority_adjustments": 0,
"rollback_entries_written": 0,
"errors": []
},
"errors": [
"no LLM client configured"
],
"dry_run": true
},
"mcp_tool_count": 36,
"recall_count": 1,
"snapshot_count": 1,
"manifest_count": 1
}
raw JSON
phase1-node-c
{
"phase": 1,
"host": "aim-v0-6-0-0-final-r13-node-c",
"version": "ai-memory 0.6.0",
"pass": true,
"reasons": [
""
],
"stats": {
"total": 1,
"by_tier": [
{
"tier": "mid",
"count": 1
}
],
"by_namespace": [
{
"namespace": "ship-gate-phase1",
"count": 1
}
],
"expiring_soon": 0,
"links_count": 0,
"db_size_bytes": 139264
},
"curator": {
"started_at": "2026-04-20T04:26:49.263137618+00:00",
"completed_at": "2026-04-20T04:26:49.263610536+00:00",
"cycle_duration_ms": 0,
"memories_scanned": 1,
"memories_eligible": 1,
"auto_tagged": 0,
"contradictions_found": 0,
"operations_attempted": 0,
"operations_skipped_cap": 0,
"autonomy": {
"clusters_formed": 0,
"memories_consolidated": 0,
"memories_forgotten": 0,
"priority_adjustments": 0,
"rollback_entries_written": 0,
"errors": []
},
"errors": [
"no LLM client configured"
],
"dry_run": true
},
"mcp_tool_count": 36,
"recall_count": 1,
"snapshot_count": 1,
"manifest_count": 1
}
raw JSON