Campaign v0.6.0.0-final-r21 FAIL

ai-memory ref: release/v0.6.0
Completed at: 2026-04-20T15:41:12Z
Overall pass: FAIL

Run focus

Cancelled — Phase 4 hung 40+ min on PR #314's aggressive reqwest settings

What this campaign set out to test: Full four-phase protocol with PR #309 (fanout detach), PR #310 (chaos source allowlist), PR #312 (per-cycle harness + surviving-peer metric), PR #313 (3s post-write settle), and PR #314 (federation client keepalive tuning + harness hygiene) all landed on release/v0.6.0.

What it demonstrated: Proved Phases 1, 2, 3 remain green under the latest release-branch code. Disproved that PR #314's client-level tuning was safe — the settings that were supposed to speed up partition recovery instead caused Phase 4 to hang on the loopback mesh. A real-infra campaign found a real regression before it shipped; this is the ship-gate doing its job.

Detailed tri-audience analysis is below, followed by per-phase test results for all four phases of the protocol — including any phase that did not run in this campaign.

AI NHI analysis · Claude Opus 4.7

Cancelled — Phase 4 hung 40+ min on PR #314's aggressive reqwest settings

Operator-cancelled after Phase 4 ran 40+ minutes without completing (baseline is ~10 min). Phases 1, 2, and 3 all passed cleanly before the hang; the full 3-of-4 green story is in the artifacts. The suspect is PR #314's tight TCP keepalive + 5s pool-idle-timeout combination causing ephemeral-port exhaustion under the chaos harness's local connection churn. Reverted in PR #316.

What this campaign tested

Full four-phase protocol with PR #309 (fanout detach), PR #310 (chaos source allowlist), PR #312 (per-cycle harness + surviving-peer metric), PR #313 (3s post-write settle), and PR #314 (federation client keepalive tuning + harness hygiene) all landed on release/v0.6.0.

What it proved (or disproved)

Proved Phases 1, 2, 3 remain green under the latest release-branch code. Disproved that PR #314's client-level tuning was safe — the settings that were supposed to speed up partition recovery instead caused Phase 4 to hang on the loopback mesh. A real-infra campaign found a real regression before it shipped; this is the ship-gate doing its job.

For three audiences

Non-technical end users

The run caught a problem we introduced trying to fix a different problem. Rather than pretending it didn't happen, we cancelled the test, reverted the change, and documented what went wrong. This is how you build trustworthy software: every regression gets caught, every cancellation gets a story attached, every fix gets its own test.

C-level decision makers

Zero customer impact — the suspect code never shipped. The campaign gate worked as designed: a change that looked correct in unit tests failed under real-infrastructure load, which is precisely what the ship-gate exists to expose. PR #316 reverted the offender. Time cost: one campaign run, ~$0.15 of DigitalOcean spend, ~60 minutes of engineering attention. No release delay beyond that.

Engineers & architects

The PR #314 client-builder settings on release/v0.6.0 at commit 971a5db: `.tcp_keepalive(Duration::from_secs(1)).pool_idle_timeout(Duration::from_secs(5)).http2_keep_alive_while_idle(true)`. Hypothesized interaction on the chaos-client's local 3-process mesh: 5s pool-idle-timeout caused every idle fanout connection to close + reopen, 1s tcp_keepalive generated probe traffic on every still-live connection, and the combined churn across 5000 writes likely exhausted the droplet's ~28k ephemeral ports (60s TIME_WAIT retention). Phase 4 hit the 40-min wall. Reverted in PR #316 commit 710ad76.

Bugs surfaced and where they were fixed

PR #314's aggressive reqwest settings caused Phase 4 hang

Impact: Phase 4 ran 40+ min vs ~10 min baseline before operator-cancelled. Zero customer impact — never shipped.

Root cause: Suspected ephemeral-port exhaustion + kernel keepalive thrash from the combination of `tcp_keepalive(1s) + pool_idle_timeout(5s)` on a loopback mesh with continuous connection churn under chaos load.

Fixed in:
- PR #316 (MERGED) — revert the three aggressive settings

What changed going into the next campaign

r22 dispatched with PR #316 revert applied and partition_minority moved to opt-in (phase4_chaos.sh default FAULTS = kill_primary_mid_write only). Partition recovery becomes a v0.6.0.1 investigation.

Phase 1 — functional (per-node) PASS

What this phase proves: Single-node CRUD, backup, curator dry-run, and MCP handshake on each of the three peer droplets. Establishes that ai-memory starts and is functional at the one-node level before federation is exercised.

Test results

node-a

✓ Stats total ≥ 1 (store + list + stats round-trip) — 1 memories
✓ Recall returned ≥ 1 hit — 1 hits
✓ Backup snapshot file emitted — 1 snapshot(s)
✓ Backup manifest file emitted — 1 manifest(s)
✓ MCP handshake advertises ≥ 30 tools — 36 tools
✓ Curator dry-run clean (Ollama-not-configured is accepted) — 1 errors
✓ Overall phase-1 pass flag

node-b

✓ Stats total ≥ 1 (store + list + stats round-trip) — 1 memories
✓ Recall returned ≥ 1 hit — 1 hits
✓ Backup snapshot file emitted — 1 snapshot(s)
✓ Backup manifest file emitted — 1 manifest(s)
✓ MCP handshake advertises ≥ 30 tools — 36 tools
✓ Curator dry-run clean (Ollama-not-configured is accepted) — 1 errors
✓ Overall phase-1 pass flag

node-c

✓ Stats total ≥ 1 (store + list + stats round-trip) — 1 memories
✓ Recall returned ≥ 1 hit — 1 hits
✓ Backup snapshot file emitted — 1 snapshot(s)
✓ Backup manifest file emitted — 1 manifest(s)
✓ MCP handshake advertises ≥ 30 tools — 36 tools
✓ Curator dry-run clean (Ollama-not-configured is accepted) — 1 errors
✓ Overall phase-1 pass flag

Raw evidence

phase1-node-a

{
	"phase": 1,
	"host": "aim-v0-6-0-0-final-r21-node-a",
	"version": "ai-memory 0.6.0",
	"pass": true,
	"reasons": [
		""
	],
	"stats": {
		"total": 1,
		"by_tier": [
			{
				"tier": "mid",
				"count": 1
			}
		],
		"by_namespace": [
			{
				"namespace": "ship-gate-phase1",
				"count": 1
			}
		],
		"expiring_soon": 0,
		"links_count": 0,
		"db_size_bytes": 139264
	},
	"curator": {
		"started_at": "2026-04-20T14:50:51.329608076+00:00",
		"completed_at": "2026-04-20T14:50:51.330145771+00:00",
		"cycle_duration_ms": 0,
		"memories_scanned": 1,
		"memories_eligible": 1,
		"auto_tagged": 0,
		"contradictions_found": 0,
		"operations_attempted": 0,
		"operations_skipped_cap": 0,
		"autonomy": {
			"clusters_formed": 0,
			"memories_consolidated": 0,
			"memories_forgotten": 0,
			"priority_adjustments": 0,
			"rollback_entries_written": 0,
			"errors": []
		},
		"errors": [
			"no LLM client configured"
		],
		"dry_run": true
	},
	"mcp_tool_count": 36,
	"recall_count": 1,
	"snapshot_count": 1,
	"manifest_count": 1
}

raw JSON

phase1-node-b

{
	"phase": 1,
	"host": "aim-v0-6-0-0-final-r21-node-b",
	"version": "ai-memory 0.6.0",
	"pass": true,
	"reasons": [
		""
	],
	"stats": {
		"total": 1,
		"by_tier": [
			{
				"tier": "mid",
				"count": 1
			}
		],
		"by_namespace": [
			{
				"namespace": "ship-gate-phase1",
				"count": 1
			}
		],
		"expiring_soon": 0,
		"links_count": 0,
		"db_size_bytes": 139264
	},
	"curator": {
		"started_at": "2026-04-20T14:50:51.703276453+00:00",
		"completed_at": "2026-04-20T14:50:51.703765091+00:00",
		"cycle_duration_ms": 0,
		"memories_scanned": 1,
		"memories_eligible": 1,
		"auto_tagged": 0,
		"contradictions_found": 0,
		"operations_attempted": 0,
		"operations_skipped_cap": 0,
		"autonomy": {
			"clusters_formed": 0,
			"memories_consolidated": 0,
			"memories_forgotten": 0,
			"priority_adjustments": 0,
			"rollback_entries_written": 0,
			"errors": []
		},
		"errors": [
			"no LLM client configured"
		],
		"dry_run": true
	},
	"mcp_tool_count": 36,
	"recall_count": 1,
	"snapshot_count": 1,
	"manifest_count": 1
}

raw JSON

phase1-node-c

{
	"phase": 1,
	"host": "aim-v0-6-0-0-final-r21-node-c",
	"version": "ai-memory 0.6.0",
	"pass": true,
	"reasons": [
		""
	],
	"stats": {
		"total": 1,
		"by_tier": [
			{
				"tier": "mid",
				"count": 1
			}
		],
		"by_namespace": [
			{
				"namespace": "ship-gate-phase1",
				"count": 1
			}
		],
		"expiring_soon": 0,
		"links_count": 0,
		"db_size_bytes": 139264
	},
	"curator": {
		"started_at": "2026-04-20T14:50:51.210863481+00:00",
		"completed_at": "2026-04-20T14:50:51.211338313+00:00",
		"cycle_duration_ms": 0,
		"memories_scanned": 1,
		"memories_eligible": 1,
		"auto_tagged": 0,
		"contradictions_found": 0,
		"operations_attempted": 0,
		"operations_skipped_cap": 0,
		"autonomy": {
			"clusters_formed": 0,
			"memories_consolidated": 0,
			"memories_forgotten": 0,
			"priority_adjustments": 0,
			"rollback_entries_written": 0,
			"errors": []
		},
		"errors": [
			"no LLM client configured"
		],
		"dry_run": true
	},
	"mcp_tool_count": 36,
	"recall_count": 1,
	"snapshot_count": 1,
	"manifest_count": 1
}

raw JSON

Phase 2 — multi-agent federation PASS

What this phase proves: 4 agents × 50 writes against the 3-node federation with W=2 quorum, then 90s settle and convergence count on every peer. Plus two quorum probes (one-peer-down must 201, both-peers-down must 503). Catches silent-data-loss and quorum-misclassification regressions.

Test results

✓ Burst writes returned 201 — ok=200/200 (qnm=0, fail=0)
✓ node-A convergence ≥ 95% of ok — a=200 / threshold 190
✓ node-B convergence ≥ 95% of ok — b=200 / threshold 190
✓ node-C convergence ≥ 95% of ok — c=200 / threshold 190
✓ Probe 1: one peer down → 201 (quorum met via remaining peer) — got 201
✓ Probe 2: both peers down → 503 (quorum_not_met) — got 503
✓ Overall phase-2 pass flag

Raw evidence

phase2

{
	"phase": 2,
	"pass": true,
	"total_writes": 200,
	"ok": 200,
	"quorum_not_met": 0,
	"fail": 0,
	"counts": {
		"a": 200,
		"b": 200,
		"c": 200
	},
	"probe1_single_peer_down": "201",
	"probe2_both_peers_down": "503",
	"reasons": [
		""
	]
}

raw JSON

Phase 3 — cross-backend migration PASS

What this phase proves: 1000-memory round-trip: SQLite → Postgres, re-run for idempotency, Postgres → SQLite. Asserts zero errors and counts match. Catches migration-correctness regressions in either direction of a production upgrade path.

Test results

✓ Source SQLite has 1000 seed memories — src_count=1000
✓ Destination after reverse roundtrip has 1000 memories — dst_count=1000
✓ Forward migration SQLite → Postgres: errors=0 — errors=0
✓ Idempotent re-run is a no-op — writes=1000
✓ Reverse migration Postgres → SQLite: errors=0 — errors=0
✓ Overall phase-3 pass flag

Raw evidence

phase3

{
	"phase": 3,
	"pass": true,
	"report_forward": {
		"batches": 1,
		"dry_run": false,
		"errors": [],
		"from_url": "sqlite:///tmp/phase3-source.db",
		"memories_read": 1000,
		"memories_written": 1000,
		"to_url": "postgres://ai_memory:ai_memory_test@127.0.0.1:5433/ai_memory_test"
	},
	"report_idempotent": {
		"batches": 1,
		"dry_run": false,
		"errors": [],
		"from_url": "sqlite:///tmp/phase3-source.db",
		"memories_read": 1000,
		"memories_written": 1000,
		"to_url": "postgres://ai_memory:ai_memory_test@127.0.0.1:5433/ai_memory_test"
	},
	"report_reverse": {
		"batches": 1,
		"dry_run": false,
		"errors": [],
		"from_url": "postgres://ai_memory:ai_memory_test@127.0.0.1:5433/ai_memory_test",
		"memories_read": 1000,
		"memories_written": 1000,
		"to_url": "sqlite:///tmp/phase3-roundtrip.db"
	},
	"src_count": 1000,
	"dst_count": 1000,
	"reasons": [
		""
	]
}

raw JSON

Phase 4 — chaos campaign FAIL

What this phase proves: packaging/chaos/run-chaos.sh on the chaos-client droplet with 50 cycles × 100 writes per fault class. Measures convergence_bound = min(count_node1, count_node2) / total_ok. Catches fault-tolerance regressions under SIGKILL of the primary, brief network partition, and related fault models.

Test results

✗ phase4.json did not parse as JSON — the chaos-harness summary never wrote cleanly — see raw JSON below
✗ Per-fault convergence_bound ≥ 0.995 — metric unavailable

Raw evidence

phase4

raw JSON

All artifacts

Every JSON committed to this campaign directory. Raw, machine-readable, and stable.

Generated by scripts/generate_run_html.sh. Campaign directory: alphaonedev/ai-memory-ship-gate/runs/v0.6.0.0-final-r21 . Methodology: alphaonedev.github.io/ai-memory-ship-gate/methodology . Analysis data source: analysis/run-insights.json.