../ runs index · rendered on Pages

Campaign v0.6.0.0-final-r21 FAIL

ai-memory ref
release/v0.6.0
Completed at
2026-04-20T15:41:12Z
Overall pass
FAIL

Run focus

Cancelled — Phase 4 hung 40+ min on PR #314's aggressive reqwest settings

What this campaign set out to test: Full four-phase protocol with PR #309 (fanout detach), PR #310 (chaos source allowlist), PR #312 (per-cycle harness + surviving-peer metric), PR #313 (3s post-write settle), and PR #314 (federation client keepalive tuning + harness hygiene) all landed on release/v0.6.0.

What it demonstrated: Proved Phases 1, 2, 3 remain green under the latest release-branch code. Disproved that PR #314's client-level tuning was safe — the settings that were supposed to speed up partition recovery instead caused Phase 4 to hang on the loopback mesh. A real-infra campaign found a real regression before it shipped; this is the ship-gate doing its job.

Detailed tri-audience analysis is below, followed by per-phase test results for all four phases of the protocol — including any phase that did not run in this campaign.

AI NHI analysis · Claude Opus 4.7

Cancelled — Phase 4 hung 40+ min on PR #314's aggressive reqwest settings

Operator-cancelled after Phase 4 ran 40+ minutes without completing (baseline is ~10 min). Phases 1, 2, and 3 all passed cleanly before the hang; the full 3-of-4 green story is in the artifacts. The suspect is PR #314's tight TCP keepalive + 5s pool-idle-timeout combination causing ephemeral-port exhaustion under the chaos harness's local connection churn. Reverted in PR #316.

What this campaign tested

Full four-phase protocol with PR #309 (fanout detach), PR #310 (chaos source allowlist), PR #312 (per-cycle harness + surviving-peer metric), PR #313 (3s post-write settle), and PR #314 (federation client keepalive tuning + harness hygiene) all landed on release/v0.6.0.

What it proved (or disproved)

Proved Phases 1, 2, 3 remain green under the latest release-branch code. Disproved that PR #314's client-level tuning was safe — the settings that were supposed to speed up partition recovery instead caused Phase 4 to hang on the loopback mesh. A real-infra campaign found a real regression before it shipped; this is the ship-gate doing its job.

For three audiences

Non-technical end users

The run caught a problem we introduced trying to fix a different problem. Rather than pretending it didn't happen, we cancelled the test, reverted the change, and documented what went wrong. This is how you build trustworthy software: every regression gets caught, every cancellation gets a story attached, every fix gets its own test.

C-level decision makers

Zero customer impact — the suspect code never shipped. The campaign gate worked as designed: a change that looked correct in unit tests failed under real-infrastructure load, which is precisely what the ship-gate exists to expose. PR #316 reverted the offender. Time cost: one campaign run, ~$0.15 of DigitalOcean spend, ~60 minutes of engineering attention. No release delay beyond that.

Engineers & architects

The PR #314 client-builder settings on release/v0.6.0 at commit 971a5db: `.tcp_keepalive(Duration::from_secs(1)).pool_idle_timeout(Duration::from_secs(5)).http2_keep_alive_while_idle(true)`. Hypothesized interaction on the chaos-client's local 3-process mesh: 5s pool-idle-timeout caused every idle fanout connection to close + reopen, 1s tcp_keepalive generated probe traffic on every still-live connection, and the combined churn across 5000 writes likely exhausted the droplet's ~28k ephemeral ports (60s TIME_WAIT retention). Phase 4 hit the 40-min wall. Reverted in PR #316 commit 710ad76.

Bugs surfaced and where they were fixed

  1. PR #314's aggressive reqwest settings caused Phase 4 hang

    Impact: Phase 4 ran 40+ min vs ~10 min baseline before operator-cancelled. Zero customer impact — never shipped.

    Root cause: Suspected ephemeral-port exhaustion + kernel keepalive thrash from the combination of `tcp_keepalive(1s) + pool_idle_timeout(5s)` on a loopback mesh with continuous connection churn under chaos load.

    Fixed in:

What changed going into the next campaign

r22 dispatched with PR #316 revert applied and partition_minority moved to opt-in (phase4_chaos.sh default FAULTS = kill_primary_mid_write only). Partition recovery becomes a v0.6.0.1 investigation.

Phase 1 — functional (per-node) PASS

What this phase proves: Single-node CRUD, backup, curator dry-run, and MCP handshake on each of the three peer droplets. Establishes that ai-memory starts and is functional at the one-node level before federation is exercised.

Test results

node-a

node-b

node-c

Raw evidence

phase1-node-a
{
	"phase": 1,
	"host": "aim-v0-6-0-0-final-r21-node-a",
	"version": "ai-memory 0.6.0",
	"pass": true,
	"reasons": [
		""
	],
	"stats": {
		"total": 1,
		"by_tier": [
			{
				"tier": "mid",
				"count": 1
			}
		],
		"by_namespace": [
			{
				"namespace": "ship-gate-phase1",
				"count": 1
			}
		],
		"expiring_soon": 0,
		"links_count": 0,
		"db_size_bytes": 139264
	},
	"curator": {
		"started_at": "2026-04-20T14:50:51.329608076+00:00",
		"completed_at": "2026-04-20T14:50:51.330145771+00:00",
		"cycle_duration_ms": 0,
		"memories_scanned": 1,
		"memories_eligible": 1,
		"auto_tagged": 0,
		"contradictions_found": 0,
		"operations_attempted": 0,
		"operations_skipped_cap": 0,
		"autonomy": {
			"clusters_formed": 0,
			"memories_consolidated": 0,
			"memories_forgotten": 0,
			"priority_adjustments": 0,
			"rollback_entries_written": 0,
			"errors": []
		},
		"errors": [
			"no LLM client configured"
		],
		"dry_run": true
	},
	"mcp_tool_count": 36,
	"recall_count": 1,
	"snapshot_count": 1,
	"manifest_count": 1
}

raw JSON

phase1-node-b
{
	"phase": 1,
	"host": "aim-v0-6-0-0-final-r21-node-b",
	"version": "ai-memory 0.6.0",
	"pass": true,
	"reasons": [
		""
	],
	"stats": {
		"total": 1,
		"by_tier": [
			{
				"tier": "mid",
				"count": 1
			}
		],
		"by_namespace": [
			{
				"namespace": "ship-gate-phase1",
				"count": 1
			}
		],
		"expiring_soon": 0,
		"links_count": 0,
		"db_size_bytes": 139264
	},
	"curator": {
		"started_at": "2026-04-20T14:50:51.703276453+00:00",
		"completed_at": "2026-04-20T14:50:51.703765091+00:00",
		"cycle_duration_ms": 0,
		"memories_scanned": 1,
		"memories_eligible": 1,
		"auto_tagged": 0,
		"contradictions_found": 0,
		"operations_attempted": 0,
		"operations_skipped_cap": 0,
		"autonomy": {
			"clusters_formed": 0,
			"memories_consolidated": 0,
			"memories_forgotten": 0,
			"priority_adjustments": 0,
			"rollback_entries_written": 0,
			"errors": []
		},
		"errors": [
			"no LLM client configured"
		],
		"dry_run": true
	},
	"mcp_tool_count": 36,
	"recall_count": 1,
	"snapshot_count": 1,
	"manifest_count": 1
}

raw JSON

phase1-node-c
{
	"phase": 1,
	"host": "aim-v0-6-0-0-final-r21-node-c",
	"version": "ai-memory 0.6.0",
	"pass": true,
	"reasons": [
		""
	],
	"stats": {
		"total": 1,
		"by_tier": [
			{
				"tier": "mid",
				"count": 1
			}
		],
		"by_namespace": [
			{
				"namespace": "ship-gate-phase1",
				"count": 1
			}
		],
		"expiring_soon": 0,
		"links_count": 0,
		"db_size_bytes": 139264
	},
	"curator": {
		"started_at": "2026-04-20T14:50:51.210863481+00:00",
		"completed_at": "2026-04-20T14:50:51.211338313+00:00",
		"cycle_duration_ms": 0,
		"memories_scanned": 1,
		"memories_eligible": 1,
		"auto_tagged": 0,
		"contradictions_found": 0,
		"operations_attempted": 0,
		"operations_skipped_cap": 0,
		"autonomy": {
			"clusters_formed": 0,
			"memories_consolidated": 0,
			"memories_forgotten": 0,
			"priority_adjustments": 0,
			"rollback_entries_written": 0,
			"errors": []
		},
		"errors": [
			"no LLM client configured"
		],
		"dry_run": true
	},
	"mcp_tool_count": 36,
	"recall_count": 1,
	"snapshot_count": 1,
	"manifest_count": 1
}

raw JSON

Phase 2 — multi-agent federation PASS

What this phase proves: 4 agents × 50 writes against the 3-node federation with W=2 quorum, then 90s settle and convergence count on every peer. Plus two quorum probes (one-peer-down must 201, both-peers-down must 503). Catches silent-data-loss and quorum-misclassification regressions.

Test results

Raw evidence

phase2
{
	"phase": 2,
	"pass": true,
	"total_writes": 200,
	"ok": 200,
	"quorum_not_met": 0,
	"fail": 0,
	"counts": {
		"a": 200,
		"b": 200,
		"c": 200
	},
	"probe1_single_peer_down": "201",
	"probe2_both_peers_down": "503",
	"reasons": [
		""
	]
}

raw JSON

Phase 3 — cross-backend migration PASS

What this phase proves: 1000-memory round-trip: SQLite → Postgres, re-run for idempotency, Postgres → SQLite. Asserts zero errors and counts match. Catches migration-correctness regressions in either direction of a production upgrade path.

Test results

Raw evidence

phase3
{
	"phase": 3,
	"pass": true,
	"report_forward": {
		"batches": 1,
		"dry_run": false,
		"errors": [],
		"from_url": "sqlite:///tmp/phase3-source.db",
		"memories_read": 1000,
		"memories_written": 1000,
		"to_url": "postgres://ai_memory:ai_memory_test@127.0.0.1:5433/ai_memory_test"
	},
	"report_idempotent": {
		"batches": 1,
		"dry_run": false,
		"errors": [],
		"from_url": "sqlite:///tmp/phase3-source.db",
		"memories_read": 1000,
		"memories_written": 1000,
		"to_url": "postgres://ai_memory:ai_memory_test@127.0.0.1:5433/ai_memory_test"
	},
	"report_reverse": {
		"batches": 1,
		"dry_run": false,
		"errors": [],
		"from_url": "postgres://ai_memory:ai_memory_test@127.0.0.1:5433/ai_memory_test",
		"memories_read": 1000,
		"memories_written": 1000,
		"to_url": "sqlite:///tmp/phase3-roundtrip.db"
	},
	"src_count": 1000,
	"dst_count": 1000,
	"reasons": [
		""
	]
}

raw JSON

Phase 4 — chaos campaign FAIL

What this phase proves: packaging/chaos/run-chaos.sh on the chaos-client droplet with 50 cycles × 100 writes per fault class. Measures convergence_bound = min(count_node1, count_node2) / total_ok. Catches fault-tolerance regressions under SIGKILL of the primary, brief network partition, and related fault models.

Test results

Raw evidence

phase4

          

raw JSON

All artifacts

Every JSON committed to this campaign directory. Raw, machine-readable, and stable.