../ runs index · rendered on Pages

Campaign v0.6.0.0-final-r18 FAIL

ai-memory ref
release/v0.6.0
Completed at
2026-04-20T12:59:06Z
Overall pass
FAIL

Run focus

Metric returned 6.286 (impossibly > 1.0) and partition crashed at cycle 13

What this campaign set out to test: Phase 4 with the surviving-peer metric (PR #312 applied in parts), same two fault classes.

What it demonstrated: Disproved that run-chaos.sh's cycle isolation was clean. Cycles reused `node-0.db`, `node-1.db`, `node-2.db` AND the `chaos` namespace across all 50 iterations. The count-query returned cumulative memories (cycle N's count = 3 × N instead of 3), so the new `min(.)/ok` formula summed into numbers > 1.0. Separately disproved that SIGTERM at teardown reliably releases the listen socket before the next cycle spawns — ai-memory's 30-second graceful WAL-checkpoint path holds the socket.

Detailed tri-audience analysis is below, followed by per-phase test results for all four phases of the protocol — including any phase that did not run in this campaign.

AI NHI analysis · Claude Opus 4.7

Metric returned 6.286 (impossibly > 1.0) and partition crashed at cycle 13

Two more chaos harness defects surfaced in one run: per-cycle DB + namespace reuse polluted the convergence counts, and SIGTERM graceful-shutdown races with the next cycle's spawn under partition_minority.

What this campaign tested

Phase 4 with the surviving-peer metric (PR #312 applied in parts), same two fault classes.

What it proved (or disproved)

Disproved that run-chaos.sh's cycle isolation was clean. Cycles reused `node-0.db`, `node-1.db`, `node-2.db` AND the `chaos` namespace across all 50 iterations. The count-query returned cumulative memories (cycle N's count = 3 × N instead of 3), so the new `min(.)/ok` formula summed into numbers > 1.0. Separately disproved that SIGTERM at teardown reliably releases the listen socket before the next cycle spawns — ai-memory's 30-second graceful WAL-checkpoint path holds the socket.

For three audiences

Non-technical end users

We kept finding problems in the way we RAN the chaos tests. None of them were in the product; they were test-setup issues. Each cycle needs to be a clean slate, or the counts bleed across cycles and the numbers become nonsense. And each cycle needs to fully tear down its processes before the next one starts, or the ports stay occupied. Both fixed in the same PR.

C-level decision makers

Chaos harness hardening continues to surface AHEAD of product assertions. That's exactly the posture a ship-gate should take: if the test rig is suspect, we don't trust the test, we fix the rig first. Nothing yet in the chaos campaign has pointed at a real product regression beyond the run-14 federation fanout fix. Phase 4 is a week of rig work away from producing trustworthy signal. Phases 1-3 are already green. Release decision is: ship v0.6.0 on a conservative chaos floor (kill_primary_mid_write only) and iterate partition_minority as informational until the fault-injection timing issues are solved.

Engineers & architects

Two stacked harness defects. (1) DB path was `WORKDIR/node-N.db` shared across cycles — abandoned WAL from one cycle's SIGKILL'd primary raced the next cycle's replay on spawn. Plus cumulative namespace `"chaos"` made count_node1 return the running total across all prior cycles, so the new metric's sum of per-cycle mins was `3 + 6 + 9 + ... + 150 = 3 × 50 × 51 / 2 = 3825 / (3 × 50) ≈ 25`, clamped somewhere around 6.286 by whatever the cycle loop's partial state was. Fix: per-cycle DB files `c${n}-node-${idx}.db` + per-cycle namespace `chaos-c${n}`. (2) SIGTERM at teardown kicked ai-memory's graceful-shutdown WAL-checkpoint path (red-team #233); listen socket held past `wait`'s return. Fix: SIGKILL + 100ms settle. Dirty WAL in the abandoned per-cycle DB doesn't matter because nothing reads it again. Both closed in PR #312.

Bugs surfaced and where they were fixed

  1. Per-cycle DB + namespace reuse polluted counts and crashed WAL recovery

    Impact: Metric output 6.286 (impossible > 1). partition_minority died at cycle 13 with node-0 failed to start. Phase 4 unusable.

    Root cause: run-chaos.sh reused `node-0.db`/`node-1.db`/`node-2.db` and `namespace="chaos"` across all 50 cycles. Abandoned WAL recovery raced with fresh spawn.

    Fixed in:

  2. SIGTERM teardown race with next cycle's spawn (partition_minority)

    Impact: Under partition_minority, listen socket held by gracefully-shutting-down process collided with next cycle's spawn. Campaign aborted mid-run.

    Root cause: ai-memory's graceful-shutdown path holds the socket during a WAL checkpoint that can take seconds. `wait` returns before the socket actually releases.

    Fixed in:

What changed going into the next campaign

r19 uses per-cycle `c${n}-node-${idx}.db` files + `chaos-c${n}` namespaces + SIGKILL teardown + 100ms settle. kill_primary_mid_write metric reaches 1.0; partition_minority still missed (different root cause, see r19 + r20).

Phase 1 — functional (per-node) PASS

What this phase proves: Single-node CRUD, backup, curator dry-run, and MCP handshake on each of the three peer droplets. Establishes that ai-memory starts and is functional at the one-node level before federation is exercised.

Test results

node-a

node-b

node-c

Raw evidence

phase1-node-a
{
	"phase": 1,
	"host": "aim-v0-6-0-0-final-r18-node-a",
	"version": "ai-memory 0.6.0",
	"pass": true,
	"reasons": [
		""
	],
	"stats": {
		"total": 1,
		"by_tier": [
			{
				"tier": "mid",
				"count": 1
			}
		],
		"by_namespace": [
			{
				"namespace": "ship-gate-phase1",
				"count": 1
			}
		],
		"expiring_soon": 0,
		"links_count": 0,
		"db_size_bytes": 139264
	},
	"curator": {
		"started_at": "2026-04-20T12:50:05.564550624+00:00",
		"completed_at": "2026-04-20T12:50:05.565276174+00:00",
		"cycle_duration_ms": 0,
		"memories_scanned": 1,
		"memories_eligible": 1,
		"auto_tagged": 0,
		"contradictions_found": 0,
		"operations_attempted": 0,
		"operations_skipped_cap": 0,
		"autonomy": {
			"clusters_formed": 0,
			"memories_consolidated": 0,
			"memories_forgotten": 0,
			"priority_adjustments": 0,
			"rollback_entries_written": 0,
			"errors": []
		},
		"errors": [
			"no LLM client configured"
		],
		"dry_run": true
	},
	"mcp_tool_count": 36,
	"recall_count": 1,
	"snapshot_count": 1,
	"manifest_count": 1
}

raw JSON

phase1-node-b
{
	"phase": 1,
	"host": "aim-v0-6-0-0-final-r18-node-b",
	"version": "ai-memory 0.6.0",
	"pass": true,
	"reasons": [
		""
	],
	"stats": {
		"total": 1,
		"by_tier": [
			{
				"tier": "mid",
				"count": 1
			}
		],
		"by_namespace": [
			{
				"namespace": "ship-gate-phase1",
				"count": 1
			}
		],
		"expiring_soon": 0,
		"links_count": 0,
		"db_size_bytes": 139264
	},
	"curator": {
		"started_at": "2026-04-20T12:50:05.696760663+00:00",
		"completed_at": "2026-04-20T12:50:05.697295428+00:00",
		"cycle_duration_ms": 0,
		"memories_scanned": 1,
		"memories_eligible": 1,
		"auto_tagged": 0,
		"contradictions_found": 0,
		"operations_attempted": 0,
		"operations_skipped_cap": 0,
		"autonomy": {
			"clusters_formed": 0,
			"memories_consolidated": 0,
			"memories_forgotten": 0,
			"priority_adjustments": 0,
			"rollback_entries_written": 0,
			"errors": []
		},
		"errors": [
			"no LLM client configured"
		],
		"dry_run": true
	},
	"mcp_tool_count": 36,
	"recall_count": 1,
	"snapshot_count": 1,
	"manifest_count": 1
}

raw JSON

phase1-node-c
{
	"phase": 1,
	"host": "aim-v0-6-0-0-final-r18-node-c",
	"version": "ai-memory 0.6.0",
	"pass": true,
	"reasons": [
		""
	],
	"stats": {
		"total": 1,
		"by_tier": [
			{
				"tier": "mid",
				"count": 1
			}
		],
		"by_namespace": [
			{
				"namespace": "ship-gate-phase1",
				"count": 1
			}
		],
		"expiring_soon": 0,
		"links_count": 0,
		"db_size_bytes": 139264
	},
	"curator": {
		"started_at": "2026-04-20T12:50:06.432899797+00:00",
		"completed_at": "2026-04-20T12:50:06.433428136+00:00",
		"cycle_duration_ms": 0,
		"memories_scanned": 1,
		"memories_eligible": 1,
		"auto_tagged": 0,
		"contradictions_found": 0,
		"operations_attempted": 0,
		"operations_skipped_cap": 0,
		"autonomy": {
			"clusters_formed": 0,
			"memories_consolidated": 0,
			"memories_forgotten": 0,
			"priority_adjustments": 0,
			"rollback_entries_written": 0,
			"errors": []
		},
		"errors": [
			"no LLM client configured"
		],
		"dry_run": true
	},
	"mcp_tool_count": 36,
	"recall_count": 1,
	"snapshot_count": 1,
	"manifest_count": 1
}

raw JSON

Phase 2 — multi-agent federation PASS

What this phase proves: 4 agents × 50 writes against the 3-node federation with W=2 quorum, then 90s settle and convergence count on every peer. Plus two quorum probes (one-peer-down must 201, both-peers-down must 503). Catches silent-data-loss and quorum-misclassification regressions.

Test results

Raw evidence

phase2
{
	"phase": 2,
	"pass": true,
	"total_writes": 200,
	"ok": 200,
	"quorum_not_met": 0,
	"fail": 0,
	"counts": {
		"a": 200,
		"b": 200,
		"c": 200
	},
	"probe1_single_peer_down": "201",
	"probe2_both_peers_down": "503",
	"reasons": [
		""
	]
}

raw JSON

Phase 3 — cross-backend migration PASS

What this phase proves: 1000-memory round-trip: SQLite → Postgres, re-run for idempotency, Postgres → SQLite. Asserts zero errors and counts match. Catches migration-correctness regressions in either direction of a production upgrade path.

Test results

Raw evidence

phase3
{
	"phase": 3,
	"pass": true,
	"report_forward": {
		"batches": 1,
		"dry_run": false,
		"errors": [],
		"from_url": "sqlite:///tmp/phase3-source.db",
		"memories_read": 1000,
		"memories_written": 1000,
		"to_url": "postgres://ai_memory:ai_memory_test@127.0.0.1:5433/ai_memory_test"
	},
	"report_idempotent": {
		"batches": 1,
		"dry_run": false,
		"errors": [],
		"from_url": "sqlite:///tmp/phase3-source.db",
		"memories_read": 1000,
		"memories_written": 1000,
		"to_url": "postgres://ai_memory:ai_memory_test@127.0.0.1:5433/ai_memory_test"
	},
	"report_reverse": {
		"batches": 1,
		"dry_run": false,
		"errors": [],
		"from_url": "postgres://ai_memory:ai_memory_test@127.0.0.1:5433/ai_memory_test",
		"memories_read": 1000,
		"memories_written": 1000,
		"to_url": "sqlite:///tmp/phase3-roundtrip.db"
	},
	"src_count": 1000,
	"dst_count": 1000,
	"reasons": [
		""
	]
}

raw JSON

Phase 4 — chaos campaign FAIL

What this phase proves: packaging/chaos/run-chaos.sh on the chaos-client droplet with 50 cycles × 100 writes per fault class. Measures convergence_bound = min(count_node1, count_node2) / total_ok. Catches fault-tolerance regressions under SIGKILL of the primary, brief network partition, and related fault models.

Test results

Raw evidence

phase4
[chaos] chaos campaign: fault=kill_primary_mid_write cycles=50 writes/cycle=100
[chaos] workdir: /tmp/phase4-kill_primary_mid_write
[chaos] binary: /usr/local/bin/ai-memory
[chaos] cycle 1: nodes ready (pids 4420 4422 4424)
[chaos] cycle 2: nodes ready (pids 4792 4794 4796)
[chaos] cycle 3: nodes ready (pids 5134 5136 5138)
[chaos] cycle 4: nodes ready (pids 5476 5478 5480)
[chaos] cycle 5: nodes ready (pids 5818 5820 5822)
[chaos] cycle 6: nodes ready (pids 6161 6163 6165)
[chaos] cycle 7: nodes ready (pids 6503 6505 6507)
[chaos] cycle 8: nodes ready (pids 6845 6847 6849)
[chaos] cycle 9: nodes ready (pids 7187 7189 7191)
[chaos] cycle 10: nodes ready (pids 7529 7531 7533)
[chaos] cycle 11: nodes ready (pids 7871 7873 7875)
[chaos] cycle 12: nodes ready (pids 8211 8213 8215)
[chaos] cycle 13: nodes ready (pids 8553 8555 8557)
[chaos] cycle 14: nodes ready (pids 8895 8897 8899)
[chaos] cycle 15: nodes ready (pids 9237 9239 9241)
[chaos] cycle 16: nodes ready (pids 9579 9581 9583)
[chaos] cycle 17: nodes ready (pids 9921 9923 9925)
[chaos] cycle 18: nodes ready (pids 10263 10265 10267)
[chaos] cycle 19: nodes ready (pids 10605 10607 10609)
[chaos] cycle 20: nodes ready (pids 10947 10949 10951)
[chaos] cycle 21: nodes ready (pids 11289 11291 11293)
[chaos] cycle 22: nodes ready (pids 11631 11633 11635)
[chaos] cycle 23: nodes ready (pids 11973 11975 11977)
[chaos] cycle 24: nodes ready (pids 12315 12317 12319)
[chaos] cycle 25: nodes ready (pids 12657 12659 12661)
[chaos] cycle 26: nodes ready (pids 13002 13004 13006)
[chaos] cycle 27: nodes ready (pids 13349 13351 13353)
[chaos] cycle 28: nodes ready (pids 13693 13695 13697)
[chaos] cycle 29: nodes ready (pids 14037 14039 14041)
[chaos] cycle 30: nodes ready (pids 14381 14383 14385)
[chaos] cycle 31: nodes ready (pids 14725 14727 14729)
[chaos] cycle 32: nodes ready (pids 15069 15071 15073)
[chaos] cycle 33: nodes ready (pids 15413 15415 15417)
[chaos] cycle 34: nodes ready (pids 15757 15759 15761)
[chaos] cycle 35: nodes ready (pids 16101 16103 16105)
[chaos] cycle 36: nodes ready (pids 16445 16447 16449)
[chaos] cycle 37: nodes ready (pids 16789 16791 16793)
[chaos] cycle 38: nodes ready (pids 17135 17137 17139)
[chaos] cycle 39: nodes ready (pids 17481 17483 17485)
[chaos] cycle 40: nodes ready (pids 17827 17829 17831)
[chaos] cycle 41: nodes ready (pids 18173 18175 18177)
[chaos] cycle 42: nodes ready (pids 18519 18521 18523)
[chaos] cycle 43: nodes ready (pids 18865 18867 18869)
[chaos] cycle 44: nodes ready (pids 19211 19213 19215)
[chaos] cycle 45: nodes ready (pids 19557 19559 19561)
[chaos] cycle 46: nodes ready (pids 19903 19905 19907)
[chaos] cycle 47: nodes ready (pids 20249 20251 20253)
[chaos] cycle 48: nodes ready (pids 20595 20597 20599)
[chaos] cycle 49: nodes ready (pids 20941 20943 20945)
[chaos] cycle 50: nodes ready (pids 21289 21291 21293)
[chaos] ---- summary ----
{
  "total_cycles": 50,
  "total_writes": 5000,
  "total_ok": 150,
  "total_quorum_not_met": 0,
  "total_fail": 4850,
  "convergence_bound": 0.03
}
[chaos] per-cycle JSONL: /tmp/phase4-kill_primary_mid_write/chaos-report.jsonl
[chaos] chaos campaign: fault=partition_minority cycles=50 writes/cycle=100
[chaos] workdir: /tmp/phase4-partition_minority
[chaos] binary: /usr/local/bin/ai-memory
[chaos] cycle 1: nodes ready (pids 21643 21645 21647)
[chaos] cycle 2: nodes ready (pids 21995 21997 21999)
[chaos] cycle 3: nodes ready (pids 22347 22349 22351)
[chaos] cycle 4: nodes ready (pids 22705 22707 22709)
[chaos] cycle 5: nodes ready (pids 23069 23071 23073)
[chaos] cycle 6: nodes ready (pids 23439 23441 23443)
[chaos] cycle 7: nodes ready (pids 23815 23817 23819)
[chaos] cycle 8: nodes ready (pids 24197 24199 24201)
[chaos] cycle 9: nodes ready (pids 24587 24589 24591)
[chaos] cycle 10: nodes ready (pids 24984 24986 24988)
[chaos] cycle 11: nodes ready (pids 25386 25388 25390)
[chaos] cycle 12: nodes ready (pids 25796 25798 25800)
[chaos] FATAL: node-0 failed to start (see /tmp/phase4-partition_minority/node-0.log)

raw JSON

All artifacts

Every JSON committed to this campaign directory. Raw, machine-readable, and stable.