../ runs index · rendered on Pages

Campaign v0.6.0.0-final-r14 FAIL

ai-memory ref
release/v0.6.0
Completed at
2026-04-20T11:29:27Z
Overall pass
FAIL

Run focus

Convergence STILL missed with 90s settle — silent data-loss bug identified in federation fanout

What this campaign set out to test: Same burst + probes, with 3× the settle time (30s → 90s) to rule out test-window explanations.

What it demonstrated: Proved that extending the settle window from 30s to 90s did NOT materially improve convergence (B moved from 89.5% to 83%; C moved from 73% to 82% — inside statistical noise). Proved that the convergence miss is not a timing problem but a correctness problem. Disproved the hypothesis that async sync-daemon catch-up would eventually close the gap. Identified the specific race: under `JoinSet::shutdown().await` in federation.rs::broadcast_store_quorum, whichever peer's fanout POST was still in-flight when quorum met got its reqwest task ABORTED mid-POST — often before the peer's axum handler committed.

Detailed tri-audience analysis is below, followed by per-phase test results for all four phases of the protocol — including any phase that did not run in this campaign.

AI NHI analysis · Claude Opus 4.7

Convergence STILL missed with 90s settle — silent data-loss bug identified in federation fanout

Product bug uncovered. The harness is now correct; the product itself was dropping writes to one of the two non-leader peers on every burst, depending on which peer won the quorum-ack race. This is exactly the class of bug the ship-gate exists to catch.

What this campaign tested

Same burst + probes, with 3× the settle time (30s → 90s) to rule out test-window explanations.

What it proved (or disproved)

Proved that extending the settle window from 30s to 90s did NOT materially improve convergence (B moved from 89.5% to 83%; C moved from 73% to 82% — inside statistical noise). Proved that the convergence miss is not a timing problem but a correctness problem. Disproved the hypothesis that async sync-daemon catch-up would eventually close the gap. Identified the specific race: under `JoinSet::shutdown().await` in federation.rs::broadcast_store_quorum, whichever peer's fanout POST was still in-flight when quorum met got its reqwest task ABORTED mid-POST — often before the peer's axum handler committed.

For three audiences

Non-technical end users

Found a real bug in the product, not just the test. Imagine sending a critical document to two warehouses for redundant safekeeping. The courier stops as soon as one warehouse signs for it, leaving the other courier still stuck in traffic with your document. One document, one warehouse. If that warehouse burns down before your second courier arrives, you lose the document. This is what silent data loss looks like: the product was telling the customer "committed" while actually only fully committing to leader + one peer. If a node failed during the burst, writes could be lost. The ship-gate catching this BEFORE release is exactly the outcome the whole investment in real-infrastructure testing was designed to produce.

C-level decision makers

Silent data-loss risk identified and SCOPED with zero customer impact. A customer running the 3-node federation (our flagship multi-agent topology) would have had ~50% of writes only single-redundant, not double-redundant as the SLA implied. Discovered pre-tag; no release shipped with this defect. Cost of discovery: roughly $0.60 of cumulative DigitalOcean spend across the investigation (r11-r14) plus engineering time. Cost had it shipped: indeterminate but nontrivial — affected writes would be silently recoverable only by re-establishing quorum AFTER a node outage, and we'd be explaining the pattern to a customer after they'd lost data. The ship-gate's raison d'être just paid for itself.

Engineers & architects

`src/federation.rs::broadcast_store_quorum` spawns fanout POSTs into a `tokio::task::JoinSet`. Main loop waits for W-1 acks then calls `joins.shutdown().await`, which ABORTS still-in-flight tasks. Under W=2/N=3 with two concurrent fanouts racing, the loser's reqwest POST was cancelled mid-flight — frequently BEFORE the receiving axum handler's `db::insert_if_newer` had committed. Correct fix: detach the remaining tasks into a background `tokio::spawn` after the quorum condition is met so they run to completion naturally. Arc<Mutex<AckTracker>> `try_unwrap` still succeeds immediately because the spawned tasks never held the tracker (they only captured client/url/payload/id). Test added: `federation::tests::post_quorum_fanout_reaches_all_peers` with W=2/N=3 against two Ack mock peers, asserts both call-counts == 1 within 200 ms of broadcast_store_quorum returning.

Bugs surfaced and where they were fixed

  1. Federation post-quorum fanout aborted mid-POST (silent data-loss)

    Impact: Under 3-node federation with W=2, ~50% of writes landed on leader + ONE peer rather than all three. Customer-visible only after a node failure: writes attributed to the killed peer that hadn't yet caught up would be missing from the cluster. This is a canonical silent data-loss pattern — the write returned success.

    Root cause: `tokio::task::JoinSet::shutdown().await` after early-exit cancelled in-flight spawned tasks. The tasks didn't hold the tracker Arc, so detaching is safe.

    Fixed in:

What changed going into the next campaign

PR #309 lands on release/v0.6.0. The following campaign (r15) validates under real load: if the fix is correct, node-B and node-C should both reach 200/200. If they don't, the root-cause analysis was wrong.

Phase 1 — functional (per-node) PASS

What this phase proves: Single-node CRUD, backup, curator dry-run, and MCP handshake on each of the three peer droplets. Establishes that ai-memory starts and is functional at the one-node level before federation is exercised.

Test results

node-a

node-b

node-c

Raw evidence

phase1-node-a
{
	"phase": 1,
	"host": "aim-v0-6-0-0-final-r14-node-a",
	"version": "ai-memory 0.6.0",
	"pass": true,
	"reasons": [
		""
	],
	"stats": {
		"total": 1,
		"by_tier": [
			{
				"tier": "mid",
				"count": 1
			}
		],
		"by_namespace": [
			{
				"namespace": "ship-gate-phase1",
				"count": 1
			}
		],
		"expiring_soon": 0,
		"links_count": 0,
		"db_size_bytes": 139264
	},
	"curator": {
		"started_at": "2026-04-20T11:27:34.937250072+00:00",
		"completed_at": "2026-04-20T11:27:34.937535771+00:00",
		"cycle_duration_ms": 0,
		"memories_scanned": 1,
		"memories_eligible": 1,
		"auto_tagged": 0,
		"contradictions_found": 0,
		"operations_attempted": 0,
		"operations_skipped_cap": 0,
		"autonomy": {
			"clusters_formed": 0,
			"memories_consolidated": 0,
			"memories_forgotten": 0,
			"priority_adjustments": 0,
			"rollback_entries_written": 0,
			"errors": []
		},
		"errors": [
			"no LLM client configured"
		],
		"dry_run": true
	},
	"mcp_tool_count": 36,
	"recall_count": 1,
	"snapshot_count": 1,
	"manifest_count": 1
}

raw JSON

phase1-node-b
{
	"phase": 1,
	"host": "aim-v0-6-0-0-final-r14-node-b",
	"version": "ai-memory 0.6.0",
	"pass": true,
	"reasons": [
		""
	],
	"stats": {
		"total": 1,
		"by_tier": [
			{
				"tier": "mid",
				"count": 1
			}
		],
		"by_namespace": [
			{
				"namespace": "ship-gate-phase1",
				"count": 1
			}
		],
		"expiring_soon": 0,
		"links_count": 0,
		"db_size_bytes": 139264
	},
	"curator": {
		"started_at": "2026-04-20T11:27:35.617185397+00:00",
		"completed_at": "2026-04-20T11:27:35.617694143+00:00",
		"cycle_duration_ms": 0,
		"memories_scanned": 1,
		"memories_eligible": 1,
		"auto_tagged": 0,
		"contradictions_found": 0,
		"operations_attempted": 0,
		"operations_skipped_cap": 0,
		"autonomy": {
			"clusters_formed": 0,
			"memories_consolidated": 0,
			"memories_forgotten": 0,
			"priority_adjustments": 0,
			"rollback_entries_written": 0,
			"errors": []
		},
		"errors": [
			"no LLM client configured"
		],
		"dry_run": true
	},
	"mcp_tool_count": 36,
	"recall_count": 1,
	"snapshot_count": 1,
	"manifest_count": 1
}

raw JSON

phase1-node-c
{
	"phase": 1,
	"host": "aim-v0-6-0-0-final-r14-node-c",
	"version": "ai-memory 0.6.0",
	"pass": true,
	"reasons": [
		""
	],
	"stats": {
		"total": 1,
		"by_tier": [
			{
				"tier": "mid",
				"count": 1
			}
		],
		"by_namespace": [
			{
				"namespace": "ship-gate-phase1",
				"count": 1
			}
		],
		"expiring_soon": 0,
		"links_count": 0,
		"db_size_bytes": 139264
	},
	"curator": {
		"started_at": "2026-04-20T11:27:34.890954960+00:00",
		"completed_at": "2026-04-20T11:27:34.891462844+00:00",
		"cycle_duration_ms": 0,
		"memories_scanned": 1,
		"memories_eligible": 1,
		"auto_tagged": 0,
		"contradictions_found": 0,
		"operations_attempted": 0,
		"operations_skipped_cap": 0,
		"autonomy": {
			"clusters_formed": 0,
			"memories_consolidated": 0,
			"memories_forgotten": 0,
			"priority_adjustments": 0,
			"rollback_entries_written": 0,
			"errors": []
		},
		"errors": [
			"no LLM client configured"
		],
		"dry_run": true
	},
	"mcp_tool_count": 36,
	"recall_count": 1,
	"snapshot_count": 1,
	"manifest_count": 1
}

raw JSON

Phase 2 — multi-agent federation FAIL

What this phase proves: 4 agents × 50 writes against the 3-node federation with W=2 quorum, then 90s settle and convergence count on every peer. Plus two quorum probes (one-peer-down must 201, both-peers-down must 503). Catches silent-data-loss and quorum-misclassification regressions.

Test results

Raw evidence

phase2
{
	"phase": 2,
	"pass": false,
	"total_writes": 200,
	"ok": 200,
	"quorum_not_met": 0,
	"fail": 0,
	"counts": {
		"a": 200,
		"b": 166,
		"c": 164
	},
	"probe1_single_peer_down": "201",
	"probe2_both_peers_down": "503",
	"reasons": [
		"node-B count 166 < 95% of 200",
		"node-C count 164 < 95% of 200"
	]
}

raw JSON

Phase 3 — cross-backend migration NOT REACHED

What this phase proves: 1000-memory round-trip: SQLite → Postgres, re-run for idempotency, Postgres → SQLite. Asserts zero errors and counts match. Catches migration-correctness regressions in either direction of a production upgrade path.

This phase did not run because an earlier phase failed and the campaign aborted. Evidence from the phases that did run is above; the protocol would have exercised this phase next if the prior step had passed.

Phase 4 — chaos campaign NOT REACHED

What this phase proves: packaging/chaos/run-chaos.sh on the chaos-client droplet with 50 cycles × 100 writes per fault class. Measures convergence_bound = min(count_node1, count_node2) / total_ok. Catches fault-tolerance regressions under SIGKILL of the primary, brief network partition, and related fault models.

This phase did not run because an earlier phase failed and the campaign aborted. Evidence from the phases that did run is above; the protocol would have exercised this phase next if the prior step had passed.

All artifacts

Every JSON committed to this campaign directory. Raw, machine-readable, and stable.