Skip to content

Per-run NHI matrix

Every campaign run that produced a phase4-analysis.json lands here with its NHI-layer verdict rendered alongside the substrate-layer verdict already shown on Campaign runs. One row per run, sorted newest-first.

Each row carries:

  • Run ID — the campaign directory under runs/.
  • Substrate verdict — from a2a-summary.json (this is the same value rendered on the Campaign runs dashboard).
  • NHI verdict — derived from phase4-analysis.json per governance §11.
  • Scenario × arm grounding-rate matrixper_cell.<scenario>/<arm>.grounding_rate_mean.
  • Top finding — the highest-severity findings[*] entry, with its classification (governance §8.4).
  • Cross-layer row outcome — the consistency cell for substrate finding S24 (#318) vs scenario D (governance §8.3).

Rows where phase4-analysis.json is absent (older or interrupted runs) are omitted from this view; their substrate verdict still renders on Campaign runs.


Per-run NHI verdict

Run Substrate NHI A (T grounding · ΔvsCold) B (T grounding · ΔvsCold) C (T grounding · ΔvsCold) D (T grounding · ΔvsCold) E (T grounding · ΔvsCold) F (T grounding · ΔvsCold) G (T grounding · ΔvsCold) H (T grounding · ΔvsCold) I (T grounding · ΔvsCold) J (T grounding · ΔvsCold) Top finding Cross-layer (S24/D)
a2a-hermes-v0.6.3.1-r16 ❌ FAIL ⚠️ INCONCLUSIVE 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 weak-treatment-effect-A (high, needs_review) ⚠️ UNKNOWN
a2a-hermes-v0.6.3.1-r15 ❌ FAIL ⚠️ INCONCLUSIVE 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 weak-treatment-effect-A (high, needs_review) ⚠️ UNKNOWN
a2a-hermes-v0.6.3.1-r12 ❌ FAIL ⚠️ INCONCLUSIVE 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 weak-treatment-effect-A (high, needs_review) ⚠️ UNKNOWN
a2a-ironclaw-v0.6.3.1-r27 ❌ FAIL ⚠️ INCONCLUSIVE 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 weak-treatment-effect-A (high, needs_review) ⚠️ UNKNOWN
a2a-ironclaw-v0.6.3.1-r26 ❌ FAIL ⚠️ INCONCLUSIVE 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 weak-treatment-effect-A (high, needs_review) ⚠️ UNKNOWN
a2a-ironclaw-v0.6.3.1-r25 ❌ FAIL ⚠️ INCONCLUSIVE 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 weak-treatment-effect-A (high, needs_review) ⚠️ UNKNOWN
a2a-ironclaw-v0.6.3.1-r19 ✅ PASS ⚠️ INCONCLUSIVE 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 — · Δ— — · Δ— — · Δ— — · Δ— — · Δ— — · Δ— weak-treatment-effect-A (high, needs_review) ⚠️ UNKNOWN
a2a-ironclaw-v0.6.3.1-r18 ✅ PASS ⚠️ INCONCLUSIVE 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 — · Δ— — · Δ— — · Δ— — · Δ— — · Δ— — · Δ— weak-treatment-effect-A (high, needs_review) ⚠️ UNKNOWN
a2a-ironclaw-v0.6.3.1-r16 ✅ PASS ⚠️ INCONCLUSIVE 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 — · Δ— — · Δ— — · Δ— — · Δ— — · Δ— — · Δ— weak-treatment-effect-A (high, needs_review) ⚠️ UNKNOWN
a2a-ironclaw-v0.6.3.1-r15 ✅ PASS ⚠️ INCONCLUSIVE 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 — · Δ— — · Δ— — · Δ— — · Δ— — · Δ— — · Δ— weak-treatment-effect-A (high, needs_review) ⚠️ UNKNOWN
a2a-ironclaw-v0.6.3.1-r14 ❌ FAIL ⚠️ INCONCLUSIVE 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 — · Δ— — · Δ— — · Δ— — · Δ— — · Δ— — · Δ— weak-treatment-effect-A (high, needs_review) ⚠️ UNKNOWN
a2a-ironclaw-v0.6.3.1-r13 ✅ PASS ⚠️ INCONCLUSIVE 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 0.00 · Δ0.00 — · Δ— — · Δ— — · Δ— — · Δ— — · Δ— — · Δ— weak-treatment-effect-A (high, needs_review) ⚠️ UNKNOWN

Total runs with phase4-analysis.json: 12


Reading the matrix

  • A green grounding-rate cell (≥ 0.50) means real agent claims in that scenario × arm trace back to retrieved memory ops at least half the time. A near-zero cell means either the scenario didn't drive enough agent traffic, the agent didn't retrieve, or the retrievals didn't bind to claims — which one is true is in the §7 logs of the corresponding run.
  • The cross-layer column is the headline. YES = substrate and NHI layers agree on the known gap. UNKNOWN = scenario D didn't produce data. NO = the campaign found a contradiction between the layers, which is the highest-value output of the entire harness.
  • Top finding severity: high with class: needs_review typically means Phase 3 produced no usable agent traffic for that cell — the fix is in the phase 3 driver, not ai-memory.

For the written interpretation of the most-recent run, see NHI insights. For the explainer on what the scenarios, arms, and metrics are, see NHI assessments.