Per-run NHI matrix¶
Every campaign run that produced a phase4-analysis.json lands here
with its NHI-layer verdict rendered alongside the substrate-layer
verdict already shown on Campaign runs. One row per run,
sorted newest-first.
Each row carries:
- Run ID — the campaign directory under
runs/. - Substrate verdict — from
a2a-summary.json(this is the same value rendered on the Campaign runs dashboard). - NHI verdict — derived from
phase4-analysis.jsonper governance §11. - Scenario × arm grounding-rate matrix —
per_cell.<scenario>/<arm>.grounding_rate_mean. - Top finding — the highest-severity
findings[*]entry, with its classification (governance §8.4). - Cross-layer row outcome — the consistency cell for substrate finding S24 (#318) vs scenario D (governance §8.3).
Rows where phase4-analysis.json is absent (older or interrupted
runs) are omitted from this view; their substrate verdict still
renders on Campaign runs.
Per-run NHI verdict¶
| Run | Substrate | NHI | A (T grounding · ΔvsCold) | B (T grounding · ΔvsCold) | C (T grounding · ΔvsCold) | D (T grounding · ΔvsCold) | E (T grounding · ΔvsCold) | F (T grounding · ΔvsCold) | G (T grounding · ΔvsCold) | H (T grounding · ΔvsCold) | I (T grounding · ΔvsCold) | J (T grounding · ΔvsCold) | Top finding | Cross-layer (S24/D) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
a2a-hermes-v0.6.3.1-r16 |
❌ FAIL | ⚠️ INCONCLUSIVE | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | weak-treatment-effect-A (high, needs_review) |
⚠️ UNKNOWN |
a2a-hermes-v0.6.3.1-r15 |
❌ FAIL | ⚠️ INCONCLUSIVE | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | weak-treatment-effect-A (high, needs_review) |
⚠️ UNKNOWN |
a2a-hermes-v0.6.3.1-r12 |
❌ FAIL | ⚠️ INCONCLUSIVE | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | weak-treatment-effect-A (high, needs_review) |
⚠️ UNKNOWN |
a2a-ironclaw-v0.6.3.1-r27 |
❌ FAIL | ⚠️ INCONCLUSIVE | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | weak-treatment-effect-A (high, needs_review) |
⚠️ UNKNOWN |
a2a-ironclaw-v0.6.3.1-r26 |
❌ FAIL | ⚠️ INCONCLUSIVE | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | weak-treatment-effect-A (high, needs_review) |
⚠️ UNKNOWN |
a2a-ironclaw-v0.6.3.1-r25 |
❌ FAIL | ⚠️ INCONCLUSIVE | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | weak-treatment-effect-A (high, needs_review) |
⚠️ UNKNOWN |
a2a-ironclaw-v0.6.3.1-r19 |
✅ PASS | ⚠️ INCONCLUSIVE | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | — · Δ— | — · Δ— | — · Δ— | — · Δ— | — · Δ— | — · Δ— | weak-treatment-effect-A (high, needs_review) |
⚠️ UNKNOWN |
a2a-ironclaw-v0.6.3.1-r18 |
✅ PASS | ⚠️ INCONCLUSIVE | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | — · Δ— | — · Δ— | — · Δ— | — · Δ— | — · Δ— | — · Δ— | weak-treatment-effect-A (high, needs_review) |
⚠️ UNKNOWN |
a2a-ironclaw-v0.6.3.1-r16 |
✅ PASS | ⚠️ INCONCLUSIVE | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | — · Δ— | — · Δ— | — · Δ— | — · Δ— | — · Δ— | — · Δ— | weak-treatment-effect-A (high, needs_review) |
⚠️ UNKNOWN |
a2a-ironclaw-v0.6.3.1-r15 |
✅ PASS | ⚠️ INCONCLUSIVE | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | — · Δ— | — · Δ— | — · Δ— | — · Δ— | — · Δ— | — · Δ— | weak-treatment-effect-A (high, needs_review) |
⚠️ UNKNOWN |
a2a-ironclaw-v0.6.3.1-r14 |
❌ FAIL | ⚠️ INCONCLUSIVE | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | — · Δ— | — · Δ— | — · Δ— | — · Δ— | — · Δ— | — · Δ— | weak-treatment-effect-A (high, needs_review) |
⚠️ UNKNOWN |
a2a-ironclaw-v0.6.3.1-r13 |
✅ PASS | ⚠️ INCONCLUSIVE | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | 0.00 · Δ0.00 | — · Δ— | — · Δ— | — · Δ— | — · Δ— | — · Δ— | — · Δ— | weak-treatment-effect-A (high, needs_review) |
⚠️ UNKNOWN |
Total runs with phase4-analysis.json: 12
Reading the matrix¶
- A green grounding-rate cell (≥ 0.50) means real agent claims in that scenario × arm trace back to retrieved memory ops at least half the time. A near-zero cell means either the scenario didn't drive enough agent traffic, the agent didn't retrieve, or the retrievals didn't bind to claims — which one is true is in the §7 logs of the corresponding run.
- The cross-layer column is the headline. YES = substrate and NHI layers agree on the known gap. UNKNOWN = scenario D didn't produce data. NO = the campaign found a contradiction between the layers, which is the highest-value output of the entire harness.
- Top finding
severity: highwithclass: needs_reviewtypically means Phase 3 produced no usable agent traffic for that cell — the fix is in the phase 3 driver, not ai-memory.
For the written interpretation of the most-recent run, see NHI insights. For the explainer on what the scenarios, arms, and metrics are, see NHI assessments.