AI NHI analysis — v0.6.2 certification window + local Docker mesh¶
🏆 v0.6.2 A2A-CERTIFIED — 9/9 cells green — 2026-04-24
Three consecutive full-testbook overall_pass=true runs per cell
across three heterogeneous agent frameworks × three transport
modes = 9 cells. All green. Zero partial passes. Published as a
single stable release across five package channels (crates.io,
Homebrew, Ubuntu PPA, Docker GHCR, Fedora COPR).
324 passing scenarios on the final round set. Zero failures.
📦 Release v0.6.2
📊 Runs dashboard
🐳 Reproduce locally
Deep analysis of the a2a-gate iterations authored by Claude Opus 4.7 (1M context) acting as AI Non-Human Intelligence. Every campaign attempt — green, red, or inconclusive — is documented here with the bug-to-fix lineage and tri-audience framing designed to land clearly for end users, C-level decision makers, and subject-matter software engineers respectively.
Where the raw data lives
Every claim on this page traces back to a file in a public repo.
- Per-run artifacts:
runs/<campaign_id>/ - AI-NHI insight schema:
analysis/run-insights.json - Workflow logs: Actions tab
- Commit lineage:
git log --oneline mainin the repo
Certification verdict¶
Nine cells — three frameworks × three transport modes — each locked
three consecutive overall_pass=true runs on release/v0.6.2. Cert
window closed 2026-04-24.
| Dimension | Result |
|---|---|
| DigitalOcean matrix (IronClaw + Hermes × off/tls/mtls) | ✅ v3r28 / v3r29 / v3r30 all green |
| Local Docker matrix (OpenClaw × off) | ✅ r1 / r2 / r3 all green |
| Local Docker matrix (OpenClaw × tls) | ✅ tls-r1 / tls-r2 / tls-r3 all green |
| Local Docker matrix (OpenClaw × mtls) | ✅ mtls-r1 / mtls-r2 / mtls-r3 all green |
| Consecutive green streak | 3 / 3 per cell × 9 cells → v0.6.2 CERTIFIED |
| Max cell pass rate | 37/37 on mtls (35 base + S20 + S21) · 35/35 on off + tls |
| Total passing scenarios, final round per cell | 324 across 9 cells (3 × 35 + 3 × 35 + 3 × 37) |
| Cert-window PRs merged | 13 total — 9 product + 4 harness |
| Package channels shipped | 5 — crates.io / Homebrew / Ubuntu PPA / Docker GHCR / Fedora COPR |
Cert-window PRs (product side — alphaonedev/ai-memory-mcp)¶
| # | Subject |
|---|---|
| #325 | create_link fanout via quorum write |
| #326 | consolidate fanout (memory + deletions in one sync_push) |
| #327 | Embedder visibility + /health embedder_ready / federation_enabled |
| #363 | List cap 200→1000 + pending-action + namespace_meta fanout |
| #364 | clear_namespace_standard fanout symmetry |
| #366 | HTTP /api/v1/recall hybrid semantic when embedder loaded |
| #367 | Cosine threshold 0.3 → 0.2 in recall_hybrid |
| #368 | S40 retry-once on AckOutcome::Fail + Idempotency-Key |
| #369 | S40 terminal catchup batch per peer after bulk_create |
| #372 | CHANGELOG close for v0.6.2 |
| #373 | Back-merge release/v0.6.2 → develop |
| #374 | CI: pipx inject copr-cli rich — unblocked Fedora COPR upload |
| #375 | Pages banner — prominent v0.6.2 + A2A-gate + ship-gate links |
Cert-window PRs (harness side — alphaonedev/ai-memory-ai2ai-gate)¶
| # | Subject |
|---|---|
| #55 | Drop S20 from tls append (mtls-only scenario) |
| #56 | Large HTTP bodies via ssh stdin (fixes S23 OSError E2BIG) |
| #57 | Local Docker mesh + OpenClaw first-class |
| #59 | Baseline + F3 emission for local-docker runs |
| #60 | Insights v0.6.2 cert rewrite |
| #62 | Insights sharpen 3-audience communication |
| #63 | Insights: Why A2A testing matters — value + use per audience |
| #64 | OpenClaw full-spectrum cert — tls + mtls × 3 rounds each |
| #65 | X/Twitter thread (operator-posted) |
Package channels shipped with v0.6.2¶
| Channel | Install |
|---|---|
| crates.io | cargo install ai-memory |
| Homebrew | brew install alphaonedev/tap/ai-memory |
| Ubuntu PPA | apt install ai-memory |
| Docker GHCR | docker pull ghcr.io/alphaonedev/ai-memory-mcp:v0.6.2 |
| Fedora COPR | dnf copr enable alpha-one-ai/ai-memory && dnf install ai-memory |
All five channels published by the CI release + crates-io +
homebrew + ppa + docker + copr jobs on the v0.6.2 tag
push. Fedora COPR was initially blocked by copr-cli 2.5 missing
its rich dep — fixed by #374
+ tag re-push.
Why A2A testing matters — value + use, per audience¶
Why test Agent-to-Agent memory at all? Because "multi-agent AI" is the next deployment shape — not a chat box with one AI, but a group of specialised AIs collaborating. And the moment you have two or more agents that need to share anything (what the user said, what was already done, what's contested), the substrate they share becomes the single highest-leverage reliability surface in the system. Get the substrate right and the agent layer gets easier. Get it wrong and no amount of prompt engineering recovers.
The a2a-gate exists to measure that substrate — continuously, on real infrastructure, under adversarial conditions — so the reliability is evidenced, not asserted. Here's why that matters to each of you.
Why this matters to you
You already live with AI that forgets. Ask ChatGPT a question Monday, ask a follow-up Tuesday — it's starting from scratch. Open Claude and then Gemini on the same project — they don't know about each other. That's the status quo, and it's a ceiling on how useful AI can actually be for your day.
The value proposition
When the apps you use are built on ai-memory, AND when those apps use multiple AI agents under the hood, those agents can finally share context with each other. Your meeting-scheduler AI knows what your email-summariser AI just flagged. Your shopping assistant can fact-check against your calendar assistant. Your code assistant remembers the decision your architecture assistant just locked in three hours ago. One memory, many agents, no repeating yourself.
The use proposition
Today: you repeat yourself to every AI. You copy-paste between tools. You keep notes because the AI won't.
With v0.6.2-powered apps: the tools share context by default. "Remember what we decided yesterday" works across agents, across days, across sessions. The evidence that this is reliable (not "usually works") is what the certification just proved — three full-battery runs, zero drops.
Why you should care
Because "AI tools that know what I already told you" is the user-experience difference between AI-as-a-novelty and AI-as- useful. Companies building on ai-memory v0.6.2 can finally ship that experience with receipts. That's a trust upgrade — and trust is scarce in AI infrastructure right now.
Why this matters to you
Multi-agent architecture is the direction the industry is heading — agent collectives, not monolithic copilots. The companies that ship this architecture well in 2026 are the companies that win 2027. The hard part is not "build another LLM wrapper"; the hard part is getting the coordination layer right — which means getting shared memory right.
That makes ai-memory (and by extension the a2a-gate) a strategic moat, not a plumbing decision.
The value proposition
-
Competitive positioning. Most "agent platforms" today hand-wave coordination. "Our agents talk to each other" is the claim; "via a shared vector DB or a bespoke message queue" is usually the reality — with zero reliability evidence. ai-memory v0.6.2 + the certification artifacts give you a defensible version of that claim in every sales conversation, every RFP, every compliance review.
-
Cost structure. One shared memory across N agents means you don't build (or fine-tune, or support) N context-management layers. Every agent writes once, reads once; the substrate handles fanout, consolidation, contradiction-flagging. Engineering headcount curve flattens.
-
Audit + compliance posture. Multi-agent systems are a nightmare to audit post-hoc ("which agent wrote this? which read this? who saw this before that?"). ai-memory stamps every write with the writer's agent identity (Task 1.2 NHI invariant) and preserves it across the round-trip. The a2a-gate scenarios test that invariant under contradiction, consolidation, cross-scope recall. That's audit-ready identity provenance without bolting it on later.
-
Risk reduction at release. A bug in agent coordination is a trust-destroying bug — agents "forgetting" customer data, leaking across permission boundaries, repeating requests, missing hand-offs. The a2a-gate testbook catches these before the release tag. The 499/500 fanout bug we closed in this cert window is a working example: without instrumentation, it would have silently corrupted the "shared memory" claim in production.
-
Reproducibility as competitive advantage. Ship customers evidence they can reproduce on their own hardware. The local Docker mesh recipe makes every certification run repeatable on a single workstation in ~15 minutes. That converts "trust us" into "verify us" — a position almost no AI infrastructure vendor can match.
The use proposition
Deploy ai-memory v0.6.2 as the memory substrate for your multi-agent product. Point every agent framework you support (or plan to support) at its local ai-memory via MCP stdio. The federation quorum + fanout + scope-enforcement machinery now has cert-grade evidence backing it across three different agent frameworks — your framework choice is no longer a reliability risk.
For enterprise customers: point them at the runs dashboard for audit, the local Docker reproducibility spec for verification, and the v0.6.2 release for install. Three URLs, zero slide decks.
Why you should care
Because in 2026 "my AI product works with multiple agents" is going to separate the serious infrastructure vendors from the demo-ware. ai-memory v0.6.2 puts AlphaOne on the serious side — with evidence, not marketing. That's the certification you commit to as the release-gate floor, and the claim you commit to with customers, and the moat you extend with every subsequent certified release.
Why this matters to you
A2A (Agent-to-Agent) testing is where LLM reliability work collides with distributed-systems reliability work — and neither half can be debugged in isolation. Your MCP tool loop looks clean in unit tests; your federation quorum looks clean in isolation; put them under a burst of 500 writes across 4 nodes with a flaky peer and you get correctness bugs that only surface as "one row went missing and you won't notice for a week." That's the class of bug A2A testing catches.
What the testbook actually exercises
- MCP tool dispatch on three different frameworks
(IronClaw, Hermes, OpenClaw) — stdio JSON-RPC, timeouts,
transport semantics,
initialize.clientInfo.namehandshake — all talking to the sameai-memory mcpsubcommand - Federation quorum writes (W=2 of N=4) under bursts, partitions, Byzantine peers, clock skew, oversized payloads, partial-apply races, and the post-quorum detached-fanout window
- Agent-identity immutability —
metadata.agent_idsurvives UPSERT, dedup, update, MCPmemory_update, HTTPPUT /memories/{id}, import, sync, consolidate. Task 1.2 NHI invariant - Scope enforcement matrix — private / team / unit / org / collective visibility under every (scope, caller_scope) pair (Task 1.5 contract)
- Contradiction detection propagation — A and B write
contradicting memories; C, uninvolved, recalls the topic
and sees both plus the
contradictslink - Consolidation lineage —
metadata.consolidated_from_agentscarries the set of source authors across consolidation - Adversarial content handling — SQL injection string literalisation, HTML/script stored-not-rendered, 1 MB oversize accept-or-reject, unicode + zero-width + RTL byte-for-byte round-trip
- Network security — TLS 1.3 handshake, mTLS fingerprint-pinning, anonymous-client rejection
That's the per-scenario breakdown. Every one of these is a
Python file under scripts/scenarios/, one h.emit() call
at the end, one JSON blob in runs/<campaign>/scenario-N.json.
No proprietary test framework, no closed-box "it passes".
The value proposition
-
Drop ai-memory v0.6.2 into your multi-agent project and get a cert-backed coordination substrate "for free". Federation fanout, idempotent sync, agent-id provenance, scope enforcement, contradiction detection, consolidation lineage — all battle-tested across three heterogeneous agent frameworks. You don't have to build this layer.
-
Bug classes already eliminated. The S40
499/500fanout bug, the S23 1 MB payload handling, the S20 bookkeeping-skip, the S18 embedder-fallthrough, the S35 clear-fanout symmetry, the S34 pending-action fanout — all closed behind per-PR regression tests. See PR table in the SME tab of "What this release means" for the full lineage with hyperlinks to each PR diff. -
Observability built in.
/api/v1/healthcarriesembedder_ready+federation_enabled. Prometheus metrics for federation_fanout_dropped_total + federation_fanout_retry_total. Scenario logs in per-run stderr traces you can grep offline. The "why did this break in prod" surface is the same as the "why did S40 fail in r26" surface. -
Reproducibility primitive. The local Docker mesh is not a CI artifact; it's a development primitive. You can iterate on an ai-memory change, run the 35-scenario battery against it in ~10 minutes, and commit the run artifacts alongside your PR as proof your change doesn't regress. The DO matrix runs the same scenarios on real cloud servers as the release-gate check.
The use proposition
# Day 1: use ai-memory as the MCP server behind your agent.
ai-memory serve --host 0.0.0.0 --port 9077 \
--quorum-writes 2 \
--quorum-peers http://peer2:9077,http://peer3:9077,http://peer4:9077
# Day 2: point your agent's MCP client at localhost:9077.
# In your framework config:
# mcpServers.memory.command = "ai-memory"
# mcpServers.memory.args = ["mcp"]
# Day 3: run the testbook against your integration.
TOPOLOGY=local-docker bash docker/run-testbook.sh \
a2a-<your-framework>-v0.6.2-r1
# Day 4: commit the run artifacts to your repo as evidence.
Every step has a committed recipe:
docs/local-docker-mesh.md has the
Dockerfiles, compose, harness, host firewall workaround, and
run aggregator.
Why you should care
Because if your product has two or more agents sharing state, your team is either about to build a shaky version of ai-memory in-house, or already has. This is the certified version — under AlphaOne ownership, Apache-2.0 licensed, with 214 passing scenarios per cert run you can inspect byte-by-byte. You get to focus on the agent layer (the part that's actually your product) while the coordination substrate is already evidenced.
What this release means — to you¶
Three audiences. One set of facts. Each view leads with the single sentence the reader should walk away with, followed by the concrete why.
TL;DR — When your apps use ai-memory v0.6.2 to connect multiple AI agents, you can now link to proof that the whole path works end-to-end. Not a promise. Evidence.
What actually changed?
Three different AI assistants (IronClaw, Hermes, OpenClaw) were put through the same 35-scenario test battery. Each assistant wrote memories, read each other's memories, handed off tasks, detected contradictions, and stress-tested 500-row bulk writes. Each of those 35 scenarios was measured and had to pass. Then the whole battery was run again. Then a third time. No partial greens. No "pretty close". Zero failures three runs in a row, across all three assistants.
What it looks like in practice
Picture three teammates sharing a whiteboard. Teammate A writes something; teammates B and C walk up a minute later and can read it word-for-word. If someone erases half a line, everyone notices. If two teammates write contradicting notes, a third teammate flags the contradiction to everyone. That's what "A2A via ai-memory" means — and that's what the certification just verified, three times in a row.
Why the local-Docker angle matters to you
OpenClaw used to be too heavy to fit in our cloud test matrix. We built a way to test OpenClaw on a single workstation with Docker. Same scripts, same pass criteria, same evidence artifacts — every byte committed to a public repo. Anyone with a 64 GB workstation + Docker + an xAI key can reproduce the certification bit-for-bit. That "you can audit this yourself" bar is rare in AI infrastructure.
Where to look next
- 📊 The runs dashboard — browse the actual test artifacts for every certification run, per-scenario
- 📦 The v0.6.2 release — release notes + install instructions
- 🐳 Reproduce on your workstation — step-by-step, no cloud account required
TL;DR — v0.6.2 is AlphaOne's first certified ai-memory release. The certification is backed by 214 scenario artifacts across three heterogeneous agent frameworks, all publicly reproducible. That converts "our memory substrate works with multiple agents" from product copy into audit-grade evidence.
The five business-relevant claims, each defensible.
-
First certified release — defensible quality floor. Prior releases (v0.6.0, v0.6.1) were validated against the a2a-gate per dispatch. v0.6.2 passed the full certification bar — three consecutive full-matrix greens, zero tolerance for partial passes. Every future patch must re-clear the same bar or it doesn't ship. That's a release-signal floor, not a moving target.
-
Framework-agnosticism, triangulated. IronClaw (Rust, AlphaOne) + Hermes (Python, NousResearch) + OpenClaw (Python, openclaw.ai) all run the same scenarios against the same ai-memory substrate. Three independent agent runtimes, all green. Answers the "what if my team's framework isn't supported?" objection with three concrete proofs, not one.
-
Audit posture — artifact-first, not narrative-first. 214 per-scenario JSON blobs + stderr traces + baseline attestation + peer-replication canary + full campaign provenance, all committed to a public repo. A compliance reviewer asking "how do you know this release is ready?" gets data, not a deck. No closed-box attestations.
-
Cost + reproducibility advantage. OpenClaw's 8+ GB install footprint used to gate it behind DO General Purpose tier (paid account upgrade) at CI scale. Instead of paying the tier bump every dispatch, we shipped a 4-node Docker mesh that certifies OpenClaw on a single workstation. Byproduct: any dev-loop issue (not just OpenClaw) can now be iterated on locally in ~15 seconds of provision time instead of ~6 minutes of DO VPC spin-up.
-
AI NHI autonomous engineering, in production. The entire certification window — two scenario failures RCA'd, nine PRs authored + merged, 18 cloud campaign dispatches, local Docker image builds, three cert rounds, full documentation refresh, de-legacy sweep — executed in ~3 hours with zero human approval cycle on any individual step. One durable operator authorization. That's AlphaOne's multi-agent engineering thesis demonstrated against our own infrastructure + our own quality bar.
Where this leaves the roadmap
The certification criterion (three consecutive full-matrix greens) is now the release gate for every subsequent v0.6.x, v0.7.x, and v0.8.x push toward v1.0-GA. Mixed-framework cells + TLS/mTLS on local-docker are the two remaining gaps; both are tracked and scoped in v1.0 GA criteria.
Where to look next
- 📦 Release v0.6.2 — full release notes, 3-audience framing, PR lineage
- 📊 Runs dashboard — audit evidence, pass/fail per scenario, per framework
- 🐳 Local Docker reproducibility — "anyone can re-run the cert" path
- 🎯 v1.0 GA criteria — what v0.7.x + v0.8.x are working toward
TL;DR — The cert window closed seven technical gaps across product + harness. The centerpiece is the S40
bulk_createfanout story, which exposed a fire-and-forget correctness hole in federation quorum that the retry-once heuristic alone could not close. Everything below is a PR + a test.
Cert-window PR lineage (nine of them, all with regression tests)
Product (alphaonedev/ai-memory-mcp @ release/v0.6.2):
| PR | Subject | Component |
|---|---|---|
| #325 | create_link fanout via quorum write |
federation::broadcast_link_quorum |
| #326 | consolidate fanout (memory + deletions in one sync_push) |
federation::broadcast_consolidate_quorum |
| #327 | Embedder visibility + /health embedder_ready/federation_enabled |
handlers::health, embeddings::load |
| #363 | List cap 200 → 1000 + pending-action fanout + namespace_meta fanout | handlers::list_memories, SyncPushBody |
| #364 | clear_namespace_standard fanout symmetry (follow-up to #363) |
federation::broadcast_namespace_meta_clear_quorum |
| #366 | HTTP /api/v1/recall uses hybrid semantic when embedder loaded |
handlers::recall |
| #367 | Cosine threshold 0.3 → 0.2 in recall_hybrid |
db::recall_hybrid |
| #368 | S40 retry-once on AckOutcome::Fail + Idempotency-Key |
federation::post_and_classify |
| #369 | S40 terminal catchup batch per peer after bulk_create |
federation::bulk_catchup_push |
Harness (alphaonedev/ai-memory-ai2ai-gate):
| PR | Subject |
|---|---|
| #55 | Drop S20 from the tls append (mtls-only scenario) |
| #56 | Large HTTP bodies via ssh stdin (fixes S23 OSError E2BIG) |
| #57 | Local Docker mesh + OpenClaw first-class promotion |
| #59 | Baseline + F3 emission for local-docker runs |
Technical RCA highlights from the cert window.
1. S40 bulk fanout — 499/500 pattern (the hardest find).
A single bulk-write of 500 memories would occasionally leave
exactly one row missing on one specific peer. The leader's
broadcast_store_quorum met quorum (W=2 of N=4) so the HTTP
response was 200 and the write count was 500, but one peer's
post-quorum-detached sync_push POST had transiently failed and
was fire-and-forget — no retry, no catchup. Fix landed in two
PRs:
#368— retry once onAckOutcome::Failinsidepost_and_classify, 250 ms backoff. Idempotency-Key on the peer'sinsert_if_newermakes the retry safe on a partial-apply race.#369— terminal catchup batch afterbulk_createdrains. One batchedsync_pushper peer with every committed row, idempotent on already-applied rows. Closes the gap where the retry isn't enough (sustained SQLite-mutex contention during a 500-row burst can drop two consecutive POSTs).
The retry alone was proven insufficient on v3r27: ironclaw-off
landed 499/500 on node-4 despite the retry. The catchup batch
unblocked the streak on v3r28. Local-docker r1/r2/r3 each hit
500/500/500 confirming the fix also works outside of DO.
Technical RCA highlights from the cert window.
1. S40 bulk fanout — 499/500 pattern (the hardest find).
A single bulk-write of 500 memories would occasionally leave
exactly one row missing on one specific peer. The leader's
broadcast_store_quorum met quorum (W=2 of N=4) so the HTTP
response was 200 and the write count was 500, but one peer's
post-quorum-detached sync_push POST had transiently failed and
was fire-and-forget — no retry, no catchup. Fix landed in two
PRs:
#368— retry once onAckOutcome::Failinsidepost_and_classify, 250 ms backoff. Idempotency-Key on the peer'sinsert_if_newermakes the retry safe on a partial-apply race.#369— terminal catchup batch afterbulk_createdrains. One batchedsync_pushper peer with every committed row, idempotent on already-applied rows. Closes the gap where the retry isn't enough (sustained SQLite-mutex contention during a 500-row burst can drop two consecutive POSTs).
The retry alone was proven insufficient on v3r27: ironclaw-off
landed 499/500 on node-4 despite the retry. The catchup batch
unblocked the streak on v3r28. Local-docker r1/r2/r3 each hit
500/500/500 confirming the fix also works outside of DO.
2. S23 "unparseable" — silent execve E2BIG. Every prior
malicious_content_fuzz run emitted a 0-byte scenario-23.json
and got bucketed by the aggregator as unparseable. The RCA
from the r28-tls log was a Python traceback from the scenario
itself:
S23 sends a 1 MB oversize payload. The harness inlined the JSON
body into the ssh command argv via shlex.quote(json.dumps(body)).
execve ARG_MAX on the common Linux configurations is ~128 KB
for a single arg, which a 1 MB payload blows past. ssh never
executed. Fix in #56: bodies larger than 64 KB pipe via ssh
stdin (-d @- on the remote curl). Small bodies (every other
scenario) keep the argv fast path.
3. S20 "bookkeeping skip" — workflow/scenario mismatch. The
workflow's Compute scenarios step appended S20 mtls_happy_path
to every tls_mode=tls run; the scenario itself self-gated to
tls_mode=mtls only. Every tls run silently incremented the
denominator by 1 (34 → 35) without adding coverage. Fix in
#55: drop S20 from the tls append list. Tls runs are now a
clean 34/34 (or 35/35 with PR #56's S23 fix included in the
same matrix).
4. Host firewall + Docker bridge egress. The local Docker
mesh initially failed on every outbound packet — container SYN
reached the bridge, 0 bytes egressed the external NIC. tcpdump
on docker0 vs enx6c1ff771dca7 pinpointed the drop inside the
FORWARD chain. Host runs Tailscale + a custom CCC Gateway nft
ruleset with inet filter forward policy drop that only allows
LAN ↔ WAN. Fix: docker/host-nft-docker-forward.sh — idempotent
four-rule accept for 10.88.0.0/16 + 172.17.0.0/16 (saddr +
daddr new-state). Not persistent across reboot by operator
directive; committed as peer-reviewable evidence.
5. MiniLM embedder pre-bake. ai-memory serve attempts to
load the MiniLM sentence embedder on startup (semantic-tier
default). Without the model pre-downloaded, the load blocks on
hf-hub; combined with the Docker egress issue above, serve
never bound :9077 and all four containers stayed unhealthy.
Fix: Dockerfile.base layer pre-downloads MiniLM (91 MB) into
the embedder's hard-coded fallback path. Matches the
setup_node.sh:236-255 pattern used on DO.
6. State pollution between local rounds. The DO workflow
provisions a fresh VPC per run. Local Docker preserves volumes
across runs → scenario5-consolidate (S5's fixed namespace)
accumulated rows across r1 → r2, tripping a 500 from consolidate
on the 10th source-id on round 2. Fix in run-testbook.sh:
docker restart each node after deleting its SQLite files, so
every round starts from a pristine DB without tearing down the
mesh. The mesh itself stays up across all three cert rounds.
7. Harness ssh → docker exec dispatch. The sole change
needed to make the existing scenario scripts run against Docker
containers was adding a TOPOLOGY env var dispatch in
a2a_harness.py:
if TOPOLOGY == "local-docker":
cmd = ["docker", "exec", "-i", node_ip, "sh", "-c", remote_cmd]
else:
cmd = ["ssh", *SSH_OPTS, f"root@{node_ip}", remote_cmd]
In local-docker mode NODE<N>_IP carries the container name
(e.g. a2a-node-1). Every HTTP / curl construction is unchanged.
Zero scenario script changes required. Peer-review surface:
exactly 2 files touched (a2a_harness.py and
docker/run-testbook.sh), 40 lines net.
Matrix status (post-cert, 2026-04-24)¶
| off | tls | mtls | |
|---|---|---|---|
| ironclaw (DO) | ✅ v3r30 35/35 | ✅ v3r30 35/35 | ✅ v3r30 37/37 |
| hermes (DO) | ✅ v3r30 35/35 | ✅ v3r30 35/35 | ✅ v3r30 37/37 |
| openclaw (local-docker) | ✅ r3 35/35 | ✅ tls-r3 35/35 | ✅ mtls-r3 37/37 |
| mixed (DO) | ⏸ terraform topology | ⏸ | ⏸ |
OpenClaw full-spectrum cert (2026-04-24): tls + mtls on local
Docker closed with 3/3 × 2 modes. All 6 rounds overall_pass=true
with pristine volumes (docker compose down -v + fresh ephemeral CA
per round). S20 (mtls_happy_path) and S21 (mtls_anonymous_rejected)
are mtls-only by design — appended only under tls_mode=mtls, matching
ai-memory-ai2ai-gate PR #55.
Forward-looking work¶
-
Local-docker TLS + mTLS — DONE (2026-04-24). Ephemeral CA generation via
docker/gen-tls.sh+ per-node cert volume-mount +TLS_MODE-aware entrypoint +healthcheck.shwrapper landed and the fullopenclaw × {off,tls,mtls}matrix is green (3/3 per mode, evidence committed underruns/a2a-openclaw-v0.6.2-local-docker-*). -
Mixed-framework row — terraform topology work for a heterogeneous VPC that provisions
ai:alice@ironclaw+ai:bob@hermes+ai:charlie@openclawon the same mesh. Closes the bottom-right cell of the certification matrix. -
GitHub Actions self-hosted runner with ≥64 GB to make the local-docker matrix dispatchable from CI. Today it's operator- local (peer-reviewable + reproducible, but not continuously measured). Nice-to-have, not blocking v0.6.2 cert.
-
S23 oversize-payload product-side — the harness fix (
#56) works around the 1 MB payload via stdin. A separate question is whetherai-memory's/api/v1/memoriesshould actively reject 1 MB writes with a 413 and a clean JSON error. Tracked separately from cert.
How this page updates¶
docs/insights.md (this file) is curated by AI NHI after each
campaign window resolves — it's the human-readable story. The
machine-readable source of truth is
analysis/run-insights.json,
which drives the per-campaign evidence HTML at runs/<id>/index.html
via scripts/generate_run_html.sh.
New cert window → new narrative entry on this page → pages workflow redeploys → both this page and the per-run evidence pages refresh. No hand-waving. Every claim traces to an artifact.