Every test performed¶

The canonical list of every AI-to-AI integration test the a2a-gate runs against ai-memory. Two tiers: baseline probes gate scenario execution; scenarios exercise the full A2A surface.

Authoritative sources:

Baseline probes → scripts/setup_node.sh
Scenario runners → scripts/scenarios/
Per-scenario full plans → docs/testbook.md

If this page and the testbook disagree, the testbook (v3.0.0+) is the current contract; this page is the index / summary view. Every scenario is implemented in Python 3 (testbook v3.0.0 convention); the shared harness lives at scripts/a2a_harness.py.

1. Baseline probes (8)¶

Every agent droplet runs these before scenarios are allowed to execute. baseline_pass=true requires all gate-marked probes green. The three-node baseline (3 agent nodes × 8 probes) + the 4th memory-only node gate every campaign.

Probe	What it exercises	Gates `baseline_pass`?	Failure indicates
F1 xAI Grok reachability	Direct HTTPS POST to `api.x.ai/v1/chat/completions`; expects `READY` literal reply	yes	LLM backend down or API key revoked
F2a HTTP substrate canary	Direct `POST /api/v1/memories` + `GET` on local `ai-memory serve`; write+read roundtrip	yes	ai-memory HTTP daemon not running or SQLite broken
F2b Agent-driven MCP canary	Framework LLM prompted to use `memory_store` via MCP stdio; deterministic retrieval verification	no (observation only — LLM-dependent)	Framework-MCP loop broken; baseline still passes if F2a + F5 pass
F3 Peer A2A canary	One node writes via MCP; other two nodes + node-4 aggregator must see it via HTTP	yes (separate step in workflow)	Federation fanout broken; S1b regression
F4 Mesh directional reachability	Every node runs `GET /health` + `POST /sync/push?dry_run=true` against every peer. N-1 edges per node; aggregator ANDs across N nodes = full N*(N-1) bidirectional mesh	yes	VPC firewall/routing broken, ai-memory serve not listening, quorum peers misconfigured
F5 ai-memory MCP stdio handshake	Spawns `ai-memory mcp --tier semantic` with the exact invocation each framework uses; sends MCP 2024-11-05 `initialize` + `tools/list`; verifies `memory_store`, `memory_recall`, `memory_list` in the response	yes	ai-memory binary missing, stdio protocol broken, or framework invocation mismatched
F6 TLS handshake (planned, tls_mode ≥ tls)	Verify `ai-memory serve` presents a server cert and completes a TLS 1.3 handshake on every peer edge	yes (when tls_mode ≥ tls)	Cert material missing on disk or rustls config rejected
F7 mTLS enforcement (planned, tls_mode=mtls)	Anonymous client (no client cert) MUST be rejected. Client with off-allowlist fingerprint MUST be rejected. Client with on-allowlist fingerprint MUST succeed	yes (when tls_mode=mtls)	`--mtls-allowlist` ignored or fingerprint verifier bypass

Per-probe implementation¶

All probes live in scripts/setup_node.sh and emit a single line each into /etc/ai-memory-a2a/baseline.json under functional_probes. The baseline.json is scp'd back to the runner and aggregated into runs/<campaign_id>/a2a-baseline.json (rendered on campaign run pages).

F6 and F7 ship with Tranche 2 — TLS/mTLS.

2. Config + negative invariants (10)¶

These are attestations, not probes — each one reads a config file or runs a static check and records the result. All ten must be true for baseline_pass=true.

Invariant	What it asserts	Per-framework evidence
`framework_is_authentic`	Binary is the upstream one (not a same-named stub)	`readlink -f $(which <framework>)` contains framework name
`mcp_server_ai_memory_registered`	The `memory` MCP server is registered with the framework	IronClaw: `ironclaw mcp list` contains `memory`; Hermes: YAML `mcp_servers.memory`; OpenClaw: JSON `mcpServers.memory`
`llm_backend_is_xai_grok`	LLM provider is xAI Grok	IronClaw: `.env` has `LLM_BASE_URL=https://api.x.ai/v1`; Hermes: `XAI_API_KEY` in hermes.env; OpenClaw: `providers.xai` in JSON
`llm_is_default_provider`	xAI Grok is the default (not a second fallback)	Per-framework config default-provider field
`mcp_command_is_ai_memory`	MCP server command resolves to `ai-memory` binary	Grep config for `command: ai-memory` (or equivalent)
`agent_id_stamped`	Every write carries this node's `AI_MEMORY_AGENT_ID`	Env/config contains `AI_MEMORY_AGENT_ID=ai:<alice\|bob\|charlie>`
`federation_live`	Local `ai-memory serve` is listening + has ≥1 peer	`GET /api/v1/health` returns `{"status":"ok"}`
`ufw_disabled`	Ubuntu UFW is OFF (ship-gate lesson — blocks intra-VPC)	`ufw status` contains `inactive` or UFW not installed
`iptables_flushed`	iptables policies are ACCEPT on INPUT/OUTPUT/FORWARD	`iptables -S` shows 3 `-P … ACCEPT` lines
`dead_man_switch_scheduled`	`shutdown -P +480` scheduled at boot (8hr cap on spend)	`shutdown -c` dry-run or `/run/systemd/shutdown/scheduled` check

Thesis-preserving negative invariants (5)¶

Invariant	What it asserts	Enforcement
`a2a_protocol_off`	No direct agent-to-agent RPC channel (ACP, sessions, etc.)	Per-framework config flag(s)
`sub_agent_or_sessions_spawn_off`	No parent/child agent hierarchy or session-spawn tool	Framework config + tool allowlist
`alternative_channels_off`	No Telegram / Discord / Slack / Moltbook / gateway / execution_backends	Per-framework disable blocks
`tool_allowlist_is_memory_only`	Only `memory_` (or namespaced `mcp_memory_memory_`) tools available to the agent	Hermes: `tool_allowlist` YAML list — entries may be bare or hermes-prefixed; IronClaw: only one MCP server registered (provisioning-control); OpenClaw: `toolAllowlist` JSON
`a2a_gate_profile_locked`	`a2a_gate_profile: shared-memory-only` tag present	Per-framework config set

Any false here = thesis preserved (good). The negative invariants only pass baseline_pass when all true. Document sources: docs/baseline.md §6b.

3. Scenarios¶

3.1 Suite A — Core A2A (3 scenarios)¶

#	Name	What it proves	Primary primitives	Runner
S1	Per-agent write + read (MCP stdio)	Framework can accept prompt → choose `memory_store` tool → invoke via MCP stdio → memory lands with correct `metadata.agent_id`	`memory_store`, `memory_recall`	`1_write_read_mcp.py`
S1b	Per-agent write + read (HTTP direct)	Green-path counterpart: federation + substrate work independent of the MCP-stdio path	`memory_store`, `memory_list`	`1b_write_read_http.py`
S2	Shared-context handoff	Agent A writes a handoff memory; agent B picks it up within quorum settle; round-trips back to A	`memory_store`, `memory_recall`, `memory_list`	`2_handoff.py`

3.2 Suite B — A2A primitives (4 scenarios)¶

#	Name	What it proves	Primary primitives	Runner
S3	Targeted `memory_share`	Subset of memories lands on exactly the targeted peer (not broadcast)	`memory_share`	(deferred until v0.6.0.1 / #311)
S5	Consolidation + curation	`memory_consolidate` preserves `consolidated_from_agents` metadata	`memory_consolidate`	`5_consolidation.py`
S6	Contradiction detection	Contradicting memories produce a `contradicts` link visible to third agent	`memory_detect_contradiction`, `memory_link`	`6_contradiction.py`
S11	Link integrity	Linked memories returned together on peer query	`memory_link`, `memory_get`	`11_link_integrity.py`

3.3 Suite C — Mutation + lifecycle (3 scenarios)¶

#	Name	What it proves	Primary primitives	Runner
S9	Mutation round-trip	`memory_update` from agent A is visible with new content on agent B	`memory_update`	`9_mutation.py`
S10	Deletion propagation	`memory_delete` / `memory_forget` propagates to all peers	`memory_delete`, `memory_forget`	`10_deletion.py`
S16	Tier promotion	`short` → `mid` → `long` promotion visible to peers	`memory_promote`	`16_tier_promotion.py`

3.4 Suite D — Scope + governance (3 scenarios)¶

#	Name	What it proves	Primary primitives	Runner
S7	Scope visibility matrix	Each (scope, caller_scope) pair produces correct visibility	`as_agent` filter, `scope` metadata	(partial — Task 1.5 ongoing)
S8	Auto-tagging round-trip	Agent writes without tags; tags appear; recall-by-tag works	`memory_auto_tag`	(requires Ollama-backed droplets)
S12	Agent registration (Task 1.3)	`memory_agent_register` on A visible to B's `memory_agent_list`	`memory_agent_register`, `memory_agent_list`	`12_agent_register.py`

3.5 Suite E — Resilience + observability (5 scenarios)¶

#	Name	What it proves	Primary primitives	Runner
S13	Concurrent write contention	Two agents updating the same row converge to a consistent outcome	`memory_update`, `memory_store`	`13_concurrent_contention.py`
S14	Partition tolerance	Temporary peer loss → recovery → convergence within bounded time	federation sync	`14_partition_tolerance.py`
S15	Read-your-writes	Writing agent sees its own write immediately (no settle required)	`memory_store`, `memory_recall`	`15_read_your_writes.py`
S17	Stats consistency	`memory_stats` returns equal counts across peers post-settle	`memory_stats`	`17_stats_consistency.py`
S18	Semantic query expansion	Semantic recall surfaces memories written under synonyms, across writers	`memory_expand_query`, `memory_recall`	`18_query_expansion.py`

3.6 Suite F — Topology variants (2 scenarios)¶

#	Name	What it proves	Primary primitives	Runner
S4	Federation-aware concurrent writes (quorum burst)	Quorum preserved under N-agent concurrent write burst	federation quorum	`4_federation_burst.py`
S19	Same-node A2A	Two agents on ONE droplet share local ai-memory without federation	`memory_store`, `memory_recall` (same-node)	(planned, testbook v2.0.0)

4. Tranche 2 — TLS / mTLS (shipped v3.0.0)¶

Enabled via tls_mode: tls | mtls workflow input. Adds F6/F7 baseline probes and the scenarios below. See ai-memory integration for the cert material layout.

#	Name	What it proves	Gates
F6	Server TLS handshake	Every peer presents a valid server cert; rustls completes TLS 1.3	baseline (when `tls_mode ≥ tls`)
F7	mTLS client-cert enforcement	Anonymous client must be rejected; off-allowlist fingerprint must be rejected; on-allowlist must succeed	baseline (when `tls_mode = mtls`)
S20	mTLS happy-path	Agent with valid client cert writes/reads across the federation	scenario
S21	Anonymous client rejected	POST `/api/v1/memories` without client cert → handshake rejected	scenario

5. Tranche 3 — Adversarial + cross-framework (shipped v3.0.0)¶

#	Name	What it proves	Category
S22	Identity spoofing	`X-Agent-Id` vs body.metadata.agent_id precedence honored; stored identity is one of the declared values, never a silent third	identity
S23	Malicious content fuzz	SQL-like, XSS, NUL bytes, oversize (~1 MB), unicode+RTL: no crash, no injection, oversize cleanly rejected or round-trip faithful, others byte-for-byte preserved	robustness
S24	Byzantine peer	Node-2 crafts a sync_push claiming sender=ai:alice; node-3 preserves declared `metadata.agent_id` (no silent re-attribution) or rejects	federation integrity
S25	Clock skew tolerance	Node-3 offset +300 s; alice's write from node-1 still converges to node-3 via vector clocks	time
S26	Mixed-framework campaign	IronClaw + Hermes on same VPC; writes cross readable both directions	cross-stack
S27	OpenClaw legacy regression	openclaw-only campaign regression lane (skipped unless `agent_group=openclaw`)	legacy

6. Tranche 4 — Uncovered primitive coverage (shipped v3.0.0)¶

#	Name	What it proves	Primary primitives
S28	memory_search keyword A2A	Keyword search (distinct from `/recall` semantic) consistent across peers	`memory_search`
S29	memory_archive lifecycle	archive → archive_list → archive_restore → archive_stats round-trip	`memory_archive_*`
S30	memory_capabilities handshake	Protocol version + tool surface match across peers	`memory_capabilities`
S31	memory_gc quiescence	After forget+gc, non-deleted rows remain readable on all peers	`memory_gc`, `memory_forget`
S32	memory_inbox + memory_notify	Notify delivers to target's inbox; non-target cannot read	`memory_notify`, `memory_inbox`
S33	memory_subscribe pub/sub	subscribe → write → deliver → unsubscribe → no-deliver	`memory_subscribe`, `memory_unsubscribe`, `memory_list_subscriptions`
S34	memory_pending governance	`governance.write=approve` → pending → approve/reject visibility	`memory_pending_{list,approve,reject}`
S35	memory_namespace standards	Parent-chain rules merged into namespace standard	`memory_namespace_{get,set,clear}_standard`
S36	memory_session_start lifecycle	Session-tagged writes recall by session_id only	`memory_session_start`
S37	memory_get_links bidirectional	Both forward and reverse traversal resolve the pair	`memory_get_links`

7. Tranche 5 — HTTP-only endpoint coverage (shipped v3.0.0)¶

#	Name	What it proves	Endpoint
S38	export + import round-trip	Export one peer's namespace → import elsewhere → stats match	`/api/v1/export`, `/api/v1/import`
S39	sync/since delta	Post-partition delta returns exactly the missed rows	`/api/v1/sync/since`
S40	bulk write	500-row `/bulk` POST reaches every peer + aggregator	`/api/v1/memories/bulk`
S41	metrics Prometheus	Required counters present and monotonic post-activity	`/api/v1/metrics`
S42	namespaces enumeration	Namespace list (with counts) equivalent across peers	`/api/v1/namespaces`

8. Dispatch matrix (what runs in a given campaign)¶

The default dispatch runs the v3.0.0 always-on set (35 scenarios):

Baseline probes F1, F2a, F2b, F3, F4, F5 on every agent node; + F6 when tls_mode ≥ tls; + F7 when tls_mode = mtls.
Scenarios: S1, S1b, S2, S4, S5, S6, S9, S10, S11, S12, S13, S14, S15, S16, S17, S18, S22, S23, S24, S25, S28, S29, S30, S31, S32, S33, S34, S35, S36, S37, S38, S39, S40, S41, S42.

Auto-appended conditionally by the workflow's Compute scenarios list step:

S20 when tls_mode ∈ {tls, mtls}
S21 when tls_mode = mtls
S26 when agent_group = mixed
S27 when agent_group = openclaw

agent_group selects the framework:

ironclaw (primary as of 2026-04-21)
hermes (primary)
openclaw (legacy — explicit dispatch only)
mixed — heterogeneous agents in one campaign (S26)

9. Behavioral assessments (Tier 1–4, agent-side evidence)¶

Substrate scenarios above prove ai-memory functions correctly under load. They do not prove ai-memory is useful to working LLMs in the loop — that is the agent-side question, and it requires a different instrument.

Assessment	Tier	Subject	Methodology	Output
Phase 3 NHI playbook (A–J)	2–4	IronClaw / Hermes	4-arm × 4-scenario × n=3 = 48 runs, grounding-rate metrics, treatment-vs-control attribution	`runs/<id>/phase3-*.json` + `phase4-analysis.json`; rendered on Per-run NHI matrix
OpenClaw behavioral assessment	1–4	OpenClaw 2026.5.x on `xai/grok-4.3`	8-phase suite: bootstrap → qualitative → recall@k → cross-session durability → team chain → adversarial → tool-surface discovery → roadmap	`docs/nhi/openclaw-behavioral-v0.6.3.1.md` + `releases/v0.6.3.1/openclaw-behavioral-assessment.json`

Both assessments are scope-tagged (scope=ironclaw, scope=hermes, scope=openclaw) and joined to the umbrella v0.6.3.1 release via the release=v0.6.3.1 linkage at the verdict layer per Principle 6 — they never collapse cross-framework data.

10. Read next¶

Baseline configuration — every invariant this gate defends
ai-memory integration (IronClaw + Hermes) — the authoritative config standard
Reproducing — run it yourself
Campaign runs — live evidence dashboard
OpenClaw v0.6.3.1 behavioral assessment — Tier 1–4 agent-side evidence