ai-memory · Zero-touch trust — CA-rooted federation identity at scale

Adding a node reconfigures nobody. A node proves who it is via attestation, a CA mints it a short-lived credential, and receivers verify that credential against a small trust bundle holding only the CA's key. No peer-to-peer .pub copy. No restart fan-out. The trust domain grows by trusting the CA — not by introducing every node to every other node.

This is the shape of SPIFFE/SPIRE (CA-rooted, attestation-issued, short-lived, auto-rotating) but first-party: it composes the Ed25519 sign/verify and canonical-CBOR primitives already in ai-memory. No new dependency — no rcgen, no openssl, no X.509.

Full operator + configuration reference: docs/federation-identity.md. Transport hardening beneath it (mTLS, X-API-Key, peer attestation): docs/federation.md.

The credential

What a node presents.

A FederationCredential is the node's Ed25519 public key bound to its agent-id and a short validity window, signed by a CA key (canonical-CBOR, the same encoder as link signing). It travels base64 next to X-Memory-Sig under x-memory-cred: v1=<base64>.

Field	Meaning
`subject_agent_id`	The node identity (SPIFFE-style path allowed)
`subject_pubkey`	The node's Ed25519 verifying key (32 bytes)
`issuer_id`	The CA / intermediate that minted it
`not_before` / `not_after`	Unix-seconds validity window (short TTL — default 1h)
`trust_domain`	Fleet namespace for multi-tenant isolation
`cred_version`	Wire/format version (`CRED_VERSION = 1`)

Negotiated, no partition: if a credential header is present and a trust bundle is configured, the receiver takes the credential path; otherwise it falls back to the legacy per-peer .pub path. An un-upgraded peer keeps working; an upgraded peer accepts both. Every phase ships independently and a mixed fleet stays safe.

Configuration

Every `AI_MEMORY_FED_*` knob.

Federation identity is configured entirely through the AI_MEMORY_FED_* env-var family:

Env var	Effect
`AI_MEMORY_FED_IDENTITY`	Overrides the node's federation identity. Highest precedence; blank is ignored. Default: `host:<hostname>`.
`AI_MEMORY_FED_TRUST_DOMAIN`	The trust domain a receiver's bundle is scoped to. Cross-domain creds are rejected.
`AI_MEMORY_FED_TRUST_BUNDLE_DIR`	Directory of trusted issuer verifying keys — the O(1) enrollment surface.
`AI_MEMORY_FED_CRED_PATH`	Path to this node's issued leaf credential (what it presents).
`AI_MEMORY_FED_CRED_CHAIN_PATH`	Path to the anchor-first intermediate chain (hierarchical trust).
`AI_MEMORY_FED_INVENTORY_PATH`	Path to the declarative GitOps inventory YAML.
`AI_MEMORY_FED_REQUIRE_PEER_ENROLLMENT`	Fail-closed gate. When `=1`, a receiver rejects any peer with no valid CA-signed credential — `401 peer_not_enrolled`. The switch that makes enrollment mandatory.
`AI_MEMORY_KEY_DIR`	Directory holding this node's signing keypair at `<key_dir>/<federation-identity>.{pub,priv}`, loaded by the resolved (slashed) identity.
`AI_MEMORY_FED_REQUIRE_SIG`	Receivers reject unsigned posts (default-on; inherited enforcement gate).

Compiled defaults (SSOT constants): leaf TTL 1h, intermediate TTL 1d, clock skew 30s, max chain depth 2, renewal interval 60s, renewal lead 15m.

GitOps

Declarative inventory.

A repo-reviewed YAML file is the source of truth for fleet membership, trust topology, and enforcement. Parsed strictly (deny_unknown_fields) so a typo is a load-time error, never a silently-weakened posture. Point the daemon at it with AI_MEMORY_FED_INVENTORY_PATH.

trust_domain: fleet.example
root_ca: root/ca
regions:
  - name: nyc
    intermediate_ca: region/nyc/ca   # optional — omit for single-tier
    nodes:
      - id: region/nyc/node-1        # SPIFFE-style
        attestor: mtls-cert          # mtls-cert | node-plugin
        cred_ttl: 1h
        renew_before: 15m            # must be < cred_ttl
        roles: [writer]
quorum:
  width: 2                           # W-of-N; >= 1
enforcement:
  require_sig: true                  # maps to AI_MEMORY_FED_REQUIRE_SIG

Operate

Reconciler, rollout & SLOs.

The reconciler is a pure diff of desired inventory vs. observed state → a partition-safe ReconcilePlan (strict-enforcement actions are emitted last, gated on observed sign-capability). The side-effecting Apply half — scripts/federation-rollout.sh — does health-gated, mTLS-verified binary swaps with automatic rollback; if both new and previous binaries fail health it emits a loud MANUAL INTERVENTION block rather than leaving the fleet dark.

Four Prometheus SLO series surface the trust path's health, refreshed every renewal tick:

Metric	SLO
`ai_memory_federation_cred_verify_total{result}`	verify-failure-rate = fail / (ok + fail)
`ai_memory_federation_inbound_cred_total{presence}`	signed-vs-unsigned ratio (climbs to 1.0 as peers upgrade)
`ai_memory_federation_cred_max_age_seconds`	max-cred-age — alert as it nears the leaf TTL
`ai_memory_federation_renewal_lag_seconds`	renewal-lag — alert when it exceeds the refresh interval

Every issue / renew / revoke is also recorded on the signed_events audit chain.

Platform

OS-agnostic by design.

The daemon and the entire federation-identity core are pure Rust with no platform-bound logic in the trust path. The credential format, issuer, trust bundle, chain verification, inventory parsing, renewal worker, and reconciler behave identically on Linux, Windows, and macOS — the CI matrix runs the full suite on all three on every change. A credential minted on a Linux root CA verifies on a Windows node and a macOS node with byte-identical results. A federated fleet can be heterogeneous across operating systems in one trust domain.

Support focus is tiered by where production enterprise fleets actually run — a priority ranking, not a capability limit:

Priority	Platform	Default shell	Notes
Primary	Linux (x86_64 / aarch64)	Bash	Reference enterprise target — systemd rollout, Unix key-mode enforcement, container / Kubernetes substrate.
Primary	Windows (x86_64)	PowerShell	First-class daemon + federation node. Key-directory hardening via NTFS ACLs (POSIX mode bits don't apply).
Tertiary	macOS (Apple Silicon / x86_64)	Zsh	Startup / small-enterprise niche — e.g. Mac Mini clusters. Fully functional node; Unix mode enforcement as on Linux.

The ai-memory binary is shell-agnostic; only the env-var-setting syntax differs per platform's default shell. The same AI_MEMORY_FED_* knob, set three ways:

# Linux (Bash) / macOS (Zsh)
export AI_MEMORY_FED_TRUST_DOMAIN="fleet.example"
export AI_MEMORY_FED_IDENTITY="region/nyc/node-1"

# Windows (PowerShell)
$env:AI_MEMORY_FED_TRUST_DOMAIN = "fleet.example"
$env:AI_MEMORY_FED_IDENTITY     = "region/nyc/node-1"

The only OS-specific differences are operational plumbing — the service supervisor (systemd / launchd / Windows Service or WSL2) and the key-directory permission mechanism (POSIX mode bits on Unix, NTFS ACLs on Windows). The trust model itself is identical everywhere.

Reproducible tooling

The first-party issuer.

All CA / node / credential material is minted by examples/fed_issue.rs — a standalone first-party cargo example that composes the Ed25519 + canonical-CBOR primitives already in ai-memory. It is never linked into the golden binary: there is no Cargo.toml change and the pinned sha256 is unchanged. Build + run it on demand — the first call compiles, the rest are cache hits. Every verb is idempotent, so a re-run yields a stable trust topology.

Verb	What it does
`mint-ca`	Generate the campaign CA keypair (`--key-dir`, `--issuer-id`).
`export-bundle`	Publish only the CA verifying key to the trust bundle dir — the sole anchor a receiver needs.
`gen-node`	Generate a node signing keypair (`--agent-id` may be a slashed federation identity).
`issue`	Mint a short-lived CA-signed credential binding the node's identity → its verifying key, scoped to the trust domain (`--ttl-secs` default 3600).

One footgun: issuer-id MUST be slash-free. TrustBundle::from_dir is non-recursive, so a slashed id would nest the published <issuer-id>.pub in a skipped sub-dir → empty bundle → UnknownIssuer. The tool's validate_issuer_id rejects it at the boundary; node subject-agent-ids, by contrast, may be slashed.

The push-based deploy/hive-1461/provision/45_zero_touch.sh drives these four verbs per peer: mint the CA once, then per peer generate a keypair, issue a credential, fan out keys/ + trust/ + credential to /etc/ai-memory (private key 0400, public 0444), and append the five daemon env vars. Agents and ctrl are pure mTLS API clients, not mesh members, and are skipped. Step-by-step: docs/zero-touch-quickstart.md.

Validated

Proven on a live fleet.

The hive-1461 reproducible baseline runs this arm as provisioning step 45 and the full-spectrum P3 suite exercises it over the real TLS+mTLS path against the live mesh. The committed canonical report is TOTAL=26 PASS=26 FAIL=0, including four dedicated zerotouch checks:

Check	Proves
`enrolled_write`	An enrolled peer writes a collective memory using only its CA-signed credential.
`enrolled_converge`	That memory converges on a federated peer via the CA credential — no operator-pushed pubkey.
`unenrolled_status`	A peer with valid api-key + mTLS but no enrollment is refused `401` on `/sync/since`.
`unenrolled_reason`	The refusal is `peer_not_enrolled` — the `AI_MEMORY_FED_REQUIRE_PEER_ENROLLMENT=1` fail-closed gate.

Both the environment and the results are peer-reproducible: make seed up provision validate test stands the fleet up and re-proves all 26 checks. The report ships at deploy/hive-1461/results/test-full-spectrum.{json,tsv}.

Zero-touch trust

O(N²) doesn't scale.

What a node presents.

Every `AI_MEMORY_FED_*` knob.

Declarative inventory.

Mint · rotate · revoke.

Reconciler, rollout & SLOs.

OS-agnostic by design.

The first-party issuer.

Proven on a live fleet.

Where to go.

Zero-touch trust

O(N²) doesn't scale.

What a node presents.

Every AI_MEMORY_FED_* knob.

Declarative inventory.

Mint · rotate · revoke.

Reconciler, rollout & SLOs.

OS-agnostic by design.

The first-party issuer.

Proven on a live fleet.

Where to go.

Every `AI_MEMORY_FED_*` knob.