Fleet readiness

A KEK rotation only works if every running process has already loaded the new version. The fleet-readiness check is what guarantees that. Each process heartbeats into MongoDB every 30 s reporting its loaded versions and current version; kek verify --target=N refuses to ready-up if any live process is missing the target key material.

01Why a pre-rotation gate

A rotation reads each envelope, unwraps with the old KEK, then rewraps with the new KEK using an explicit target-version wrap (crypto.EncryptWithVersion) so the outer doc.kek_version field and the inner envelope's kv tag are always consistent. For that rewrap to succeed on every process that may run a batch, every process must have the target key material loaded. If even one lagging process (missing VN+1) picked up a batch, it would fail at the wrap step — the rotation filter would stall at that collection.

The readiness gate also covers a subtler case: processes that never loaded the new key would continue writing new secrets with the old KEK version during rotation. Each such write produces a document tagged with the old version that the rotation filter ({kek_version: fromVersion}) later picks up — so no envelope is orphaned — but the operator still has to wait for those tail rewrites to settle. Keeping every node at least LOADED on the target avoids that churn; the per-node current is flipped atomically at rotation finalize (SwapActive in the keystore) and each node hot-reloads it on its next heartbeat tick without a restart.

02The heartbeat mechanism

  • Every CertAutoPilot process (API / worker / scheduler) writes a record to the process_heartbeats collection every 30 s.
  • The record carries: hostname + PID (or pod name for K8s), loaded KEK versions, current version, provider, process role, last start time.
  • A TTL index expires records 120 s after last write. So a process that crashed 2 minutes ago drops off the roster.

03kek verify --target=N

sudo certautopilot kek verify --target=2
  • Reads the heartbeat collection.
  • Applies a 60-second freshness window — records older than 60 s are treated as stale (not as laggard). This protects against a restart in progress.
  • For each fresh record, checks: does loaded_versions contain target? Is the reported provider the same as the local CLI's? Current_version is NOT required to match the target — the keystore's active version only flips at rotation finalize, and every live node then hot-reloads it within one heartbeat tick (~30 s).
  • Exit codes:
    • 0 — READY. Every fresh process has the target version loaded and reports the same provider as the local CLI.
    • 2 — NOT READY. One or more processes lag (missing the target version, or on a different provider). The laggard list is printed on stderr.
    • 1 — error (DB unreachable, invalid target).

04Interpreting the laggard list

NOT READY: target=2

LAGGARDS:
  host=cap-api-1 pid=1234   loaded=[1]   current=1

Provision CERTAUTOPILOT_ENCRYPTION_ENV_KEK_V2 (or run
`kek pkcs11-init --version=2`) on the lagging host and restart it,
then retry `kek verify --target=2`.
  • loaded=[1] current=1 — the process never loaded V2. Env var missing or typo on that host; HSM tokens: check the key label exists in the token.
  • loaded=[1,2] current=1NOT a laggard. V2 is loaded; current being V1 reflects the keystore's still-active V1, which is exactly the pre-rotation state. kek verify --target=2 passes on this row.
  • provider=env while the CLI reports provider=pkcs11 (or vice versa) — fleet is mid-provider-migration. Rotation refuses until every process matches; finish the migration first (provider migration).

05Edge cases

--local

kek verify --local skips the fleet check entirely and only verifies the current CLI's ability to round-trip wrap/unwrap. Useful for confirming a single-host standalone has the new KEK loaded.

Newly-started pod not yet visible

A fresh process must heartbeat once before verify sees it. If you restart a pod at T=0 and run verify at T=5, it's not in the roster yet. Wait 30 seconds or re-check.

Long-lived job

Heartbeats come from the process, not from each job. A worker running a 10-minute distribution is still heartbeating in the background; verify reflects its current state.

Read-only replicas

If you run on a MongoDB read-replica (secondaryPreferred or similar), the heartbeat writes may be slightly delayed. The 60-second freshness window absorbs normal replication lag.

06Readiness during rotation

kek rotate --from-version=M --to-version=N runs the same readiness check internally as kek verify, then starts a background job. The per-doc UpdateOne filter ({kek_version: fromVersion}) makes the rotation tolerant to mid-rotation topology changes: a process that restarts or crashes never half-rotates a document (each rewrap is atomic across all encrypted fields on a doc), and stale processes writing new secrets during the rotation with the old version simply get picked up by subsequent batches — no straggler envelopes.

If a process comes back mid-rotation with a missing target key (operator typo on restart), its batches fail at the rewrap wrap step and the rotation's failed_records counter increments. The rotation record stays in in_progress status so you can see it; restoring the expected fleet state and letting the job queue retry the failing batches resolves it. Re-running kek rotate from scratch is also safe (idempotent) — all already-rotated docs are filtered out by their kek_version tag.

Post-rotation: every live node's heartbeat tick (~30 s) detects the SwapActive in the keystore and hot-reloads its in-memory current KEK. kek remove --version=OLD includes a built-in safety check that refuses while any live node still reports OLD as current — combined with the auto-reload, it becomes observably safe to retire the old version as soon as the heartbeat convergence window passes.

07Programmatic access

The heartbeat roster is exposed via the cluster-instances API (GET /api/v1/settings/cluster-instances, admin role) and rendered on the Settings → Cluster page. Each row carries loaded_kek_versions, current_kek_version, provider, last_heartbeat, and started_at. The Settings → KEK versions page adds a fleet-readiness summary (N/N instances on v<active>). For alerting, poll the API and diff against the expected target version; a future release will add a dedicated Prometheus gauge for laggard count.

08Troubleshooting

"Ghost" laggard that isn't actually running

The heartbeat TTL is 120 s. A process that crashed should disappear within 2 minutes. If you still see one, check MongoDB cluster health — write replication to the primary may be stalling.

Verify always reports one process lagging

Often a stuck init container or a one-off CronJob that starts with an older image. In K8s: kubectl get pods -L certautopilot.version --all-namespaces.