Fleet readiness
A KEK rotation only works if every running process has
already loaded the new version. The fleet-readiness check is
what guarantees that. Each process heartbeats into MongoDB every
30 s reporting its loaded versions and current version;
kek verify --target=N refuses to ready-up if any
live process is missing the target key material.
01Why a pre-rotation gate
A rotation reads each envelope, unwraps with the old KEK, then
rewraps with the new KEK using an explicit target-version wrap
(crypto.EncryptWithVersion) so the outer
doc.kek_version field and the inner envelope's
kv tag are always consistent. For that rewrap to
succeed on every process that may run a batch, every process
must have the target key material loaded. If even one lagging
process (missing VN+1) picked up a batch, it would
fail at the wrap step — the rotation filter would stall at
that collection.
The readiness gate also covers a subtler case: processes that
never loaded the new key would continue writing new secrets
with the old KEK version during rotation. Each such
write produces a document tagged with the old version that the
rotation filter ({kek_version: fromVersion}) later
picks up — so no envelope is orphaned — but the operator still
has to wait for those tail rewrites to settle. Keeping every
node at least LOADED on the target avoids that churn; the
per-node current is flipped atomically at rotation
finalize (SwapActive in the keystore) and each node hot-reloads
it on its next heartbeat tick without a restart.
02The heartbeat mechanism
- Every CertAutoPilot process (API / worker / scheduler) writes a record to the
process_heartbeatscollection every 30 s. - The record carries: hostname + PID (or pod name for K8s), loaded KEK versions, current version, provider, process role, last start time.
- A TTL index expires records 120 s after last write. So a process that crashed 2 minutes ago drops off the roster.
03kek verify --target=N
sudo certautopilot kek verify --target=2
- Reads the heartbeat collection.
- Applies a 60-second freshness window — records older than 60 s are treated as stale (not as laggard). This protects against a restart in progress.
- For each fresh record, checks: does
loaded_versionscontaintarget? Is the reportedproviderthe same as the local CLI's? Current_version is NOT required to match the target — the keystore's active version only flips at rotation finalize, and every live node then hot-reloads it within one heartbeat tick (~30 s). -
Exit codes:
0— READY. Every fresh process has the target version loaded and reports the same provider as the local CLI.2— NOT READY. One or more processes lag (missing the target version, or on a different provider). The laggard list is printed on stderr.1— error (DB unreachable, invalid target).
04Interpreting the laggard list
NOT READY: target=2
LAGGARDS:
host=cap-api-1 pid=1234 loaded=[1] current=1
Provision CERTAUTOPILOT_ENCRYPTION_ENV_KEK_V2 (or run
`kek pkcs11-init --version=2`) on the lagging host and restart it,
then retry `kek verify --target=2`.
loaded=[1] current=1— the process never loaded V2. Env var missing or typo on that host; HSM tokens: check the key label exists in the token.loaded=[1,2] current=1— NOT a laggard. V2 is loaded; current being V1 reflects the keystore's still-active V1, which is exactly the pre-rotation state.kek verify --target=2passes on this row.provider=envwhile the CLI reportsprovider=pkcs11(or vice versa) — fleet is mid-provider-migration. Rotation refuses until every process matches; finish the migration first (provider migration).
05Edge cases
--local
kek verify --local skips the fleet check entirely and only verifies the current CLI's ability to round-trip wrap/unwrap. Useful for confirming a single-host standalone has the new KEK loaded.
Newly-started pod not yet visible
A fresh process must heartbeat once before verify sees it. If you restart a pod at T=0 and run verify at T=5, it's not in the roster yet. Wait 30 seconds or re-check.
Long-lived job
Heartbeats come from the process, not from each job. A worker running a 10-minute distribution is still heartbeating in the background; verify reflects its current state.
Read-only replicas
If you run on a MongoDB read-replica (secondaryPreferred or similar), the heartbeat writes may be slightly delayed. The 60-second freshness window absorbs normal replication lag.
06Readiness during rotation
kek rotate --from-version=M --to-version=N runs the
same readiness check internally as kek verify, then
starts a background job. The per-doc UpdateOne filter
({kek_version: fromVersion}) makes the rotation
tolerant to mid-rotation topology changes: a process that
restarts or crashes never half-rotates a document (each rewrap
is atomic across all encrypted fields on a doc), and stale
processes writing new secrets during the rotation with the
old version simply get picked up by subsequent batches
— no straggler envelopes.
If a process comes back mid-rotation with a missing target key
(operator typo on restart), its batches fail at the rewrap wrap
step and the rotation's failed_records counter
increments. The rotation record stays in in_progress
status so you can see it; restoring the expected fleet state
and letting the job queue retry the failing batches resolves it.
Re-running kek rotate from scratch is also safe
(idempotent) — all already-rotated docs are filtered out by
their kek_version tag.
Post-rotation: every live node's heartbeat tick (~30 s) detects
the SwapActive in the keystore and hot-reloads its in-memory
current KEK. kek remove --version=OLD includes a
built-in safety check that refuses while any live node still
reports OLD as current — combined with the auto-reload, it
becomes observably safe to retire the old version as soon as
the heartbeat convergence window passes.
07Programmatic access
The heartbeat roster is exposed via the cluster-instances API
(GET /api/v1/settings/cluster-instances, admin role)
and rendered on the Settings → Cluster page. Each row carries loaded_kek_versions,
current_kek_version, provider, last_heartbeat,
and started_at. The Settings → KEK versions
page adds a fleet-readiness summary (N/N instances on v<active>).
For alerting, poll the API and diff against the expected target
version; a future release will add a dedicated Prometheus gauge
for laggard count.
08Troubleshooting
"Ghost" laggard that isn't actually running
The heartbeat TTL is 120 s. A process that crashed should disappear within 2 minutes. If you still see one, check MongoDB cluster health — write replication to the primary may be stalling.
Verify always reports one process lagging
Often a stuck init container or a one-off CronJob that starts with an older image. In K8s: kubectl get pods -L certautopilot.version --all-namespaces.