KEK rotation

Rotate the key-encryption-key without downtime, even across a multi-replica cluster. The fleet-aware rotation is the headline feature in 1.4.x.

01Why rotate

Most compliance regimes (SOC 2, ISO 27001, PCI DSS) require periodic rotation of long-lived encryption keys. A 12-month rotation is typical; 6 months for high-assurance environments. CertAutoPilot makes it cheap so you can do it more often without operational fear.

02What rotation actually changes

Reminder: every secret field is encrypted with a per-field DEK, and the DEK is wrapped by a versioned KEK. Rotation:

  1. Generates KEK_v(N+1) and stores it alongside KEK_vN.
  2. Re-wraps every record's DEK with KEK_v(N+1) — the small wrapper, not the field ciphertext itself.
  3. Updates each record's kek_version field.
  4. When all records are at v(N+1), deletes KEK_vN.

Because the bulk ciphertext doesn't change, rotation is bounded by the number of records, not their size. Even a million-record installation finishes in minutes.

03Fleet-aware rotation

In a multi-replica cluster, every API and worker pod has the KEK in memory. Rotation must guarantee:

  • No pod ever encrypts under v(N+1) before every pod knows v(N+1).
  • No pod ever fails to decrypt because it doesn't have a KEK version some record references.

The scheduler runs a 4-phase rotation. Phase transitions only happen when every pod has acknowledged the previous phase via a heartbeat in MongoDB.

PhasePods knowPods write under
0 (steady)vNvN
1 (announce)vN, v(N+1)vN
2 (re-wrap)vN, v(N+1)v(N+1)
3 (settle)v(N+1)v(N+1)

04Triggering a rotation

Rotation is initiated from the CLI on a host with KEK file access. Before running, every process in the fleet must be restarted with the new KEK loaded and CERTAUTOPILOT_ENCRYPTION_CURRENT_VERSION set to the target — the rotate command's preflight refuses otherwise. Confirm readiness first:

certautopilot kek verify --target=2
certautopilot kek rotate --to-version=2

Optional flags: --from-version=N (explicit source if your shell's CURRENT_VERSION disagrees with the running service's), --batch-size, --concurrency.

Settings → KEK Versions in the UI is a read-only monitoring surface — it shows the version table, rotation history, in-progress phase counters, and fleet drift / process-vs-keystore mismatches. There is no "Rotate" button, no in-UI cancel, and no UI scheduling; every transition is operator-initiated via CLI.

Back the new KEK up immediately

Once rotation enters phase 2, the old KEK is still valid for decryption — but as soon as phase 3 runs, only the new KEK can decrypt anything. Lose the new KEK between those phases and you lose all secrets. Sync to your secret store before triggering rotation, and verify the sync.

05Cancel & reverse rotation

An in-progress rotation can be halted with certautopilot kek rotate --cancel. Running batches finish their current item at the next boundary and exit cleanly. Already-rewrapped envelopes stay on the new version; un-rewrapped envelopes stay on the old version — the database is left mid-rotation, which is safe to read from (both KEKs are still loaded) but should be resolved by either resuming or reversing.

To reverse a rotation that's already partially or fully complete, run a fresh rotation back to the previous version (you must still have both KEKs loaded):

certautopilot kek rotate --to-version=<old> --from-version=<new>

This re-wraps everything back. There is no dedicated kek rollback sub-command — reverse rotation is just another rotation in the opposite direction.

06Scheduled rotation

Rotation is currently a deliberate, operator-driven action — there is no built-in scheduler. To rotate periodically, wrap certautopilot kek rotate --to-version=<next> in your existing job runner (cron, systemd timer, Kubernetes CronJob) on the host that holds KEK file access — and remember to bump CERTAUTOPILOT_ENCRYPTION_CURRENT_VERSION on every fleet pod first, otherwise the preflight will refuse. Each phase emits audit events the same way an interactive run does, so SIEM correlation and ownership-notification rules work identically.

07HSM-backed KEKs

If your KEK lives in an HSM (PKCS#11), rotation generates the new key inside the HSM and never exports it. The wrap/unwrap operations all happen in HSM-land. Performance is bounded by HSM throughput; expect rotation to take longer (5–60 minutes for typical inventories).