High availability

Replicate every layer: a 3-member MongoDB replica set, multi-replica API, leader-elected scheduler, and rolling upgrades that never drop in-flight ACME orders.

01Availability targets

  • Survive a single zone failure with no data loss.
  • Survive a rolling restart of any one workload type with zero failed renewal jobs.
  • API 99.95% monthly. Background jobs eventually-consistent within 5 minutes.

02MongoDB replica set

Three voting members, one per AZ. Use a managed offering (Atlas, AWS DocumentDB-with-replica-set-protocol, or self-hosted with the Mongo Operator). Connection string lists all members.

CAP_MONGO_URI="mongodb://m0,m1,m2/certautopilot?replicaSet=rs0&readPreference=primaryPreferred&w=majority"

CertAutoPilot writes use w=majority; reads default to primary for consistency. Job claims use a transaction so a worker crashing mid-claim does not lose the job.

03API replicas

Run at least 3. The Ingress / Load Balancer should send health checks to GET /healthz; deep readiness is on GET /readyz (verifies Mongo + KEK reachability).

04Worker replicas

Workers scale horizontally. They claim jobs from a Mongo collection; only one worker ever runs a given job. The HPA target metric is queue depth — the chart ships a custom metric adapter for it.

05Scheduler & leader election

Run two scheduler pods. They acquire a lease in MongoDB; only the leader enqueues time-driven work (renewal windows, drift scans, KEK rotation steps). The standby promotes within 30 seconds if the leader stops renewing the lease.

Why not 3 schedulers?

Two is enough. The lease itself is the correctness boundary; more replicas only buys you marginal failover speed and adds cost.

06Rolling upgrade

  1. Migrations run as a Helm pre-upgrade Job.
  2. API rolls one pod at a time (PDB caps unavailable to 1).
  3. Worker rolls — in-flight jobs that lose their pod are re-claimed by a peer after the lease expires (default 60s).
  4. Scheduler rolls last; the leader steps down so the new pod can pick up the lease.

07Backup & DR

  • Snapshot MongoDB at the storage layer or use mongodump at least daily. Store offsite.
  • Back up the KEK separately. Without it the snapshot is useless.
  • Audit log forwarding to SIEM is your second line of defence — even if the primary site is unrecoverable, the audit trail of operations survives.

08DR test

Quarterly: restore the most recent snapshot to a clean cluster, supply the KEK, and run certautopilot audit verify + certautopilot health. The fixture project should issue a test certificate against Let's Encrypt staging end-to-end.