High availability

Replicate every layer: a 3-member MongoDB replica set, multi-replica API, leader-elected scheduler, and rolling upgrades that never drop in-flight ACME orders.

01Availability targets

Survive a single zone failure with no data loss.
Survive a rolling restart of any one workload type with zero failed renewal jobs.
API 99.95% monthly. Background jobs eventually-consistent within 5 minutes.

02MongoDB replica set

Three voting members, one per AZ. Use a managed offering (Atlas, AWS DocumentDB-with-replica-set-protocol, or self-hosted with the Mongo Operator). Connection string lists all members.

CAP_MONGO_URI="mongodb://m0,m1,m2/certautopilot?replicaSet=rs0&readPreference=primaryPreferred&w=majority"

CertAutoPilot writes use w=majority; reads default to primary for consistency. Job claims use a transaction so a worker crashing mid-claim does not lose the job.

03API replicas

Run at least 3. The Ingress / Load Balancer should send health checks to GET /healthz; deep readiness is on GET /readyz (verifies Mongo + KEK reachability).

04Worker replicas

Workers scale horizontally. They claim jobs from a Mongo collection; only one worker ever runs a given job. The HPA target metric is queue depth — the chart ships a custom metric adapter for it.

05Scheduler & leader election

Run two scheduler pods. They acquire a lease in MongoDB; only the leader enqueues time-driven work (renewal windows, drift scans, KEK rotation steps). The standby promotes within 30 seconds if the leader stops renewing the lease.

Why not 3 schedulers?

Two is enough. The lease itself is the correctness boundary; more replicas only buys you marginal failover speed and adds cost.

06Rolling upgrade

Migrations run as a Helm pre-upgrade Job.
API rolls one pod at a time (PDB caps unavailable to 1).
Worker rolls — in-flight jobs that lose their pod are re-claimed by a peer after the lease expires (default 60s).
Scheduler rolls last; the leader steps down so the new pod can pick up the lease.

07Backup & DR

Snapshot MongoDB at the storage layer or use mongodump at least daily. Store offsite.
Back up the KEK separately. Without it the snapshot is useless.
Audit log forwarding to SIEM is your second line of defence — even if the primary site is unrecoverable, the audit trail of operations survives.

08DR test

Quarterly: restore the most recent snapshot to a clean cluster, supply the KEK, and run certautopilot audit verify + certautopilot health. The fixture project should issue a test certificate against Let's Encrypt staging end-to-end.