Operations
The day-2 surface: leader-elected scheduler, MongoDB-backed job queue with retries and dead-letter, the cluster instances view that joins heartbeats with locks, domain tracking for registrar-side expiry, and an incident runbook for the common failures.
01Scheduler
The scheduler is a leader-elected cron. Exactly one instance is active at any moment; the rest heartbeat and wait. On leader failure the lock expires and a standby takes over. Sweeps run on a configurable cadence and enqueue jobs for the worker pool.
scheduler:
interval: 1h # sweep cadence
leader_lock_ttl: 90s # lock lifetime
heartbeat_interval: 30s # leader re-asserts the lock every N
heartbeat_interval should be ≤ leader_lock_ttl / 2 so a short network blip doesn't expire the lock. Defaults satisfy this.
Leader election
- A MongoDB
findAndModifyon thelockscollection atomically acquires the lock. - The instance holding the lock runs all sweeps; standbys watch for it to expire.
- On graceful shutdown (SIGTERM), the leader releases the lock explicitly so the standby picks up immediately. On crash, wait
leader_lock_ttl. - A stale leader that can't reach MongoDB realises it lost the lock and steps down; the standby takes over. Split-brain is prevented by MongoDB's single-writer semantics on the
findAndModify.
What the scheduler does
Every scheduler.interval tick, the leader runs:
- Renewal sweep — find certs whose
not_afteris inside the renewal window (default 30 days, adjustable via ARI). Enqueuerenew_certificate/msca_renew_certificate. - Expiration sweep — emit
cert.expiring_soon/cert.expiredevents at 30 / 14 / 7 / 3 / 1 day thresholds. - ARI refresh — for participating CAs (Let's Encrypt Prod), poll
/acme/renewal-infoand cache the hint on the certificate. - Domain expiration sweep — WHOIS / RDAP lookup for each tracked domain.
- Discovery sweep — for every discovery source with a non-manual schedule, enqueue
discovery_executewhen the next-run time has passed. - Auto-distribution sweep — for certs that renewed recently AND have an auto-trigger distribution, enqueue
distribution_execute. Covers certs that renewed while the distribution module was temporarily degraded. - Rate-limit reset — decrement the rolling 7-day issuance count for each zone.
- Approval expiry sweep — flip pending approval requests past their TTL to
expired. - Cleanup sweeps — consumed download tokens, orphaned audit-log exports, stale process_heartbeats.
02Job queue
Every asynchronous operation runs as a job in the MongoDB-backed queue. One handler per type. Workers claim jobs atomically, retry on failure with backoff, and land in dead-letter after N attempts.
Job types
| Type | Purpose |
|---|---|
issue_certificate / renew_certificate / reissue_certificate / revoke_certificate | ACME lifecycle. |
validate_manual_dns / cleanup_dns | DNS-01 manual flow + post-issuance cleanup. |
msca_issue_certificate / msca_renew_certificate / msca_poll_pending | Microsoft AD CS lifecycle (CES enrollment, renewal, pending-approval polling). |
certificate_expiration_check / domain_expiration_check | Scheduler-driven sweeps. |
distribution_execute / distribution_rollback | Module execution / rollback. |
notification.send | Channel delivery. |
discovery_execute | Scan / CT monitor / OCSP / CRL fetch. |
kek_rotation_* | Orchestrator + per-collection re-wrap. |
States
pending → in_progress → succeeded / partial / failed / cancelled → dead_letter on terminal failure.
Retry ladder (default): 1 m → 5 m → 15 m → 1 h → 4 h → 12 h, up to 6 attempts. Per-job-type overrides exist; renewal uses a longer ladder (up to 7 days) so transient provider outages don't expire certs. Error classification (network / io_transient / auth / validation / io_permanent) informs whether to retry — auth and validation go straight to failed without burning retries.
Queue lanes
Distribution jobs (distribution_execute, distribution_rollback) run on a dedicated distWorker. Main workers explicitly filter them out via UseRegisteredTypesAsFilter(). Benefit: a 500-target fan-out doesn't starve ACME / MSCA / notification jobs.
Browse, retry, cancel
- Jobs in the left nav. Filter by type / status / certificate / actor / time.
- Click for detail — timeline, payload, attempts (each with error class + message + duration), live log stream over WebSocket while running, paginated logs after.
- Retry re-enqueues a failed / dead-letter job. Operator role required. Retry doesn't change the payload — fix the underlying cause first (expired credential, etc.) then retry.
- Cancel moves a pending / in-progress job to
cancelled; in-flight work finishes its current step then stops.
Idempotency & dead-letter
Job handlers are idempotent against their natural key — a second issue_certificate for the same cert while one is already in progress is dedup'd at the enqueue site. A distribution executed twice in parallel is guarded by the aggregate state machine.
Jobs that exhaust their retry ladder transition to dead_letter and emit job.dead_letter. Wire it to your ops channel. On Jobs, switch the status tab to dead to inspect — fix the underlying cause then click Retry on the row (which re-enqueues the job and resets the attempt counter). The same Retry action is available on failed jobs.
03Cluster instances
A consolidated view (Settings → Cluster) of every live CertAutoPilot process — identity, leader roles, uptime, loaded KEK versions — joined in one table so operators can answer who is doing what without touching mongosh.
Each row is one process, joined from two collections:
process_heartbeats— emitted every 30 s by every process regardless of run mode (api,worker,scheduler,all). Records liveness, PID, hostname, binary version, loaded KEK versions, andinstance_name.locks— distributed leader locks for the scheduler, domain checker, and discovery checker. Each active lock has a matching heartbeat owner; the UI stamps role badges onto that row.
Identity triple resolved at startup:
Name— operator-facing display. Resolution chain:server.instance_name→CERTAUTOPILOT_INSTANCE_NAME→POD_NAME→HOSTNAME→os.Hostname()→ literal"unknown".ID—<Name>/<short-uuid>, authoritative, used as the lock owner + heartbeat_id. The UUID suffix prevents collisions if two pods accidentally share a Name, and avoids resurrecting an old document on restart.Mode—api/worker/scheduler/all.
API: GET /api/v1/settings/cluster-instances (admin-only). Returns a list of instances with roles (e.g. scheduler_leader, domain_leader) and orphan_leaders (locks whose owner heartbeat has aged out — useful for spotting stale locks during a primary failover).
04Domain tracking
A cert can be perfectly valid while the underlying domain expires out from under you. Domain tracking watches each registered domain's WHOIS / RDAP expiry and fires alerts on the configured thresholds. Complementary to cert expiry — different time horizon, different responsible party.
Add a tracker: Domains → Add domain. Apex or any FQDN; check interval (default daily); alert thresholds in days (default 30, 7). An initial WHOIS lookup runs immediately.
Events: domain.expiring at each crossed threshold (severity escalates with proximity); domain.expired past the date; domain.whois_lookup_failed when throttled or thin-WHOIS TLDs (.ai / .gg / .io have no expiry; expiry shows as unknown).
Modern gTLDs prefer RDAP (HTTPS/JSON) over WHOIS — CertAutoPilot tries RDAP first, falls back to WHOIS. The outbound network policy allows TCP/43 to WHOIS hosts by default.
05Runbook
Playbooks for the common incidents.
Renewal failing for a specific cert
- Open the cert detail; scroll to Timeline. The
cert.renewal_failedevent carrieserror_class+attempt. - Click through to the failed job; inspect logs.
- By class:
- network — ACME directory or DNS provider unreachable.
curlfrom the worker host. - auth — DNS credential or ACME account revoked / expired. Re-issue, update credential.
- validation — DNS TXT record didn't propagate. Check zone TTL, registrar console for stale records.
- io_permanent — CA rejected the request (rate limit, policy). Read the CA's error verbatim in the log.
- network — ACME directory or DNS provider unreachable.
- Fix the root cause; click Renew on the cert to retry immediately, or wait for the next scheduler sweep.
Dead-letter jobs piling up
- Open Jobs; switch the status tab to dead.
- Look for patterns by type — one type failing en masse = systemic (credential rotation, CA outage, MongoDB index missing).
- Resolve the common cause; click Retry on each affected row (no bulk-retry button — fix the cause once, retry the rows individually or via the API).
- For stragglers, inspect individually. Sometimes the right answer is Cancel (cert was revoked, no point retrying).
Scheduler won't elect a leader
- Check
cap_scheduler_is_leaderacross pods. Should be exactly one at 1. - All zeros: no instance can write to
locks. MongoDB primary probably unreachable. - Two ones: caching glitch — wait one
leader_lock_ttl, one should drop. - Logs showing
context deadline exceededon lock renewal indicate slow MongoDB writes.
MongoDB replica-set failover
- Pods see connection errors for a few seconds during primary election.
- Driver retries; most in-flight requests recover.
- Scheduler loses its lock; standby re-acquires once the new primary accepts writes.
- Workers mid-job: current attempt may fail; next retry succeeds.
- If primary election takes longer than
leader_lock_ttl, sweeps are delayed correspondingly. Nothing to do otherwise.
Suspected KEK compromise
Treat as urgent. The KEK wraps every secret; if it leaks, every secret is at risk. Full procedure: KEK rotation.
- Generate a new KEK version (
openssl rand -hex 32). - Add to env / HSM. Restart the fleet so both old and new are loaded.
certautopilot kek verify --target=<new>— confirm all processes loaded.certautopilot kek rotate --from-version=<current> --to-version=<new>. Background job rewraps every envelope.- Once
kek statusshows complete,kek remove --version=<old>. Drop the old key from env / delete the HSM key. - Audit-log review: cross-reference admin actions in the compromise window. Consider rotating dependent secrets too (JWT secret, API-key pepper, module credentials).
ACME provider outage
- Renewals + issuances for that CA fail with
networkerrors. - CertAutoPilot retries on the configured ladder — transient blips are absorbed.
- Extended outage (hours): create an account on a fallback CA (Sectigo / Google / ZeroSSL); reissue urgent certs manually.
- Once the CA recovers, dead-letter jobs from the outage window can be bulk-retried.
License expired with no replacement
- At
expthe backend enters a 7-day grace window. Cert lifecycle (issue, renew, distribute, revoke) keeps running; the UI shows an expired-banner and the license status endpoint reportsin_grace_period: true. - If the active plan is below the cert-count cap, you'll keep being able to issue new certs. If the cert cap is hit (e.g.
starterat 5/5), issuance blocks until you free up slots or upgrade. - After the 7-day grace window, enterprise feature gates (LDAP login, OTP policy enforcement, Syslog forwarding) close until a renewed license is uploaded.
servekeeps running; cert renewals always continue regardless of license state. - Obtain a new license; Settings → License.
Fan-out distribution stuck
- Jobs → filter
distribution_execute+in_progress. - Many stuck:
distWorkerisn't running. Check worker pod health. - One stuck: inspect its log. Usually a specific target hanging on connect — TCP timeout will eventually resolve it.
- Cancel the parent distribution to mark all pending children cancelled. Retry after fixing.
Lost both MongoDB primary AND secret store
Catastrophic — you've lost the data and the ability to decrypt it.
- Restore MongoDB from the latest backup.
- Restore the secret store (
secrets.envor K8s Secret) from the same time window — KEK and envelopes must match. - Start the service.
- Any certs issued between backup time and loss are gone. Re-issue.
- Post-mortem: why are backups not synchronised? Fix the procedure.
06Observability hooks
Operations alerting hangs off the metrics / events surface. The high-value alerts:
cap_scheduler_is_leader— must be exactly one.cap_scheduler_leader_elections_total— incrementing frequently → MongoDB write instability or bad config ratio.cap_jobs_queue_depth{type}— growing means workers can't keep up.cap_jobs_dead_letter_total{type}— any non-zero increase warrants triage.scheduler.leader_electionevent — wire to ops channel to notice failovers.job.dead_letterevent — wire to triage channel.domain.expiring/domain.expired— wire to whoever owns registrar renewals.
Full metric reference: Observability.
07Upgrades & backups
- Helm:
helm upgraderolling restart. Leader election handles the scheduler transition cleanly. - Standalone:
upgrade.sh. Service bounce; config preserved. - Backups: snapshot MongoDB + the secret store on the same schedule. Restore drills should rehearse the joint restore — out-of-sync KEK + envelopes is unrecoverable.