Operations

The day-2 surface: leader-elected scheduler, MongoDB-backed job queue with retries and dead-letter, the cluster instances view that joins heartbeats with locks, domain tracking for registrar-side expiry, and an incident runbook for the common failures.

01Scheduler

The scheduler is a leader-elected cron. Exactly one instance is active at any moment; the rest heartbeat and wait. On leader failure the lock expires and a standby takes over. Sweeps run on a configurable cadence and enqueue jobs for the worker pool.

scheduler:
  interval: 1h               # sweep cadence
  leader_lock_ttl: 90s       # lock lifetime
  heartbeat_interval: 30s    # leader re-asserts the lock every N

heartbeat_interval should be ≤ leader_lock_ttl / 2 so a short network blip doesn't expire the lock. Defaults satisfy this.

Leader election

A MongoDB findAndModify on the locks collection atomically acquires the lock.
The instance holding the lock runs all sweeps; standbys watch for it to expire.
On graceful shutdown (SIGTERM), the leader releases the lock explicitly so the standby picks up immediately. On crash, wait leader_lock_ttl.
A stale leader that can't reach MongoDB realises it lost the lock and steps down; the standby takes over. Split-brain is prevented by MongoDB's single-writer semantics on the findAndModify.

What the scheduler does

Every scheduler.interval tick, the leader runs:

Renewal sweep — find certs whose not_after is inside the renewal window (default 30 days, adjustable via ARI). Enqueue renew_certificate / msca_renew_certificate.
Expiration sweep — emit cert.expiring_soon / cert.expired events at 30 / 14 / 7 / 3 / 1 day thresholds.
ARI refresh — for participating CAs (Let's Encrypt Prod), poll /acme/renewal-info and cache the hint on the certificate.
Domain expiration sweep — WHOIS / RDAP lookup for each tracked domain.
Discovery sweep — for every discovery source with a non-manual schedule, enqueue discovery_execute when the next-run time has passed.
Auto-distribution sweep — for certs that renewed recently AND have an auto-trigger distribution, enqueue distribution_execute. Covers certs that renewed while the distribution module was temporarily degraded.
Rate-limit reset — decrement the rolling 7-day issuance count for each zone.
Approval expiry sweep — flip pending approval requests past their TTL to expired.
Cleanup sweeps — consumed download tokens, orphaned audit-log exports, stale process_heartbeats.

02Job queue

Every asynchronous operation runs as a job in the MongoDB-backed queue. One handler per type. Workers claim jobs atomically, retry on failure with backoff, and land in dead-letter after N attempts.

Job types

Type	Purpose
`issue_certificate` / `renew_certificate` / `reissue_certificate` / `revoke_certificate`	ACME lifecycle.
`validate_manual_dns` / `cleanup_dns`	DNS-01 manual flow + post-issuance cleanup.
`msca_issue_certificate` / `msca_renew_certificate` / `msca_poll_pending`	Microsoft AD CS lifecycle (CES enrollment, renewal, pending-approval polling).
`certificate_expiration_check` / `domain_expiration_check`	Scheduler-driven sweeps.
`distribution_execute` / `distribution_rollback`	Module execution / rollback.
`notification.send`	Channel delivery.
`discovery_execute`	Scan / CT monitor / OCSP / CRL fetch.
`kek_rotation_*`	Orchestrator + per-collection re-wrap.

States

pending → in_progress → succeeded / partial / failed / cancelled → dead_letter on terminal failure.

Retry ladder (default): 1 m → 5 m → 15 m → 1 h → 4 h → 12 h, up to 6 attempts. Per-job-type overrides exist; renewal uses a longer ladder (up to 7 days) so transient provider outages don't expire certs. Error classification (network / io_transient / auth / validation / io_permanent) informs whether to retry — auth and validation go straight to failed without burning retries.

Queue lanes

Distribution jobs (distribution_execute, distribution_rollback) run on a dedicated distWorker. Main workers explicitly filter them out via UseRegisteredTypesAsFilter(). Benefit: a 500-target fan-out doesn't starve ACME / MSCA / notification jobs.

Browse, retry, cancel

Jobs in the left nav. Filter by type / status / certificate / actor / time.
Click for detail — timeline, payload, attempts (each with error class + message + duration), live log stream over WebSocket while running, paginated logs after.
Retry re-enqueues a failed / dead-letter job. Operator role required. Retry doesn't change the payload — fix the underlying cause first (expired credential, etc.) then retry.
Cancel moves a pending / in-progress job to cancelled; in-flight work finishes its current step then stops.

Idempotency & dead-letter

Job handlers are idempotent against their natural key — a second issue_certificate for the same cert while one is already in progress is dedup'd at the enqueue site. A distribution executed twice in parallel is guarded by the aggregate state machine.

Jobs that exhaust their retry ladder transition to dead_letter and emit job.dead_letter. Wire it to your ops channel. On Jobs, switch the status tab to dead to inspect — fix the underlying cause then click Retry on the row (which re-enqueues the job and resets the attempt counter). The same Retry action is available on failed jobs.

03Cluster instances

A consolidated view (Settings → Cluster) of every live CertAutoPilot process — identity, leader roles, uptime, loaded KEK versions — joined in one table so operators can answer who is doing what without touching mongosh.

Each row is one process, joined from two collections:

process_heartbeats — emitted every 30 s by every process regardless of run mode (api, worker, scheduler, all). Records liveness, PID, hostname, binary version, loaded KEK versions, and instance_name.
locks — distributed leader locks for the scheduler, domain checker, and discovery checker. Each active lock has a matching heartbeat owner; the UI stamps role badges onto that row.

Identity triple resolved at startup:

Name — operator-facing display. Resolution chain: server.instance_name → CERTAUTOPILOT_INSTANCE_NAME → POD_NAME → HOSTNAME → os.Hostname() → literal "unknown".
ID — <Name>/<short-uuid>, authoritative, used as the lock owner + heartbeat _id. The UUID suffix prevents collisions if two pods accidentally share a Name, and avoids resurrecting an old document on restart.
Mode — api / worker / scheduler / all.

API: GET /api/v1/settings/cluster-instances (admin-only). Returns a list of instances with roles (e.g. scheduler_leader, domain_leader) and orphan_leaders (locks whose owner heartbeat has aged out — useful for spotting stale locks during a primary failover).

04Domain tracking

A cert can be perfectly valid while the underlying domain expires out from under you. Domain tracking watches each registered domain's WHOIS / RDAP expiry and fires alerts on the configured thresholds. Complementary to cert expiry — different time horizon, different responsible party.

Add a tracker: Domains → Add domain. Apex or any FQDN; check interval (default daily); alert thresholds in days (default 30, 7). An initial WHOIS lookup runs immediately.

Events: domain.expiring at each crossed threshold (severity escalates with proximity); domain.expired past the date; domain.whois_lookup_failed when throttled or thin-WHOIS TLDs (.ai / .gg / .io have no expiry; expiry shows as unknown).

Modern gTLDs prefer RDAP (HTTPS/JSON) over WHOIS — CertAutoPilot tries RDAP first, falls back to WHOIS. The outbound network policy allows TCP/43 to WHOIS hosts by default.

05Runbook

Playbooks for the common incidents.

Renewal failing for a specific cert

Open the cert detail; scroll to Timeline. The cert.renewal_failed event carries error_class + attempt.
Click through to the failed job; inspect logs.
By class:
- network — ACME directory or DNS provider unreachable. curl from the worker host.
- auth — DNS credential or ACME account revoked / expired. Re-issue, update credential.
- validation — DNS TXT record didn't propagate. Check zone TTL, registrar console for stale records.
- io_permanent — CA rejected the request (rate limit, policy). Read the CA's error verbatim in the log.
Fix the root cause; click Renew on the cert to retry immediately, or wait for the next scheduler sweep.

Dead-letter jobs piling up

Open Jobs; switch the status tab to dead.
Look for patterns by type — one type failing en masse = systemic (credential rotation, CA outage, MongoDB index missing).
Resolve the common cause; click Retry on each affected row (no bulk-retry button — fix the cause once, retry the rows individually or via the API).
For stragglers, inspect individually. Sometimes the right answer is Cancel (cert was revoked, no point retrying).

Scheduler won't elect a leader

Check cap_scheduler_is_leader across pods. Should be exactly one at 1.
All zeros: no instance can write to locks. MongoDB primary probably unreachable.
Two ones: caching glitch — wait one leader_lock_ttl, one should drop.
Logs showing context deadline exceeded on lock renewal indicate slow MongoDB writes.

MongoDB replica-set failover

Pods see connection errors for a few seconds during primary election.
Driver retries; most in-flight requests recover.
Scheduler loses its lock; standby re-acquires once the new primary accepts writes.
Workers mid-job: current attempt may fail; next retry succeeds.
If primary election takes longer than leader_lock_ttl, sweeps are delayed correspondingly. Nothing to do otherwise.

Suspected KEK compromise

Treat as urgent. The KEK wraps every secret; if it leaks, every secret is at risk. Full procedure: KEK rotation.

Generate a new KEK version (openssl rand -hex 32).
Add to env / HSM. Restart the fleet so both old and new are loaded.
certautopilot kek verify --target=<new> — confirm all processes loaded.
certautopilot kek rotate --from-version=<current> --to-version=<new>. Background job rewraps every envelope.
Once kek status shows complete, kek remove --version=<old>. Drop the old key from env / delete the HSM key.
Audit-log review: cross-reference admin actions in the compromise window. Consider rotating dependent secrets too (JWT secret, API-key pepper, module credentials).

ACME provider outage

Renewals + issuances for that CA fail with network errors.
CertAutoPilot retries on the configured ladder — transient blips are absorbed.
Extended outage (hours): create an account on a fallback CA (Sectigo / Google / ZeroSSL); reissue urgent certs manually.
Once the CA recovers, dead-letter jobs from the outage window can be bulk-retried.

License expired with no replacement

At exp the backend enters a 7-day grace window. Cert lifecycle (issue, renew, distribute, revoke) keeps running; the UI shows an expired-banner and the license status endpoint reports in_grace_period: true.
If the active plan is below the cert-count cap, you'll keep being able to issue new certs. If the cert cap is hit (e.g. starter at 5/5), issuance blocks until you free up slots or upgrade.
After the 7-day grace window, enterprise feature gates (LDAP login, OTP policy enforcement, Syslog forwarding) close until a renewed license is uploaded. serve keeps running; cert renewals always continue regardless of license state.
Obtain a new license; Settings → License.

Fan-out distribution stuck

Jobs → filter distribution_execute + in_progress.
Many stuck: distWorker isn't running. Check worker pod health.
One stuck: inspect its log. Usually a specific target hanging on connect — TCP timeout will eventually resolve it.
Cancel the parent distribution to mark all pending children cancelled. Retry after fixing.

Lost both MongoDB primary AND secret store

Catastrophic — you've lost the data and the ability to decrypt it.

Restore MongoDB from the latest backup.
Restore the secret store (secrets.env or K8s Secret) from the same time window — KEK and envelopes must match.
Start the service.
Any certs issued between backup time and loss are gone. Re-issue.
Post-mortem: why are backups not synchronised? Fix the procedure.

06Observability hooks

Operations alerting hangs off the metrics / events surface. The high-value alerts:

cap_scheduler_is_leader — must be exactly one.
cap_scheduler_leader_elections_total — incrementing frequently → MongoDB write instability or bad config ratio.
cap_jobs_queue_depth{type} — growing means workers can't keep up.
cap_jobs_dead_letter_total{type} — any non-zero increase warrants triage.
scheduler.leader_election event — wire to ops channel to notice failovers.
job.dead_letter event — wire to triage channel.
domain.expiring / domain.expired — wire to whoever owns registrar renewals.

Full metric reference: Observability.

07Upgrades & backups

Helm: helm upgrade rolling restart. Leader election handles the scheduler transition cleanly.
Standalone: upgrade.sh. Service bounce; config preserved.
Backups: snapshot MongoDB + the secret store on the same schedule. Restore drills should rehearse the joint restore — out-of-sync KEK + envelopes is unrecoverable.