Troubleshooting

Symptom-driven catalog. Find the failure mode, jump to the fix. Index of the things that go wrong in production CertAutoPilot installs.

01Install & bootstrap

Installer fails: "port 443 in use"

Another web server is bound. sudo systemctl stop apache2 / nginx, or re-run the installer with --port=8443.

Setup token lost

Tokens are single-use. Recover from journal:

sudo journalctl -u certautopilot --since "30 minutes ago" | grep "setup token"

If consumed already and no admin user exists, reset:

sudo certautopilot setup --reset

MongoDB won't start

Usually a permission issue on /var/lib/certautopilot/mongo. Check ownership: must be certautopilot:certautopilot with mode 0750. Recreate via sudo certautopilot setup --fix-permissions.

02ACME & CA

ACME account registration fails

The backend cannot reach the directory URL. Test from the host:

curl -sI https://acme-v02.api.letsencrypt.org/directory

If behind a proxy, set HTTPS_PROXY in /etc/certautopilot/secrets.env and restart.

"Too many certificates already issued"

You hit Let's Encrypt's per-registered-domain or duplicate-cert weekly cap. Switch to staging while iterating, and consolidate redundant cert requests.

EAB binding error

The CA returned urn:ietf:params:acme:error:externalAccountRequired or unauthorized. Double-check the kid and HMAC: each EAB pair is single-use, and many providers regenerate them on the fly. See EAB-bound CAs.

03DNS-01

DNS challenge never resolves

See DNS-01 troubleshooting. The most common cause is a stale TXT record with a long TTL.

"Insufficient DNS permissions"

The credential's scope is narrower than the zone. Re-issue the credential with the minimum permission set per provider.

04Distribution

Fan-out partial failure

Some targets succeeded, others failed. Open the certificate's Distribution tab — each target shows its own status and logs. Common causes: target offline, credential expired, post-deploy reload command non-zero.

Rollback could not restore

The Validate step failed and Rollback also failed (e.g. SSH host went unreachable mid-deploy). The target ends in status broken; the previous cert backup is preserved on the worker filesystem at /var/lib/certautopilot/dist-backups/<job-id>/ for manual recovery.

05Audit

Audit verify reports broken chain

This is serious. Either a bug or tampering. First capture the broken entry's id and the previous entry's hash, then escalate.

certautopilot audit verify --since 2026-01-01 --json > audit-verify.json

Forward the JSON to support@cloudnativeworks.com along with the deployment version.

06Upgrade

Migration failed mid-upgrade

Migrations are forward-only and idempotent — re-running the same upgrade is safe and is the recommended fix. If a migration is consistently failing, restore the pre-upgrade Mongo snapshot and contact support; do not run the next minor version on top.

07KEK

"Cannot decrypt: unknown KEK version"

A pod is encountering records encrypted under a KEK version it doesn't have in memory. Most often: the pod was started before a rotation completed and never picked up the new KEK file. Restart the pod. If the issue persists, the KEK file is missing the version — restore from your secret store.