Troubleshooting
Symptom-driven catalog. Find the failure mode, jump to the fix. Index of the things that go wrong in production CertAutoPilot installs.
01Install & bootstrap
Installer fails: "port 443 in use"
Another web server is bound. sudo systemctl stop apache2 / nginx, or re-run the installer with --port=8443.
Setup token lost
Tokens are single-use. Recover from journal:
sudo journalctl -u certautopilot --since "30 minutes ago" | grep "setup token"
If consumed already and no admin user exists, reset:
sudo certautopilot setup --reset
MongoDB won't start
Usually a permission issue on /var/lib/certautopilot/mongo. Check ownership: must be certautopilot:certautopilot with mode 0750. Recreate via sudo certautopilot setup --fix-permissions.
02ACME & CA
ACME account registration fails
The backend cannot reach the directory URL. Test from the host:
curl -sI https://acme-v02.api.letsencrypt.org/directory
If behind a proxy, set HTTPS_PROXY in /etc/certautopilot/secrets.env and restart.
"Too many certificates already issued"
You hit Let's Encrypt's per-registered-domain or duplicate-cert weekly cap. Switch to staging while iterating, and consolidate redundant cert requests.
EAB binding error
The CA returned urn:ietf:params:acme:error:externalAccountRequired or unauthorized. Double-check the kid and HMAC: each EAB pair is single-use, and many providers regenerate them on the fly. See EAB-bound CAs.
03DNS-01
DNS challenge never resolves
See DNS-01 troubleshooting. The most common cause is a stale TXT record with a long TTL.
"Insufficient DNS permissions"
The credential's scope is narrower than the zone. Re-issue the credential with the minimum permission set per provider.
04Distribution
Fan-out partial failure
Some targets succeeded, others failed. Open the certificate's Distribution tab — each target shows its own status and logs. Common causes: target offline, credential expired, post-deploy reload command non-zero.
Rollback could not restore
The Validate step failed and Rollback also failed (e.g. SSH host went unreachable mid-deploy). The target ends in status broken; the previous cert backup is preserved on the worker filesystem at /var/lib/certautopilot/dist-backups/<job-id>/ for manual recovery.
05Audit
Audit verify reports broken chain
This is serious. Either a bug or tampering. First capture the broken entry's id and the previous entry's hash, then escalate.
certautopilot audit verify --since 2026-01-01 --json > audit-verify.json
Forward the JSON to support@cloudnativeworks.com along with the deployment version.
06Upgrade
Migration failed mid-upgrade
Migrations are forward-only and idempotent — re-running the same upgrade is safe and is the recommended fix. If a migration is consistently failing, restore the pre-upgrade Mongo snapshot and contact support; do not run the next minor version on top.
07KEK
"Cannot decrypt: unknown KEK version"
A pod is encountering records encrypted under a KEK version it doesn't have in memory. Most often: the pod was started before a rotation completed and never picked up the new KEK file. Restart the pod. If the issue persists, the KEK file is missing the version — restore from your secret store.