Risk scoring
Every time you open a certificate's detail page, CertAutoPilot computes a risk score from its current state — how close to expiry, renewal health, distribution coverage, issuer health, chain integrity. The score is computed on-read (no background job), aggregates weighted factors, and maps to one of four levels: critical, high, medium, low.
01Why on-read
Risk is a view over current state. Caching it in the database means we'd need to invalidate on every renewal, every distribution, every OCSP check — fragile. Computing it at read time from already-fresh state is both cheaper in code and guaranteed correct. The UI renders lazily so the operator doesn't wait for risk on list views unless they expand.
02Levels
| Level | Score | Semantic |
|---|---|---|
| Critical | ≥ 60 | Act now. Revoked, expired, or key compromise territory. |
| High | 35–59 | Intervene this week. Near-term expiry or repeated renewal failures. |
| Medium | 15–34 | Watch. Renewal deferred, distribution drift, old issuer. |
| Low | 0–14 | Nominal. Normal operational cert. |
03Risk factors
Each factor that applies contributes its weight to the score. The active factors are shown sorted by weight so you see the biggest contributors first.
| Factor | Weight | When it fires |
|---|---|---|
| Revoked | 60 | Cert has been revoked at the CA (ours or discovered via OCSP). |
| Expired | 60 | Past not_after. |
| Expiry < 7 days | 50 | Fires when the remaining lifetime is under 7 days — mutually exclusive with Expired. |
| Renewal failed (terminal) | 45 | All retries exhausted on the last renewal attempt. |
| Expiry 7–14 days | 35 | Inside the "act this week" window. |
| Renewal failed (retrying) | 30 | Failed at least once; the retry ladder is still climbing. |
| Distribution failed | 25 | Most recent distribution ended in failed. |
| Auto-renew disabled | 20 | Cert relies on manual renewal (explicit operator choice, but a risk signal). |
| Issuer unhealthy | 20 | The ACME account / MSCA connection is in error state. |
| Chain integrity broken | 20 | A previously-trusted intermediate is no longer verifiable. |
| Expiry 14–30 days | 15 | Inside the standard renewal window — normal but surfaced so operators see it. |
| Distribution partial | 10 | Most recent distribution ended in partial_failure. |
Maximum observed score caps at 100; numbers above simply saturate.
04Attention signals
Four additional signals surface as amber warnings on the detail page without contributing to the score — informational rather than risk-driving:
- PQC vulnerable — key type (RSA ≤ 2048) classed vulnerable by the PQC analysis.
- Wildcard deployed — cert contains a wildcard SAN — compliance teams may require explicit tracking.
- Short validity — less than 30 days left on the template's configured validity (common in AD CS short-lived templates).
- No distributions attached — the cert is issued but never deployed anywhere (possible orphan).
05Mutual exclusion
Some factors would double-count if both applied:
- Revoked takes precedence over Expired (a revoked-then-expired cert scores as revoked).
- Among expiry buckets, only the worst applicable one fires — a cert 5 days out doesn't also trigger the 7–14 and 14–30 buckets.
06Where you see risk
- Certificate list: per-row icon (hover for factors).
- Certificate detail: full factor list with weights, sorted DESC.
- Dashboard: aggregate counts per level.
- Notifications: rules can filter on risk level (route "critical" to an urgent channel).
07Tuning
Weights are currently fixed in the backend (internal/service/cert_risk_service.go) — they are product defaults chosen from operational experience with CA/B Forum shortening trends. If you need different weights for internal compliance reporting, export the factor data via API and recompute in your own dashboards. Runtime-configurable weights are on the roadmap.
08Troubleshooting
"Cert shows as critical but looks fine"
Click to expand the factor list — the specific reason is shown. Most common surprise: OCSP showed revoked during a discovery probe (another team revoked it independently).
"Risk level didn't update after I fixed the issue"
Risk is on-read — the next page load recomputes. If you see a stale value, the page you're looking at is cached client-side; refresh.