Observability
Three signal pipes: Prometheus metrics for fleet health and SLOs, OpenTelemetry traces for end-to-end latency and external-API correlation, and syslog forwarding for SIEM archival of every audit-relevant event. The audit log covers the chain-integrity and verification path.
01Prometheus metrics
Every CertAutoPilot process exposes /metrics in the Prometheus text format. The endpoint is unauthenticated but carries no secrets — it returns Prometheus counters, gauges, and histograms only. If you terminate outside-world traffic on the same listener, block /metrics at the LB level or expose it on an internal listener.
Scrape — Helm chart
# values.yaml
serviceMonitor:
enabled: true
interval: 30s
labels:
release: prometheus # match your Prometheus operator's selector
prometheusRule:
enabled: true # ships recommended alerts (see below)
Scrape — standalone
# /etc/prometheus/conf.d/certautopilot.yml
- job_name: certautopilot
scrape_interval: 30s
static_configs:
- targets: ["cap-host.internal:18181"]
metrics_path: /metrics
02Core metrics
HTTP
cap_http_requests_total{method,path,status} counter
cap_http_request_duration_seconds{method,path} histogram
cap_http_in_flight_requests gauge
Path labels are normalised to route patterns (e.g. /projects/:projectId/certificates/:id) — no cardinality explosion from cert IDs.
Job queue
cap_jobs_queued_total{type} counter
cap_jobs_completed_total{type,result} counter # result: success|failed|partial
cap_jobs_retry_total{type} counter
cap_jobs_dead_letter_total{type} counter
cap_jobs_duration_seconds{type} histogram
cap_jobs_queue_depth{type} gauge # pending count
Certificates
cap_certificates_total{status,issuer_type} gauge # status: active|expired|revoked|renewal_failed
cap_certificates_expiring_in_days{threshold} gauge # count of certs under each threshold (7/14/30/90)
cap_certificates_issued_total{issuer_type} counter
cap_certificates_renewed_total{issuer_type,trigger} counter
Distribution
cap_distribution_executions_total{module,status} counter
cap_distribution_target_result_total{module,error_class} counter
cap_distribution_fanout_total counter
cap_distribution_fanout_children_active gauge
cap_distribution_validation_total{status} counter
KEK / encryption
cap_kek_process_loaded_versions{host,version} gauge # 1 if loaded, 0 otherwise
cap_kek_process_current_version{host} gauge
cap_kek_cluster_laggards_total{target} gauge # non-zero => NOT READY for target
cap_kek_rotation_active{rotation_id} gauge # 1 while rotation is running
cap_kek_rotation_progress_ratio{rotation_id} gauge # 0..1
cap_kek_wrap_total{provider,result} counter
cap_kek_unwrap_total{provider,result} counter
Notifications · discovery · external · license
cap_notification_delivery_total{channel_type,status} counter
cap_notification_delivery_duration_seconds{channel_type} histogram
cap_discovery_sources_total gauge
cap_discovery_endpoints_total{source_id} gauge
cap_discovery_findings_open{severity} gauge
cap_external_request_duration_seconds{provider,endpoint,status} histogram
cap_external_request_total{provider,endpoint,result} counter
# provider = acme, msca, <dns provider>, <module name>
cap_license_valid gauge # 0 if invalid/expired
cap_license_expires_in_seconds gauge
cap_license_cert_usage gauge
cap_license_cert_limit gauge # 0 = unlimited
03Recommended alerts
The Helm PrometheusRule ships these by default; copy them into your standalone Prometheus rules file otherwise:
ALERT CertRenewalFailingAndExpiring
IF cap_certificates_total{status="renewal_failed"} > 0
AND cap_certificates_expiring_in_days{threshold="7"} > 0
FOR 1h
LABELS { severity="critical" }
ALERT KEKClusterNotReady
IF cap_kek_cluster_laggards_total > 0
FOR 5m
LABELS { severity="warning" }
ALERT JobDeadLetterRising
IF rate(cap_jobs_dead_letter_total[15m]) > 0
FOR 15m
LABELS { severity="warning" }
ALERT LicenseExpiringSoon
IF cap_license_expires_in_seconds < 7 * 24 * 3600
LABELS { severity="warning" }
04OpenTelemetry tracing
CertAutoPilot emits OTLP traces for every HTTP request and every job execution. Spans include cert IDs, project IDs, and external-API latencies — correlate "this user's renewal failed" to "the ACME provider returned 5xx twice" in one view.
Enable:
telemetry:
tracing:
enabled: true
endpoint: http://otel-collector.observability:4318
sample_rate: 0.1 # 1.0 for debug runs
sample_rate is parent-based — incoming requests with an upstream sampling decision are honoured; otherwise the probabilistic sampler kicks in at the configured rate.
Transports: OTLP/HTTP (port 4318, default) or OTLP/gRPC (port 4317) — set the endpoint scheme to grpc:// for the latter. TLS via system trust if the endpoint is https:// or grpcs://.
What gets traced
- HTTP requests — method, route pattern, status, actor, org, project, child spans for DB queries / job enqueues / service calls.
- Jobs — job type, ID, attempt number, certificate / distribution ID, child spans for external-API calls (ACME, MSCA, DNS, module target), DB writes, event emissions. Errors set the span to
Errorwith the classification tag (network/auth/ etc.). - Propagation — W3C Trace Context + Baggage on inbound and outbound. A webhook receiver that propagates the context back produces end-to-end traces across the system boundary.
Span attributes worth filtering on
| Attribute | Values |
|---|---|
cap.tenant.org_id | opaque id |
cap.tenant.project_id | opaque id |
cap.actor.type | user / api_key / system |
cap.cert.id | opaque id |
cap.job.type | job type name |
cap.job.attempt | 1..N |
cap.external.provider | acme, msca, cloudflare, … |
cap.error.class | network / auth / io_transient / io_permanent / validation |
Minimal collector config
receivers:
otlp:
protocols:
http: { endpoint: 0.0.0.0:4318 }
grpc: { endpoint: 0.0.0.0:4317 }
processors:
batch:
exporters:
otlp/tempo:
endpoint: tempo.observability:4317
tls: { insecure: true }
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/tempo]
05Syslog forwarding
Every audit event — login, cert issuance, revocation, role change, KEK rotation — forwards to an external syslog relay when enabled. RFC 5424 format over UDP, TCP, or TCP+TLS. Pair with your SIEM (Splunk, ELK / Logstash, Graylog, Datadog, Azure Sentinel) for long-term retention and cross-system correlation.
Configure: Settings → Syslog (org admin role).
- Host + port — typical: 514 (UDP), 601 (TCP), 6514 (TCP+TLS).
- Transport:
udp/tcp/tcp+tls. - Facility:
local0–local7(defaultlocal6). - Hostname override — the RFC 5424 HOSTNAME field; defaults to the pod / host name.
- App name — defaults to
certautopilot. - TLS CA cert (tcp+tls only) — PEM trust anchor for private CAs.
Click Test after saving — a dummy syslog message confirms reachability and parses on the relay.
Message format
<174>1 2026-04-21T14:23:45.123Z cap-api-1 certautopilot 12345 - [cap@0 event_id="..." event_type="cert.issued" org_id="..." project_id="..." actor="user:alice@example.com" resource="cert:..."] Certificate issued for api.example.com
<174> = facility 21 (local6) × 8 + severity 6 (info). Structured data [cap@0 ...] carries the event payload; the trailing free-text summarises.
Severity mapping
| CertAutoPilot severity | Syslog severity | Numeric |
|---|---|---|
| critical | crit | 2 |
| error | err | 3 |
| warn | warning | 4 |
| info | info | 6 |
| debug | debug | 7 |
Delivery reliability
- UDP: fire-and-forget. Lost packets are lost. Fine for best-effort feeds.
- TCP: guaranteed delivery per RFC 5425. If the relay is unreachable, the backend queues up to
syslog.buffer_sizeevents (default 10 k) and drops oldest when the buffer fills; a warning surfaces in the UI. - TCP + TLS: same as TCP plus transport encryption. Required for external SIEMs across untrusted networks.
Audit logs themselves stay in MongoDB regardless of syslog delivery — syslog is a secondary export path, not the primary record. See Audit & SIEM for the chain-integrity model.
SIEM integration
- Splunk: use a syslog: 5424 sourcetype. The structured data becomes parsed fields. Splunk Add-on for Unix and Linux handles 5424 natively.
- ELK / Logstash:
sysloginput on TCP 6514 with TLS,kvfilter on the structured-data prefix. - Graylog: Syslog TCP or Syslog UDP input. Enable Parse structured data under input settings.
06Egress & SSRF guards
Outbound network policy (SSRF guard) blocks link-local and cloud-metadata addresses by default. The metrics scrape, OTLP collector, and syslog relay must be reachable from the backend's namespace / VPC; Kubernetes NetworkPolicies often block egress by default — add explicit rules. Allowlist private endpoints at the network-policy layer if the relay or collector is intentionally on a private network.
07Troubleshooting
Prometheus shows the target as DOWN
Check that /metrics isn't blocked by your nginx / LB. Hit it directly from the scrape host — should return text starting with # HELP. If you see HTML or 401, the path is being intercepted; expose a separate internal listener.
No traces appear in the backend
Sample rate is 0 or the endpoint is wrong. Set sample_rate: 1.0 for a debug run and tail the collector's logs for POST /v1/traces. Check that the namespace egress allows TCP to the collector port.
"Syslog Test" passes but nothing arrives at the SIEM
Almost always a relay-side parser problem (the SIEM rejected the structured-data block). Capture on the relay with tcpdump -i any port 6514 (or a relay-specific debug log) and confirm the message hits it. If the message is on the wire but the SIEM dropped it, the structured-data SD-ID (cap@0) isn't whitelisted in your input config.
TCP+TLS syslog: "x509: certificate signed by unknown authority"
The relay's CA isn't in CertAutoPilot's trust set. Paste the CA PEM into the TLS CA cert field, or add it to the system trust store on the host running the backend.