Observability

Three signal pipes: Prometheus metrics for fleet health and SLOs, OpenTelemetry traces for end-to-end latency and external-API correlation, and syslog forwarding for SIEM archival of every audit-relevant event. The audit log covers the chain-integrity and verification path.

01Prometheus metrics

Every CertAutoPilot process exposes /metrics in the Prometheus text format. The endpoint is unauthenticated but carries no secrets — it returns Prometheus counters, gauges, and histograms only. If you terminate outside-world traffic on the same listener, block /metrics at the LB level or expose it on an internal listener.

Scrape — Helm chart

# values.yaml
serviceMonitor:
  enabled: true
  interval: 30s
  labels:
    release: prometheus   # match your Prometheus operator's selector
prometheusRule:
  enabled: true           # ships recommended alerts (see below)

Scrape — standalone

# /etc/prometheus/conf.d/certautopilot.yml
- job_name: certautopilot
  scrape_interval: 30s
  static_configs:
    - targets: ["cap-host.internal:18181"]
  metrics_path: /metrics

02Core metrics

HTTP

cap_http_requests_total{method,path,status} counter
cap_http_request_duration_seconds{method,path} histogram
cap_http_in_flight_requests gauge

Path labels are normalised to route patterns (e.g. /projects/:projectId/certificates/:id) — no cardinality explosion from cert IDs.

Job queue

cap_jobs_queued_total{type} counter
cap_jobs_completed_total{type,result} counter      # result: success|failed|partial
cap_jobs_retry_total{type} counter
cap_jobs_dead_letter_total{type} counter
cap_jobs_duration_seconds{type} histogram
cap_jobs_queue_depth{type} gauge                    # pending count

Certificates

cap_certificates_total{status,issuer_type} gauge   # status: active|expired|revoked|renewal_failed
cap_certificates_expiring_in_days{threshold} gauge # count of certs under each threshold (7/14/30/90)
cap_certificates_issued_total{issuer_type} counter
cap_certificates_renewed_total{issuer_type,trigger} counter

Distribution

cap_distribution_executions_total{module,status} counter
cap_distribution_target_result_total{module,error_class} counter
cap_distribution_fanout_total counter
cap_distribution_fanout_children_active gauge
cap_distribution_validation_total{status} counter

KEK / encryption

cap_kek_process_loaded_versions{host,version} gauge  # 1 if loaded, 0 otherwise
cap_kek_process_current_version{host} gauge
cap_kek_cluster_laggards_total{target} gauge         # non-zero => NOT READY for target
cap_kek_rotation_active{rotation_id} gauge           # 1 while rotation is running
cap_kek_rotation_progress_ratio{rotation_id} gauge   # 0..1
cap_kek_wrap_total{provider,result} counter
cap_kek_unwrap_total{provider,result} counter

Notifications · discovery · external · license

cap_notification_delivery_total{channel_type,status} counter
cap_notification_delivery_duration_seconds{channel_type} histogram

cap_discovery_sources_total gauge
cap_discovery_endpoints_total{source_id} gauge
cap_discovery_findings_open{severity} gauge

cap_external_request_duration_seconds{provider,endpoint,status} histogram
cap_external_request_total{provider,endpoint,result} counter
# provider = acme, msca, <dns provider>, <module name>

cap_license_valid gauge                          # 0 if invalid/expired
cap_license_expires_in_seconds gauge
cap_license_cert_usage gauge
cap_license_cert_limit gauge                     # 0 = unlimited

03Recommended alerts

The Helm PrometheusRule ships these by default; copy them into your standalone Prometheus rules file otherwise:

ALERT CertRenewalFailingAndExpiring
  IF cap_certificates_total{status="renewal_failed"} > 0
     AND cap_certificates_expiring_in_days{threshold="7"} > 0
  FOR 1h
  LABELS { severity="critical" }

ALERT KEKClusterNotReady
  IF cap_kek_cluster_laggards_total > 0
  FOR 5m
  LABELS { severity="warning" }

ALERT JobDeadLetterRising
  IF rate(cap_jobs_dead_letter_total[15m]) > 0
  FOR 15m
  LABELS { severity="warning" }

ALERT LicenseExpiringSoon
  IF cap_license_expires_in_seconds < 7 * 24 * 3600
  LABELS { severity="warning" }

04OpenTelemetry tracing

CertAutoPilot emits OTLP traces for every HTTP request and every job execution. Spans include cert IDs, project IDs, and external-API latencies — correlate "this user's renewal failed" to "the ACME provider returned 5xx twice" in one view.

Enable:

telemetry:
  tracing:
    enabled: true
    endpoint: http://otel-collector.observability:4318
    sample_rate: 0.1   # 1.0 for debug runs

sample_rate is parent-based — incoming requests with an upstream sampling decision are honoured; otherwise the probabilistic sampler kicks in at the configured rate.

Transports: OTLP/HTTP (port 4318, default) or OTLP/gRPC (port 4317) — set the endpoint scheme to grpc:// for the latter. TLS via system trust if the endpoint is https:// or grpcs://.

What gets traced

  • HTTP requests — method, route pattern, status, actor, org, project, child spans for DB queries / job enqueues / service calls.
  • Jobs — job type, ID, attempt number, certificate / distribution ID, child spans for external-API calls (ACME, MSCA, DNS, module target), DB writes, event emissions. Errors set the span to Error with the classification tag (network / auth / etc.).
  • Propagation — W3C Trace Context + Baggage on inbound and outbound. A webhook receiver that propagates the context back produces end-to-end traces across the system boundary.

Span attributes worth filtering on

AttributeValues
cap.tenant.org_idopaque id
cap.tenant.project_idopaque id
cap.actor.typeuser / api_key / system
cap.cert.idopaque id
cap.job.typejob type name
cap.job.attempt1..N
cap.external.provideracme, msca, cloudflare, …
cap.error.classnetwork / auth / io_transient / io_permanent / validation

Minimal collector config

receivers:
  otlp:
    protocols:
      http: { endpoint: 0.0.0.0:4318 }
      grpc: { endpoint: 0.0.0.0:4317 }
processors:
  batch:
exporters:
  otlp/tempo:
    endpoint: tempo.observability:4317
    tls: { insecure: true }
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/tempo]

05Syslog forwarding

Every audit event — login, cert issuance, revocation, role change, KEK rotation — forwards to an external syslog relay when enabled. RFC 5424 format over UDP, TCP, or TCP+TLS. Pair with your SIEM (Splunk, ELK / Logstash, Graylog, Datadog, Azure Sentinel) for long-term retention and cross-system correlation.

Configure: Settings → Syslog (org admin role).

  • Host + port — typical: 514 (UDP), 601 (TCP), 6514 (TCP+TLS).
  • Transport: udp / tcp / tcp+tls.
  • Facility: local0local7 (default local6).
  • Hostname override — the RFC 5424 HOSTNAME field; defaults to the pod / host name.
  • App name — defaults to certautopilot.
  • TLS CA cert (tcp+tls only) — PEM trust anchor for private CAs.

Click Test after saving — a dummy syslog message confirms reachability and parses on the relay.

Message format

<174>1 2026-04-21T14:23:45.123Z cap-api-1 certautopilot 12345 - [cap@0 event_id="..." event_type="cert.issued" org_id="..." project_id="..." actor="user:alice@example.com" resource="cert:..."] Certificate issued for api.example.com

<174> = facility 21 (local6) × 8 + severity 6 (info). Structured data [cap@0 ...] carries the event payload; the trailing free-text summarises.

Severity mapping

CertAutoPilot severitySyslog severityNumeric
criticalcrit2
errorerr3
warnwarning4
infoinfo6
debugdebug7

Delivery reliability

  • UDP: fire-and-forget. Lost packets are lost. Fine for best-effort feeds.
  • TCP: guaranteed delivery per RFC 5425. If the relay is unreachable, the backend queues up to syslog.buffer_size events (default 10 k) and drops oldest when the buffer fills; a warning surfaces in the UI.
  • TCP + TLS: same as TCP plus transport encryption. Required for external SIEMs across untrusted networks.

Audit logs themselves stay in MongoDB regardless of syslog delivery — syslog is a secondary export path, not the primary record. See Audit & SIEM for the chain-integrity model.

SIEM integration

  • Splunk: use a syslog: 5424 sourcetype. The structured data becomes parsed fields. Splunk Add-on for Unix and Linux handles 5424 natively.
  • ELK / Logstash: syslog input on TCP 6514 with TLS, kv filter on the structured-data prefix.
  • Graylog: Syslog TCP or Syslog UDP input. Enable Parse structured data under input settings.

06Egress & SSRF guards

Outbound network policy (SSRF guard) blocks link-local and cloud-metadata addresses by default. The metrics scrape, OTLP collector, and syslog relay must be reachable from the backend's namespace / VPC; Kubernetes NetworkPolicies often block egress by default — add explicit rules. Allowlist private endpoints at the network-policy layer if the relay or collector is intentionally on a private network.

07Troubleshooting

Prometheus shows the target as DOWN

Check that /metrics isn't blocked by your nginx / LB. Hit it directly from the scrape host — should return text starting with # HELP. If you see HTML or 401, the path is being intercepted; expose a separate internal listener.

No traces appear in the backend

Sample rate is 0 or the endpoint is wrong. Set sample_rate: 1.0 for a debug run and tail the collector's logs for POST /v1/traces. Check that the namespace egress allows TCP to the collector port.

"Syslog Test" passes but nothing arrives at the SIEM

Almost always a relay-side parser problem (the SIEM rejected the structured-data block). Capture on the relay with tcpdump -i any port 6514 (or a relay-specific debug log) and confirm the message hits it. If the message is on the wire but the SIEM dropped it, the structured-data SD-ID (cap@0) isn't whitelisted in your input config.

TCP+TLS syslog: "x509: certificate signed by unknown authority"

The relay's CA isn't in CertAutoPilot's trust set. Paste the CA PEM into the TLS CA cert field, or add it to the system trust store on the host running the backend.