Prometheus scrape endpoint at GET /metrics — always-on, no
configuration needed. Counters + gauges become visible after first
increment.
OTLP export to an OpenTelemetry collector — activated when
OTEL_EXPORTER_OTLP_ENDPOINT is set. Runs alongside the Prometheus
reader; both see the same instruments.
Every counter is labeled with pid at minimum so cardinality is
bounded and per-project queries are trivial.
Returns Prometheus text format. No auth today (the counters are
operational, not data). The path is unversioned per Prometheus
convention — operators scrape one URL regardless of /api/v1
changes.
The endpoint issues a 307 redirect from /metrics to /metrics/ —
standard FastAPI mount behavior. Any normal Prometheus scraper
follows the redirect; manual curl users want curl -L.
Master elections decided. Increments only when the call actually decides the master seat — joining as peer doesn’t count.
controller_stream_events_total
pid, event_type
ServerEvents pushed out through a CoordinationStream (SPEC-030 Phase 3). Event types: master_preempted|lease_contention|approval_requested|state_change|heartbeat_ack|nudge_push.
controller_lease_grants_total
pid, resource_type
Leases granted (successful grants only — contention/rejections live on stream_events).
controller_approval_requests_total
pid, outcome
Approval lifecycle events. Outcomes: submitted|approved|rejected|timed_out. Each approval flows through multiple outcome buckets over its life.
controller_stream_drops_total
pid
EventBus overflow drops — oldest event discarded to make room. Surfacing this helps operators notice when the event rate exceeds the bus capacity.
SPEC-101 Marconi (signal-mesh hot path + audit fan-out)
Marconi is the in-memory signal-mesh switch (SPEC-101). The legacy signals_sent_total counter remains for backward-compat; Marconi adds a parallel marconi_* namespace that disambiguates hot-path delivery from audit-pipeline durability. Every metric below MUST exist; the Stage 5 hot-path cutover required them in place before the flag flipped on server1 (2026-05-11). See Marconi for the architectural context.Hot-path counters (per tenant)
Counter / gauge
Labels
What it means
marconi_signals_received_total
tenant, signal_type
At API entry — before routing lookup.
marconi_signals_delivered_total
tenant, signal_type, outcome
outcome ∈ . pushed means Marconi found a live WS handle and pushed; queued_offline means recipient had no live entry; no_subscriber means identity didn’t resolve; dropped is the loss-event terminal.
marconi_signals_acked_total
tenant, signal_type, ack_kind
Final-delivery evidence. ack_kind ∈ .
marconi_signal_send_duration_seconds
tenant
Histogram. p50/p95/p99 of prism_signal end-to-end latency. v0.4 target: p99 < 5ms on the same-instance hot path.
marconi_routing_table_size
tenant
Gauge — count of (tenant_id, project_id, identity) entries in Marconi’s routing table.
pid — project identifier like PID-PGR01. Low cardinality
(typically < 100 per install). In hot paths where only a UUID is
available at metric-emit time, we fall back to project_id[:8]
shorthand — still bounded, operators can grep for the prefix.
signal_type — the §5.2 type string literally. For broadcasts
(to="*"), the value is broadcast to distinguish from targeted
types with the same payload.
outcome — disjoint per-event terminal state. A single
POST /signal lands in exactly one outcome bucket.
- alert: PrismSignalsMostlyQueueing expr: | sum by (pid) (rate(signals_sent_total{outcome="queued"}[15m])) / sum by (pid) (rate(signals_sent_total[15m])) > 0.5 for: 15m annotations: summary: "{{ $labels.pid }}: >50% of signals are queueing instead of delivering (target offline?)"
Marconi loss event — any audit-queue overwrite is a paging incident
- alert: MarconiAuditQueueOverwrite expr: sum by (tenant) (rate(marconi_audit_queue_overwrite_total[5m])) > 0 for: 1m annotations: summary: "{{ $labels.tenant }}: Marconi audit queue overwrote entries before reaching Redis Stream — durable signal loss (Redis writer lagging or Redis down longer than queue holds)"
The backend uses OpenTelemetry’s Python SDK. Counters are
create_counter instruments on a shared meter. Both a
PrometheusMetricReader (writing to prometheus_client.REGISTRY)
and — when configured — a PeriodicExportingMetricReader (OTLP
gRPC) are attached. Instruments behave identically regardless of
which readers are active; no export path in = silent no-op.
Metric emission is lazy-imported + try/except-wrapped in every
service. A broken record_* call logs at DEBUG and returns — it
never fails the caller. Runs counter to “fail fast” but matches
the principle that observability bugs should never break the
thing being observed.
Counter names + labels are load-bearing for dashboards. Changing
them is a breaking change for anyone scraping us.
See backend/app/observability/metrics.py for the source of
truth. Adding a counter is a 2-change diff: (1) new
_meter.create_counter(...) at module level, (2) a record_X(...)
helper to be called from services.