Skip to main content

Metrics & Observability

Prism’s backend exports metrics in two ways:
  • Prometheus scrape endpoint at GET /metrics — always-on, no configuration needed. Counters + gauges become visible after first increment.
  • OTLP export to an OpenTelemetry collector — activated when OTEL_EXPORTER_OTLP_ENDPOINT is set. Runs alongside the Prometheus reader; both see the same instruments.
Every counter is labeled with pid at minimum so cardinality is bounded and per-project queries are trivial.

Endpoint

GET http://<backend>/metrics
DeploymentURL
Personal-localhttp://127.0.0.1:8765/metrics
LAN (e.g. server1)http://server1.home.lan:41765/metrics
Cloudhttps://<cloud-host>/metrics
Returns Prometheus text format. No auth today (the counters are operational, not data). The path is unversioned per Prometheus convention — operators scrape one URL regardless of /api/v1 changes.
The endpoint issues a 307 redirect from /metrics to /metrics/ — standard FastAPI mount behavior. Any normal Prometheus scraper follows the redirect; manual curl users want curl -L.

Counter catalog

SPEC-030 controller plane

CounterLabelsWhat it means
controller_elections_totalpid, winner_surfaceMaster elections decided. Increments only when the call actually decides the master seat — joining as peer doesn’t count.
controller_stream_events_totalpid, event_typeServerEvents pushed out through a CoordinationStream (SPEC-030 Phase 3). Event types: master_preempted|lease_contention|approval_requested|state_change|heartbeat_ack|nudge_push.
controller_lease_grants_totalpid, resource_typeLeases granted (successful grants only — contention/rejections live on stream_events).
controller_approval_requests_totalpid, outcomeApproval lifecycle events. Outcomes: submitted|approved|rejected|timed_out. Each approval flows through multiple outcome buckets over its life.
controller_stream_drops_totalpidEventBus overflow drops — oldest event discarded to make room. Surfacing this helps operators notice when the event rate exceeds the bus capacity.

SPEC-030 gauges

GaugeLabelsWhat it means
controller_registrations_activepidCurrently-active controller registrations (released_at IS NULL). Observable via periodic polling of Postgres.

SPEC-034 agent-to-agent signals

CounterLabelsWhat it means
signals_sent_totalpid, signal_type, outcomeSignals accepted by POST /signal. signal_type. outcome.
signals_drained_on_startup_totalpid, to_identitySignals delivered via prism_start’s drain path (SPEC-034 §6.3). Increments by the number of rows drained per call.
system_signals_emitted_totalpid, signal_type, targetController-emitted system signals (not via the prism_signal verb). signal_type. target.

SPEC-101 Marconi (signal-mesh hot path + audit fan-out)

Marconi is the in-memory signal-mesh switch (SPEC-101). The legacy signals_sent_total counter remains for backward-compat; Marconi adds a parallel marconi_* namespace that disambiguates hot-path delivery from audit-pipeline durability. Every metric below MUST exist; the Stage 5 hot-path cutover required them in place before the flag flipped on server1 (2026-05-11). See Marconi for the architectural context. Hot-path counters (per tenant)
Counter / gaugeLabelsWhat it means
marconi_signals_received_totaltenant, signal_typeAt API entry — before routing lookup.
marconi_signals_delivered_totaltenant, signal_type, outcomeoutcome. pushed means Marconi found a live WS handle and pushed; queued_offline means recipient had no live entry; no_subscriber means identity didn’t resolve; dropped is the loss-event terminal.
marconi_signals_acked_totaltenant, signal_type, ack_kindFinal-delivery evidence. ack_kind.
marconi_signal_send_duration_secondstenantHistogram. p50/p95/p99 of prism_signal end-to-end latency. v0.4 target: p99 < 5ms on the same-instance hot path.
marconi_routing_table_sizetenantGauge — count of (tenant_id, project_id, identity) entries in Marconi’s routing table.
marconi_routing_cache_hits_total / marconi_routing_cache_misses_totaltenantRouting lookup cache hits vs SessionStore-fallback misses. Steady-state hit-rate ≥ 99%.
Audit queue + fan-out counters
Counter / gaugeLabelsWhat it means
marconi_audit_queue_depthtenantGauge — in-memory audit queue depth. Sized per max_entries.
marconi_audit_queue_overwrite_totaltenantCounter — entries overwritten before reaching Redis Stream. Loss event. Non-zero is a paging incident under v0.4 loss budget.
marconi_redis_writer_lag_secondstenantGauge — Redis Stream writer lag from audit-queue head.
marconi_redis_writer_errors_totaltenant, reasonCounter — Redis writer failures.
marconi_pg_archiver_lag_secondstenantGauge — PG archiver lag from Marconi Cache. Near-zero in steady state (“once it hits cache it immediately goes to PG”).
marconi_pg_archiver_errors_totaltenant, reasonCounter — PG archiver failures.
Cache-invalidator counters (Stage 2)
CounterLabelsWhat it means
marconi_invalidator_calls_totalhookhook. Tracks every direct write-through hook firing.
marconi_invalidator_errors_totalhook, reasonNon-fatal hook errors. Non-zero in steady state is a paging incident — invalidation gaps cause stale-route delivery.
Obligation counters
Counter / gaugeLabelsWhat it means
marconi_obligations_opentenant, kindGauge — open obligations awaiting ack or terminal.
marconi_obligations_durable_totaltenant, kindCounter — obligations whose envelope reached PG audit.
marconi_obligations_degraded_not_durable_totaltenant, kindCounter — obligations delivered but not yet durably persisted (recoverable from upstream tier).
marconi_obligation_sla_violation_totaltenant, kind, slaCounter — SLA breach (ack_sla_seconds or terminal_sla_seconds exceeded).
Unknown-recipient rejections
CounterLabelsWhat it means
unknown_recipient_rejections_totaltenant, signal_typeIncrements on every publish_path=rejected_unknown. Surfaces typo’d identities and stale routing without log inspection. SPEC-071 §5.

Label conventions

  • pid — project identifier like PID-PGR01. Low cardinality (typically < 100 per install). In hot paths where only a UUID is available at metric-emit time, we fall back to project_id[:8] shorthand — still bounded, operators can grep for the prefix.
  • signal_type — the §5.2 type string literally. For broadcasts (to="*"), the value is broadcast to distinguish from targeted types with the same payload.
  • outcome — disjoint per-event terminal state. A single POST /signal lands in exactly one outcome bucket.
  • winner_surface / agent_surfaceclaude_desktop, claude_code, codex, cursor, or other. Bounded.

Suggested Prometheus queries

sum by (pid) (rate(signals_sent_total[5m]))

Suggested alerts (baseline)

Alert thresholds are deployment-specific; these are starting points for personal / small-team installs.

Stream drops — anything > 0 is bad

- alert: PrismStreamDrops
  expr: rate(controller_stream_drops_total[5m]) > 0
  for: 2m
  annotations:
    summary: "Prism EventBus dropped events — bus capacity may be undersized"

Election churn — more than one election per hour per PID suggests instability

- alert: PrismElectionChurn
  expr: sum by (pid) (rate(controller_elections_total[1h])) > 1
  for: 10m
  annotations:
    summary: "Controller election churn on {{ $labels.pid }} — master may be flapping"

Approval timeouts — any approval timing out warrants attention

- alert: PrismApprovalTimeout
  expr: rate(controller_approval_requests_total{outcome="timed_out"}[10m]) > 0
  annotations:
    summary: "Approval request timed out on {{ $labels.pid }} — decider offline?"

Signal drop-rate — high queued/delivered ratio

- alert: PrismSignalsMostlyQueueing
  expr: |
    sum by (pid) (rate(signals_sent_total{outcome="queued"}[15m]))
      /
    sum by (pid) (rate(signals_sent_total[15m])) > 0.5
  for: 15m
  annotations:
    summary: "{{ $labels.pid }}: >50% of signals are queueing instead of delivering (target offline?)"

Marconi loss event — any audit-queue overwrite is a paging incident

- alert: MarconiAuditQueueOverwrite
  expr: sum by (tenant) (rate(marconi_audit_queue_overwrite_total[5m])) > 0
  for: 1m
  annotations:
    summary: "{{ $labels.tenant }}: Marconi audit queue overwrote entries before reaching Redis Stream — durable signal loss (Redis writer lagging or Redis down longer than queue holds)"

Marconi invalidator errors — stale-route delivery risk

- alert: MarconiInvalidatorErrors
  expr: sum by (hook) (rate(marconi_invalidator_errors_total[5m])) > 0
  for: 5m
  annotations:
    summary: "Marconi invalidator hook {{ $labels.hook }} failing — routing cache may drift from session state"

Marconi PG archiver lag — audit pipeline backing up

- alert: MarconiPGArchiverLag
  expr: max by (tenant) (marconi_pg_archiver_lag_seconds) > 30
  for: 5m
  annotations:
    summary: "{{ $labels.tenant }}: Marconi PG archiver lag > 30s — cache-to-audit pipeline backing up"

Marconi hot-path p99 latency — should be < 5ms

- alert: MarconiHotPathP99Slow
  expr: |
    histogram_quantile(0.99,
      sum by (le, tenant) (rate(marconi_signal_send_duration_seconds_bucket[5m]))
    ) > 0.05
  for: 10m
  annotations:
    summary: "{{ $labels.tenant }}: Marconi prism_signal p99 > 50ms (target < 5ms) — hot-path slow"

Scraper setup

Prometheus

scrape_configs:
  - job_name: prism
    static_configs:
      - targets: ['server1.home.lan:41765']
    metrics_path: /metrics
    # Per-target defaults are fine; scrape_interval can be 30s.

Grafana Cloud Agent

metrics:
  configs:
    - name: prism
      scrape_configs:
        - job_name: prism
          static_configs:
            - targets: ['server1.home.lan:41765']
          metrics_path: /metrics

curl (one-shot inspection)

curl -sL http://server1.home.lan:41765/metrics \
  | grep -E '^(signals_|controller_|system_signals_)'

Implementation notes

  • The backend uses OpenTelemetry’s Python SDK. Counters are create_counter instruments on a shared meter. Both a PrometheusMetricReader (writing to prometheus_client.REGISTRY) and — when configured — a PeriodicExportingMetricReader (OTLP gRPC) are attached. Instruments behave identically regardless of which readers are active; no export path in = silent no-op.
  • Metric emission is lazy-imported + try/except-wrapped in every service. A broken record_* call logs at DEBUG and returns — it never fails the caller. Runs counter to “fail fast” but matches the principle that observability bugs should never break the thing being observed.
  • Counter names + labels are load-bearing for dashboards. Changing them is a breaking change for anyone scraping us.
See backend/app/observability/metrics.py for the source of truth. Adding a counter is a 2-change diff: (1) new _meter.create_counter(...) at module level, (2) a record_X(...) helper to be called from services.
Last modified on May 13, 2026