Metrics & Observability

Prism’s backend exports metrics in two ways:

Prometheus scrape endpoint at GET /metrics — always-on, no configuration needed. Counters + gauges become visible after first increment.
OTLP export to an OpenTelemetry collector — activated when OTEL_EXPORTER_OTLP_ENDPOINT is set. Runs alongside the Prometheus reader; both see the same instruments.

Every counter is labeled with pid at minimum so cardinality is bounded and per-project queries are trivial.

Endpoint

GET http://<backend>/metrics

Deployment	URL
Personal-local	`http://127.0.0.1:8765/metrics`
LAN (e.g. server1)	`http://server1.home.lan:41765/metrics`
Cloud	`https://<cloud-host>/metrics`

Returns Prometheus text format. No auth today (the counters are operational, not data). The path is unversioned per Prometheus convention — operators scrape one URL regardless of /api/v1 changes.

The endpoint issues a 307 redirect from /metrics to /metrics/ — standard FastAPI mount behavior. Any normal Prometheus scraper follows the redirect; manual curl users want curl -L.

Counter catalog

SPEC-030 controller plane

Counter	Labels	What it means
`controller_elections_total`	`pid`, `winner_surface`	Master elections decided. Increments only when the call actually decides the master seat — joining as peer doesn’t count.
`controller_stream_events_total`	`pid`, `event_type`	ServerEvents pushed out through a CoordinationStream (SPEC-030 Phase 3). Event types: `master_preempted\|lease_contention\|approval_requested\|state_change\|heartbeat_ack\|nudge_push`.
`controller_lease_grants_total`	`pid`, `resource_type`	Leases granted (successful grants only — contention/rejections live on stream_events).
`controller_approval_requests_total`	`pid`, `outcome`	Approval lifecycle events. Outcomes: `submitted\|approved\|rejected\|timed_out`. Each approval flows through multiple outcome buckets over its life.
`controller_stream_drops_total`	`pid`	EventBus overflow drops — oldest event discarded to make room. Surfacing this helps operators notice when the event rate exceeds the bus capacity.

SPEC-030 gauges

Gauge	Labels	What it means
`controller_registrations_active`	`pid`	Currently-active controller registrations (`released_at IS NULL`). Observable via periodic polling of Postgres.

SPEC-034 agent-to-agent signals

Counter	Labels	What it means
`signals_sent_total`	`pid`, `signal_type`, `outcome`	Signals accepted by `POST /signal`. `signal_type` ∈ . `outcome` ∈ .
`signals_drained_on_startup_total`	`pid`, `to_identity`	Signals delivered via `prism_start`’s drain path (SPEC-034 §6.3). Increments by the number of rows drained per call.
`system_signals_emitted_total`	`pid`, `signal_type`, `target`	Controller-emitted system signals (not via the `prism_signal` verb). `signal_type` ∈ . `target` ∈ .

SPEC-101 Marconi (signal-mesh hot path + audit fan-out)

Marconi is the in-memory signal-mesh switch (SPEC-101). The legacy signals_sent_total counter remains for backward-compat; Marconi adds a parallel marconi_* namespace that disambiguates hot-path delivery from audit-pipeline durability. Every metric below MUST exist; the Stage 5 hot-path cutover required them in place before the flag flipped on server1 (2026-05-11). See Marconi for the architectural context. Hot-path counters (per tenant)

Counter / gauge	Labels	What it means
`marconi_signals_received_total`	`tenant`, `signal_type`	At API entry — before routing lookup.
`marconi_signals_delivered_total`	`tenant`, `signal_type`, `outcome`	`outcome` ∈ . `pushed` means Marconi found a live WS handle and pushed; `queued_offline` means recipient had no live entry; `no_subscriber` means identity didn’t resolve; `dropped` is the loss-event terminal.
`marconi_signals_acked_total`	`tenant`, `signal_type`, `ack_kind`	Final-delivery evidence. `ack_kind` ∈ .
`marconi_signal_send_duration_seconds`	`tenant`	Histogram. p50/p95/p99 of `prism_signal` end-to-end latency. v0.4 target: p99 `< 5ms` on the same-instance hot path.
`marconi_routing_table_size`	`tenant`	Gauge — count of `(tenant_id, project_id, identity)` entries in Marconi’s routing table.
`marconi_routing_cache_hits_total` / `marconi_routing_cache_misses_total`	`tenant`	Routing lookup cache hits vs SessionStore-fallback misses. Steady-state hit-rate ≥ 99%.

Audit queue + fan-out counters

Counter / gauge	Labels	What it means
`marconi_audit_queue_depth`	`tenant`	Gauge — in-memory audit queue depth. Sized per `max_entries`.
`marconi_audit_queue_overwrite_total`	`tenant`	Counter — entries overwritten before reaching Redis Stream. Loss event. Non-zero is a paging incident under v0.4 loss budget.
`marconi_redis_writer_lag_seconds`	`tenant`	Gauge — Redis Stream writer lag from audit-queue head.
`marconi_redis_writer_errors_total`	`tenant`, `reason`	Counter — Redis writer failures.
`marconi_pg_archiver_lag_seconds`	`tenant`	Gauge — PG archiver lag from Marconi Cache. Near-zero in steady state (“once it hits cache it immediately goes to PG”).
`marconi_pg_archiver_errors_total`	`tenant`, `reason`	Counter — PG archiver failures.

Cache-invalidator counters (Stage 2)

Counter	Labels	What it means
`marconi_invalidator_calls_total`	`hook`	`hook` ∈ . Tracks every direct write-through hook firing.
`marconi_invalidator_errors_total`	`hook`, `reason`	Non-fatal hook errors. Non-zero in steady state is a paging incident — invalidation gaps cause stale-route delivery.

Obligation counters

Counter / gauge	Labels	What it means
`marconi_obligations_open`	`tenant`, `kind`	Gauge — open obligations awaiting ack or terminal.
`marconi_obligations_durable_total`	`tenant`, `kind`	Counter — obligations whose envelope reached PG audit.
`marconi_obligations_degraded_not_durable_total`	`tenant`, `kind`	Counter — obligations delivered but not yet durably persisted (recoverable from upstream tier).
`marconi_obligation_sla_violation_total`	`tenant`, `kind`, `sla`	Counter — SLA breach (`ack_sla_seconds` or `terminal_sla_seconds` exceeded).

Unknown-recipient rejections

Counter	Labels	What it means
`unknown_recipient_rejections_total`	`tenant`, `signal_type`	Increments on every `publish_path=rejected_unknown`. Surfaces typo’d identities and stale routing without log inspection. SPEC-071 §5.

Label conventions

pid — project identifier like PID-PGR01. Low cardinality (typically < 100 per install). In hot paths where only a UUID is available at metric-emit time, we fall back to project_id[:8] shorthand — still bounded, operators can grep for the prefix.
signal_type — the §5.2 type string literally. For broadcasts (to="*"), the value is broadcast to distinguish from targeted types with the same payload.
outcome — disjoint per-event terminal state. A single POST /signal lands in exactly one outcome bucket.
winner_surface / agent_surface — claude_desktop, claude_code, codex, cursor, or other. Bounded.

Suggested Prometheus queries

sum by (pid) (rate(signals_sent_total[5m]))

Suggested alerts (baseline)

Alert thresholds are deployment-specific; these are starting points for personal / small-team installs.

Stream drops — anything > 0 is bad

- alert: PrismStreamDrops
  expr: rate(controller_stream_drops_total[5m]) > 0
  for: 2m
  annotations:
    summary: "Prism EventBus dropped events — bus capacity may be undersized"

Election churn — more than one election per hour per PID suggests instability

- alert: PrismElectionChurn
  expr: sum by (pid) (rate(controller_elections_total[1h])) > 1
  for: 10m
  annotations:
    summary: "Controller election churn on {{ $labels.pid }} — master may be flapping"

Approval timeouts — any approval timing out warrants attention

- alert: PrismApprovalTimeout
  expr: rate(controller_approval_requests_total{outcome="timed_out"}[10m]) > 0
  annotations:
    summary: "Approval request timed out on {{ $labels.pid }} — decider offline?"

Signal drop-rate — high queued/delivered ratio

- alert: PrismSignalsMostlyQueueing
  expr: |
    sum by (pid) (rate(signals_sent_total{outcome="queued"}[15m]))
      /
    sum by (pid) (rate(signals_sent_total[15m])) > 0.5
  for: 15m
  annotations:
    summary: "{{ $labels.pid }}: >50% of signals are queueing instead of delivering (target offline?)"

Marconi loss event — any audit-queue overwrite is a paging incident

- alert: MarconiAuditQueueOverwrite
  expr: sum by (tenant) (rate(marconi_audit_queue_overwrite_total[5m])) > 0
  for: 1m
  annotations:
    summary: "{{ $labels.tenant }}: Marconi audit queue overwrote entries before reaching Redis Stream — durable signal loss (Redis writer lagging or Redis down longer than queue holds)"

Marconi invalidator errors — stale-route delivery risk

- alert: MarconiInvalidatorErrors
  expr: sum by (hook) (rate(marconi_invalidator_errors_total[5m])) > 0
  for: 5m
  annotations:
    summary: "Marconi invalidator hook {{ $labels.hook }} failing — routing cache may drift from session state"

Marconi PG archiver lag — audit pipeline backing up

- alert: MarconiPGArchiverLag
  expr: max by (tenant) (marconi_pg_archiver_lag_seconds) > 30
  for: 5m
  annotations:
    summary: "{{ $labels.tenant }}: Marconi PG archiver lag > 30s — cache-to-audit pipeline backing up"

Marconi hot-path p99 latency — should be < 5ms

- alert: MarconiHotPathP99Slow
  expr: |
    histogram_quantile(0.99,
      sum by (le, tenant) (rate(marconi_signal_send_duration_seconds_bucket[5m]))
    ) > 0.05
  for: 10m
  annotations:
    summary: "{{ $labels.tenant }}: Marconi prism_signal p99 > 50ms (target < 5ms) — hot-path slow"

Scraper setup

Prometheus

scrape_configs:
  - job_name: prism
    static_configs:
      - targets: ['server1.home.lan:41765']
    metrics_path: /metrics
    # Per-target defaults are fine; scrape_interval can be 30s.

Grafana Cloud Agent

metrics:
  configs:
    - name: prism
      scrape_configs:
        - job_name: prism
          static_configs:
            - targets: ['server1.home.lan:41765']
          metrics_path: /metrics

curl (one-shot inspection)

curl -sL http://server1.home.lan:41765/metrics \
  | grep -E '^(signals_|controller_|system_signals_)'

Implementation notes

The backend uses OpenTelemetry’s Python SDK. Counters are create_counter instruments on a shared meter. Both a PrometheusMetricReader (writing to prometheus_client.REGISTRY) and — when configured — a PeriodicExportingMetricReader (OTLP gRPC) are attached. Instruments behave identically regardless of which readers are active; no export path in = silent no-op.
Metric emission is lazy-imported + try/except-wrapped in every service. A broken record_* call logs at DEBUG and returns — it never fails the caller. Runs counter to “fail fast” but matches the principle that observability bugs should never break the thing being observed.
Counter names + labels are load-bearing for dashboards. Changing them is a breaking change for anyone scraping us.

See backend/app/observability/metrics.py for the source of truth. Adding a counter is a 2-change diff: (1) new _meter.create_counter(...) at module level, (2) a record_X(...) helper to be called from services.

​Metrics & Observability

​Endpoint

​Counter catalog

​SPEC-030 controller plane

​SPEC-030 gauges

​SPEC-034 agent-to-agent signals

​SPEC-101 Marconi (signal-mesh hot path + audit fan-out)

​Label conventions

​Suggested Prometheus queries

​Suggested alerts (baseline)

​Stream drops — anything > 0 is bad

​Election churn — more than one election per hour per PID suggests instability

​Approval timeouts — any approval timing out warrants attention

​Signal drop-rate — high queued/delivered ratio

​Marconi loss event — any audit-queue overwrite is a paging incident

​Marconi invalidator errors — stale-route delivery risk

​Marconi PG archiver lag — audit pipeline backing up

​Marconi hot-path p99 latency — should be < 5ms

​Scraper setup

​Prometheus

​Grafana Cloud Agent

​curl (one-shot inspection)

​Implementation notes

Metrics & Observability

Endpoint

Counter catalog

SPEC-030 controller plane

SPEC-030 gauges

SPEC-034 agent-to-agent signals

SPEC-101 Marconi (signal-mesh hot path + audit fan-out)

Label conventions

Suggested Prometheus queries

Suggested alerts (baseline)

Stream drops — anything > 0 is bad

Election churn — more than one election per hour per PID suggests instability

Approval timeouts — any approval timing out warrants attention

Signal drop-rate — high queued/delivered ratio

Marconi loss event — any audit-queue overwrite is a paging incident

Marconi invalidator errors — stale-route delivery risk

Marconi PG archiver lag — audit pipeline backing up

Marconi hot-path p99 latency — should be < 5ms

Scraper setup

Prometheus

Grafana Cloud Agent

curl (one-shot inspection)

Implementation notes