Skip to main content
Status: accepted · Version 0.1 · Filed 2026-05-01

SPEC-064 v0.1 — OTel Collector & Telemetry Surface

Status: draft Version: 0.1 Authors: Porsche (Claude Code, mini3), Frank Tewksbury Date: 2026-05-01 Supersedes: none — net-new

1. Summary

Add a single shared OpenTelemetry Collector container (gateway pattern) to every Prism deployment profile — dev, personal, server. All Prism services that already hold the OTel SDK (backend HTTP, backend gRPC, session-manager) point OTEL_EXPORTER_OTLP_ENDPOINT at the in-network collector at http://otelcol:4317. The collector batches traces / metrics / logs and writes them as JSON-lines to a named volume mounted at /var/lib/otelcol/ inside the container; the host volume is the local dump directory operators can tarball and ship later when log shipping returns. The dashboard probes the collector like any other service (topology + health), scrapes its self-monitoring endpoint at :8888/metrics, exposes a new Telemetry page showing pipeline state and dump file inventory, and ships a Mintlify “Operations Docs” link from the Overview page header.

2. Goals

  • Stop dropping spans. Today every span the backend creates is discarded — OTEL_EXPORTER_OTLP_ENDPOINT is unset everywhere. After SPEC-064 every span lands in a queryable file on disk.
  • One shared collector, not per-service sidecars. Sidecar pattern is wrong shape for docker-compose; a single gateway is canonical and saves ~750 MB RAM in this stack.
  • Default-on, no operator action required. Frank’s mandate: “yes I want the collector”. Personal install gets it; server install gets it; no opt-in flag.
  • File exporter is the default destination. Subsumes the deferred log-shipping work — operators tarball the dump dir to ship later. When SaaS endpoints come back into scope, swap one exporter; pipeline stays the same. Aligns with feedback_same_code_everywhere.
  • Backend tolerates the collector being down. OTLP exporter retries internally and drops on ceiling. Collector outage degrades observability, not availability.
  • Dashboard surface is real. Telemetry page shows pipeline health, ingestion rates, exporter state, dump file inventory with download. Topology shows the collector as a node. Service probe inventory includes it.

3. Non-goals

  • No remote / SaaS exporter targets in v0.1 (operator can swap; default ships file exporter).
  • No log receiver from container stdout in v0.1 (only OTLP-in). Container-log ingestion is a v0.2 follow-up if logs prove valuable.
  • No tail-sampling / probabilistic sampling. Backend volume is low (single-operator); we capture everything.
  • No Tempo / Jaeger / Loki / Grafana service. File on disk is the only sink in v0.1.
  • No exporter-target picker UI. v0.1 ships one config; swap is a compose edit + redeploy.
  • No “destructive ops” against the collector from the dashboard (start/stop/clear). Read-only surface; lifecycle through prism_upgrade_lan.

4. Architecture

                              ┌──────────────────────────┐
backend (HTTP)   OTLP/grpc ──>│                          │
backend (gRPC)   OTLP/grpc ──>│   otelcol (gateway)      │── file/traces.jsonl
session-manager  OTLP/grpc ──>│                          │── file/metrics.jsonl
                              │   :4317 OTLP gRPC        │── file/logs.jsonl
                              │   :4318 OTLP HTTP        │
                              │   :8888 self /metrics    │
                              │   :13133 health_check    │
                              └──────────────────────────┘

                                         │ scrape :8888 + probe :13133
                                  dashboard service
  • Image: otel/opentelemetry-collector-contrib:0.116.0 (pinned). Contrib distro is required for the file exporter.
  • Network: in-compose only. No host-port published in personal mode. Server mode publishes nothing externally either — operators only reach the collector via the dashboard’s Telemetry page.
  • Volume: named volume prism_{profile}_otel_data mounted at /var/lib/otelcol/. Survives container restarts; included in prism_backup (already volume-driven).
  • Config file: otelcol/config.yaml mounted as a read-only file from the repo. Cleaner upgrade story (no rebuild on config tweak).

5. Components

5.1 otelcol container

In every compose profile:
otelcol:
  image: otel/opentelemetry-collector-contrib:0.116.0
  container_name: prism-{profile}-otelcol
  restart: unless-stopped
  command: ["--config=/etc/otelcol/config.yaml"]
  volumes:
    - ./otelcol/config.yaml:/etc/otelcol/config.yaml:ro
    - prism_{profile}_otel_data:/var/lib/otelcol
  healthcheck:
    test: ["CMD", "wget", "-qO-", "http://127.0.0.1:13133/"]
    interval: 10s
    timeout: 3s
    retries: 3
    start_period: 10s
No published ports. The dashboard reaches it via the compose network DNS name otelcol.

5.2 otelcol config

  • Receivers: otlp with both gRPC (:4317) and HTTP (:4318) endpoints.
  • Processors: batch (5s timeout, 512 batch size), attributes/prism (sets deployment.environment from ${env:PRISM_ENV}).
  • Exporters: file/traces, file/metrics, file/logs writing to /var/lib/otelcol/{traces,metrics,logs}.jsonl with rotation (50 MB max per file, 5 backups, 7 days).
  • Extensions: health_check (:13133), pprof (:1777, internal), zpages (:55679, internal).
  • Service.telemetry: self-metrics on :8888, info-level logs to stdout.
  • Pipelines: three independent — traces, metrics, logs — each receivers→processors→exporters.

5.3 Backend OTLP wiring

Set in every compose profile, on every service that currently calls setup_telemetry:
environment:
  OTEL_EXPORTER_OTLP_ENDPOINT: http://otelcol:4317
  OTEL_EXPORTER_OTLP_PROTOCOL: grpc
  PRISM_ENV: ${PRISM_MODE:-personal}
Services touched: backend, backend-grpc, session-manager (server only — closes a small gap; SM doesn’t currently call setup_telemetry and will start doing so). depends_on ordering: backend services depend on otelcol: service_started (not service_healthy — we don’t want a slow collector start to block the backend; backend tolerates collector down per §3).

5.4 Dashboard probe + topology

dashboard/src/probes/registry.ts adds:
{
  containerName: `${prefix}-otelcol`,
  serviceName: 'otelcol',
  healthUrl: 'http://otelcol:13133/',
  critical: false,  // observability outage is degraded, not unhealthy
}
dashboard/src/topology/builder.ts adds backbone edges from each emitter (backend, backend-grpc, session-manager) to svc:otelcol so topology shows the telemetry data path.

5.5 Dashboard Telemetry page

New route /telemetry. Shows:
  • Collector status card — up/down, version (from :13133/), uptime.
  • Pipeline ingestion — spans/sec, log records/sec, metric points/sec from the collector’s own self-metrics (otelcol_processor_batch_batch_send_size, otelcol_receiver_accepted_spans, etc.).
  • Exporter status — destination path, bytes written (cumulative), errors. From otelcol_exporter_sent_* self-metrics.
  • Dump file inventory — list of /var/lib/otelcol/*.jsonl{,.<n>} files with size + mtime; “Download tarball” button.
  • Pipeline diagram — a static SVG showing receivers → processors → exporters wired as configured.
The page subscribes to a new SSE channel telemetry (see §5.6) for live updates.

5.6 Dashboard backend telemetry routes

  • GET /dashboard/api/telemetry — returns { status, version, uptimeSeconds, ingestion: { spansPerSec, logsPerSec, metricsPerSec }, exporter: { destination, bytesWritten, errors }, dumpFiles: [{ name, sizeBytes, modifiedAt }] }. Read-only, session-gated.
  • GET /dashboard/api/telemetry/dump.tar.gz — streams a gzipped tar of the dump directory. Session-gated. Implementation: read dump files from a read-only volume mount on the dashboard container, stream via tar -czf - piped through Express response. Bounded to ~250 MB to avoid runaway download.
  • New OtelcolScraper (sibling of MetricsScraper) polls http://otelcol:8888/metrics every 10s, parses Prometheus text, computes deltas, broadcasts on the SSE telemetry channel.

5.7 Backend session-manager telemetry call

backend/app/services/session_manager/standalone.py adds the same setup_telemetry(service_name="prism-session-manager") call already used by main.py and grpc_server.py. Tiny gap-fix; included here because the OTLP wiring is moot for SM if it never registers a TracerProvider. Add an “Operations Docs” link in the dashboard TopBar pointing at the Mintlify ops dashboard URL. Single link, header-style placement, opens in new tab. The exact URL is operator-configurable via PRISM_OPS_DOCS_URL env on the dashboard container; default to https://docs.prism.local/operations (placeholder until Mintlify URL is finalized).

6. Configuration

New env vars:
VarServiceDefaultPurpose
OTEL_EXPORTER_OTLP_ENDPOINTbackend, backend-grpc, session-managerhttp://otelcol:4317OTLP gRPC target
OTEL_EXPORTER_OTLP_PROTOCOLsamegrpcOTLP transport
PRISM_ENVotelcol, services${PRISM_MODE}Tagged on every span/metric/log
PRISM_DASHBOARD_OTELCOL_URLdashboardhttp://otelcol:8888/metricsotelcol self-metrics endpoint
PRISM_DASHBOARD_OTELCOL_HEALTH_URLdashboardhttp://otelcol:13133/health_check endpoint
PRISM_DASHBOARD_TELEMETRY_INTERVAL_MSdashboard10000self-metrics scrape cadence
PRISM_DASHBOARD_OTEL_DUMP_PATHdashboard/var/lib/otelcolRO mount to list dump files
PRISM_OPS_DOCS_URLdashboardhttps://docs.prism.local/operationsMintlify link

7. Failure modes

  • Collector container down: backend’s OTLP exporter retries with exponential backoff, eventually drops. Backend health unaffected. Dashboard topology shows otelcol node as unreachable; Telemetry page shows “Collector offline”. Critical=false on the probe → overall stack stays “degraded”, not “unhealthy”.
  • Disk full at /var/lib/otelcol: file exporter logs errors. Self-metrics surface otelcol_exporter_send_failed_*. Telemetry page shows error count. Operator manually trims with prism_upgrade_lan follow-up or shells in.
  • Collector misconfigured at start: image enters CrashLoop. prism_upgrade_lan deploy verification asserts collector health via :13133 before declaring success — bad config rolls back deployment.
  • Backend → collector network partition (compose-internal): Same as collector down. Exporter retries.

8. Migration path

  • Personal install (mini3, fresh): prism install_local brings up the new compose with otelcol included. No state migration; new named volume created.
  • Server install (server1, existing): prism_upgrade_lan rsyncs the compose + config, runs docker compose up -d which creates the new prism-server-otelcol container alongside everything else. Backend services restart to pick up the new env vars (this disrupts active controllers — they re-register on next heartbeat). Total disruption: ~30s.
  • Backend image rebuild: not required. SDK is already in the image.
  • Schema migration: none. No DB tables added.

9. Testing

9.1 Unit / integration

  • backend/tests/test_observability_otel_setup.py — asserts setup_telemetry reads OTEL_EXPORTER_OTLP_ENDPOINT and configures OTLP path; asserts no-op when env is unset.
  • dashboard/tests/probes_registry.test.ts (new vitest) — asserts otelcol appears in inventory with critical=false, healthUrl present.
  • dashboard/tests/api_telemetry.test.ts — integration with mock otelcol; asserts route shape, 503 on collector unreachable.

9.2 Smoke

  • scripts/smoke_otelcol.sh — boots the personal compose stack, sends a prism_signal via the deployed backend’s MCP path, polls /var/lib/otelcol/traces.jsonl until a span with service.name=prism-backend-http appears (60s ceiling). Asserts the JSON record contains the operation name and resource attributes.

9.3 Deploy verification

prism_upgrade_lan post-deploy checks (in order):
  1. docker ps shows otelcol running and healthy.
  2. curl -fsS http://otelcol:13133/ returns 200 (run inside the dashboard container).
  3. curl -fsS http://otelcol:8888/metrics | grep -q otelcol_receiver_accepted_spans returns true.
  4. Existing checks (backend health, db_health, dashboard /health) all pass.
Failure of any step fails the deploy and surfaces in the verb response.

10. Security

  • No exposed ports outside the compose network. Both gRPC (4317) and HTTP (4318) are docker-network-only. Self-metrics (8888), health (13133), pprof (1777), zpages (55679) all internal. The dashboard is the only externally-reachable surface that touches any of them.
  • Read-only dump access: dashboard mounts /var/lib/otelcol as :ro. The dump tarball endpoint is session-gated and bounded.
  • No secrets in spans: backend instrumentation already redacts known secret-bearing fields. SPEC-064 doesn’t change instrumentation; if a redaction gap exists today it persists. Tracked separately if discovered.
  • Volume retention: file exporter rotates and limits to 5 backups × 50 MB = 250 MB ceiling per signal (traces / metrics / logs). Hard ceiling on disk use.

11. Acceptance criteria

  • otelcol container runs in dev, personal, server compose profiles.
  • Backend HTTP, backend-grpc, session-manager all set OTLP env and emit traces.
  • After a single prism_signal round-trip, a span lands in traces.jsonl (smoke passes).
  • Dashboard topology shows otelcol node with edges from backend / backend-grpc / session-manager.
  • Dashboard /telemetry page renders with live collector status + dump file list.
  • Dashboard “Operations Docs” link visible from Overview header.
  • prism_upgrade_lan deploys to server1 and post-deploy verification passes.
  • PR merged to main; SPEC-064 status moves to accepted on merge.

12. Out-of-scope / future work

  • v0.2: container-stdout log receiver (filelog or docker_container) so operator logs unify with structured backend logs in one pipeline.
  • v0.2: SaaS exporter target picker (Honeycomb / Grafana Cloud) — operator-configurable, no compose edit required.
  • v0.3: tail-sampling for high-volume installs.
  • v0.3: live trace explorer in the dashboard (read directly from traces.jsonl).
  • Independent: Prom bridge in-process bug (TODO #103) — orthogonal; the dashboard’s existing app-counter scrape is unaffected by SPEC-064 because we kept the Prom reader on the backend’s own /metrics. Bridge fix tracked separately.
  • Independent: Prism Console (broader maintenance verbs — restart single service, drain stale signals, capacity planning, etc.) — separate SPEC-065+ thread; SPEC-064 scoped to telemetry only.
Last modified on May 3, 2026