Status:
accepted · Version 0.1 · Filed 2026-05-01SPEC-064 v0.1 — OTel Collector & Telemetry Surface
Status: draft Version: 0.1 Authors: Porsche (Claude Code, mini3), Frank Tewksbury Date: 2026-05-01 Supersedes: none — net-new1. Summary
Add a single shared OpenTelemetry Collector container (gateway pattern) to every Prism deployment profile — dev, personal, server. All Prism services that already hold the OTel SDK (backend HTTP, backend gRPC, session-manager) pointOTEL_EXPORTER_OTLP_ENDPOINT at the in-network collector at http://otelcol:4317. The collector batches traces / metrics / logs and writes them as JSON-lines to a named volume mounted at /var/lib/otelcol/ inside the container; the host volume is the local dump directory operators can tarball and ship later when log shipping returns.
The dashboard probes the collector like any other service (topology + health), scrapes its self-monitoring endpoint at :8888/metrics, exposes a new Telemetry page showing pipeline state and dump file inventory, and ships a Mintlify “Operations Docs” link from the Overview page header.
2. Goals
- Stop dropping spans. Today every span the backend creates is discarded —
OTEL_EXPORTER_OTLP_ENDPOINTis unset everywhere. After SPEC-064 every span lands in a queryable file on disk. - One shared collector, not per-service sidecars. Sidecar pattern is wrong shape for docker-compose; a single gateway is canonical and saves ~750 MB RAM in this stack.
- Default-on, no operator action required. Frank’s mandate: “yes I want the collector”. Personal install gets it; server install gets it; no opt-in flag.
- File exporter is the default destination. Subsumes the deferred log-shipping work — operators tarball the dump dir to ship later. When SaaS endpoints come back into scope, swap one exporter; pipeline stays the same. Aligns with
feedback_same_code_everywhere. - Backend tolerates the collector being down. OTLP exporter retries internally and drops on ceiling. Collector outage degrades observability, not availability.
- Dashboard surface is real. Telemetry page shows pipeline health, ingestion rates, exporter state, dump file inventory with download. Topology shows the collector as a node. Service probe inventory includes it.
3. Non-goals
- No remote / SaaS exporter targets in v0.1 (operator can swap; default ships file exporter).
- No log receiver from container stdout in v0.1 (only OTLP-in). Container-log ingestion is a v0.2 follow-up if logs prove valuable.
- No tail-sampling / probabilistic sampling. Backend volume is low (single-operator); we capture everything.
- No Tempo / Jaeger / Loki / Grafana service. File on disk is the only sink in v0.1.
- No exporter-target picker UI. v0.1 ships one config; swap is a compose edit + redeploy.
- No “destructive ops” against the collector from the dashboard (start/stop/clear). Read-only surface; lifecycle through
prism_upgrade_lan.
4. Architecture
- Image:
otel/opentelemetry-collector-contrib:0.116.0(pinned). Contrib distro is required for the file exporter. - Network: in-compose only. No host-port published in personal mode. Server mode publishes nothing externally either — operators only reach the collector via the dashboard’s Telemetry page.
- Volume: named volume
prism_{profile}_otel_datamounted at/var/lib/otelcol/. Survives container restarts; included inprism_backup(already volume-driven). - Config file:
otelcol/config.yamlmounted as a read-only file from the repo. Cleaner upgrade story (no rebuild on config tweak).
5. Components
5.1 otelcol container
In every compose profile:otelcol.
5.2 otelcol config
- Receivers:
otlpwith both gRPC (:4317) and HTTP (:4318) endpoints. - Processors:
batch(5s timeout, 512 batch size),attributes/prism(setsdeployment.environmentfrom${env:PRISM_ENV}). - Exporters:
file/traces,file/metrics,file/logswriting to/var/lib/otelcol/{traces,metrics,logs}.jsonlwith rotation (50 MB max per file, 5 backups, 7 days). - Extensions:
health_check(:13133),pprof(:1777, internal),zpages(:55679, internal). - Service.telemetry: self-metrics on
:8888, info-level logs to stdout. - Pipelines: three independent — traces, metrics, logs — each receivers→processors→exporters.
5.3 Backend OTLP wiring
Set in every compose profile, on every service that currently callssetup_telemetry:
backend, backend-grpc, session-manager (server only — closes a small gap; SM doesn’t currently call setup_telemetry and will start doing so).
depends_on ordering: backend services depend on otelcol: service_started (not service_healthy — we don’t want a slow collector start to block the backend; backend tolerates collector down per §3).
5.4 Dashboard probe + topology
dashboard/src/probes/registry.ts adds:
dashboard/src/topology/builder.ts adds backbone edges from each emitter (backend, backend-grpc, session-manager) to svc:otelcol so topology shows the telemetry data path.
5.5 Dashboard Telemetry page
New route/telemetry. Shows:
- Collector status card — up/down, version (from
:13133/), uptime. - Pipeline ingestion — spans/sec, log records/sec, metric points/sec from the collector’s own self-metrics (
otelcol_processor_batch_batch_send_size,otelcol_receiver_accepted_spans, etc.). - Exporter status — destination path, bytes written (cumulative), errors. From
otelcol_exporter_sent_*self-metrics. - Dump file inventory — list of
/var/lib/otelcol/*.jsonl{,.<n>}files with size + mtime; “Download tarball” button. - Pipeline diagram — a static SVG showing receivers → processors → exporters wired as configured.
telemetry (see §5.6) for live updates.
5.6 Dashboard backend telemetry routes
GET /dashboard/api/telemetry— returns{ status, version, uptimeSeconds, ingestion: { spansPerSec, logsPerSec, metricsPerSec }, exporter: { destination, bytesWritten, errors }, dumpFiles: [{ name, sizeBytes, modifiedAt }] }. Read-only, session-gated.GET /dashboard/api/telemetry/dump.tar.gz— streams a gzipped tar of the dump directory. Session-gated. Implementation: read dump files from a read-only volume mount on the dashboard container, stream viatar -czf -piped through Express response. Bounded to ~250 MB to avoid runaway download.- New
OtelcolScraper(sibling ofMetricsScraper) pollshttp://otelcol:8888/metricsevery 10s, parses Prometheus text, computes deltas, broadcasts on the SSEtelemetrychannel.
5.7 Backend session-manager telemetry call
backend/app/services/session_manager/standalone.py adds the same setup_telemetry(service_name="prism-session-manager") call already used by main.py and grpc_server.py. Tiny gap-fix; included here because the OTLP wiring is moot for SM if it never registers a TracerProvider.
5.8 Mintlify docs link on Overview
Add an “Operations Docs” link in the dashboardTopBar pointing at the Mintlify ops dashboard URL. Single link, header-style placement, opens in new tab. The exact URL is operator-configurable via PRISM_OPS_DOCS_URL env on the dashboard container; default to https://docs.prism.local/operations (placeholder until Mintlify URL is finalized).
6. Configuration
New env vars:| Var | Service | Default | Purpose |
|---|---|---|---|
OTEL_EXPORTER_OTLP_ENDPOINT | backend, backend-grpc, session-manager | http://otelcol:4317 | OTLP gRPC target |
OTEL_EXPORTER_OTLP_PROTOCOL | same | grpc | OTLP transport |
PRISM_ENV | otelcol, services | ${PRISM_MODE} | Tagged on every span/metric/log |
PRISM_DASHBOARD_OTELCOL_URL | dashboard | http://otelcol:8888/metrics | otelcol self-metrics endpoint |
PRISM_DASHBOARD_OTELCOL_HEALTH_URL | dashboard | http://otelcol:13133/ | health_check endpoint |
PRISM_DASHBOARD_TELEMETRY_INTERVAL_MS | dashboard | 10000 | self-metrics scrape cadence |
PRISM_DASHBOARD_OTEL_DUMP_PATH | dashboard | /var/lib/otelcol | RO mount to list dump files |
PRISM_OPS_DOCS_URL | dashboard | https://docs.prism.local/operations | Mintlify link |
7. Failure modes
- Collector container down: backend’s OTLP exporter retries with exponential backoff, eventually drops. Backend health unaffected. Dashboard topology shows otelcol node as
unreachable; Telemetry page shows “Collector offline”. Critical=false on the probe → overall stack stays “degraded”, not “unhealthy”. - Disk full at /var/lib/otelcol: file exporter logs errors. Self-metrics surface
otelcol_exporter_send_failed_*. Telemetry page shows error count. Operator manually trims withprism_upgrade_lanfollow-up or shells in. - Collector misconfigured at start: image enters CrashLoop.
prism_upgrade_landeploy verification asserts collector health via:13133before declaring success — bad config rolls back deployment. - Backend → collector network partition (compose-internal): Same as collector down. Exporter retries.
8. Migration path
- Personal install (mini3, fresh):
prism install_localbrings up the new compose with otelcol included. No state migration; new named volume created. - Server install (server1, existing):
prism_upgrade_lanrsyncs the compose + config, runsdocker compose up -dwhich creates the newprism-server-otelcolcontainer alongside everything else. Backend services restart to pick up the new env vars (this disrupts active controllers — they re-register on next heartbeat). Total disruption: ~30s. - Backend image rebuild: not required. SDK is already in the image.
- Schema migration: none. No DB tables added.
9. Testing
9.1 Unit / integration
backend/tests/test_observability_otel_setup.py— assertssetup_telemetryreadsOTEL_EXPORTER_OTLP_ENDPOINTand configures OTLP path; asserts no-op when env is unset.dashboard/tests/probes_registry.test.ts(new vitest) — asserts otelcol appears in inventory with critical=false, healthUrl present.dashboard/tests/api_telemetry.test.ts— integration with mock otelcol; asserts route shape, 503 on collector unreachable.
9.2 Smoke
scripts/smoke_otelcol.sh— boots the personal compose stack, sends aprism_signalvia the deployed backend’s MCP path, polls/var/lib/otelcol/traces.jsonluntil a span withservice.name=prism-backend-httpappears (60s ceiling). Asserts the JSON record contains the operation name and resource attributes.
9.3 Deploy verification
prism_upgrade_lan post-deploy checks (in order):
docker psshows otelcol running and healthy.curl -fsS http://otelcol:13133/returns 200 (run inside the dashboard container).curl -fsS http://otelcol:8888/metrics | grep -q otelcol_receiver_accepted_spansreturns true.- Existing checks (backend health, db_health, dashboard /health) all pass.
10. Security
- No exposed ports outside the compose network. Both gRPC (4317) and HTTP (4318) are docker-network-only. Self-metrics (8888), health (13133), pprof (1777), zpages (55679) all internal. The dashboard is the only externally-reachable surface that touches any of them.
- Read-only dump access: dashboard mounts
/var/lib/otelcolas:ro. The dump tarball endpoint is session-gated and bounded. - No secrets in spans: backend instrumentation already redacts known secret-bearing fields. SPEC-064 doesn’t change instrumentation; if a redaction gap exists today it persists. Tracked separately if discovered.
- Volume retention: file exporter rotates and limits to 5 backups × 50 MB = 250 MB ceiling per signal (traces / metrics / logs). Hard ceiling on disk use.
11. Acceptance criteria
- otelcol container runs in dev, personal, server compose profiles.
- Backend HTTP, backend-grpc, session-manager all set OTLP env and emit traces.
- After a single
prism_signalround-trip, a span lands intraces.jsonl(smoke passes). - Dashboard topology shows otelcol node with edges from backend / backend-grpc / session-manager.
- Dashboard
/telemetrypage renders with live collector status + dump file list. - Dashboard “Operations Docs” link visible from Overview header.
-
prism_upgrade_landeploys to server1 and post-deploy verification passes. - PR merged to main; SPEC-064 status moves to
acceptedon merge.
12. Out-of-scope / future work
- v0.2: container-stdout log receiver (
filelogordocker_container) so operator logs unify with structured backend logs in one pipeline. - v0.2: SaaS exporter target picker (Honeycomb / Grafana Cloud) — operator-configurable, no compose edit required.
- v0.3: tail-sampling for high-volume installs.
- v0.3: live trace explorer in the dashboard (read directly from
traces.jsonl). - Independent: Prom bridge in-process bug (TODO #103) — orthogonal; the dashboard’s existing app-counter scrape is unaffected by SPEC-064 because we kept the Prom reader on the backend’s own
/metrics. Bridge fix tracked separately. - Independent: Prism Console (broader maintenance verbs — restart single service, drain stale signals, capacity planning, etc.) — separate SPEC-065+ thread; SPEC-064 scoped to telemetry only.

