Status:
draft · Version 0.3 · Filed 2026-04-28SPEC-050 — Prism Dashboard — Org-Scoped Observability Service
Version
0.3Status
draftChangelog
- 0.3: Final architecture — dashboard is a full observability service. SM pushes agent state, dashboard actively probes containers, reads logs on-demand. Separate Postgres schema. Probe adapter pattern for cloud readiness. Redis Streams for SM→dashboard and service→dashboard event delivery. BFF pattern killed. Porsche’s recommendations ratified by Frank. Ready for implementation.
- 0.2: Standalone Node/Express BFF proxy. Rejected — created unwanted coupling between dashboard and backend.
- 0.1: Static SPA on FastAPI mount. Rejected — not its own service.
1. Problem
Prism operators have no persistent visual surface for monitoring backend health. The existing tools require either manual curl, a separate Grafana stack, or an active AI session. The operator needs a single persistent entry point — pinned to their org, reachable from any browser — that shows whether things are healthy, what’s happening, and what’s gone wrong across all three deployment modes.2. Goals
- Single URL, any mode. Works whether the backend runs on localhost, a LAN server, or in the cloud.
- Org-scoped. Pinned to a Prism org (tenant). Shows health, metrics, and issues across all projects.
- Full observability service. Its own container with its own responsibilities: container health probing, log reading, event classification, and persistence. Not a proxy, not a thin display.
- Three panels: Backend Health, Metrics, Issue Log.
- SM-push for agent state. Agents only know about the Session Manager. All agent observability flows SM → dashboard. Dashboard never polls for agent state.
- Cloud-ready from day one. Probe adapter pattern ensures the architecture adapts to cloud deployment without structural changes.
3. Architecture
3.1 Data Flow
3.2 Key Rules
- Agents only know about the SM. No agent-to-dashboard path. All agent observability flows SM → dashboard via Redis Streams.
- SM pushes to dashboard. Dashboard never polls SM or backend for agent state. SM publishes to
agent_events:{tenant_id}stream on every state change. - Dashboard actively health-checks all containers. It’s the watchdog. Probes backend, backend-grpc, session-manager, postgres, redis, neo4j on a 15s tick.
- Log reading is on-demand. Operator investigates an issue → dashboard reads bounded log lines from the relevant container. Not a continuous firehose. We are not recreating Datadog.
- Container impact monitoring. Dashboard tracks CPU, memory, disk pressure across all containers. Practical ops monitoring for local and LAN modes.
- Service-internal failures flow via Redis Streams. Backend services emit to
service_events:{tenant_id}via a slimlog_issue()helper. Dashboard subscribes. Services don’t know about the dashboard — they publish to a shared transport.
3.3 Container Topology (8 containers)
3.4 What the Dashboard Is NOT
- NOT a BFF proxy to the Prism backend
- NOT a log aggregation firehose (reads on-demand, not continuous)
- NOT Datadog/Grafana — practical ops monitoring for a Prism deployment
- NOT a polling consumer of backend APIs
4. Event Delivery — Three Streams
The dashboard consumes events from three sources. All use the same Redis Streams pattern with consumer groups for at-least-once delivery and backlog drain on restart.4.1 Agent Events (SM → Dashboard)
Stream:agent_events:{tenant_id}
Publisher: Session Manager (on every state change)
Events: registration, deregistration, election (master claim/preempt), heartbeat freshness changes, identity conflicts, session lifecycle (start/wrap/checkpoint)
Each event carries:
4.2 Service Events (Backend Services → Dashboard)
Stream:service_events:{tenant_id}
Publisher: Any backend service via log_issue() helper
Events: signal delivery failures, auth failures, election anomalies, health check failures, wrap discipline breaches, migration drift, deploy drift, gRPC disconnects, dashboard auth failures
Each event carries:
log_issue() helper signature:
4.3 Dashboard’s Own Probes
The dashboard generates its own events from active health checking and container monitoring. These are written directly toprism_dashboard.log_event (no Redis stream needed — dashboard is both producer and consumer).
4.4 Committed v1 log_issue() Call Sites (10 points)
signal_service.send→ delivery failure after retry exhaustion (error)controller_service→ election anomalies, master preemption conflicts (warning)controller_service→ identity_conflict per SPEC-038 (warning)auth/*→ failed OAuth callbacks, API-key auth failures (warning)- health probes → any dependency check failing (error)
- wrap discipline → rate-floor breach below 0.60 (info)
- migrations → Alembic drift detected at startup (error)
- deploy/upgrade → container image SHA mismatch (warning)
- gRPC stream → disconnect/reconnect events (warning)
- dashboard auth → BFF auth failures in the dashboard service itself (warning)
5. Probe Adapter Pattern (Cloud-Ready)
The dashboard probes containers through an abstract adapter interface — never coupled directly to Docker socket code paths.5.1 v1: DockerSocketAdapter + HttpHealthAdapter
DockerSocketAdapter— mounts/var/run/docker.sock, provides container status, CPU/memory/disk stats, restart counts, bounded log readsHttpHealthAdapter— calls each service’s/health/livenessendpoint for application-level health confirmation
5.2 Future: Cloud Adapters
When deploying to cloud, swap inKubernetesAdapter (kubelet API + metrics-server) or ECSAdapter (CloudWatch + ECS API). Single factory line change. No structural rewrite.
5.3 On-Demand Log Reading
Bounded reads only. The adapter enforcessince, until, and limit parameters. Dashboard never loads full container logs into memory. UI sends time-bounded requests: “give me 5 min around this issue, max 500 lines.”
5.4 Container Impact Monitoring
Dashboard tracks per-container: CPU %, memory usage/limit, disk I/O, restart count, uptime. Surfaced as resource cards in the Health panel. Probed on the same 15s tick as health checks.6. Data Store — Separate Postgres Schema
Dashboard uses the same Postgres instance but its ownprism_dashboard schema. Independent migrations, independent test cycles, no rebooting the whole stack.
6.1 Schema: prism_dashboard
6.2 Schema Isolation
- Dashboard’s Alembic chain lives in
dashboard/migrations/withversion_table_schema='prism_dashboard' - No cross-schema foreign keys to
public.*. Agent identity stored as plain VARCHAR, not FK topublic.personas. Historical name preserved even if persona is renamed (correct audit behavior). - Schema-scoped DB user:
USAGE+CREATEonprism_dashboardonly. No rights topublic. Defense in depth — dashboard can’t readtenants,projects, API keys, or any backend data.
7. Auth (v1 — API Key Only)
OAuth deferred to v2.| Mode | Flow |
|---|---|
| local | Dashboard reads API key from shared credentials volume. No login screen. |
| lan | Login screen. Operator pastes API key → encrypted session cookie (iron-session, AES-GCM, 24h TTL). |
| cloud | Same as LAN for v1. OAuth replaces this in v2. |
prism install. Non-local modes refuse to start if secret is changeme or empty.
7.1 Dashboard Health Endpoints
GET /dashboard/health— lightweight, for Docker healthcheck / orchestrator probesGET /dashboard/api/health/current— comprehensive, for the operator UI (latest probe results across all monitored services + agent roster + container resource stats)
7.2 Session Expiry UX
When session cookie expires mid-page: page stays rendered, re-auth banner appears, writes (acknowledge) require re-auth. Read-mostly UX preserved.8. SPA Frontend
Vanilla HTML + Chart.js. No React, no Vite, no build pipeline. Three files served by the dashboard backend:index.html, chart.umd.min.js (CDN or bundled), app.js.
8.1 Panel: Backend Health
- Overall status badge (green/yellow/red) from dashboard’s own probes
- Individual service cards (Backend, Backend-gRPC, Session Manager, Postgres, Redis, Neo4j) with status, latency, CPU/memory/disk
- Active agent roster with identity / surface / machine / master status (from SM-pushed state)
- Container restart counts and uptime
- Auto-refresh via SSE from dashboard backend
8.2 Panel: Metrics
- Signal rate (sent/delivered/queued) — line chart, 1h/6h/24h toggle
- Election events — bar chart
- Active registrations — live counter
- Wrap discipline rate — gauge with 0.60 floor line
- Container resource trends (CPU/memory over time)
health_history table in v1.
8.3 Panel: Issue Log
- Severity-colored rows (red/amber/blue for ERROR/WARNING/INFO)
- Filter by: severity, category, project, agent identity, time range
- Sort by: time, severity, agent, category
- Search by title text
- Acknowledge button (records who + when)
- New issues appear at top via SSE
- Expandable detail showing full JSONB payload
- On-demand log drill-down: click an issue → fetch bounded log lines from the source container around that timestamp
9. Docker Integration
9.1 Compose Block (all three compose files)
9.2 Side-Effect: Fix SM Compose Comment
Thesession-manager service comment in docker-compose.server.yml currently says “optional / opt-in”. Per SPEC-049 §2, the SM is required infrastructure. Comment corrected to: “Phase 4 standalone deployment. Remove this block to use the in-tree manager path (Phases 1-3). The session manager as a concept is required — see SPEC-049 §2.”
10. Implementation Phases
Phase 1 — Health Probing + Health Panel (~1-2 weeks)
dashboard/directory: Node/Express service, Dockerfile, compose blocksprism_dashboardPostgres schema + Alembic chain (health_history, log_event, agent_state_snapshot)ProbeAdapterinterface +DockerSocketAdapter+HttpHealthAdapter- Periodic health probe loop (15s tick) writing to
health_history - Container resource monitoring (CPU/memory/disk/restart counts)
- Vanilla HTML SPA — Health panel only — reading current state via SSE
- Mode-aware auth (local auto, LAN/cloud login with key paste, iron-session)
prism installintegration (image build, env-secret generation, post-install URL)
Phase 2 — SM Push + Issue Log Panel
- SM publishes to
agent_events:{tenant_id}Redis Stream (coordinate with Donna for SM-side PR) log_issue()helper in backend services emitting toservice_events:{tenant_id}- Dashboard subscribes to both streams (consumer group
dashboard) - Issue log panel with severity/category/agent/project filtering + acknowledge
- Agent roster in Health panel populated from SM-pushed state
- On-demand log drill-down via Docker socket
- Instrument v1 call sites (10 points per §4.4)
Phase 3 — Metrics Panel + Polish
- Metrics panel with live charts (Chart.js)
- Container resource trend charts
- Dark/light mode (system preference)
- SSE startup stagger (health t+0, metrics t+5s, issues t+10s)
- Metrics-parsed endpoint with 5s server-side cache
Phase 4 — Hardening + Docs
- Rate limiting on dashboard endpoints
- CSP headers
- Cloud adapter scaffolding (KubernetesAdapter interface)
- docs/dashboard.mdx
- OAuth v2 prep (GitHub/Google callback handling in dashboard service)
11. Future: prism_issues MCP Verb
Out of scope for v1, but the/dashboard/api/issues read API is designed for dual-consumer support from day one. Clean filter params (severity, category, agent_identity, project_id, since, until, acknowledged, limit, offset), no UI-coupled response shape. A future prism_issues MCP verb taps the same data source so agents can query “why did the deploy fail at 3am” without opening a browser.
12. Non-Goals (Explicit Deferrals)
- OAuth (GitHub/Google) — v2
- Continuous log streaming / firehose ingestion — on-demand only
- Datadog-scale APM — practical ops monitoring, not distributed tracing
- Metrics persistence beyond health_history — live-only charts in v1
- User management UI — v2
- Project-level drill-down dashboards — v1 is org-scoped
- Mobile-first layout — responsive enough for tablets
13. References
- SPEC-019 — Environment Resolution Contract (modes: local/lan/cloud)
- SPEC-030 — Multi-Prism Controller (registrations, elections, metrics)
- SPEC-032 — Redis Session Plane (SM owns Redis, backend-only access)
- SPEC-045 — Unified Session+Coordination Stream (WebSocket data plane)
- SPEC-049 — Identity & Session Manager (single-writer rule, §2 non-negotiable)
- ADR-027 — SM owns realtime state; Redis-fronted; Postgres is audit/durable
- docs/metrics.mdx — existing counter catalog
14. Decisions Log
| # | Decision | Source | Date |
|---|---|---|---|
| D1 | Dashboard is a full observability service, not a thin display | Frank | 2026-04-28 |
| D2 | Agents only know about SM; all agent observability flows SM → dashboard | Frank | 2026-04-28 |
| D3 | SM pushes via Redis Streams; dashboard never polls | Frank + Porsche | 2026-04-28 |
| D4 | Service-internal failures emit via log_issue() to service_events Redis stream | Porsche (B2), ratified Frank | 2026-04-28 |
| D5 | Probe adapter pattern from day one for cloud readiness | Porsche, ratified Frank | 2026-04-28 |
| D6 | Separate Postgres schema (prism_dashboard) for independent dev/test | Frank | 2026-04-28 |
| D7 | On-demand log reading, not continuous streaming | Frank | 2026-04-28 |
| D8 | Container impact monitoring (CPU/memory/disk) across local and LAN | Frank | 2026-04-28 |
| D9 | agent_identity as first-class column on every log_event, sortable/filterable | Frank | 2026-04-28 |
| D10 | Severity enum: ERROR / WARNING / INFO only | Frank | 2026-04-28 |
| D11 | Vanilla HTML + Chart.js for SPA (no React/Vite build pipeline) | Porsche, ratified Frank | 2026-04-28 |
| D12 | Port 43000 for dashboard | Lola + Porsche, ratified Frank | 2026-04-28 |
| D13 | API-key auth v1; OAuth deferred to v2 | Frank | 2026-04-28 |
| D14 | Session cookie encrypted via iron-session (AES-GCM) | Porsche, ratified Frank | 2026-04-28 |
| D15 | Session secret generated at install time; refuse to start if changeme | Porsche, ratified Frank | 2026-04-28 |
| D16 | Session expiry UX: keep page rendered, show re-auth banner | Porsche, ratified Frank | 2026-04-28 |
| D17 | BFF proxy pattern rejected — creates coupling | Frank | 2026-04-28 |

