Status: draft · Version 0.3 · Filed 2026-04-28

SPEC-050 — Prism Dashboard — Org-Scoped Observability Service

Version

0.3

Status

draft

Changelog

0.3: Final architecture — dashboard is a full observability service. SM pushes agent state, dashboard actively probes containers, reads logs on-demand. Separate Postgres schema. Probe adapter pattern for cloud readiness. Redis Streams for SM→dashboard and service→dashboard event delivery. BFF pattern killed. Porsche’s recommendations ratified by Frank. Ready for implementation.
0.2: Standalone Node/Express BFF proxy. Rejected — created unwanted coupling between dashboard and backend.
0.1: Static SPA on FastAPI mount. Rejected — not its own service.

1. Problem

Prism operators have no persistent visual surface for monitoring backend health. The existing tools require either manual curl, a separate Grafana stack, or an active AI session. The operator needs a single persistent entry point — pinned to their org, reachable from any browser — that shows whether things are healthy, what’s happening, and what’s gone wrong across all three deployment modes.

2. Goals

Single URL, any mode. Works whether the backend runs on localhost, a LAN server, or in the cloud.
Org-scoped. Pinned to a Prism org (tenant). Shows health, metrics, and issues across all projects.
Full observability service. Its own container with its own responsibilities: container health probing, log reading, event classification, and persistence. Not a proxy, not a thin display.
Three panels: Backend Health, Metrics, Issue Log.
SM-push for agent state. Agents only know about the Session Manager. All agent observability flows SM → dashboard. Dashboard never polls for agent state.
Cloud-ready from day one. Probe adapter pattern ensures the architecture adapts to cloud deployment without structural changes.

3. Architecture

3.1 Data Flow

Agents (Donna, Candi, Lafonda, etc.)
    │
    │ prism_start / wrap / heartbeat / all verbs
    ▼
Session Manager (single source of truth for agent/session state)
    │
    │ Redis Stream: agent_events:{tenant_id}
    │ (registration, deregistration, election, heartbeat, lifecycle)
    ▼
Dashboard Backend (observability service — own container)
    │
    ├── Subscribes to agent_events stream from SM
    ├── Subscribes to service_events stream from backend services
    ├── Actively probes container health (all 6 services)
    ├── Reads service logs ON DEMAND (not continuous)
    ├── Monitors container resource impact (CPU/memory/disk)
    ├── Classifies events as ERROR / WARNING / INFO
    ├── Persists to its own Postgres schema (prism_dashboard)
    │
    │ Serves SPA + pushes live state via SSE
    ▼
Browser (renders what it receives)

3.2 Key Rules

Agents only know about the SM. No agent-to-dashboard path. All agent observability flows SM → dashboard via Redis Streams.
SM pushes to dashboard. Dashboard never polls SM or backend for agent state. SM publishes to agent_events:{tenant_id} stream on every state change.
Dashboard actively health-checks all containers. It’s the watchdog. Probes backend, backend-grpc, session-manager, postgres, redis, neo4j on a 15s tick.
Log reading is on-demand. Operator investigates an issue → dashboard reads bounded log lines from the relevant container. Not a continuous firehose. We are not recreating Datadog.
Container impact monitoring. Dashboard tracks CPU, memory, disk pressure across all containers. Practical ops monitoring for local and LAN modes.
Service-internal failures flow via Redis Streams. Backend services emit to service_events:{tenant_id} via a slim log_issue() helper. Dashboard subscribes. Services don’t know about the dashboard — they publish to a shared transport.

3.3 Container Topology (8 containers)

┌──────────────────────────────────────────────────────────────────┐
│  Docker Compose Stack                                            │
│                                                                  │
│  ┌───────────────┐                                               │
│  │  Dashboard     │◀── Redis Streams (agent_events,              │
│  │  :3000         │    service_events)                            │
│  │  (observability│──▶ Docker socket (probes, logs, stats)        │
│  │   service)     │──▶ Postgres prism_dashboard schema            │
│  └───────────────┘                                               │
│        ▲                                                         │
│   Browser (SPA)                                                  │
│                                                                  │
│  ┌───────────────┐  ┌───────────────┐  ┌───────────────────────┐ │
│  │  Backend       │  │  Backend-gRPC  │  │  Session Manager      │ │
│  │  :8000         │  │  :50051        │  │  :41766               │ │
│  └───────────────┘  └───────────────┘  └───────────────────────┘ │
│         │                  │                     │                │
│  ┌──────┴──────────────────┴─────────────────────┘               │
│  │                                                               │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐                       │
│  │  │Postgres  │  │ Redis   │  │ Neo4j   │                       │
│  │  │:5432     │  │ :6379   │  │ :7687   │                       │
│  │  └─────────┘  └─────────┘  └─────────┘                       │
│  │                                                               │
└──┴───────────────────────────────────────────────────────────────┘

Session Manager is required infrastructure per SPEC-049 §2. The compose comment describing it as “optional” is a documentation bug — corrected in §9.

3.4 What the Dashboard Is NOT

NOT a BFF proxy to the Prism backend
NOT a log aggregation firehose (reads on-demand, not continuous)
NOT Datadog/Grafana — practical ops monitoring for a Prism deployment
NOT a polling consumer of backend APIs

4. Event Delivery — Three Streams

The dashboard consumes events from three sources. All use the same Redis Streams pattern with consumer groups for at-least-once delivery and backlog drain on restart.

4.1 Agent Events (SM → Dashboard)

Stream: agent_events:{tenant_id} Publisher: Session Manager (on every state change) Events: registration, deregistration, election (master claim/preempt), heartbeat freshness changes, identity conflicts, session lifecycle (start/wrap/checkpoint) Each event carries:

{
  "event_type": "registration | deregistration | election | heartbeat_stale | ...",
  "agent_identity": "Donna",
  "agent_surface": "claude_code",
  "machine_id": "mini3.home.lan",
  "session_id": "...",
  "project_id": "...",
  "pid": "PID-PGR01",
  "timestamp": "2026-04-28T12:00:00Z",
  "detail": { ... }
}

4.2 Service Events (Backend Services → Dashboard)

Stream: service_events:{tenant_id} Publisher: Any backend service via log_issue() helper Events: signal delivery failures, auth failures, election anomalies, health check failures, wrap discipline breaches, migration drift, deploy drift, gRPC disconnects, dashboard auth failures Each event carries:

{
  "severity": "error | warning | info",
  "category": "signal | auth | election | health | drift | deploy | session",
  "agent_identity": "Donna",
  "title": "Signal delivery failed after 3 retries",
  "detail": { "signal_id": "...", "to": "Candi", "last_error": "..." },
  "source": "signal_service.send",
  "project_id": "...",
  "pid": "PID-PGR01",
  "timestamp": "2026-04-28T12:00:00Z"
}

The log_issue() helper signature:

await log_issue(
    tenant_id=ctx.tenant_id,
    agent_identity="Donna",
    severity="error",
    category="signal",
    title="Signal delivery failed after 3 retries",
    detail={"signal_id": sid, "to": target, "last_error": str(e)},
    source="signal_service.send",
    project_id=pid,
)

Services don’t know about the dashboard. They XADD to a Redis stream. Dashboard happens to be a subscriber.

4.3 Dashboard’s Own Probes

The dashboard generates its own events from active health checking and container monitoring. These are written directly to prism_dashboard.log_event (no Redis stream needed — dashboard is both producer and consumer).

4.4 Committed v1 log_issue() Call Sites (10 points)

signal_service.send → delivery failure after retry exhaustion (error)
controller_service → election anomalies, master preemption conflicts (warning)
controller_service → identity_conflict per SPEC-038 (warning)
auth/* → failed OAuth callbacks, API-key auth failures (warning)
health probes → any dependency check failing (error)
wrap discipline → rate-floor breach below 0.60 (info)
migrations → Alembic drift detected at startup (error)
deploy/upgrade → container image SHA mismatch (warning)
gRPC stream → disconnect/reconnect events (warning)
dashboard auth → BFF auth failures in the dashboard service itself (warning)

5. Probe Adapter Pattern (Cloud-Ready)

The dashboard probes containers through an abstract adapter interface — never coupled directly to Docker socket code paths.

class ProbeAdapter:
    async def container_status(self, name: str) -> ContainerStatus
    async def container_stats(self, name: str) -> ResourceStats
    async def container_logs(self, name: str, since: str, until: str, limit: int) -> list[LogLine]
    async def host_stats(self) -> HostStats

5.1 v1: DockerSocketAdapter + HttpHealthAdapter

DockerSocketAdapter — mounts /var/run/docker.sock, provides container status, CPU/memory/disk stats, restart counts, bounded log reads
HttpHealthAdapter — calls each service’s /health/liveness endpoint for application-level health confirmation

5.2 Future: Cloud Adapters

When deploying to cloud, swap in KubernetesAdapter (kubelet API + metrics-server) or ECSAdapter (CloudWatch + ECS API). Single factory line change. No structural rewrite.

5.3 On-Demand Log Reading

Bounded reads only. The adapter enforces since, until, and limit parameters. Dashboard never loads full container logs into memory. UI sends time-bounded requests: “give me 5 min around this issue, max 500 lines.”

5.4 Container Impact Monitoring

Dashboard tracks per-container: CPU %, memory usage/limit, disk I/O, restart count, uptime. Surfaced as resource cards in the Health panel. Probed on the same 15s tick as health checks.

6. Data Store — Separate Postgres Schema

Dashboard uses the same Postgres instance but its own prism_dashboard schema. Independent migrations, independent test cycles, no rebooting the whole stack.

6.1 Schema: prism_dashboard

CREATE SCHEMA prism_dashboard;

-- Classified events from all three sources
CREATE TABLE prism_dashboard.log_event (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL,
    project_id      UUID,              -- nullable (some events are org-wide)
    agent_identity  VARCHAR(128),      -- required when agent-originated
    severity        VARCHAR(8) NOT NULL CHECK (severity IN ('error', 'warning', 'info')),
    category        VARCHAR(64) NOT NULL,
    title           VARCHAR(256) NOT NULL,
    detail          JSONB,
    source          VARCHAR(128) NOT NULL,
    source_stream   VARCHAR(32),       -- 'agent_events' | 'service_events' | 'probe'
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    acknowledged    BOOLEAN DEFAULT false,
    acknowledged_by VARCHAR(128),
    acknowledged_at TIMESTAMPTZ
);

CREATE INDEX idx_log_event_tenant_created ON prism_dashboard.log_event (tenant_id, created_at DESC);
CREATE INDEX idx_log_event_severity ON prism_dashboard.log_event (tenant_id, severity, created_at DESC);
CREATE INDEX idx_log_event_agent ON prism_dashboard.log_event (tenant_id, agent_identity, created_at DESC);
CREATE INDEX idx_log_event_category ON prism_dashboard.log_event (tenant_id, category, created_at DESC);
CREATE INDEX idx_log_event_project ON prism_dashboard.log_event (tenant_id, project_id, created_at DESC);

-- Health probe history
CREATE TABLE prism_dashboard.health_history (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL,
    service_name    VARCHAR(64) NOT NULL,
    status          VARCHAR(16) NOT NULL,  -- ok | degraded | unhealthy | unreachable
    latency_ms      INTEGER,
    cpu_percent     REAL,
    memory_mb       REAL,
    memory_limit_mb REAL,
    disk_usage_mb   REAL,
    restart_count   INTEGER,
    detail          JSONB,
    probed_at       TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_health_history_service ON prism_dashboard.health_history (tenant_id, service_name, probed_at DESC);

-- Agent state snapshots (populated from SM push)
CREATE TABLE prism_dashboard.agent_state_snapshot (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL,
    agent_identity  VARCHAR(128) NOT NULL,
    agent_surface   VARCHAR(64),
    machine_id      VARCHAR(128),
    session_id      UUID,
    project_id      UUID,
    pid             VARCHAR(32),
    is_master       BOOLEAN DEFAULT false,
    event_type      VARCHAR(32) NOT NULL,  -- registration | deregistration | election | ...
    captured_at     TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_agent_snapshot_identity ON prism_dashboard.agent_state_snapshot (tenant_id, agent_identity, captured_at DESC);

6.2 Schema Isolation

Dashboard’s Alembic chain lives in dashboard/migrations/ with version_table_schema='prism_dashboard'
No cross-schema foreign keys to public.*. Agent identity stored as plain VARCHAR, not FK to public.personas. Historical name preserved even if persona is renamed (correct audit behavior).
Schema-scoped DB user: USAGE + CREATE on prism_dashboard only. No rights to public. Defense in depth — dashboard can’t read tenants, projects, API keys, or any backend data.

7. Auth (v1 — API Key Only)

OAuth deferred to v2.

Mode	Flow
local	Dashboard reads API key from shared credentials volume. No login screen.
lan	Login screen. Operator pastes API key → encrypted session cookie (iron-session, AES-GCM, 24h TTL).
cloud	Same as LAN for v1. OAuth replaces this in v2.

Session secret: generated at install time by prism install. Non-local modes refuse to start if secret is changeme or empty.

7.1 Dashboard Health Endpoints

GET /dashboard/health — lightweight, for Docker healthcheck / orchestrator probes
GET /dashboard/api/health/current — comprehensive, for the operator UI (latest probe results across all monitored services + agent roster + container resource stats)

7.2 Session Expiry UX

When session cookie expires mid-page: page stays rendered, re-auth banner appears, writes (acknowledge) require re-auth. Read-mostly UX preserved.

8. SPA Frontend

Vanilla HTML + Chart.js. No React, no Vite, no build pipeline. Three files served by the dashboard backend: index.html, chart.umd.min.js (CDN or bundled), app.js.

8.1 Panel: Backend Health

Overall status badge (green/yellow/red) from dashboard’s own probes
Individual service cards (Backend, Backend-gRPC, Session Manager, Postgres, Redis, Neo4j) with status, latency, CPU/memory/disk
Active agent roster with identity / surface / machine / master status (from SM-pushed state)
Container restart counts and uptime
Auto-refresh via SSE from dashboard backend

8.2 Panel: Metrics

Signal rate (sent/delivered/queued) — line chart, 1h/6h/24h toggle
Election events — bar chart
Active registrations — live counter
Wrap discipline rate — gauge with 0.60 floor line
Container resource trends (CPU/memory over time)

Data: SSE pushes from dashboard backend. Client accumulates points in-memory. No historical persistence beyond health_history table in v1.

8.3 Panel: Issue Log

Severity-colored rows (red/amber/blue for ERROR/WARNING/INFO)
Filter by: severity, category, project, agent identity, time range
Sort by: time, severity, agent, category
Search by title text
Acknowledge button (records who + when)
New issues appear at top via SSE
Expandable detail showing full JSONB payload
On-demand log drill-down: click an issue → fetch bounded log lines from the source container around that timestamp

9. Docker Integration

9.1 Compose Block (all three compose files)

  dashboard:
    build:
      context: ./dashboard
      dockerfile: Dockerfile
    image: prism-dashboard:server
    container_name: prism-server-dashboard
    restart: unless-stopped
    environment:
      PRISM_MODE: ${PRISM_MODE:-personal}
      PRISM_DASHBOARD_PORT: ${PRISM_DASHBOARD_PORT:-3000}
      PRISM_DASHBOARD_SESSION_SECRET: ${PRISM_DASHBOARD_SESSION_SECRET}
      PRISM_REDIS_URL: redis://:${PRISM_REDIS_PASSWORD:-prism_server}@redis:6379/0
      DATABASE_URL: postgresql+asyncpg://prism_dashboard:${PRISM_DASHBOARD_DB_PASSWORD:-prism_dashboard}@postgres:5432/prism
      PRISM_CREDENTIALS_PATH: /root/.prism/credentials.json
    ports:
      - "${PRISM_BIND_ADDR:-0.0.0.0}:${PRISM_DASHBOARD_HOST_PORT:-43000}:3000"
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    volumes:
      - prism_server_credentials:/root/.prism:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro

9.2 Side-Effect: Fix SM Compose Comment

The session-manager service comment in docker-compose.server.yml currently says “optional / opt-in”. Per SPEC-049 §2, the SM is required infrastructure. Comment corrected to: “Phase 4 standalone deployment. Remove this block to use the in-tree manager path (Phases 1-3). The session manager as a concept is required — see SPEC-049 §2.”

10. Implementation Phases

Phase 1 — Health Probing + Health Panel (~1-2 weeks)

dashboard/ directory: Node/Express service, Dockerfile, compose blocks
prism_dashboard Postgres schema + Alembic chain (health_history, log_event, agent_state_snapshot)
ProbeAdapter interface + DockerSocketAdapter + HttpHealthAdapter
Periodic health probe loop (15s tick) writing to health_history
Container resource monitoring (CPU/memory/disk/restart counts)
Vanilla HTML SPA — Health panel only — reading current state via SSE
Mode-aware auth (local auto, LAN/cloud login with key paste, iron-session)
prism install integration (image build, env-secret generation, post-install URL)

Phase 2 — SM Push + Issue Log Panel

SM publishes to agent_events:{tenant_id} Redis Stream (coordinate with Donna for SM-side PR)
log_issue() helper in backend services emitting to service_events:{tenant_id}
Dashboard subscribes to both streams (consumer group dashboard)
Issue log panel with severity/category/agent/project filtering + acknowledge
Agent roster in Health panel populated from SM-pushed state
On-demand log drill-down via Docker socket
Instrument v1 call sites (10 points per §4.4)

Phase 3 — Metrics Panel + Polish

Metrics panel with live charts (Chart.js)
Container resource trend charts
Dark/light mode (system preference)
SSE startup stagger (health t+0, metrics t+5s, issues t+10s)
Metrics-parsed endpoint with 5s server-side cache

Phase 4 — Hardening + Docs

Rate limiting on dashboard endpoints
CSP headers
Cloud adapter scaffolding (KubernetesAdapter interface)
docs/dashboard.mdx
OAuth v2 prep (GitHub/Google callback handling in dashboard service)

11. Future: prism_issues MCP Verb

Out of scope for v1, but the /dashboard/api/issues read API is designed for dual-consumer support from day one. Clean filter params (severity, category, agent_identity, project_id, since, until, acknowledged, limit, offset), no UI-coupled response shape. A future prism_issues MCP verb taps the same data source so agents can query “why did the deploy fail at 3am” without opening a browser.

12. Non-Goals (Explicit Deferrals)

OAuth (GitHub/Google) — v2
Continuous log streaming / firehose ingestion — on-demand only
Datadog-scale APM — practical ops monitoring, not distributed tracing
Metrics persistence beyond health_history — live-only charts in v1
User management UI — v2
Project-level drill-down dashboards — v1 is org-scoped
Mobile-first layout — responsive enough for tablets

13. References

SPEC-019 — Environment Resolution Contract (modes: local/lan/cloud)
SPEC-030 — Multi-Prism Controller (registrations, elections, metrics)
SPEC-032 — Redis Session Plane (SM owns Redis, backend-only access)
SPEC-045 — Unified Session+Coordination Stream (WebSocket data plane)
SPEC-049 — Identity & Session Manager (single-writer rule, §2 non-negotiable)
ADR-027 — SM owns realtime state; Redis-fronted; Postgres is audit/durable
docs/metrics.mdx — existing counter catalog

14. Decisions Log

#	Decision	Source	Date
D1	Dashboard is a full observability service, not a thin display	Frank	2026-04-28
D2	Agents only know about SM; all agent observability flows SM → dashboard	Frank	2026-04-28
D3	SM pushes via Redis Streams; dashboard never polls	Frank + Porsche	2026-04-28
D4	Service-internal failures emit via `log_issue()` to `service_events` Redis stream	Porsche (B2), ratified Frank	2026-04-28
D5	Probe adapter pattern from day one for cloud readiness	Porsche, ratified Frank	2026-04-28
D6	Separate Postgres schema (prism_dashboard) for independent dev/test	Frank	2026-04-28
D7	On-demand log reading, not continuous streaming	Frank	2026-04-28
D8	Container impact monitoring (CPU/memory/disk) across local and LAN	Frank	2026-04-28
D9	agent_identity as first-class column on every log_event, sortable/filterable	Frank	2026-04-28
D10	Severity enum: ERROR / WARNING / INFO only	Frank	2026-04-28
D11	Vanilla HTML + Chart.js for SPA (no React/Vite build pipeline)	Porsche, ratified Frank	2026-04-28
D12	Port 43000 for dashboard	Lola + Porsche, ratified Frank	2026-04-28
D13	API-key auth v1; OAuth deferred to v2	Frank	2026-04-28
D14	Session cookie encrypted via iron-session (AES-GCM)	Porsche, ratified Frank	2026-04-28
D15	Session secret generated at install time; refuse to start if changeme	Porsche, ratified Frank	2026-04-28
D16	Session expiry UX: keep page rendered, show re-auth banner	Porsche, ratified Frank	2026-04-28
D17	BFF proxy pattern rejected — creates coupling	Frank	2026-04-28

​SPEC-050 — Prism Dashboard — Org-Scoped Observability Service

​Version

​Status

​Changelog

​1. Problem

​2. Goals

​3. Architecture

​3.1 Data Flow

​3.2 Key Rules

​3.3 Container Topology (8 containers)

​3.4 What the Dashboard Is NOT

​4. Event Delivery — Three Streams

​4.1 Agent Events (SM → Dashboard)

​4.2 Service Events (Backend Services → Dashboard)

​4.3 Dashboard’s Own Probes

​4.4 Committed v1 log_issue() Call Sites (10 points)

​5. Probe Adapter Pattern (Cloud-Ready)

​5.1 v1: DockerSocketAdapter + HttpHealthAdapter

​5.2 Future: Cloud Adapters

​5.3 On-Demand Log Reading

​5.4 Container Impact Monitoring

​6. Data Store — Separate Postgres Schema

​6.1 Schema: prism_dashboard

​6.2 Schema Isolation

​7. Auth (v1 — API Key Only)

​7.1 Dashboard Health Endpoints

​7.2 Session Expiry UX

​8. SPA Frontend

​8.1 Panel: Backend Health

​8.2 Panel: Metrics

​8.3 Panel: Issue Log

​9. Docker Integration

​9.1 Compose Block (all three compose files)

​9.2 Side-Effect: Fix SM Compose Comment

​10. Implementation Phases

​Phase 1 — Health Probing + Health Panel (~1-2 weeks)

​Phase 2 — SM Push + Issue Log Panel

​Phase 3 — Metrics Panel + Polish

​Phase 4 — Hardening + Docs

​11. Future: prism_issues MCP Verb

​12. Non-Goals (Explicit Deferrals)

​13. References

​14. Decisions Log

SPEC-050 — Prism Dashboard — Org-Scoped Observability Service

Version

Status

Changelog

1. Problem

2. Goals

3. Architecture

3.1 Data Flow

3.2 Key Rules

3.3 Container Topology (8 containers)

3.4 What the Dashboard Is NOT

4. Event Delivery — Three Streams

4.1 Agent Events (SM → Dashboard)

4.2 Service Events (Backend Services → Dashboard)

4.3 Dashboard’s Own Probes

4.4 Committed v1 log_issue() Call Sites (10 points)

5. Probe Adapter Pattern (Cloud-Ready)

5.1 v1: DockerSocketAdapter + HttpHealthAdapter

5.2 Future: Cloud Adapters

5.3 On-Demand Log Reading

5.4 Container Impact Monitoring

6. Data Store — Separate Postgres Schema

6.1 Schema: prism_dashboard

6.2 Schema Isolation

7. Auth (v1 — API Key Only)

7.1 Dashboard Health Endpoints

7.2 Session Expiry UX

8. SPA Frontend

8.1 Panel: Backend Health

8.2 Panel: Metrics

8.3 Panel: Issue Log

9. Docker Integration

9.1 Compose Block (all three compose files)

9.2 Side-Effect: Fix SM Compose Comment

10. Implementation Phases

Phase 1 — Health Probing + Health Panel (~1-2 weeks)

Phase 2 — SM Push + Issue Log Panel

Phase 3 — Metrics Panel + Polish

Phase 4 — Hardening + Docs

11. Future: prism_issues MCP Verb

12. Non-Goals (Explicit Deferrals)

13. References

14. Decisions Log