Skip to main content
Status: draft · Version 0.3 · Filed 2026-04-28

SPEC-050 — Prism Dashboard — Org-Scoped Observability Service

Version

0.3

Status

draft

Changelog

  • 0.3: Final architecture — dashboard is a full observability service. SM pushes agent state, dashboard actively probes containers, reads logs on-demand. Separate Postgres schema. Probe adapter pattern for cloud readiness. Redis Streams for SM→dashboard and service→dashboard event delivery. BFF pattern killed. Porsche’s recommendations ratified by Frank. Ready for implementation.
  • 0.2: Standalone Node/Express BFF proxy. Rejected — created unwanted coupling between dashboard and backend.
  • 0.1: Static SPA on FastAPI mount. Rejected — not its own service.

1. Problem

Prism operators have no persistent visual surface for monitoring backend health. The existing tools require either manual curl, a separate Grafana stack, or an active AI session. The operator needs a single persistent entry point — pinned to their org, reachable from any browser — that shows whether things are healthy, what’s happening, and what’s gone wrong across all three deployment modes.

2. Goals

  1. Single URL, any mode. Works whether the backend runs on localhost, a LAN server, or in the cloud.
  2. Org-scoped. Pinned to a Prism org (tenant). Shows health, metrics, and issues across all projects.
  3. Full observability service. Its own container with its own responsibilities: container health probing, log reading, event classification, and persistence. Not a proxy, not a thin display.
  4. Three panels: Backend Health, Metrics, Issue Log.
  5. SM-push for agent state. Agents only know about the Session Manager. All agent observability flows SM → dashboard. Dashboard never polls for agent state.
  6. Cloud-ready from day one. Probe adapter pattern ensures the architecture adapts to cloud deployment without structural changes.

3. Architecture

3.1 Data Flow

Agents (Donna, Candi, Lafonda, etc.)

    │ prism_start / wrap / heartbeat / all verbs

Session Manager (single source of truth for agent/session state)

    │ Redis Stream: agent_events:{tenant_id}
    │ (registration, deregistration, election, heartbeat, lifecycle)

Dashboard Backend (observability service — own container)

    ├── Subscribes to agent_events stream from SM
    ├── Subscribes to service_events stream from backend services
    ├── Actively probes container health (all 6 services)
    ├── Reads service logs ON DEMAND (not continuous)
    ├── Monitors container resource impact (CPU/memory/disk)
    ├── Classifies events as ERROR / WARNING / INFO
    ├── Persists to its own Postgres schema (prism_dashboard)

    │ Serves SPA + pushes live state via SSE

Browser (renders what it receives)

3.2 Key Rules

  1. Agents only know about the SM. No agent-to-dashboard path. All agent observability flows SM → dashboard via Redis Streams.
  2. SM pushes to dashboard. Dashboard never polls SM or backend for agent state. SM publishes to agent_events:{tenant_id} stream on every state change.
  3. Dashboard actively health-checks all containers. It’s the watchdog. Probes backend, backend-grpc, session-manager, postgres, redis, neo4j on a 15s tick.
  4. Log reading is on-demand. Operator investigates an issue → dashboard reads bounded log lines from the relevant container. Not a continuous firehose. We are not recreating Datadog.
  5. Container impact monitoring. Dashboard tracks CPU, memory, disk pressure across all containers. Practical ops monitoring for local and LAN modes.
  6. Service-internal failures flow via Redis Streams. Backend services emit to service_events:{tenant_id} via a slim log_issue() helper. Dashboard subscribes. Services don’t know about the dashboard — they publish to a shared transport.

3.3 Container Topology (8 containers)

┌──────────────────────────────────────────────────────────────────┐
│  Docker Compose Stack                                            │
│                                                                  │
│  ┌───────────────┐                                               │
│  │  Dashboard     │◀── Redis Streams (agent_events,              │
│  │  :3000         │    service_events)                            │
│  │  (observability│──▶ Docker socket (probes, logs, stats)        │
│  │   service)     │──▶ Postgres prism_dashboard schema            │
│  └───────────────┘                                               │
│        ▲                                                         │
│   Browser (SPA)                                                  │
│                                                                  │
│  ┌───────────────┐  ┌───────────────┐  ┌───────────────────────┐ │
│  │  Backend       │  │  Backend-gRPC  │  │  Session Manager      │ │
│  │  :8000         │  │  :50051        │  │  :41766               │ │
│  └───────────────┘  └───────────────┘  └───────────────────────┘ │
│         │                  │                     │                │
│  ┌──────┴──────────────────┴─────────────────────┘               │
│  │                                                               │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐                       │
│  │  │Postgres  │  │ Redis   │  │ Neo4j   │                       │
│  │  │:5432     │  │ :6379   │  │ :7687   │                       │
│  │  └─────────┘  └─────────┘  └─────────┘                       │
│  │                                                               │
└──┴───────────────────────────────────────────────────────────────┘
Session Manager is required infrastructure per SPEC-049 §2. The compose comment describing it as “optional” is a documentation bug — corrected in §9.

3.4 What the Dashboard Is NOT

  • NOT a BFF proxy to the Prism backend
  • NOT a log aggregation firehose (reads on-demand, not continuous)
  • NOT Datadog/Grafana — practical ops monitoring for a Prism deployment
  • NOT a polling consumer of backend APIs

4. Event Delivery — Three Streams

The dashboard consumes events from three sources. All use the same Redis Streams pattern with consumer groups for at-least-once delivery and backlog drain on restart.

4.1 Agent Events (SM → Dashboard)

Stream: agent_events:{tenant_id} Publisher: Session Manager (on every state change) Events: registration, deregistration, election (master claim/preempt), heartbeat freshness changes, identity conflicts, session lifecycle (start/wrap/checkpoint) Each event carries:
{
  "event_type": "registration | deregistration | election | heartbeat_stale | ...",
  "agent_identity": "Donna",
  "agent_surface": "claude_code",
  "machine_id": "mini3.home.lan",
  "session_id": "...",
  "project_id": "...",
  "pid": "PID-PGR01",
  "timestamp": "2026-04-28T12:00:00Z",
  "detail": { ... }
}

4.2 Service Events (Backend Services → Dashboard)

Stream: service_events:{tenant_id} Publisher: Any backend service via log_issue() helper Events: signal delivery failures, auth failures, election anomalies, health check failures, wrap discipline breaches, migration drift, deploy drift, gRPC disconnects, dashboard auth failures Each event carries:
{
  "severity": "error | warning | info",
  "category": "signal | auth | election | health | drift | deploy | session",
  "agent_identity": "Donna",
  "title": "Signal delivery failed after 3 retries",
  "detail": { "signal_id": "...", "to": "Candi", "last_error": "..." },
  "source": "signal_service.send",
  "project_id": "...",
  "pid": "PID-PGR01",
  "timestamp": "2026-04-28T12:00:00Z"
}
The log_issue() helper signature:
await log_issue(
    tenant_id=ctx.tenant_id,
    agent_identity="Donna",
    severity="error",
    category="signal",
    title="Signal delivery failed after 3 retries",
    detail={"signal_id": sid, "to": target, "last_error": str(e)},
    source="signal_service.send",
    project_id=pid,
)
Services don’t know about the dashboard. They XADD to a Redis stream. Dashboard happens to be a subscriber.

4.3 Dashboard’s Own Probes

The dashboard generates its own events from active health checking and container monitoring. These are written directly to prism_dashboard.log_event (no Redis stream needed — dashboard is both producer and consumer).

4.4 Committed v1 log_issue() Call Sites (10 points)

  1. signal_service.send → delivery failure after retry exhaustion (error)
  2. controller_service → election anomalies, master preemption conflicts (warning)
  3. controller_service → identity_conflict per SPEC-038 (warning)
  4. auth/* → failed OAuth callbacks, API-key auth failures (warning)
  5. health probes → any dependency check failing (error)
  6. wrap discipline → rate-floor breach below 0.60 (info)
  7. migrations → Alembic drift detected at startup (error)
  8. deploy/upgrade → container image SHA mismatch (warning)
  9. gRPC stream → disconnect/reconnect events (warning)
  10. dashboard auth → BFF auth failures in the dashboard service itself (warning)

5. Probe Adapter Pattern (Cloud-Ready)

The dashboard probes containers through an abstract adapter interface — never coupled directly to Docker socket code paths.
class ProbeAdapter:
    async def container_status(self, name: str) -> ContainerStatus
    async def container_stats(self, name: str) -> ResourceStats
    async def container_logs(self, name: str, since: str, until: str, limit: int) -> list[LogLine]
    async def host_stats(self) -> HostStats

5.1 v1: DockerSocketAdapter + HttpHealthAdapter

  • DockerSocketAdapter — mounts /var/run/docker.sock, provides container status, CPU/memory/disk stats, restart counts, bounded log reads
  • HttpHealthAdapter — calls each service’s /health/liveness endpoint for application-level health confirmation

5.2 Future: Cloud Adapters

When deploying to cloud, swap in KubernetesAdapter (kubelet API + metrics-server) or ECSAdapter (CloudWatch + ECS API). Single factory line change. No structural rewrite.

5.3 On-Demand Log Reading

Bounded reads only. The adapter enforces since, until, and limit parameters. Dashboard never loads full container logs into memory. UI sends time-bounded requests: “give me 5 min around this issue, max 500 lines.”

5.4 Container Impact Monitoring

Dashboard tracks per-container: CPU %, memory usage/limit, disk I/O, restart count, uptime. Surfaced as resource cards in the Health panel. Probed on the same 15s tick as health checks.

6. Data Store — Separate Postgres Schema

Dashboard uses the same Postgres instance but its own prism_dashboard schema. Independent migrations, independent test cycles, no rebooting the whole stack.

6.1 Schema: prism_dashboard

CREATE SCHEMA prism_dashboard;

-- Classified events from all three sources
CREATE TABLE prism_dashboard.log_event (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL,
    project_id      UUID,              -- nullable (some events are org-wide)
    agent_identity  VARCHAR(128),      -- required when agent-originated
    severity        VARCHAR(8) NOT NULL CHECK (severity IN ('error', 'warning', 'info')),
    category        VARCHAR(64) NOT NULL,
    title           VARCHAR(256) NOT NULL,
    detail          JSONB,
    source          VARCHAR(128) NOT NULL,
    source_stream   VARCHAR(32),       -- 'agent_events' | 'service_events' | 'probe'
    created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
    acknowledged    BOOLEAN DEFAULT false,
    acknowledged_by VARCHAR(128),
    acknowledged_at TIMESTAMPTZ
);

CREATE INDEX idx_log_event_tenant_created ON prism_dashboard.log_event (tenant_id, created_at DESC);
CREATE INDEX idx_log_event_severity ON prism_dashboard.log_event (tenant_id, severity, created_at DESC);
CREATE INDEX idx_log_event_agent ON prism_dashboard.log_event (tenant_id, agent_identity, created_at DESC);
CREATE INDEX idx_log_event_category ON prism_dashboard.log_event (tenant_id, category, created_at DESC);
CREATE INDEX idx_log_event_project ON prism_dashboard.log_event (tenant_id, project_id, created_at DESC);

-- Health probe history
CREATE TABLE prism_dashboard.health_history (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL,
    service_name    VARCHAR(64) NOT NULL,
    status          VARCHAR(16) NOT NULL,  -- ok | degraded | unhealthy | unreachable
    latency_ms      INTEGER,
    cpu_percent     REAL,
    memory_mb       REAL,
    memory_limit_mb REAL,
    disk_usage_mb   REAL,
    restart_count   INTEGER,
    detail          JSONB,
    probed_at       TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_health_history_service ON prism_dashboard.health_history (tenant_id, service_name, probed_at DESC);

-- Agent state snapshots (populated from SM push)
CREATE TABLE prism_dashboard.agent_state_snapshot (
    id              UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    tenant_id       UUID NOT NULL,
    agent_identity  VARCHAR(128) NOT NULL,
    agent_surface   VARCHAR(64),
    machine_id      VARCHAR(128),
    session_id      UUID,
    project_id      UUID,
    pid             VARCHAR(32),
    is_master       BOOLEAN DEFAULT false,
    event_type      VARCHAR(32) NOT NULL,  -- registration | deregistration | election | ...
    captured_at     TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE INDEX idx_agent_snapshot_identity ON prism_dashboard.agent_state_snapshot (tenant_id, agent_identity, captured_at DESC);

6.2 Schema Isolation

  • Dashboard’s Alembic chain lives in dashboard/migrations/ with version_table_schema='prism_dashboard'
  • No cross-schema foreign keys to public.*. Agent identity stored as plain VARCHAR, not FK to public.personas. Historical name preserved even if persona is renamed (correct audit behavior).
  • Schema-scoped DB user: USAGE + CREATE on prism_dashboard only. No rights to public. Defense in depth — dashboard can’t read tenants, projects, API keys, or any backend data.

7. Auth (v1 — API Key Only)

OAuth deferred to v2.
ModeFlow
localDashboard reads API key from shared credentials volume. No login screen.
lanLogin screen. Operator pastes API key → encrypted session cookie (iron-session, AES-GCM, 24h TTL).
cloudSame as LAN for v1. OAuth replaces this in v2.
Session secret: generated at install time by prism install. Non-local modes refuse to start if secret is changeme or empty.

7.1 Dashboard Health Endpoints

  • GET /dashboard/health — lightweight, for Docker healthcheck / orchestrator probes
  • GET /dashboard/api/health/current — comprehensive, for the operator UI (latest probe results across all monitored services + agent roster + container resource stats)

7.2 Session Expiry UX

When session cookie expires mid-page: page stays rendered, re-auth banner appears, writes (acknowledge) require re-auth. Read-mostly UX preserved.

8. SPA Frontend

Vanilla HTML + Chart.js. No React, no Vite, no build pipeline. Three files served by the dashboard backend: index.html, chart.umd.min.js (CDN or bundled), app.js.

8.1 Panel: Backend Health

  • Overall status badge (green/yellow/red) from dashboard’s own probes
  • Individual service cards (Backend, Backend-gRPC, Session Manager, Postgres, Redis, Neo4j) with status, latency, CPU/memory/disk
  • Active agent roster with identity / surface / machine / master status (from SM-pushed state)
  • Container restart counts and uptime
  • Auto-refresh via SSE from dashboard backend

8.2 Panel: Metrics

  • Signal rate (sent/delivered/queued) — line chart, 1h/6h/24h toggle
  • Election events — bar chart
  • Active registrations — live counter
  • Wrap discipline rate — gauge with 0.60 floor line
  • Container resource trends (CPU/memory over time)
Data: SSE pushes from dashboard backend. Client accumulates points in-memory. No historical persistence beyond health_history table in v1.

8.3 Panel: Issue Log

  • Severity-colored rows (red/amber/blue for ERROR/WARNING/INFO)
  • Filter by: severity, category, project, agent identity, time range
  • Sort by: time, severity, agent, category
  • Search by title text
  • Acknowledge button (records who + when)
  • New issues appear at top via SSE
  • Expandable detail showing full JSONB payload
  • On-demand log drill-down: click an issue → fetch bounded log lines from the source container around that timestamp

9. Docker Integration

9.1 Compose Block (all three compose files)

  dashboard:
    build:
      context: ./dashboard
      dockerfile: Dockerfile
    image: prism-dashboard:server
    container_name: prism-server-dashboard
    restart: unless-stopped
    environment:
      PRISM_MODE: ${PRISM_MODE:-personal}
      PRISM_DASHBOARD_PORT: ${PRISM_DASHBOARD_PORT:-3000}
      PRISM_DASHBOARD_SESSION_SECRET: ${PRISM_DASHBOARD_SESSION_SECRET}
      PRISM_REDIS_URL: redis://:${PRISM_REDIS_PASSWORD:-prism_server}@redis:6379/0
      DATABASE_URL: postgresql+asyncpg://prism_dashboard:${PRISM_DASHBOARD_DB_PASSWORD:-prism_dashboard}@postgres:5432/prism
      PRISM_CREDENTIALS_PATH: /root/.prism/credentials.json
    ports:
      - "${PRISM_BIND_ADDR:-0.0.0.0}:${PRISM_DASHBOARD_HOST_PORT:-43000}:3000"
    depends_on:
      postgres:
        condition: service_healthy
      redis:
        condition: service_healthy
    volumes:
      - prism_server_credentials:/root/.prism:ro
      - /var/run/docker.sock:/var/run/docker.sock:ro

9.2 Side-Effect: Fix SM Compose Comment

The session-manager service comment in docker-compose.server.yml currently says “optional / opt-in”. Per SPEC-049 §2, the SM is required infrastructure. Comment corrected to: “Phase 4 standalone deployment. Remove this block to use the in-tree manager path (Phases 1-3). The session manager as a concept is required — see SPEC-049 §2.”

10. Implementation Phases

Phase 1 — Health Probing + Health Panel (~1-2 weeks)

  • dashboard/ directory: Node/Express service, Dockerfile, compose blocks
  • prism_dashboard Postgres schema + Alembic chain (health_history, log_event, agent_state_snapshot)
  • ProbeAdapter interface + DockerSocketAdapter + HttpHealthAdapter
  • Periodic health probe loop (15s tick) writing to health_history
  • Container resource monitoring (CPU/memory/disk/restart counts)
  • Vanilla HTML SPA — Health panel only — reading current state via SSE
  • Mode-aware auth (local auto, LAN/cloud login with key paste, iron-session)
  • prism install integration (image build, env-secret generation, post-install URL)

Phase 2 — SM Push + Issue Log Panel

  • SM publishes to agent_events:{tenant_id} Redis Stream (coordinate with Donna for SM-side PR)
  • log_issue() helper in backend services emitting to service_events:{tenant_id}
  • Dashboard subscribes to both streams (consumer group dashboard)
  • Issue log panel with severity/category/agent/project filtering + acknowledge
  • Agent roster in Health panel populated from SM-pushed state
  • On-demand log drill-down via Docker socket
  • Instrument v1 call sites (10 points per §4.4)

Phase 3 — Metrics Panel + Polish

  • Metrics panel with live charts (Chart.js)
  • Container resource trend charts
  • Dark/light mode (system preference)
  • SSE startup stagger (health t+0, metrics t+5s, issues t+10s)
  • Metrics-parsed endpoint with 5s server-side cache

Phase 4 — Hardening + Docs

  • Rate limiting on dashboard endpoints
  • CSP headers
  • Cloud adapter scaffolding (KubernetesAdapter interface)
  • docs/dashboard.mdx
  • OAuth v2 prep (GitHub/Google callback handling in dashboard service)

11. Future: prism_issues MCP Verb

Out of scope for v1, but the /dashboard/api/issues read API is designed for dual-consumer support from day one. Clean filter params (severity, category, agent_identity, project_id, since, until, acknowledged, limit, offset), no UI-coupled response shape. A future prism_issues MCP verb taps the same data source so agents can query “why did the deploy fail at 3am” without opening a browser.

12. Non-Goals (Explicit Deferrals)

  • OAuth (GitHub/Google) — v2
  • Continuous log streaming / firehose ingestion — on-demand only
  • Datadog-scale APM — practical ops monitoring, not distributed tracing
  • Metrics persistence beyond health_history — live-only charts in v1
  • User management UI — v2
  • Project-level drill-down dashboards — v1 is org-scoped
  • Mobile-first layout — responsive enough for tablets

13. References

  • SPEC-019 — Environment Resolution Contract (modes: local/lan/cloud)
  • SPEC-030 — Multi-Prism Controller (registrations, elections, metrics)
  • SPEC-032 — Redis Session Plane (SM owns Redis, backend-only access)
  • SPEC-045 — Unified Session+Coordination Stream (WebSocket data plane)
  • SPEC-049 — Identity & Session Manager (single-writer rule, §2 non-negotiable)
  • ADR-027 — SM owns realtime state; Redis-fronted; Postgres is audit/durable
  • docs/metrics.mdx — existing counter catalog

14. Decisions Log

#DecisionSourceDate
D1Dashboard is a full observability service, not a thin displayFrank2026-04-28
D2Agents only know about SM; all agent observability flows SM → dashboardFrank2026-04-28
D3SM pushes via Redis Streams; dashboard never pollsFrank + Porsche2026-04-28
D4Service-internal failures emit via log_issue() to service_events Redis streamPorsche (B2), ratified Frank2026-04-28
D5Probe adapter pattern from day one for cloud readinessPorsche, ratified Frank2026-04-28
D6Separate Postgres schema (prism_dashboard) for independent dev/testFrank2026-04-28
D7On-demand log reading, not continuous streamingFrank2026-04-28
D8Container impact monitoring (CPU/memory/disk) across local and LANFrank2026-04-28
D9agent_identity as first-class column on every log_event, sortable/filterableFrank2026-04-28
D10Severity enum: ERROR / WARNING / INFO onlyFrank2026-04-28
D11Vanilla HTML + Chart.js for SPA (no React/Vite build pipeline)Porsche, ratified Frank2026-04-28
D12Port 43000 for dashboardLola + Porsche, ratified Frank2026-04-28
D13API-key auth v1; OAuth deferred to v2Frank2026-04-28
D14Session cookie encrypted via iron-session (AES-GCM)Porsche, ratified Frank2026-04-28
D15Session secret generated at install time; refuse to start if changemePorsche, ratified Frank2026-04-28
D16Session expiry UX: keep page rendered, show re-auth bannerPorsche, ratified Frank2026-04-28
D17BFF proxy pattern rejected — creates couplingFrank2026-04-28
Last modified on April 29, 2026