Skip to main content
Status: draft · Version 0.2 · Filed 2026-05-01

SPEC-061 v0.2 — Drill-Down Detail Panels

Version

0.2

Status

draft

Parent / Depends-On

  • SPEC-061 v0.1 — RED+USE Saturation & DB Performance (the cards this SPEC adds drill-downs to).
  • SPEC-050 v0.3 — Prism Dashboard observability service (the dashboard this lives in).

Changelog

  • 0.2: Added drill-down panels for every Saturation row + a top-level “current incidents” view triggered by the rollup header. Each panel fetches detail on demand from a new family of backend endpoints under /db_health/<topic>. v0.1 left rows mute except for hover tooltips; this closes the question “which service is being warned about and what should I do?” with a one-click answer.

1. Problem

The v0.1 Saturation card surfaces that something is hot but answers no follow-on question:
  • Signal pipeline · 7 · oldest 140082s — which signals? from whom, to whom, why are they stuck?
  • Postgres · pool 60% — which connections are open, what queries are they running, is autovacuum behind?
  • Heartbeat lag · p95 41s — which controllers are stale?
Operator’s only recourse today is to ssh into server1 and psql / redis-cli / kubectl logs — exactly the kind of friction the dashboard exists to remove. Tooltips help but are not enough; sustained operator focus belongs in a panel, not a 200ms hover.

2. Goals

  1. Every Saturation row is clickable. Click → modal panel with row-specific detail.
  2. Header rollup is clickable. Click → “current incidents” view: every warn/hot row, what each means in plain English, and the suggested next action.
  3. Detail data fetched on demand. No new periodic load on the backend; the modal makes one fetch when opened.
  4. Plain-English explanations. Each panel includes a one-paragraph “what this means and why it matters” header (so an operator who hasn’t read SPEC-061 can act on it).
  5. No destructive actions in v0.2. Force-drain, force-expire, and similar operator tools are scoped but not built here — they go to SPEC-061 v0.3 with proper guardrails (capability tokens per SPEC-049).

Non-Goals

  • Live-streaming detail (each open is a one-shot fetch; close-and-reopen to refresh).
  • Mobile layout polish for the modals.
  • Cross-tenant aggregation (same scope as SPEC-061 v0.1).

3. Architecture

3.1 Backend additions

Five new GET endpoints, root-mounted alongside /db_health (same auth posture):
PathPurpose
/db_health/signalsList of signal_queue rows where delivered_at IS NULL. Each row: signal_id, type, category, from→to, age_sec, recipient_state (active / unregistered / paused), payload_kind.
/db_health/postgrespg_stat_activity snapshot (LIMIT 200) + top 10 hot tables by n_dead_tup + slowest 5 queries from pg_stat_statements if installed.
/db_health/redisSelected INFO blocks (server, memory, clients, stats, persistence) + LATENCY HISTORY (top 5 events) + CLIENT LIST summary.
/db_health/neo4jHeap GC stats, page-cache hit/miss counters, top transactions by age (dbms.listTransactions()), longest queries.
/db_health/heartbeatAll registered controllers sorted by lag desc. Each row: identity, surface, machine_id, last_seen_at, lag_sec, project_chain.
Each endpoint is unauth (same as /db_health), single-shot, returns JSON, < 200 ms typical. Per-leg timeout 5 s.

3.2 Dashboard backend additions

Express proxy routes under /dashboard/api/db_health/<topic> that fetch the backend endpoints. One handler in dashboard/src/routes/api_db_health.ts registered next to existing api_* routes. Authenticated via existing session middleware; cached for 2 s server-side to absorb burst clicks.

3.3 Frontend additions

  • Modal shell: dashboard/web/src/components/SaturationDrillModal.tsx — overlay, ESC-to-close, click-outside-to-close, body locked. Header shows row label + current value + pressure pill; body is the row-specific drill content.
  • Drill content components, one per row id, all under dashboard/web/src/components/drills/:
    • SignalPipelineDrill.tsx — table of stuck signals, color-coded by age, “why stuck” column.
    • PostgresDrill.tsx — three sub-sections (active queries, hot tables, slow queries).
    • RedisDrill.tsx — INFO blocks + latest LATENCY events + client list summary.
    • Neo4jDrill.tsx — heap chart + GC stats + active tx + longest queries.
    • HttpQueueDrill.tsx — uvicorn worker breakdown (read from /db_health/postgres activity rows that look like API requests, until /db_health/http is added in v0.3).
    • ChannelBacklogDrill.tsx — placeholder explaining “instrumentation pending SPEC-045 §4”.
    • HeartbeatLagDrill.tsx — roster table sorted by lag desc.
  • Incidents overview: IncidentsOverviewDrill.tsx — bound to header click, lists every warn/hot row with explanation + recommended next step. Each item links into its row drill.
  • Plain-English text: centralized in dashboard/web/src/lib/saturationCopy.ts. Every row has a meaning, a whyItMatters, and a per-pressure suggestion. Localized once; used in modal + tooltip.
  • Click wiring: PressureRow gains onClick, SaturationCard header gains a click target, both push state into a Zustand drillTarget slice (null | { kind: 'row', id: string } | { kind: 'incidents' }). Modal subscribes to that slice and renders accordingly.

3.4 No destructive actions in v0.2

All operator-action surfaces (force-drain a stuck signal, force-deregister a stale controller, terminate a long Neo4j tx, etc.) are deliberately deferred to v0.3 because they require:
  • Capability tokens (SPEC-049 §6 — operator permission gating).
  • Confirmation modal with consequences spelled out.
  • Audit-log writes per action.
v0.2 is read-only operator visibility. Knowing what’s wrong, in one click.

4. Concrete file plan

Backend (Python)

  • backend/app/services/db_health.py — five new async functions: list_stuck_signals(), pg_activity_detail(), redis_detail(), neo4j_detail(), heartbeat_detail(). Each returns a structured dict; failure modes use the same status / last_error shape as the v0.1 leg samples.
  • backend/app/routers/db_health.py — five new GET routes mirroring §3.1.

Dashboard backend (TypeScript)

  • dashboard/src/routes/api_db_health.ts — five proxy handlers, each fetches the backend endpoint with a 5 s timeout, in-memory 2 s cache.
  • dashboard/src/index.ts — register the routes.

Dashboard frontend (TypeScript / React)

  • dashboard/web/src/lib/api.ts — five fetch wrappers (fetchStuckSignals, fetchPostgresDetail, fetchRedisDetail, fetchNeo4jDetail, fetchHeartbeatDetail).
  • dashboard/web/src/store/dashboard.tsdrillTarget slice + setDrillTarget(t) action.
  • dashboard/web/src/lib/saturationCopy.ts — copy table.
  • dashboard/web/src/components/SaturationDrillModal.tsx — modal shell.
  • dashboard/web/src/components/drills/*.tsx — 8 drill components (7 row drills + IncidentsOverviewDrill).
  • dashboard/web/src/components/charts/PressureRow.tsxonClick prop forwarded; cursor:pointer when handler provided.
  • dashboard/web/src/components/cards/SaturationCard.tsx — wire rollup header click + per-row click → setDrillTarget.
  • dashboard/web/src/App.tsx (or new Layout.tsx slot) — render <SaturationDrillModal /> once at root, subscribed to drillTarget.

5. Phasing

  1. 5.1 — Backend (1 day). Five detail endpoints + service functions.
  2. 5.2 — Dashboard proxy (½ day). Express routes + tsc clean.
  3. 5.3 — Modal shell + Signal pipeline drill (½ day). This is the one currently firing; ship it first so v0.2 demos meaningfully.
  4. 5.4 — Remaining 6 drills + incidents overview (1 day).
  5. 5.5 — Plain-English copy + visual polish (½ day).

6. Open questions

  1. Should the modal be a full-page route (/saturation/<row>) or a true overlay? Lean: overlay for v0.2 (fast), route for v0.3 if we want shareable URLs to live incidents.
  2. Should we expose a “snooze” mechanism per row so a known-stale signal pipeline doesn’t dominate the rollup? Defer to v0.3 once we have an opinion on whether to filter at the rules engine (saturation.ts) or at display time only.

7. Risks / notes

  • pg_stat_statements may be missing on personal-mode Postgres images. The PostgresDrill renders slow queries: extension not installed in that case; not a blocker.
  • dbms.listTransactions() requires a Cypher procedure available on Neo4j 4+; older clusters degrade the Neo4jDrill gracefully.
  • The 2 s server-side cache means rapid open/close of the same drill returns identical JSON for that window. Acceptable; full refresh is one more open after the cache expires.
  • Adding drill modals doesn’t change the live SSE / push pipeline. No regression risk to the v0.1 cards.
Last modified on May 3, 2026