Skip to main content
Status: accepted · ADR-27 · Filed 2026-04-27

Decision

Two architectural invariants apply to all current and future Prism services. Both ratified 2026-04-27 after a post-outage divergence between prism_status (which read Postgres directly) and prism_whois (which read Redis via SessionStore) exposed that the SM facade was a sham — HTTP routes existed under /api/v1/sm/* but internally delegated to direct Postgres queries. Invariant 1 — Session Manager owns ALL session and realtime/coordination state. Reads on the hot path (master election, who is online, identity → session resolution, signal routing, heartbeat tiebreak) come from the in-memory plane (today: Redis, fronted by SM). They NEVER come from Postgres on the realtime path. Postgres is reserved for long-term memory, recall, statistical analysis, reporting, and audit trails. Persistent writes can be near-realtime best-effort OR queued-and-drained by background workers — never on the blocking path. The principle generalizes: any future service with realtime/coordination concerns owns its hot-path state behind its own service interface, with persistent storage handled async. Invariant 2 — Storage backends are hidden behind their owning service. No service outside session_manager imports SessionStore, a Redis client, or any coordination-storage primitive directly. Every consumer calls either session_manager.api (the in-process Python facade) or /api/v1/sm/* (the HTTP facade, for cross-process callers). Both surfaces are thin shims over the same internal SM methods, which alone touch Redis. The principle generalizes: any service that owns Postgres-only state, vector-store state, graph state, etc. exposes APIs and never leaks its backend across the codebase. Concrete code consequences (the implementation surface this decision authorizes):
  • backend/app/services/session_manager/api.py becomes the canonical Python facade, Redis-first with Postgres backstop ONLY on cold-start cache miss (NEVER as runtime fallback).
  • backend/app/services/session_manager/sessions.py:active_sessions HTTP handler reads Redis directly, not via controller_service.get_prism_status.
  • backend/app/services/controller_service.py:get_prism_status reads via session_manager.api, not via direct Postgres _active_master / _active_peers.
  • _active_master / _active_peers in controller_service are either eliminated or converted to internal Postgres-backstop helpers used exclusively by SM cold-start seeding logic.
  • backend/app/services/signal_service.py drops its direct ControllerRegistration tiebreak query and routes through session_manager.api.
  • backend/app/services/persona_service.py drops its direct ControllerRegistration heartbeat-aggregation query and routes through session_manager.api.
  • All from ..session_store import get_session_store imports outside session_manager/ are removed and replaced with session_manager.api calls.
Deployment topology consequence (downstream of the architectural rule): Because storage is hidden behind SM, deployment topology is a config concern, not an architectural one. Therefore: one uniform deployment topology across local / lan / cloud — three containers, same composition everywhere:
  • prism-backend — Python (FastAPI HTTP + gRPC) with Redis embedded in the same container, supervised by supervisord or s6-overlay. Single image, single deployment unit.
  • prism-postgres — Postgres for long-term storage. Independent backup and upgrade lifecycle.
  • prism-neo4j — Neo4j for the graph plane (tri-graph). Independent backup and upgrade lifecycle.
Total: 3 containers, identical local/lan/cloud. Down from 5 today by (a) folding Redis into the backend container as an embedded process and (b) folding the gRPC service into the same backend container (it’s already the same image, just a different command: line). Container topology stops being a per-environment variable. The local / lan / cloud distinction defined in SPEC-019 / ADR-21 (HOST_ENV + PRISM_ENV + MODE_PROFILES) remains and continues to govern network targeting and authentication; this ADR specifically excludes deployment-composition differences from the per-environment surface.

Rationale

Correctness — what triggered this ADR. During the post-server1-restart cascade, Redis-side master election demoted Donna’s stale registration and promoted Candi (mini1, registered 70 seconds earlier). Redis carried the new state; Postgres controller_registrations.is_master did not. prism_status (Postgres path) reported Donna as master; prism_whois (Redis path) reported Candi as master. They disagreed because two services held competing copies of the same runtime fact. With Invariant 1, only one source of truth exists for runtime state. With Invariant 2, no consumer can read the wrong copy by accident. Performance. Redis hits are sub-1ms on local Unix-socket connections; Postgres hits are 10–50ms even on the same host. Every verb, every signal-route resolution, every heartbeat tiebreak, every status query incurs that cost today. The compounding effect is invisible at session scale and crippling under load. Hot path → memory; persistent path → disk; no exceptions. Substitutability. When SM is the only Redis-aware module, swapping Redis for KeyDB, Memorystore, an embedded in-process variant, or anything else is a one-file change. With direct SessionStore imports scattered across five services, every swap is a cross-service refactor with attendant bug risk. Connection-lifecycle correctness. Multiple services each holding their own Redis client connection means multiple places to debug pool exhaustion, retries, timeouts, reconnection-after-outage. Centralized in SM, those concerns live in one place with one tested implementation. The post-outage CLOSE_WAIT socket pile-up we observed (4 stale MCP subprocesses each holding dead sockets to server1) is a related symptom: when connection management is decentralized, recovery is decentralized too, and reliability suffers. Deployment uniformity. Frank’s explicit constraint: “no way I am delivering different solutions” across local/lan/cloud. Differential deployment topologies (in-process cache locally, external Redis in production) double the test surface and divergence risk over time. One topology, one image, one set of bugs. Container-count payoff. Bundling Redis into the backend container plus folding the gRPC service into the same image takes the local stack from 5 containers (backend + backend-grpc + redis + postgres + neo4j) to 3. Same composition runs in production. Faster startup, smaller install footprint (relevant for the Docker-Desktop-and-Node-only host prereq Frank has committed to), fewer moving parts to break on Mac/Windows.

Alternatives Considered

Differential deployment topology — in-process cache locally, external Redis in production. Rejected by Frank explicitly: “no way I am delivering different solutions.” Doubles maintenance surface; introduces drift between modes; means local bugs reproduce poorly in production and vice versa. The deployment-topology choice should be the same in every environment. Embedded Redis in same container as the middle option (between in-process-cache and external Redis). Rejected as the exclusive local option but adopted as the uniform option once we committed to one topology everywhere. Adds supervisord / s6-overlay operational surface, but that surface is paid once and is consistent — and the uniformity benefit dominates. Keep _active_master / _active_peers as the canonical readers, fix Postgres to stay in sync with Redis. Rejected. This is the inverse direction: it preserves the Postgres-first read path and adds dual-write complexity. Every write site must now succeed against both stores, with rollback semantics on partial failure. Higher complexity, more failure modes, and still doesn’t fix the core abstraction leak — multiple services would still know Redis exists. Keep direct SessionStore imports across services, just make sure they all stay in sync. Rejected. Distributed consistency through convention always loses to consistency through abstraction. Five services importing the same Redis client is five places to forget a TTL, mishandle a key prefix, or fall behind a schema migration. SM as the sole Redis owner makes the contract enforceable by the type system (importing SessionStore outside session_manager/ becomes a lint violation). Hide Redis behind SM, but keep services hitting Postgres directly for “read-only audit-style queries” on runtime data. Rejected. The whole point is one source of truth on the hot path. “Read-only audit query against runtime state” is a contradiction — if the data is runtime-relevant the audit query will diverge from the runtime view, and we just spent a day debugging that exact pattern. Audit queries against historical session lifecycle (rolled out of Redis, durably stored in Postgres) are fine and remain Postgres-direct; queries about current state go through SM regardless of intent.
Last modified on April 29, 2026