Status:
draft · Version 0.1 · Filed 2026-04-26SPEC-049: Identity & Session Manager Service — Single-Writer for All Identity, Registration, and Routing State
Version
0.1Status
draftOrigin
Operational directive (2026-04-26): “No call can be directly to Postgres for user registration, query or session management. Everything should be through a single endpoint. The session manager. There should be nothing local even when we are running local — the session manager should handle the redis and postgres writes.” This spec collapses ~5 outstanding tickets into one architectural fix:| Backlog item | Subsumed how |
|---|---|
| Same-identity duplicate registrations (Texi×2, Lafonda×2 today) | Single writer enforces preempt-or-reject atomically |
POST /signal honesty (delivered=false even when WS push lands) | Manager owns routing → can return delivery confidence at POST time |
prism_persona_create bootstrap-gate paradox | Persona CRUD is non-bootstrapped on the manager API |
| Stale-registration reaper | Internal to manager — Redis eviction event triggers Postgres release in the same code path |
prism_whois case-insensitive join (already shipped 2971f01) | Becomes redundant — manager canonicalizes case at write time |
| Partial: TODO #97 gRPC retirement | Cleaner once the manager owns coordination |
§1 — Audit: Current State
1.1 Postgres controller_registrations direct callers
services/controller_service.py— primary (register, release, get_status, get_prism_status, stamp_heartbeat, sweep_stale)services/persona_service.py—whois()reads heartbeat for live stateservices/signal_service.py—_resolve_identity_to_session()readslast_verb_atfor engagementservices/approval_service.py— findsclaude_desktopmaster for approvalsgrpc_runtime/servicer.py— finds session by session_idobservability/metrics.py— observability countersworkers/controller_sweep_worker.py— runssweep_stale()every 60s (10-min stale threshold)
1.2 Redis session_store direct callers
services/controller_service.py— register/release dual-writeservices/persona_service.py—whois()readsget_active_sessions()services/signal_service.py— identity-resolution reads +publish_*_eventrouters/session_stream.py—pubsub = store.client.pubsub()directlyrouters/controller.py— admin endpointsmain.py— startup persona preload
1.3 Postgres personas direct callers
services/persona_service.py— owns CRUDrouters/personas.py— HTTP wiremain.py— startup preload
1.4 The drift bug, traced
- Redis TTL: ~90s (heartbeat refresh interval × 3, per SPEC-032)
- Postgres
controller_sweep_worker: 60s poll, releases rows >10min stale - Result: an MCP that dies at T=0 disappears from Redis at T+90s, but stays in Postgres
controller_registrations(released_at=NULL) until T+10min.controller_statusenumerating peers from Postgres reports ghost rows during this window — exactly the Texi×2 / Lafonda×2 pattern we observe today.
§2 — Architectural Rule
No code outside the Identity & Session Manager service may read from or write toLocal mode is no exception. A Prism instance running on the same host as the MCP still routes through the manager’s HTTP API (loopback is fine; direct DB access is not). This is non-negotiable — local shortcut paths are the seed of every drift bug we’re closing.controller_registrations,personas, or any Redis key matching the session-state namespace (prism:session:*,prism:master:*,prism:sessions:*,prism:personas:*,prism:events:*,prism:engagement:*).
§3 — Service Surface (HTTP API)
All endpoints live under/api/v1/sm/ (Session Manager). Existing /controller/* and /personas/* routes proxy here during the migration window (Phase 1) then redirect (Phase 2) then 410 (Phase 3).
3.1 Sessions
POST /sm/sessions/register— register or preempt-and-replace. Body:{pid, agent_identity, agent_surface, machine_id, process_pid, session_id, force?, operator_id?, operator_password?}. Atomic: writes Postgres + Redis in one transaction (§4).DELETE /sm/sessions/{session_id}?reason=...— release. Atomic.POST /sm/sessions/{session_id}/heartbeat— refresh both stores. Out-of-band per SPEC-045 D1 (telecom rule).POST /sm/sessions/{session_id}/engagement— stamplast_verb_at.GET /sm/sessions/active?pid=...— list active peers (Redis-first, Postgres-backstop on cache miss only).GET /sm/sessions/by-identity/{identity}?pid=...— resolve identity → best live session (SPEC-035 §3 ordering: master, then engagement, then heartbeat). Replacessignal_service._resolve_identity_to_session().
3.2 Personas
POST /sm/personas— create. Non-bootstrap-gated (closes the prism_persona_create paradox).GET /sm/personas?pid=...— whois (resolved live state included).PATCH /sm/personas/{name}— update focus / description / archive.
3.3 Election
POST /sm/elections/{pid}/master/claim— claim master lease.POST /sm/elections/{pid}/master/preempt— force-preempt with operator credentials (SPEC-038 §2.4).GET /sm/elections/{pid}/master— current master.
3.4 Streaming (SPEC-045 plane, hosted by manager)
WS /sm/stream/{session_id}— replaces/api/v1/session/ws. Manager owns the pubsub fan-out and the new SPEC-037/045 §5.1 delivered_at marking.
3.5 Admin
POST /sm/admin/sweep— manual trigger for the reaper.GET /sm/admin/health— Redis + Postgres connectivity.
3.6 Identity normalization rule (§3.0 invariant)
Every endpoint accepting an identity string canonicalizes case at the gateway: looks up the persona row, replaces the supplied name with the persona’s canonicalidentity field, and records that canonical form in both stores. The case-insensitive whois fix (2971f01) becomes structurally unnecessary.
§4 — Internal Write Coordination
The manager is the only process holding a Postgres connection tocontroller_registrations / personas AND Redis credentials for the session namespace. Inside the manager, every state-changing operation follows a single pattern:
- Begin Postgres tx
- Write Postgres
- Issue Redis command
- If Redis fails → Postgres rollback, return error
- Commit Postgres
- If Postgres commit fails after Redis succeeded → compensating Redis delete (best effort, logged)
§5 — Reaper / Drift Prevention
The 60s-pollcontroller_sweep_worker is replaced by a Redis-keyspace-event-driven reaper:
- Manager configures Redis with
notify-keyspace-events Ex(key expiration events). - Internal subscriber listens on
__keyevent@*__:expiredfor keys matchingprism:session:*. - On expiration → manager extracts session_id from the key → atomically writes Postgres
released_at=NOW(),release_reason='heartbeat_expired'.
§6 — Migration Plan
Phase 0 — Spec ratification (this document). Phase 1 — Manager service implemented in-tree asbackend/app/services/session_manager/ (own router, own internal Redis + Postgres clients). New /sm/* endpoints live alongside existing /controller/* + /personas/*. Existing services delegate to manager internally — every direct DB call inside controller_service, persona_service, signal_service.identity_resolve, approval_service, grpc_runtime becomes a manager method call. No public API breakage yet.
Phase 2 — Existing routers (/controller/*, /personas/*) become HTTP redirects (301/308) to /sm/*. MCP client updated to call new paths. Old paths still respond for one release cycle.
Phase 3 — Old routers return 410 Gone. Direct DB callers grep-test in CI: grep -rn 'ControllerRegistration\|controller_registrations' --include='*.py' backend/ | grep -v session_manager/ returns empty.
Phase 4 — Manager extracted to its own deployable. Same Python codebase, separate container. Backend FastAPI talks to it via HTTP (loopback in single-host install, network in distributed install). Closes Frank’s portability requirement: manager runs on any machine.
§7 — Verification
- CI grep guard: no file outside
backend/app/services/session_manager/importsControllerRegistration,Persona, or accessessession_store.clientprivate surfaces. Migration adds the rule. - Same-machine same-identity duplicate is structurally impossible: second
POST /sm/sessions/registerfor an identity already registered on the same machine_id+process_pid succeeds silently (reconnect path); cross-process collision returns 409 unlessforce=true. Test against the current Texi×2 / Lafonda×2 state — after migration both collapse to one each. - Reaper drift test: kill an MCP, confirm Postgres
released_atis stamped within 5 seconds (vs current 10 minutes). - POST /signal honesty: send signal to live recipient, response returns
delivered=truewhen WS subscriber confirmed (manager knows live state at POST time). prism_persona_createworks pre-bootstrap (paradox closed).- Identity case canonicalization: registering “donna” creates the row as “Donna” if a
Donnapersona exists. Whois returns one row. - Manager can be deployed as a separate container; backend FastAPI only needs the manager HTTP URL, not Postgres/Redis directly for session/identity concerns.
- Existing smokes (SPEC-030 phase1, SPEC-032 redis, SPEC-038 collision, SPEC-044 channels, SPEC-045 envelope + ws_client, SPEC-046 surfaces) all green against the new manager.
§8 — Decisions
- D1 (in-tree vs extracted): Phase 1-3 in-tree as a self-contained service module; Phase 4 extracts to its own container. Two-stage delivery — get the architectural rule enforced first, do the packaging cleanup once the rule holds.
- D2 (atomic-write semantics): best-effort coordinated with compensating action + idempotent retry. True 2PC is not worth the complexity for this workload.
- D3 (signal_service ownership):
signal_queuetable stays with signal_service. Identity-resolution moves to manager. The cut is at “who is online?” (manager) vs “what messages are queued?” (signal). - D4 (existing reaper retired):
controller_sweep_workerremoved in Phase 1; replaced by Redis-keyspace-event subscriber inside manager. - D5 (case canonicalization): enforced at the manager gateway, not via SQL
lower()in queries. Wire identity always matches a persona row’s canonical case.
§9 — Open Questions
- Q1 (Phase 4 portability transport): HTTP only, or HTTP + WS for the streaming plane already living on the manager? Lean: HTTP for control + same WS endpoint the manager already hosts (no extra layering).
- Q2 (auth between backend and manager when separated): shared service token vs same Bearer the MCP uses? Lean: separate service-to-service token, rotated independently, scoped to the
/sm/*namespace. - Q3 (operator credentials storage): stays on
tenantstable per SPEC-038, manager reads via Postgres? Or operator-cred APIs themselves are also manager-owned? Lean: operator creds are identity infrastructure → manager owns them too. - Q4 (gRPC service relation to manager): SPEC-030 gRPC CoordinationStream currently uses controller_registrations. Should it go through manager API in Phase 1, or wait for full retirement per TODO #97? Lean: route through manager in Phase 1 — keeps the rule clean even though the gRPC path is being retired anyway.
§10 — What This Spec Replaces
The following specs / PRs / TODOs are subsumed or made redundant:- The “stale-registration reaper” priority item — implemented as part of §5.
- The “same-identity duplicate registrations” priority item — structurally impossible after §4.
- The “
prism_persona_createbootstrap paradox” item — closed by §3.2. - The “POST /signal honesty” item — manager knows live state at POST time per §3.1.
- TODO #97 gRPC retirement — partially eased; manager API replaces the LAN role gRPC was filling.
- The case-insensitive whois fix shipped today (
2971f01) — still correct, becomes redundant after §3.6.

