Skip to main content
Status: draft · Version 0.1 · Filed 2026-04-26

SPEC-049: Identity & Session Manager Service — Single-Writer for All Identity, Registration, and Routing State

Version

0.1

Status

draft

Origin

Operational directive (2026-04-26): “No call can be directly to Postgres for user registration, query or session management. Everything should be through a single endpoint. The session manager. There should be nothing local even when we are running local — the session manager should handle the redis and postgres writes.” This spec collapses ~5 outstanding tickets into one architectural fix:
Backlog itemSubsumed how
Same-identity duplicate registrations (Texi×2, Lafonda×2 today)Single writer enforces preempt-or-reject atomically
POST /signal honesty (delivered=false even when WS push lands)Manager owns routing → can return delivery confidence at POST time
prism_persona_create bootstrap-gate paradoxPersona CRUD is non-bootstrapped on the manager API
Stale-registration reaperInternal to manager — Redis eviction event triggers Postgres release in the same code path
prism_whois case-insensitive join (already shipped 2971f01)Becomes redundant — manager canonicalizes case at write time
Partial: TODO #97 gRPC retirementCleaner once the manager owns coordination

§1 — Audit: Current State

1.1 Postgres controller_registrations direct callers

  • services/controller_service.py — primary (register, release, get_status, get_prism_status, stamp_heartbeat, sweep_stale)
  • services/persona_service.pywhois() reads heartbeat for live state
  • services/signal_service.py_resolve_identity_to_session() reads last_verb_at for engagement
  • services/approval_service.py — finds claude_desktop master for approvals
  • grpc_runtime/servicer.py — finds session by session_id
  • observability/metrics.py — observability counters
  • workers/controller_sweep_worker.py — runs sweep_stale() every 60s (10-min stale threshold)

1.2 Redis session_store direct callers

  • services/controller_service.py — register/release dual-write
  • services/persona_service.pywhois() reads get_active_sessions()
  • services/signal_service.py — identity-resolution reads + publish_*_event
  • routers/session_stream.pypubsub = store.client.pubsub() directly
  • routers/controller.py — admin endpoints
  • main.py — startup persona preload

1.3 Postgres personas direct callers

  • services/persona_service.py — owns CRUD
  • routers/personas.py — HTTP wire
  • main.py — startup preload

1.4 The drift bug, traced

  • Redis TTL: ~90s (heartbeat refresh interval × 3, per SPEC-032)
  • Postgres controller_sweep_worker: 60s poll, releases rows >10min stale
  • Result: an MCP that dies at T=0 disappears from Redis at T+90s, but stays in Postgres controller_registrations (released_at=NULL) until T+10min. controller_status enumerating peers from Postgres reports ghost rows during this window — exactly the Texi×2 / Lafonda×2 pattern we observe today.

§2 — Architectural Rule

No code outside the Identity & Session Manager service may read from or write to controller_registrations, personas, or any Redis key matching the session-state namespace (prism:session:*, prism:master:*, prism:sessions:*, prism:personas:*, prism:events:*, prism:engagement:*).
Local mode is no exception. A Prism instance running on the same host as the MCP still routes through the manager’s HTTP API (loopback is fine; direct DB access is not). This is non-negotiable — local shortcut paths are the seed of every drift bug we’re closing.

§3 — Service Surface (HTTP API)

All endpoints live under /api/v1/sm/ (Session Manager). Existing /controller/* and /personas/* routes proxy here during the migration window (Phase 1) then redirect (Phase 2) then 410 (Phase 3).

3.1 Sessions

  • POST /sm/sessions/register — register or preempt-and-replace. Body: {pid, agent_identity, agent_surface, machine_id, process_pid, session_id, force?, operator_id?, operator_password?}. Atomic: writes Postgres + Redis in one transaction (§4).
  • DELETE /sm/sessions/{session_id}?reason=... — release. Atomic.
  • POST /sm/sessions/{session_id}/heartbeat — refresh both stores. Out-of-band per SPEC-045 D1 (telecom rule).
  • POST /sm/sessions/{session_id}/engagement — stamp last_verb_at.
  • GET /sm/sessions/active?pid=... — list active peers (Redis-first, Postgres-backstop on cache miss only).
  • GET /sm/sessions/by-identity/{identity}?pid=... — resolve identity → best live session (SPEC-035 §3 ordering: master, then engagement, then heartbeat). Replaces signal_service._resolve_identity_to_session().

3.2 Personas

  • POST /sm/personas — create. Non-bootstrap-gated (closes the prism_persona_create paradox).
  • GET /sm/personas?pid=... — whois (resolved live state included).
  • PATCH /sm/personas/{name} — update focus / description / archive.

3.3 Election

  • POST /sm/elections/{pid}/master/claim — claim master lease.
  • POST /sm/elections/{pid}/master/preempt — force-preempt with operator credentials (SPEC-038 §2.4).
  • GET /sm/elections/{pid}/master — current master.

3.4 Streaming (SPEC-045 plane, hosted by manager)

  • WS /sm/stream/{session_id} — replaces /api/v1/session/ws. Manager owns the pubsub fan-out and the new SPEC-037/045 §5.1 delivered_at marking.

3.5 Admin

  • POST /sm/admin/sweep — manual trigger for the reaper.
  • GET /sm/admin/health — Redis + Postgres connectivity.

3.6 Identity normalization rule (§3.0 invariant)

Every endpoint accepting an identity string canonicalizes case at the gateway: looks up the persona row, replaces the supplied name with the persona’s canonical identity field, and records that canonical form in both stores. The case-insensitive whois fix (2971f01) becomes structurally unnecessary.

§4 — Internal Write Coordination

The manager is the only process holding a Postgres connection to controller_registrations / personas AND Redis credentials for the session namespace. Inside the manager, every state-changing operation follows a single pattern:
async with manager.write_transaction() as tx:
    pg_change = await tx.postgres(<sql>)
    redis_change = await tx.redis(<command>)
    # Both succeed or both roll back. tx.commit() or tx.rollback().
Implementation note: this is not a true XA transaction (Redis doesn’t participate in 2PC). It’s a best-effort coordinated write with compensating action on failure:
  1. Begin Postgres tx
  2. Write Postgres
  3. Issue Redis command
  4. If Redis fails → Postgres rollback, return error
  5. Commit Postgres
  6. If Postgres commit fails after Redis succeeded → compensating Redis delete (best effort, logged)
The 5→6 race window is small but real. Operations are designed to be idempotent on retry so a transient Postgres-commit failure doesn’t leave permanent inconsistency.

§5 — Reaper / Drift Prevention

The 60s-poll controller_sweep_worker is replaced by a Redis-keyspace-event-driven reaper:
  1. Manager configures Redis with notify-keyspace-events Ex (key expiration events).
  2. Internal subscriber listens on __keyevent@*__:expired for keys matching prism:session:*.
  3. On expiration → manager extracts session_id from the key → atomically writes Postgres released_at=NOW(), release_reason='heartbeat_expired'.
Result: the Postgres-vs-Redis drift window collapses from up to 10 minutes to single-digit seconds.

§6 — Migration Plan

Phase 0 — Spec ratification (this document). Phase 1 — Manager service implemented in-tree as backend/app/services/session_manager/ (own router, own internal Redis + Postgres clients). New /sm/* endpoints live alongside existing /controller/* + /personas/*. Existing services delegate to manager internally — every direct DB call inside controller_service, persona_service, signal_service.identity_resolve, approval_service, grpc_runtime becomes a manager method call. No public API breakage yet. Phase 2 — Existing routers (/controller/*, /personas/*) become HTTP redirects (301/308) to /sm/*. MCP client updated to call new paths. Old paths still respond for one release cycle. Phase 3 — Old routers return 410 Gone. Direct DB callers grep-test in CI: grep -rn 'ControllerRegistration\|controller_registrations' --include='*.py' backend/ | grep -v session_manager/ returns empty. Phase 4 — Manager extracted to its own deployable. Same Python codebase, separate container. Backend FastAPI talks to it via HTTP (loopback in single-host install, network in distributed install). Closes Frank’s portability requirement: manager runs on any machine.

§7 — Verification

  1. CI grep guard: no file outside backend/app/services/session_manager/ imports ControllerRegistration, Persona, or accesses session_store.client private surfaces. Migration adds the rule.
  2. Same-machine same-identity duplicate is structurally impossible: second POST /sm/sessions/register for an identity already registered on the same machine_id+process_pid succeeds silently (reconnect path); cross-process collision returns 409 unless force=true. Test against the current Texi×2 / Lafonda×2 state — after migration both collapse to one each.
  3. Reaper drift test: kill an MCP, confirm Postgres released_at is stamped within 5 seconds (vs current 10 minutes).
  4. POST /signal honesty: send signal to live recipient, response returns delivered=true when WS subscriber confirmed (manager knows live state at POST time).
  5. prism_persona_create works pre-bootstrap (paradox closed).
  6. Identity case canonicalization: registering “donna” creates the row as “Donna” if a Donna persona exists. Whois returns one row.
  7. Manager can be deployed as a separate container; backend FastAPI only needs the manager HTTP URL, not Postgres/Redis directly for session/identity concerns.
  8. Existing smokes (SPEC-030 phase1, SPEC-032 redis, SPEC-038 collision, SPEC-044 channels, SPEC-045 envelope + ws_client, SPEC-046 surfaces) all green against the new manager.

§8 — Decisions

  • D1 (in-tree vs extracted): Phase 1-3 in-tree as a self-contained service module; Phase 4 extracts to its own container. Two-stage delivery — get the architectural rule enforced first, do the packaging cleanup once the rule holds.
  • D2 (atomic-write semantics): best-effort coordinated with compensating action + idempotent retry. True 2PC is not worth the complexity for this workload.
  • D3 (signal_service ownership): signal_queue table stays with signal_service. Identity-resolution moves to manager. The cut is at “who is online?” (manager) vs “what messages are queued?” (signal).
  • D4 (existing reaper retired): controller_sweep_worker removed in Phase 1; replaced by Redis-keyspace-event subscriber inside manager.
  • D5 (case canonicalization): enforced at the manager gateway, not via SQL lower() in queries. Wire identity always matches a persona row’s canonical case.

§9 — Open Questions

  • Q1 (Phase 4 portability transport): HTTP only, or HTTP + WS for the streaming plane already living on the manager? Lean: HTTP for control + same WS endpoint the manager already hosts (no extra layering).
  • Q2 (auth between backend and manager when separated): shared service token vs same Bearer the MCP uses? Lean: separate service-to-service token, rotated independently, scoped to the /sm/* namespace.
  • Q3 (operator credentials storage): stays on tenants table per SPEC-038, manager reads via Postgres? Or operator-cred APIs themselves are also manager-owned? Lean: operator creds are identity infrastructure → manager owns them too.
  • Q4 (gRPC service relation to manager): SPEC-030 gRPC CoordinationStream currently uses controller_registrations. Should it go through manager API in Phase 1, or wait for full retirement per TODO #97? Lean: route through manager in Phase 1 — keeps the rule clean even though the gRPC path is being retired anyway.

§10 — What This Spec Replaces

The following specs / PRs / TODOs are subsumed or made redundant:
  • The “stale-registration reaper” priority item — implemented as part of §5.
  • The “same-identity duplicate registrations” priority item — structurally impossible after §4.
  • The “prism_persona_create bootstrap paradox” item — closed by §3.2.
  • The “POST /signal honesty” item — manager knows live state at POST time per §3.1.
  • TODO #97 gRPC retirement — partially eased; manager API replaces the LAN role gRPC was filling.
  • The case-insensitive whois fix shipped today (2971f01) — still correct, becomes redundant after §3.6.
Last modified on May 18, 2026