Status: accepted · Version 1.2 · Filed 2026-04-23
SPEC-032 v1.2 — Redis Session Plane
Status
accepted — v1.0 reviewed by Lola (Desktop) + Frank; v1.1 incorporates Donna’s implementation-review critique; v1.2 adopts Option B (MCP → backend HTTP /controller/heartbeat → backend issues Redis EXPIRE) after Frank’s steering during implementation review. Redis stays fully inside the backend’s network boundary; MCP clients never connect to Redis directly.
Changelog
| Version | Date | Author | Summary |
|---|
| v0.1 | 2026-04-23 16:19Z | Donna (Claude Code) | Initial proposal, 7 open questions |
| v1.0 | 2026-04-23 18:29Z | Lola (Desktop) + Frank | All questions resolved. §3 anchor framing, §5.3 heartbeat (threading.Thread), §9 hard prerequisite. Install addendum merged. |
| v1.1 | 2026-04-23 PM | Donna (Claude Code) — implementation review | §5.2.1 Preemption CAS (Lua). §5.3 heartbeat: dedicated asyncio loop on a daemon thread (replaces threading.Thread). §5.3.1 Re-election via keyspace notifications. §8.6 notify-keyspace-events Ex config. §13 alternatives-considered rationale. §14 #13 restated as application-level SLI. |
| v1.2 | 2026-04-23 PM | Donna (Claude Code) + Frank | Option B adopted. §5.3 rewritten: MCP heartbeat posts to backend /controller/heartbeat via HTTP; backend owns all Redis access. Redis credentials stay in backend-only env. Cloud-mode security preserved (managed Redis never exposed to client machines). §11 file diff: mcp/requirements.txt no longer needs redis. §13 adds Q11. |
1. Summary
Move controller_registrations, master election, session heartbeats, and server-push event routing from Postgres to Redis. Postgres retains durable domain state: leases, capability_tokens, approval_requests, session_deltas, and all project artifacts.
Redis is not an optimization or a cache — it is the broadcast domain for Prism’s NetBIOS-style coordination protocol. NetBIOS has the network; Prism has Redis. This is the correct architecture for the election model SPEC-030 §4.1 invokes, eliminates a class of bugs hit in production, and makes single-operator deployments quieter at rest.
Redis is backend-only. MCP clients, CLI tools, and external consumers never open a Redis connection; they reach coordination semantics through backend HTTP + gRPC endpoints. This keeps Redis credentials in one place (backend env), scales to cloud-managed Redis without exposing credentials to every user’s machine, and lets the backend authenticate + rate-limit + audit per-user requests.
The dividing line between Redis and Postgres is ephemeral vs durable, not hot vs cold.
2. Motivation — a real bug, caught live
During the SPEC-030 completion arc on 2026-04-23, Donna’s own master registration was silently released between prism_start (12:26Z) and a follow-up verb call 19 min later (12:52Z). Another MCP process on the same machine claimed master at 12:47Z without any signal back. Root cause analysis identified four layered defects sharing one root cause:
Defect 1: sweep_stale does not distinguish masters from peers. controller_service.py::sweep_stale selects all unreleased registrations where last_heartbeat < cutoff. No is_master = false filter. Masters are released identically to peers.
Defect 2: Heartbeat stamping is HTTP-only. stamp_heartbeat fires exclusively through the HTTP auth middleware when X-Prism-Session-Id is on the request. Local-tool stretches (Edit/Read/Bash/Task) bypass the backend entirely; the row ages silently. The 10-minute sweep threshold is easily exceeded within a single coherent work phase.
Defect 3: gRPC Heartbeat handler does not refresh DB. The Heartbeat handler in grpc_runtime/servicer.py constructs a HeartbeatAck and nothing else. It does NOT call stamp_heartbeat. Even with a fully deployed gRPC stream sending 30s keepalives, the Postgres last_heartbeat column would still go stale. Two heartbeat mechanisms, only one connected to liveness — a split brain.
Defect 4: Silent release path. When sweep_stale flips a row, there is no side effect: no pg_notify, no nudge, no rules_reminder. The client has no way to know it lost master. By contrast, the preemption path DOES emit pg_notify — the asymmetry is architectural.
One root cause: We chose durable storage for state that is inherently ephemeral (liveness bounded by connection/TTL), then had to re-invent TTL semantics on top of Postgres with worker loops and heartbeat plumbing. Each emulation layer has gaps. Redis provides these semantics natively.
3. The NetBIOS analogy, taken seriously
SPEC-030 §4.1 invokes NetBIOS master browser election. NetBIOS is persistence-free by design — nodes announce on a broadcast domain, elections happen in the event stream, masters “die” when their announcements stop.
Frank’s observation during the bug investigation: “why isn’t this a simple memory table? why did we think this needs to be persisted, do you do it to simulate the netbios broadcast and discovery function?” Correct diagnosis. If we invoke NetBIOS semantics, we must follow through on its persistence model: none.
Redis maps 1:1 onto the NetBIOS mental model: atomic CAS (SETNX), native TTL (EXPIRE), pub/sub (PUBLISH/SUBSCRIBE), fast set operations (SADD/SMEMBERS for peer listing). All the Postgres plumbing we built becomes unnecessary.
Critical difference from NetBIOS: NetBIOS operates on a well-known network and uses broadcast for discovery. Prism doesn’t have that luxury — it needs an anchor at a known and consistent location that all agents are aware of. Redis IS that anchor. This is why Redis is a hard prerequisite, not an optional optimization (see §9).
4. Architectural commitments (load-bearing)
These survive the refactor:
- MCP at the boundary — no change. MCP clients talk to the backend via HTTP + gRPC; they never open Redis connections.
- gRPC bidirectional streams inside the control plane — no change; the event-push channel under them switches from asyncpg LISTEN to Redis SUBSCRIBE (backend-side).
- Backend is the only router — no change, and reinforced. All Redis access flows through the backend. External clients authenticate via API keys against the backend; the backend translates to Redis operations. Review flag (Lola): re-evaluate when SPEC-028 (TS MCP) lands. This commitment survives the TS migration more cleanly now that clients never need Redis credentials.
- Capability tokens + leases are DB rows — no change; Postgres retains them.
- Master election is project-scoped — no change.
- CD always wins election when present — no change; implemented via atomic Lua CAS (see §5.2.1).
- Single-master invariant — no change; enforced by atomic
SETNX on a single key plus Lua compare-and-swap for preemption.
- Redis is backend-only (new, v1.2) — MCP clients, CLI tools, and any future external consumer reach coordination state through backend HTTP / gRPC endpoints. Redis credentials live in one env (backend); no client ever sees
PRISM_REDIS_URL.
5. What moves to Redis
5.1 Key schema
All keys carry full tenant_id:project_id scope, even in local mode where tenant is implicit. This prevents mode-branch surprises on upgrade paths (local → LAN) and maintains structural consistency across all deployment modes.
# Per-session hash. TTL = PRISM_SESSION_TTL_SECONDS (default 90).
prism:session:{session_id} → HASH {
tenant_id, project_id, agent_identity, agent_surface,
machine_id, is_master (0|1), registered_at (ISO8601)
}
# Master lock per project. Value is the session_id holding master.
# Same TTL as the session hash. Claimed via SETNX; preempted via Lua CAS (§5.2.1).
prism:master:{tenant_id}:{project_id} → STRING session_id
# Per-project active-session set. SADD on register, SREM on release/expire.
prism:project:{tenant_id}:{project_id}:sessions → SET of session_ids
# Pub/sub channels.
prism:events:project:{tenant_id}:{project_id} # project-scoped broadcast
prism:events:session:{session_id} # session-targeted push
5.2 Semantics — register / elect (prism_start)
- Generate
session_id (UUID).
HSET prism:session:{session_id} + EXPIRE to TTL (90s).
- Try
SETNX prism:master:{tenant}:{project} session_id + EXPIRE to TTL.
- Success → caller is master;
HSET is_master 1.
- Failure → master exists. If caller is
claude_desktop, preempt via the Lua CAS in §5.2.1 — NOT a plain SET ... XX.
SADD prism:project:{tenant}:{project}:sessions session_id.
- Return
controller_status assembled from Redis reads.
5.2.1 Preemption CAS — atomic via Lua
Plain SET ... XX EX TTL does NOT serialize concurrent preemptions. Two Claude Desktop instances starting within the same TTL window both read “master exists, preempt,” both issue SET ... XX, and the last writer wins with no notification to the intermediate loser. Postgres ix_controller_single_master serialized this for us; Redis needs an explicit atomic CAS.
The preemption operation is:
-- KEYS[1] = prism:master:{tenant}:{project}
-- ARGV[1] = expected incumbent session_id (the master we read moments ago)
-- ARGV[2] = new master session_id (the preempting claude_desktop)
-- ARGV[3] = TTL seconds
-- Returns: 1 on success (we atomically replaced the expected incumbent),
-- 0 if the incumbent had already changed (retry or join as peer).
local current = redis.call("GET", KEYS[1])
if current == ARGV[1] then
redis.call("SET", KEYS[1], ARGV[2], "EX", ARGV[3])
return 1
end
return 0
Flow in prism_start when caller is CD and a non-CD master exists:
GET prism:master:... → incumbent_session_id.
- Run the Lua script with
incumbent_session_id as the expected value.
- Script returns 1 → caller is master.
PUBLISH prism:events:session:{incumbent_session_id} with MasterPreempted payload.
- Script returns 0 → another preemption beat us, OR the incumbent expired. Re-read the current master; retry if appropriate, or join as peer. Bounded retry (max 3) to avoid livelock.
This mirrors Postgres’s serialized preemption semantics at the Redis level. The Lua script is atomic on the Redis side (Redis executes scripts to completion without interleaving other commands on the same keyspace).
5.3 Semantics — heartbeat (Option B: via backend, never direct)
Wire path. MCP clients POST to the backend’s /api/v1/controller/heartbeat endpoint every 30s. The backend (inside the Redis network boundary) issues EXPIRE on the session + master keys. MCP clients never open a Redis connection.
MCP client (any machine) backend (inside Redis network)
│ │
│ POST /api/v1/controller/heartbeat │
│ Headers: Authorization: Bearer <key> │
│ X-Prism-Session-Id: <sid> │
├────────────────────────────────────────▶│
│ │
│ │ PIPELINE:
│ │ EXPIRE prism:session:{sid} TTL
│ │ EXPIRE prism:master:{t}:{p} TTL
│ │ (if session holds master)
│ │
│ 200 {ok: true, │
│ ttl_remaining: 90} │
│◀────────────────────────────────────────┤
Why Option B (backend-mediated) over direct MCP→Redis:
- Cloud security. Managed Redis (Upstash/ElastiCache) never exposes credentials to user machines. Only the backend holds the URL. A leaked user API key risks one user’s session; a leaked Redis URL risks the whole coordination plane.
- LAN security. ufw rule on server1 stays locked to admin-IP for Redis port 46379. No need to broaden to the LAN subnet for MCP reachability.
- Single place for Redis knowledge. Backend owns the client, connection pool, Lua registry, retry policy, observability. If we swap Redis for another store later, only the backend changes.
- Per-user authentication + audit. Backend authenticates each heartbeat via API key; can rate-limit and log per-user activity. Direct Redis has no per-user story.
- Client dependency footprint. MCP stays lightweight — no
redis library, no PRISM_REDIS_URL env.
Concurrency model. The MCP-side heartbeat runs on a dedicated asyncio event loop inside a daemon thread, isolated from the MCP server’s main event loop. This preserves the isolation rationale from v1.1 (main loop may block during verb handling, starving a main-loop asyncio.Task). The thread does HTTP, not Redis:
# Pseudocode — heartbeat runs on its own event loop in a daemon thread.
# Talks HTTP to the backend, NOT Redis directly.
async def _heartbeat_loop(
http_client: httpx.AsyncClient,
api_url: str,
api_key: str,
session_id: str,
stop: asyncio.Event,
) -> None:
"""Runs on a dedicated asyncio loop. Isolated from the MCP main loop."""
headers = {
"Authorization": f"Bearer {api_key}",
"X-Prism-Session-Id": session_id,
}
while not stop.is_set():
try:
r = await http_client.post(
f"{api_url}/api/v1/controller/heartbeat",
headers=headers,
timeout=5.0,
)
if r.status_code >= 400:
log.warning("heartbeat rejected (status=%s)", r.status_code)
except httpx.HTTPError as exc:
log.warning("heartbeat HTTP error (will retry): %s", exc)
try:
await asyncio.wait_for(stop.wait(), timeout=30.0)
return
except asyncio.TimeoutError:
pass
def spawn_heartbeat(api_url, api_key, session_id):
"""Returns a stop_event + thread handle. Called from prism_start."""
loop = asyncio.new_event_loop()
stop = asyncio.Event()
def run():
asyncio.set_event_loop(loop)
client = httpx.AsyncClient()
try:
loop.run_until_complete(
_heartbeat_loop(client, api_url, api_key, session_id, stop)
)
finally:
loop.run_until_complete(client.aclose())
loop.close()
thread = threading.Thread(target=run, daemon=True, name=f"hb-{session_id[:8]}")
thread.start()
return _HeartbeatHandle(stop_event=stop, thread=thread, loop=loop)
Backend /controller/heartbeat endpoint shape:
# backend/app/routers/controller.py
@router.post("/heartbeat", response_model=HeartbeatResponse)
async def heartbeat(
x_prism_session_id: str = Header(...),
ctx: AuthContext = Depends(require_auth),
session_store=Depends(get_session_store), # Redis-backed
) -> HeartbeatResponse:
"""Refresh the caller's session (and master, if held) TTL in Redis."""
refreshed = await session_store.refresh_ttl(
tenant_id=ctx.tenant_id,
session_id=x_prism_session_id,
)
if not refreshed:
# Session key expired between last heartbeat and this one — caller
# must re-register via prism_start.
raise HTTPException(410, "session expired, re-register via prism_start")
return HeartbeatResponse(ok=True, ttl_remaining=refreshed.ttl)
Heartbeat unification: The gRPC Heartbeat ClientEvent (for masters with open CoordinationStreams) routes through the same service method. The X-Prism-Session-Id HTTP header side effect in authforge is deleted in Phase B — heartbeat is explicit via this endpoint, not implicit on every verb.
Death semantics: Client disappears → thread dies → HTTP stops → backend sees no refresh → keys expire naturally after 90s → peers notified via keyspace notification (§5.3.1) → re-election. Clean, structural, no worker involvement.
5.3.1 Re-election trigger via keyspace notifications
When the master key expires naturally (dead master, no release), peers need a signal to race for re-election. Spec’d mechanism:
- Redis is configured with
notify-keyspace-events Ex (see §8.6) — enables generic key-expiration events on the __keyevent@{db}__:expired channel.
- The backend-grpc
ControllerEventListener subscribes to this channel at startup, filters for prism:master:* key patterns.
- On a master-key expiration event, the backend publishes a
master_released event on prism:events:project:{tenant}:{project} with the expired session_id.
- Active peers on that project receive the event via their own subscription (or at their next
prism_start via controller_status showing an empty master). Whichever peer acts first wins SETNX on the now-empty master key. No polling, no timer.
Why not have peers poll? Polling wastes RTT and scales badly with peer count. Keyspace notifications are native to Redis and cost zero when no events fire.
Why not have the dying process publish? A cleanly-released master (via prism_wrap) publishes session_ended explicitly (§5.4). But unclean deaths — process crash, network partition, OS kill — have no publisher. Keyspace notifications cover the unclean path; explicit publish covers the clean path. Both trigger the same subscriber flow.
5.4 Semantics — release (prism_wrap)
Client calls POST /api/v1/controller/release (or, for masters on a gRPC stream, closes the stream). Backend runs:
PIPELINE:
DEL prism:session:{session_id}
DEL prism:master:{tenant}:{project} # only if holding
SREM prism:project:{tenant}:{project}:sessions session_id
PUBLISH prism:events:project:{tenant}:{project} # session_ended event
The MCP-side heartbeat thread’s stop_event is set before the release POST is sent.
5.5 Semantics — checkpoint (prism_checkpoint, SPEC-031)
prism_checkpoint issues POST /api/v1/controller/heartbeat (same as a regular heartbeat) as part of its flow — refreshing the TTL without DEL. Backend recognizes the checkpoint marker in the request body and skips any SPEC-029 nudge-resolution side effects that prism_wrap would normally trigger. “Save my game” without “leave the table.”
5.6 Server-push events
Producers replace pg_notify('controller_events', ...) with:
PUBLISH prism:events:session:{target} {json_payload} — session-targeted (backend-side)
PUBLISH prism:events:project:{tenant}:{project} — broadcast (backend-side)
The gRPC servicer’s CoordinationStream subscribes to:
prism:events:session:{caller_session_id} — targeted channel
prism:events:project:{tenant}:{project} — broadcast channel
Redis pub/sub is natively multi-subscriber — this eliminates the multi-backend-grpc-container routing caveat from SPEC-030 Phase 3. Any backend-grpc container subscribing to the right channel receives the event.
MCP event consumption stays through gRPC CoordinationStream. MCP clients don’t SUBSCRIBE to Redis; they receive pushed events via their existing gRPC stream to backend-grpc, which in turn subscribes to Redis on their behalf. Same pattern as heartbeat: Redis stays backend-side.
Ordering guarantees: At-most-once delivery, no ordering guarantees across subscribers. This matches the existing LISTEN/NOTIFY semantics. No regression.
5.7 prism_status
Read-only. Backend does HGETALL on master’s session hash + iterates SMEMBERS on the project’s session set and returns the routing table over HTTP. No worker coordination, no stale-row filtering. Sub-millisecond on any realistic project scale.
6. What stays in Postgres
These have audit, durability, or integrity requirements that Redis isn’t the right primitive for:
| Table | Why Postgres |
|---|
leases | Partial UNIQUE enforces single-holder; audit matters; integration with intent queue (SPEC-026); survive backend restart |
capability_tokens | Revocation must outlive process life; audit per token; scope JSONB with SQL queryability |
approval_requests | Audit is THE reason these rows exist; terminal states must persist |
nudges (SPEC-029) | Durable obligations, not session state |
session_deltas | Semantic recall indexing; permanent record |
projects, specs, adrs, plans, … | Domain data |
Redis is for “alive if reachable within TTL” state only.
7. What this eliminates
controller_registrations table → becomes optional append-only audit log, then retired (Phase D).
ControllerSweepWorker → delete.
controller_service.stamp_heartbeat → delete.
grpc_runtime/listener.py asyncpg LISTEN → replaced by Redis SUBSCRIBE.
- Partial UNIQUE index
ix_controller_single_master → moot.
- The silent-release bug → fixed structurally via TTL.
- The gRPC-vs-HTTP heartbeat split brain → one path, one primitive.
- The
X-Prism-Session-Id heartbeat side effect in authforge → removed. Heartbeat is explicit via /controller/heartbeat, not a side effect of other verbs.
8. What this adds
8.1 Runtime dependencies
- Redis runtime dependency — new container in all compose stacks.
redis>=5.0 in backend/requirements.txt (async Python client). Backend only — MCP does NOT take this dep under Option B.
- New package
backend/app/session_store/ — Redis client, pipeline builders, pub/sub machinery, Lua script registration.
- New module
mcp/heartbeat.py — dedicated asyncio loop on daemon thread, HTTP client talking to backend /controller/heartbeat (no direct Redis access).
- New backend endpoint
POST /api/v1/controller/heartbeat — thin wrapper that validates session + issues EXPIRE.
PRISM_REDIS_URL env var — backend only. MCP reads only PRISM_API_URL + PRISM_API_KEY (unchanged from today).
8.2 Local mode (personal install)
Containerized alongside Postgres + Neo4j. No native install decision forced on the user. File name docker-compose.personal.yml preserved for now; rename to docker-compose.local.yml aligns with SPEC-019 v1.1 and is a separate migration arc.
docker-compose.personal.yml additions:
redis:
image: redis:7-alpine
container_name: prism-personal-redis
restart: unless-stopped
command: >
redis-server
--save ""
--appendonly no
--notify-keyspace-events Ex
--requirepass "prism_personal"
# No host port published — Redis reachable only from the compose network.
# Backend connects via internal DNS `redis:6379`.
healthcheck:
test: ["CMD", "redis-cli", "-a", "prism_personal", "ping"]
interval: 5s
timeout: 5s
retries: 10
Backend env: PRISM_REDIS_URL: redis://:prism_personal@redis:6379/0
Backend depends_on gains redis: service_healthy.
Password: Fixed dev password prism_personal, matching the Postgres/Neo4j pattern. Since the port isn’t published to the host, even loopback access requires being inside the compose network — password is defense in depth, consistent with other services.
Why no published port? MCP clients don’t need direct Redis access under Option B; only the backend does. Omitting the ports: block makes Redis invisible outside the compose network — tightest possible surface.
8.3 LAN mode (server install)
Containerized, no published port (backend-only access). Same posture as personal mode.
docker-compose.server.yml additions:
redis:
image: redis:7-alpine
container_name: prism-server-redis
restart: unless-stopped
command: >
redis-server
--save ""
--appendonly no
--notify-keyspace-events Ex
--requirepass "${PRISM_REDIS_PASSWORD:-prism_server}"
# No host port published under Option B. Backend access via compose network.
healthcheck:
test: ["CMD", "redis-cli", "-a", "${PRISM_REDIS_PASSWORD:-prism_server}", "ping"]
interval: 5s
timeout: 5s
retries: 10
Backend env: PRISM_REDIS_URL: redis://:${PRISM_REDIS_PASSWORD:-prism_server}@redis:6379/0
bin/prism-server-install.sh diffs:
- New config constant:
PRISM_REDIS_PASSWORD (generated via openssl rand -base64 24).
- No ufw rule needed for Redis — port isn’t published to the host. Simpler than v1.1’s admin-IP rule.
/etc/prism-server.conf template grows PRISM_REDIS_PASSWORD line.
- Post-install smoke:
docker compose exec redis redis-cli -a "$PRISM_REDIS_PASSWORD" ping
8.4 Cloud mode (hosted)
Backend runs on a cloud host; Redis is a managed service (Upstash, AWS ElastiCache, Redis Cloud, etc.) reachable over the cloud provider’s private network (VPC peering) or over TLS on the public internet.
- Operator provisions managed Redis, gets URL + credentials.
PRISM_REDIS_URL set in the backend’s deployment env. Never in client / MCP env.
- MCP clients authenticate to the backend via API key (unchanged from today). Backend uses its Redis credentials to do the actual coordination work.
- A leaked API key compromises one user’s session plane (revocable). A leaked backend Redis URL would compromise the whole cluster — which is why only the backend ever sees it.
install.py --backend=cloud hard-fails if the backend deployment doesn’t have PRISM_REDIS_URL set.
- TLS via
rediss:// is transparent through redis-py.
Managed-service keyspace notifications: Most managed Redis providers either enable keyspace notifications by default or allow the operator to configure them. Install docs call this out as a provider-specific checklist item (AWS ElastiCache: notification-events parameter group; Upstash: enabled by default in paid tiers; Redis Cloud: dashboard toggle).
8.5 SPEC-019 env resolution for PRISM_REDIS_URL
Backend-only resolution, following the existing pattern for PRISM_ALLOWED_ORIGINS and PRISM_WEB_URL:
@property
def effective_redis_url(self) -> str:
if self.prism_redis_url:
return self.prism_redis_url
mode = self.prism_mode.lower()
if mode in ("local", "personal", "development"):
return "redis://:prism_personal@redis:6379/0"
if mode == "lan":
pwd = os.environ.get("PRISM_REDIS_PASSWORD", "prism_server")
return f"redis://:{pwd}@redis:6379/0"
if mode in ("cloud", "hosted"):
raise RedisConfigError(
"cloud mode requires PRISM_REDIS_URL — point at managed Redis"
)
return "redis://127.0.0.1:6379/0"
8.6 Redis configuration
- Persistence: AOF off, RDB off. State is ephemeral by definition — restart = re-register.
- Keyspace notifications:
notify-keyspace-events Ex (generic key-expiration). Required for §5.3.1 re-election. Set via compose command: arg in local/lan; manual config for cloud (provider-specific).
- Eviction policy:
maxmemory-policy noeviction. Session keys are TTL-managed; eviction would corrupt the coordination plane.
- Memory budget: ~100KB per active project — negligible.
- Volume mount: Optional, for operator debugging only.
9. Redis availability posture — hard prerequisite
Redis is not optional. Redis IS the broadcast domain. Without it, there is no coordination plane — same as Postgres being down means no domain data.
When Redis is unavailable:
prism_start errors on coordination features. It does NOT silently degrade to an in-process singleton.
POST /controller/heartbeat returns 503; MCP heartbeat thread logs a warning and retries.
- HTTP + gRPC domain verbs that hit Postgres continue working.
prism_status returns an error for controller status, not a degraded response.
Startup dependency chain: redis: service_healthy in compose depends_on ensures backend does not start until Redis is reachable. Install scripts validate Redis connectivity before declaring success.
10. Phased migration
Phase A — Dual-write, Redis-authoritative read
- Add Redis to all compose stacks + install scripts.
- Add
session_store package (backend-side Redis client + Lua registry + keyspace listener).
- Add
POST /api/v1/controller/heartbeat endpoint on backend.
- Add
mcp/heartbeat.py (dedicated-asyncio-loop-on-thread pattern, HTTP client).
- Register preemption Lua script at backend startup; cache SHA.
controller_service.register writes BOTH Redis AND Postgres.
- Reads come from Redis.
ControllerSweepWorker continues as safety net (loosened to 30 min TTL).
- MCP heartbeat spawned on
prism_start.
- Redis keyspace notifications enabled in compose configs.
- Exit criterion: application-level SLI — see §14 criterion #13.
Phase B — Retire Postgres path
- Remove
ControllerSweepWorker.
- Remove
stamp_heartbeat.
- Postgres row becomes append-only audit.
- Remove
X-Prism-Session-Id heartbeat side effect from authforge.
Phase C — Pub/sub migration
- Dual-path: producers PUBLISH to Redis AND pg_notify.
- Consumers (backend-grpc) subscribe to Redis; LISTEN path logs-only.
- Keyspace-notification-driven re-election wired into backend-grpc
ControllerEventListener.
- After one week green, remove LISTEN infrastructure.
Phase D — Optional audit-table retirement
- Drop
controller_registrations via forward migration if unused.
11. File-by-file diff summary
| File | Change type | What changes |
|---|
docker-compose.personal.yml | Additive | New redis service (no published port, notify-keyspace-events Ex); backend env + depends_on |
docker-compose.server.yml | Additive | New redis service w/ password + keyspace notifications; no host port |
docker-compose.prod.yml | Additive | Backend env: PRISM_REDIS_URL |
bin/prism-server-install.sh | Additive | PRISM_REDIS_PASSWORD generation, conf template, smoke (no ufw rule under Option B) |
install/install.py | Additive | Cloud-mode validation that backend deploy has PRISM_REDIS_URL |
install/detect.py | Additive | Redis health probe (via docker compose exec, not direct TCP) |
install/smoke.py | Additive | redis-cli PING smoke via compose exec |
backend/app/config.py | Additive | prism_redis_url setting, SPEC-019 mode-aware default |
backend/docker-entrypoint.sh | Additive | Wait-for-redis preflight |
backend/app/session_store/ | New | Redis client, pipelines, Lua script registry, pub/sub, keyspace-notification listener |
backend/app/session_store/scripts/preempt_cas.lua | New | §5.2.1 Lua CAS script |
backend/app/routers/controller.py | Modified | Add POST /heartbeat endpoint |
backend/app/services/controller_service.py | Modified | register dual-writes Redis + Postgres; refresh_ttl new method |
mcp/heartbeat.py | New | Dedicated asyncio loop on daemon thread, HTTP client to backend /controller/heartbeat |
mcp/server.py | Modified | Spawn/stop heartbeat handle on prism_start / prism_wrap |
backend/tests/test_config_redis_resolution.py | New | Mode-aware URL resolution tests |
backend/tests/test_preempt_cas.py | New | Concurrent-CD race tests against ephemeral Redis |
backend/tests/test_heartbeat_endpoint.py | New | HTTP endpoint + TTL refresh verification |
mcp/smoke_spec032_redis.py | New | End-to-end session-plane smoke |
Note: mcp/requirements.txt is NOT modified. MCP does not take a Redis dependency under Option B — it reuses its existing httpx client.
12. Relationship to other specs
- Supersedes SPEC-030 §5 + §11. Does NOT supersede §8, §9, §13.
- Complements SPEC-031 (checkpoint = heartbeat-without-release via the same
/controller/heartbeat endpoint).
- Informs SPEC-028 (TS MCP gets a tiny HTTP-ping heartbeat model — no Redis client library needed in TypeScript).
- Uses SPEC-019 for env resolution.
- Corrects wrap-rate inflation from sweep-released sessions.
13. Resolved design questions
| Q | Question | Resolution | Rationale / alternatives considered |
|---|
| Q1 | Redis persistence | AOF off, RDB off | Ephemeral by definition — restart = re-register. Volume mount stays optional for operator debug. |
| Q2 | TTL default | 90 seconds | 30s heartbeat × 3 missed beats before expiry. Shorter (60s) gives faster dead-master detection but less tolerance to network blips; longer (120s+) masks real outages. 90s is the 3×-interval sweet spot. |
| Q3 | Race ordering on concurrent master claims | SETNX winner for empty slot; Lua CAS for preemption (§5.2.1) | Corrected in v1.1. v1.0 “SETNX winner, defer tie-breaking” didn’t address the concurrent-CD preemption race — Lua CAS handles it atomically. Alternative considered: Redis WATCH/MULTI; rejected because Lua is simpler and more portable. |
| Q4 | Personal-install Redis footprint | Redis everywhere | Alternative considered: mode-branch (pg advisory locks for local, Redis for LAN/cloud). Rejected because session-layer code paths would fork two ways with no test parity, and Frank’s direction was Redis-as-core. Cost: ~35MB RAM + forced Docker dependency on macOS. Accepted as cost of architectural consistency. |
| Q5 | SPEC-029 nudges table | Stays Postgres | Nudges are durable obligations, not session state. |
| Q6 | Event ordering | At-most-once, no cross-subscriber ordering | Matches existing LISTEN/NOTIFY semantics. No regression. Stricter ordering would require Redis Streams + consumer groups — unnecessary complexity for current event shapes. |
| Q7 | Producer cutover during migration | Dual-path Phase A/B, cut in Phase C | Alternative considered: atomic cutover with flag. Rejected for eroding rollback safety during the transition week. |
| Q8 | Local-mode Redis password | Fixed prism_personal | Pattern consistency with Postgres+Neo4j (both use fixed dev passwords). Port isn’t published — password is defense in depth. |
| Q9 | Heartbeat concurrency model | Dedicated asyncio loop on daemon thread | Alternative considered: asyncio.Task on main MCP loop. Rejected because MCP verb handlers may block the event loop and silent starvation is the exact failure mode we’re eliminating. Alternative considered: sync threading.Thread with blocking client (v1.0). Rejected because it mixes sync/async concurrency models. |
| Q10 | Re-election trigger mechanism | Redis keyspace notifications on __keyevent@0__:expired | Alternative considered: peer polling via SETNX. Rejected for wasted RTT. Alternative considered: clean-release PUBLISH only. Rejected because unclean deaths would orphan the master slot. Keyspace notifications + clean-release PUBLISH together cover both paths. |
| Q11 (v1.2) | Does MCP client connect to Redis directly? | No — MCP posts to backend /controller/heartbeat; backend issues EXPIRE. | Added in v1.2 per Frank’s steering. Alternative considered: direct MCP→Redis connection. Rejected for three reasons: (1) Cloud-mode: managed Redis credentials would have to live on every user’s laptop, making one leaked user a cluster-wide compromise. (2) LAN-mode: would require broadening ufw from admin-IP-only to LAN-subnet. (3) Credentials + Redis logic in one place (backend) vs. forked across clients. Backend-mediated adds ~5-20ms per 30s heartbeat — negligible. |
14. Acceptance criteria
PRISM_REDIS_URL resolves per SPEC-019 mode profiles on the backend only.
- All compose stacks gain Redis service (no host port published).
backend/app/session_store/ exists with typed client + pipeline helpers + Lua script registry.
prism_start writes Redis (+ Postgres in Phase A).
prism_start returns identical controller_status shape.
- MCP heartbeat spawns on
prism_start via mcp/heartbeat.py; POSTs to backend /controller/heartbeat every 30s on its own asyncio loop in a daemon thread.
- Backend
/controller/heartbeat endpoint refreshes session + master TTLs via Redis EXPIRE.
- TTL expiry → next
prism_start wins master — no worker.
- Redis DOWN →
prism_start errors; /controller/heartbeat returns 503.
- Multi-container backend-grpc: PUBLISH lands on all subscribers.
- Smoke test covers full session-plane round trip: register, heartbeat via HTTP, natural expiry, preempt (via Lua), release, pub/sub delivery to gRPC stream.
- Silent-release bug is fixed structurally (no sweep worker required).
- Phase A exit SLI: over a continuous 7-day window, ≥99% of
prism_start calls complete without a Redis-coordination error, AND zero code paths read from controller_registrations for live decisions (grep-verified). Application-level SLI, measurable via request logs.
prism_checkpoint issues heartbeat-refresh without DEL.
- Concurrent-CD preemption test (§5.2.1 Lua CAS) passes.
- Keyspace-notification-driven re-election test passes.
- MCP does NOT take
redis as a dependency. mcp/requirements.txt unchanged from pre-SPEC-032 state.
mcp/heartbeat.py tested in isolation — verify HTTP ping cadence, retry on failure, clean shutdown on stop_event.
15. Authorship + review trail
- Original author: Donna (Claude Code, session a54a1f65) 2026-04-23 16:19Z.
- v1.0 reviewer: Lola (Claude Desktop, session 97e32dbf) 2026-04-23 18:29Z.
- v1.1 implementation reviewer: Donna (Claude Code, session a54a1f65) 2026-04-23 PM. Scope: concurrent-CD race (§5.2.1), heartbeat concurrency model (§5.3), re-election trigger (§5.3.1), measurable acceptance criterion (§14 #13), rationale column in §13.
- v1.2 architectural refinement: Donna (Claude Code) + Frank 2026-04-23 PM. Scope: Option B (backend-mediated heartbeat) adopted after Frank’s steering during implementation review. Redis stays backend-only — cloud credentials never exposed to clients, LAN ufw stays tight, single source of Redis knowledge.
- Steering: Frank — questioned Postgres choice, directed Redis-as-anchor posture, confirmed Option B at implementation review.
- Trigger: silent-master-release bug during SPEC-030 wrap-discussion prep.
Status: accepted