Skip to main content
Status: complete · Version 1 · Filed 2026-04-23

Goal

Ship the Postgres-backed controller registration table, election logic embedded in prism_start / prism_wrap, and the new prism_status verb. No gRPC, no streams, no leases, no capability tokens — those are Phases 2–6. Immediately useful to Frank on land: prism_status shows live agent topology per project (who’s master, who’s peer, who’s stale). Lola + Donna dogfood the election from day one.

Scope — in

  • Migration 01X_spec030_phase1_controller_registrations.py: controller_registrations table + three indexes + the agent_surface CHECK constraint.
  • SQLAlchemy model + Pydantic schemas for registration.
  • Backend service layer: register_controller, release_controller, list_registrations, update_heartbeat.
  • Backend router: POST /api/v1/controller/register, POST /api/v1/controller/{id}/release, GET /api/v1/controller/{pid}/status.
  • MCP enhancements to existing verbs:
    • prism_start — minting flow gains registration + election + context-switch detection.
    • prism_wrap — existing wrap flow gains deregistration.
  • MCP new verb: prism_status(pid) — read-only routing table view.
  • Backend middleware hook: update last_heartbeat for the caller’s active registration on every authenticated verb call (drives peer staleness; see Decisions §D2).
  • Background sweep: release registrations whose last_heartbeat is >10 min stale (configurable via PRISM_CONTROLLER_PEER_STALE_MINUTES). Runs on backend-http startup + every 60s.
  • Smoke test mcp/smoke_spec030_phase1.py — seven scenarios per §6.

Scope — out

  • gRPC service (backend-grpc entrypoint) — Phase 3.
  • Proto codegen — Phase 2.
  • Leases + capability tokens — Phase 4.
  • Approval flow — Phase 5.
  • Observability (OpenTelemetry, metrics) — Phase 6.
  • SPEC-029 controller nudge kinds — Phase 5 (arrive with approvals).
  • BIOS/CLAUDE.md updates for agent contract around controller — defer to Phase 3 (nothing to surface until streams exist).

Decisions I’m making (judgment calls flagged for Lola review in PR)

D1. CD preemption trigger

Implement: CD preempts when Claude Desktop calls prism_start(pid) on the contested PID. Not on CD process existence, not on CD activity in other projects. Rationale: covers Frank’s “don’t yank master when CD is open but idle” concern. Matches NetBIOS analogy (a machine becomes master by participating, not by existing). Spec text is ambiguous; will flag in PR and ask Lola to confirm the interpretation for SPEC-030 v0.2.

D2. Peer heartbeat write path

Implement: backend updates last_heartbeat on every authenticated verb call carrying the caller’s session_id. Peer registrations auto-release after >10 min of no verb traffic. Rationale: Postgres-native (no new mechanism, no MCP timer, no polling endpoint), exploits existing verb traffic, low write cost (one indexed UPDATE per verb). Staleness threshold configurable via env var. Masters use gRPC keepalive in Phase 3+; Phase 1 masters heartbeat via the same verb-call path (they still make verb calls, just also hold a stream later).

D3. Multi-CD tiebreaker

Implement: if multiple claude_desktop registrations contend, earliest registered_at wins. Subsequent CD arrivals register as peers. Rationale: deterministic + matches the “first-come” fallback for non-CD. Avoids CD-vs-CD preemption loops if Frank runs CD on two machines.

D4. agent_surface enum lock

Implement: agent_surface VARCHAR(32) NOT NULL CHECK (agent_surface IN ('claude_desktop', 'claude_code', 'codex', 'cursor', 'other')). Rationale: typo defense + trust-boundary hardening. CHECK constraint is trivial in initial migration, expensive to add retroactively. Extend the set via future migration when new surfaces ship.

D5. Context switch transaction boundary

Implement: two sequential transactions, no compensation.
  • TXN1: soft-close old PID (write wrap_session nudge per SPEC-029, set released_at on old registration).
  • TXN2: election + insert on new PID.
Rationale: across-PID atomicity is unnecessary; TXN1 is correct standalone even if TXN2 fails (user retries prism_start on new PID). Simpler and avoids cross-scope transaction anti-patterns.

D6. Preemption is Postgres-authoritative; gRPC notification is a courtesy

Phase 1 has no gRPC streams, so preemption in Phase 1 just updates is_master=false on the old row and inserts the new master with is_master=true — both in one transaction, partial UNIQUE enforces single-master invariant. The old master discovers its demotion on its next prism_status call or next verb response (which can carry a you_were_preempted=true flag). Phase 3 adds the MasterPreempted gRPC push event for immediate notification; Phase 1 is polling/lazy.

Files touched

Backend (Python/FastAPI)

  • backend/alembic/versions/01X_spec030_phase1_controller_registrations.py — new migration.
  • backend/app/models/controller_registration.py — new SQLAlchemy model.
  • backend/app/schemas/controller_registration.py — new Pydantic schemas (Create, Out, StatusResponse).
  • backend/app/services/controller_service.py — new; register / release / list / elect / heartbeat / stale-sweep.
  • backend/app/routers/controller.py — new; 3 endpoints.
  • backend/app/main.py — register new router; wire stale-sweep background task on startup.
  • backend/app/auth/authforge.py — extend to call controller_service.update_heartbeat(session_id) in the authenticated-verb middleware path.
  • backend/app/workers/controller_sweep_worker.py — new; 60s loop that releases stale peer registrations.

MCP (Python, pre-SPEC-028)

  • mcp/server.py — enhance prism_start to call /controller/register after session mint; detect context switch; enhance prism_wrap to call /controller/{id}/release; add new verb prism_status.
  • mcp/client.py — add register_controller, release_controller, get_controller_status HTTP client methods.
(Per SPEC-028, the TS MCP migration is in flight; Phase 1 lands on the current Python MCP with clean enough boundaries that the TS port can mirror trivially.)

Smoke test

  • mcp/smoke_spec030_phase1.py — seven scenarios:
    1. Fresh start with no existing master → caller becomes master.
    2. Second prism_start (non-CD) on same PID → peer registration.
    3. CD prism_start with non-CD master active → CD preempts.
    4. Master calls prism_wrap → deregistered, next start re-elects.
    5. prism_start on different PID → old registration released + wrap_session nudge written + new PID election runs.
    6. prism_status returns correct master/peer/caller shape.
    7. Stale sweep: manually backdate last_heartbeat on a peer row, invoke sweep, verify released_at set with reason stale_heartbeat.

PR sequencing

Single PR. Estimated ~200 lines of new Python + 150 lines of tests + migration. Too small to split meaningfully; one PR is easier to review. PR title: feat(spec-030): Phase 1 — controller registrations + election + prism_status

Deployment (once PR lands)

  1. Mac (local dev loop): migration auto-applies on next start-backend.sh; smoke test locally.
  2. server1: rsync backend/ + mcp/ via existing deploy path (donna@server1 SSH, per memory reference_server1_ssh.md).
  3. docker compose -f docker-compose.server.yml build backend + up -d backend.
  4. Wait healthy (~16s per observed pattern).
  5. Smoke test from Mac against server1.
  6. Lola + Donna both call prism_start on PID-PGR01; verify prism_status shows both with correct surfaces and CD-wins election.

Acceptance criteria (Phase 1 subset of SPEC-030 §24)

  1. controller_registrations table exists with columns + three indexes + the agent_surface CHECK constraint.
  2. prism_start registers the caller and runs the election per decisions §D1, §D3.
  3. CD preempts non-CD masters per §D1.
  4. prism_status returns the routing table.
  5. prism_wrap sets released_at on the caller’s registration.
  6. Context switch (prism_start(different_pid)) releases old registration + writes wrap_session nudge per §D5.
  7. Stale peer sweep releases registrations with last_heartbeat >10 min stale.
  8. Smoke test 7/7 scenarios pass locally and against server1.

Blockers / risks

  • Nothing blocking.
  • One dependency: SPEC-029’s nudge table must be live for §D5’s wrap_session nudge to write. SPEC-029 is v0.2 filed but not implemented — if nudges table doesn’t exist yet, the context-switch path fails with a clear DB error. Mitigation: Phase 1 either implements SPEC-029’s minimal wrap_session nudge slice first, OR gracefully no-ops the nudge write if the table is absent (log a warning). Lean: no-op with warning; full SPEC-029 ships in its own timeline.

Follow-ups (outside Phase 1, tracked)

  • Pre-flight transaction audit (~2h grep-read of mutation verbs) — blocking for Phase 4, will run in background during Phase 2/3.
  • Items 3 (agent_surface trust disclosure), 4 (approval timeouts), 6 (NOTIFY 8000-byte cap), 7 (channel split) — fold into SPEC-030 v0.2 when Lola next revises; all are Phase 3+ surface.

References

  • SPEC-030 v0.1 (this spec) — ad3e5fef… no wait, SPEC-030 spec_id 12757c47-e895-4ba6-89be-f9151f9d48ef.
  • SPEC-029 v0.2 — nudges table + wrap_session kind used by context switch.
  • SPEC-024 — projection-retirement pattern (released_at + partial UNIQUE) applied to registrations and leases.
  • SPEC-023 — session_id mint lifecycle, reused for registration scoping.
  • SPEC-021 — prism_start bootstrap contract; registration is a new bootstrap side-effect.
Status: active
Last modified on April 27, 2026