Skip to main content
Status: accepted v0.3 — body sourced from TriGraph spec entity. Docs-mirror file added by docs lane to surface ratified specs that lacked an on-disk mirror.

SPEC-082: Operator-Driven Master Handoff Verb

SPEC-082 v0.3 — Operator-Driven Master Handoff Verb

v0.3: applied Texi’s v0.2 nit — stale_master added to both verbs’ error lists for CAS-lost-race consistency with acceptance test #9.

Motivation

Today master election is implicit “first registered wins” with no operator override. Operators cannot redirect master to a chosen identity; the only path is for the current master to wrap and hope the operator’s preferred identity wins the next-registered race — which loses if the wrapped master autostarts (observed 2026-05-04 evening, postmortem f7ff9a8e). Frank’s recurring need: name a specific identity as master mid-session for handoff (governance master ↔ engineering master).

Scope — two verbs, split by intent

prism_master_handoff(pid, to_identity, to_session_id?)

Cooperative transfer. Caller MUST be current master. Atomic server flow:
  1. Resolve target row (see §Target resolution).
  2. Update Redis master slot per SPEC-032 (single-writer through SM facade/API, not direct SessionStore imports outside session_manager) — Redis is authoritative for routing.
  3. Set is_master=false on caller’s controller row.
  4. Set is_master=true on resolved target row.
  5. Emit MasterPreempted to previous master with reason=handoff (reuses existing direct-to-previous-master contract).
  6. Return {ok, previous_master, new_master}.
Errors: not_master, target_not_registered, target_stale, target_ambiguous, stale_master (CAS-lost race — caller’s master view is stale, another concurrent claim/handoff already changed master state).

prism_master_claim(pid, to_identity, operator_id, operator_password, to_session_id?)

Operator-authorized preempt — caller and target are independent. Caller proves authority (operator creds per SPEC-038 §3.2); target identifies who gets master. Atomic server flow:
  1. Validate operator credentials.
  2. Resolve target row (see §Target resolution).
  3. Update Redis master slot per SPEC-032 (through SM facade).
  4. Mark all controller rows in pid is_master=false.
  5. Set is_master=true on resolved target row.
  6. Emit MasterPreempted to previous master with reason=preempt, by_operator=operator_id.
  7. Return {ok, previous_master, new_master, preempted}.
Errors: invalid_operator_credentials, target_not_registered, target_stale, target_ambiguous, stale_master (CAS-lost race — concurrent claim/handoff resolved first). Note (Texi-finding-1 fix from v0.1): target is explicit in both verbs. v0.1 had prism_master_claim promoting the caller — that contradicted the motivation. Now caller authority and target identity are separated, matching the use case.

Target resolution (Texi-finding-3 fix from v0.1)

Deterministic algorithm:
  1. If to_session_id provided: select that exact controller row (must match pid+identity, must be active). Authoritative override.
  2. Else, candidate set = active controller rows in pid where identity=to_identity AND last_heartbeat within freshness threshold (default 30s, configurable via SPEC-032 plane).
  3. Empty candidate set → target_not_registered.
  4. Candidate set has only stale rows (rows exist for identity but all heartbeat-aged-out) → target_stale.
  5. Exactly one fresh candidate → select it.
  6. Multiple fresh candidates → target_ambiguous error listing candidate session_ids. Operator disambiguates via to_session_id.
The companion bug fix (controller-row UPSERT, see Dependencies) makes case 6 vanishingly rare in practice — but the verb must still handle it deterministically while peer agents have stale rows pre-fix.

Dual-plane state (Texi-finding-4 fix from v0.1, SPEC-032 alignment)

Master state lives in two planes this phase:
  • Redis session plane — authoritative for routing/realtime, SM-owned per SPEC-032. All updates go through the Session Manager facade/API; no direct SessionStore imports outside session_manager.
  • Postgres controller_status rows — audit + cold-restore.
Both verbs MUST update Redis FIRST (single-writer through SM), then Postgres. If Redis succeeds but Postgres fails: log + retry the Postgres write; do NOT roll back Redis (routing already changed). prism_status reads from Redis, so post-verb status reflects the new master immediately. Without this rule, prism_status and routing would disagree until the next reconciliation pass.

Signal contract (Texi-finding-2 fix from v0.1)

Reuse existing MasterPreempted enum value. Do NOT introduce MasterChanged. Payload extension on MasterPreempted:
  • previous_master_identity, previous_master_session_id
  • new_master_identity, new_master_session_id
  • reason: "preempt" | "handoff"
  • by_operator: operator_id (only on preempt path; absent on handoff)
Delivery: direct to previous master per existing contract. Broadcast notification to peers (so they refresh routing tables) deferred to SPEC-073 alignment — for v0.3, peers learn via next prism_status poll or PeerJoined/PeerLeft delta.

Acceptance tests

  1. Cooperative handoff success — current master calls handoff, target promoted, MasterPreempted reason=handoff emitted.
  2. not_master — non-master attempts handoff, rejected.
  3. target_not_registered — handoff to identity with zero active rows.
  4. target_stale — handoff to identity whose only rows have stale heartbeat.
  5. target_ambiguous — handoff to identity with multiple fresh rows, no to_session_id provided; error lists candidates.
  6. to_session_id disambiguation — handoff with explicit session_id picks that row even when others are fresh.
  7. invalid_operator_credentials — claim with wrong password rejected.
  8. Operator claim to named target — claim with valid creds promotes named target (not caller).
  9. Concurrent claims/handoffs atomicity — two simultaneous calls; exactly one wins, other gets stale_master error.
  10. MasterPreempted payload shape — reason field correct on both paths; by_operator present on preempt only.
  11. prism_status reflects new master immediately after verb returns (Redis-first invariant).
  12. Cross-machine handoff — handoff to identity registered on different machine succeeds.

Dependencies

  • SPEC-030 (broader election + leases): SPEC-082 ships before, integrates with leases later.
  • SPEC-032 (Redis session plane): master state goes here first; all updates through SM facade.
  • SPEC-038 §3.2 (operator credentials for prism_master_claim).
  • Companion bug fix (postmortem f7ff9a8e action item): controller-row UPSERT on (project_id, identity, surface, machine_id) + prism_session_deregister(session_id). Reduces target_ambiguous from common to rare.

Out of scope (v0.3)

  • Lease renewal / TTL auto-failover (SPEC-030).
  • Automatic role-aware election (“engineering takes master when PR open”).
  • Multi-tier operator authority (single operator credential gate this phase).
  • Peer-broadcast MasterChanged semantics (deferred to SPEC-073 alignment).

Open questions resolved from v0.1

  • Cross-machine handoff: yes, supported (target resolved by identity+session, machine implicit).
  • Signal trace continuity: previous_master_session_id added to MasterPreempted payload.
  • Distinct verbs vs force=true: keep distinct — claim takes named target so it’s clearly distinct from force=true identity preempt.

Review history

  • v0.1 → Texi review: 4 findings (claim signature, MasterChanged removal, target ambiguity contract, dual-plane state) + acceptance tests ask.
  • v0.2 → Texi review: approved_with_minor_nit (stale_master vocabulary alignment + implementation-phrasing reminder).
  • v0.3 → applied stale_master + SM-facade phrasing.
  • Extends: SPEC-030 §master-election (operator-override path)
  • Aligns: SPEC-032 (Redis session plane authoritative for routing)
  • Related: SPEC-038 §3.2 (operator credentials), postmortem f7ff9a8e (controller-row leak — companion fix)
Last modified on May 9, 2026