Skip to main content
Status: draft · Version 0.2 · Filed 2026-04-25
SPEC-035 v0.2 — Agent engagement signal for signal routing and session supersede. STATUS: Partially shipped. §1, §2, §3 live on commit 11537f7. §4 (prism_start supersede) DISABLED in code with rationale comment; needs redesign.

What shipped (commit 11537f7, 2026-04-25)

§1 Migration 021: controller_registrations.last_verb_at column (TIMESTAMPTZ NULL) + backfill from last_heartbeat + partial index ix_controller_engagement ON (agent_identity, last_verb_at) WHERE released_at IS NULL. §2 authforge.require_auth stamps last_verb_at on authenticated verb calls, excluding /controller/heartbeat (that path only bumps last_heartbeat per SPEC-030). Added _is_heartbeat_path helper. controller_service.stamp_engagement function mirrors stamp_heartbeat pattern (decoupled session, silent no-op when no active registration). §3 signal_service._resolve_identity_to_session now accepts optional db_session. When multiple sessions share an identity and none is master, queries Postgres for ORDER BY last_verb_at DESC NULLS LAST, last_heartbeat DESC. Master still wins. Pre-upgrade NULL rows fall to end of ordering. §5 release_reason taxonomy: ‘superseded_by_new_start’ value reserved in code comments but not yet written (since §4 is disabled). Schema: last_verb_at exposed in ControllerRegistrationOut response. Verified on server1: migration applied cleanly (alembic_version=021), backfill populated all existing rows, post-verb engagement stamp observed (my session last_verb_at advanced 60+s past zombie last_verb_at during a prism_whois call), signal routing uses new ordering.

What got disabled during implementation (§4)

Attempted supersede UPDATE same-(identity, machine) to released_at=NOW(), release_reason=‘superseded_by_new_start’ on every register() call where session_id != req.session_id. Result: infinite re-register loop. Sequence:
  1. User calls prism_start → new session W created, prior session X superseded.
  2. X’s MCP subprocess still has a live heartbeat thread (30s timer spawned at original prism_start, never torn down).
  3. X’s next heartbeat POST hits backend, gets 410 (released session).
  4. Client-side auto-recovery fires: POSTs /controller/register with a fresh session_id Y for the same subprocess.
  5. Y’s register() runs supersede on SAME (identity, machine) → supersedes W.
  6. W’s heartbeat thread hits 410, auto-recovers as Z, supersedes Y.
  7. Cascade continues indefinitely (~10–30s per iteration per live subprocess on the machine).
Observed on server1: 11 superseded Donna registrations in ~2min before revert. Root cause: supersede can’t distinguish “abandoned subprocess’s heartbeat thread” from “different live subprocess on same machine.” Both look like ‘an older registration for the same (identity, machine).’ Killing either produces a 410 → auto-recovery → fresh registration → re-supersede loop.

What’s required to ship §4 safely

Need one of: (a) Client-side heartbeat-thread cleanup — when a subprocess receives 410 with reason=‘superseded_by_new_start’ (vs. stale_heartbeat), stop auto-recovery; let the subprocess’s heartbeat thread die quietly. (b) Distinguishing header on auto-recovery POSTs (e.g. X-Prism-Reregister: auto) so backend skips supersede for auto-recovery requests only. (c) Stable subprocess identifier in the registration payload so supersede can scope to same-subprocess (replace my own old session) vs cross-subprocess (preserve). Recommend (a) + (b) together: backend distinguishes the auto-recovery case so it doesn’t supersede on that path, and client distinguishes the ‘you were superseded’ case so it doesn’t fight back.

Impact of shipping without §4

§3 engagement-preference routing already solves the signal-misroute problem without killing zombies. Zombies stay in the table but always lose routing vs. the engaged session. Whois remains visually cluttered (multiple Donnas). Stale_heartbeat sweep still eventually releases abandoned subprocesses (10 min after their heartbeat stops). Net: the ‘false queue’ correctness bug is addressed. Zombie cleanup ergonomics regress to pre-SPEC-035 (rely on 10-min sweep). Acceptable.

Non-goals, Backwards compatibility, Observability sections unchanged from v0.1.


Open questions update (from v0.1)

  1. RESOLVED: path exclusion works via _is_heartbeat_path checking request.url.path.
  2. DEFERRED: engagement stale threshold for downstream consumers.
  3. PROMOTED TO BLOCKING for §4 ship: cross-subprocess vs same-subprocess distinction. Was flagged as an edge case in v0.1; turned out to be the central blocker.

References

  • Commit 11537f7 (shipped §1/§2/§3)
  • Original spec delta d29568a2 (v0.1)
  • Signal a8e3648f (Donna → Lola, heads-up FYI)
  • 2026-04-24 session transcript (Donna c5e0ea78 et al.)
Last modified on April 27, 2026