Skip to main content
Status: accepted · Version v0.3.1 · Filed 2026-05-03

SPEC-071 — Signal Bus QoS — Delivery Classes, TTL, Recall, Recipient State

Status: draft (v0.3 — publish_path persistence amendment per Porsche §5/§8 gap) Author: Donna (Engineering) Reviewer: Texi (Architect) PO: Lola Date: 2026-05-03 Supersedes: none — extends SPEC-034 (signal service), SPEC-037 (piggyback drain), SPEC-052 (categories)

1. Motivation

The Prism signal bus was designed for at-most-once piggyback delivery of agent-to-agent mail. As multi-agent operations have matured, three real deficiencies have surfaced:
  1. Senders cannot distinguish urgent from lazy. Today every send is async-best-effort. A Blocker sits in queue alongside a StatusUpdate. Senders of urgent signals have no path to fail-fast when the recipient is unavailable.
  2. Signals never expire. Stale rows accumulate (TODO-106: 7 Candi rows, 1.6+ days old, dominating the dashboard saturation rollup with zero actionable signal). A 36h-old MasterPreempted notice is noise; a 36h-old TaskAssigned may still be valid.
  3. Senders cannot retract decisions. In agent-on-agent flows, the world changes between when a signal is sent and when the recipient acts. Today the only recovery is “send a contradicting follow-up and hope.” Recall frequency increases sharply as agents learn to spawn other agents and plans mutate mid-execution.
This SPEC formalizes a small QoS framework that closes these gaps without inventing RPC, idempotency, or retry machinery the system doesn’t need.

2. Goals

  • Senders can declare delivery class per signal (sync / async).
  • Sync signals fail-fast with a structured recipient_state when the target is unavailable.
  • Async signals carry a per-type TTL default, server-enforced via a sweeper.
  • Senders can recall undelivered signals with email semantics.
  • Send-response carries enough recipient-state information for senders to make routing decisions without polling.
  • Existing observability gains the fidelity to distinguish stuck-because-undeliverable from stuck-because-mesh-issue.

3. Non-goals

  • No RPC semantics. Sync = fail-fast on send-time availability check.
  • No idempotency keys, no auto-retry. System never replays. Senders are agents — they have judgment.
  • No post-drain recall notice. Email semantics: once delivered, recall fails; sender writes a follow-up.
  • No bulk recall in v1. Correlation_id column lands now (free); the bulk-recall verb is v1.x.
  • No edit/amend in v1. Sender’s path is recall + send-fresh.
  • No plan-scoped recall. Workflow-layer concern, out of scope for the bus.

4. Definitions

4.1 Recipient state — three accepted-into-bus states

recipient_state is the send-time resolution result for sends that the bus accepted (i.e. the identity is known). It is not a QoS axis; it is the input that conditions how delivery_class behaves. Three values:
StateMeaningDetection
not_available_offlineIdentity registered but no active sessionagent_sessions row missing or all released_at IS NOT NULL
not_available_staleActive session but heartbeat older than 60s threshold (PR #56)last_heartbeat < NOW() - INTERVAL '60s'
availableActive session with fresh heartbeatelse
Unknown-recipient is NOT a recipient_state. Sending to an identity that has never registered an Agent row in this project is hard-rejected before recipient_state is computed, returning a 4xx SignalError with error_code='unknown_recipient' (already implemented PR #72 §2). The send is never persisted to signal_queue. This is the same contract for both sync and async classes.

4.2 Delivery class

ClassSend-time behavior on each recipient_state
syncavailable → publish + return delivered: true. not_available_offline / not_available_stale → fail-fast, return structured failure with recipient_state.
asyncavailable → publish + return delivered: true. not_available_offline / not_available_stale → queue with TTL, return queued: true and expires_at.
In both classes, the unknown-recipient rejection happens upstream and never reaches this branch.

4.3 TTL and terminal states

Every signal carries expires_at = created_at + ttl. Default TTL is per-signal-type (table §6). A signal_queue row has three independent terminal-state columns, each meaning a different outcome:
ColumnMeaningStamped by
delivered_atRecipient actually consumed the signal (WS frame pushed OR piggyback-drained OR startup-drained)mark_delivered_via_ws / drain_for_caller / drain_pending
expired_atTTL fired before deliverysweeper
recalled_atSender recalled before deliveryprism_signal_recall
These are mutually exclusive — at most one is non-NULL on any row. Drain predicates filter delivered_at IS NULL AND expired_at IS NULL AND recalled_at IS NULL. Throughput dashboards counting deliveries read delivered_at IS NOT NULL and stay clean of expirations or recalls. Each terminal column has its own audit trail.

4.4 Recall (email semantics)

prism_signal_recall(signal_id) →
  recalled         — atomic UPDATE WHERE delivered_at IS NULL AND expired_at IS NULL AND recalled_at IS NULL AND from_identity = caller (recipient never sees it)
  already_delivered — delivered_at IS NOT NULL — recipient consumed it; sender owns the cleanup
  already_expired   — expired_at IS NOT NULL — TTL fired before recall arrived; sender knows the window closed
  not_found         — bad signal_id, or someone else's signal (security boundary)
Recall is sender-only. Backend enforces from_identity match. Operator-driven force-expire (TODO-106 stale-recipient cleanup) is a separate verb (prism_signal_force_expire) — not in this SPEC.

4.5 Publish path (v0.3 amendment)

publish_path is the bus’s observable record of what it actually did at send time. Persisted alongside delivery_class and the terminal columns; populated at insert from the publish-result branch the send took. Three values:
ValueMeaning
pushed_to_wsFrame published to recipient’s agent channel; their WS subscriber should receive it. (Recipient was available.)
buffered_for_piggybackRecipient’s surface is piggyback-only (Desktop, etc.); shim WS got the frame and buffered it for the recipient’s next verb call.
queued_offlineRecipient was not_available_*; row sits in signal_queue until next bootstrap drains it (or TTL fires).
These are observable bus states, not predictions. Persistence is for two reasons:
  1. Per-row dashboard queries. Porsche’s TTL list panel (§11) wants publish_path per pending row so operators see at a glance which delivery path each waiting signal is on. Computing it at query time from current state would be lossy — recipient state may have flipped since send.
  2. Audit trail. Together with delivery_class + the three terminal columns, publish_path lets observability reconstruct the full lifecycle of any signal post-hoc: how was it sent, what publish path did the bus take, did it land via WS forwarder or piggyback drain or expire/recall.

5. Send response shape

prism_signal extended return for accepted sends:
{
  "signal_id": "...",
  "delivered": true | false,
  "queued": true | false,
  "recipient_state": "available | not_available_offline | not_available_stale",
  "delivery_class": "sync | async",
  "expires_at": "2026-05-04T12:00:00Z" | null,
  "resolved_to_session": "..." | null,
  "publish_path": "pushed_to_ws | buffered_for_piggyback | queued_offline"
}
The publish_path field is sourced from the same column persisted in signal_queue (§8). Senders that care can route on it; observability reads it from the row directly. For unknown-recipient rejection, no response shape is returned — the verb errors with HTTP 4xx and error_code='unknown_recipient'.

6. Per-type defaults

Initial recommendations (override per-send allowed):
signal_typedelivery_classTTLpriority (existing category)Notes
Blockersync4hBLOCKERTTL only enforced when sender downgrades to async (sync never queues)
Questionsync1hASKAnswer must be relevant to asker’s still-current problem
ReviewRequestedasync24hASKPR review can wait overnight, not a week
TaskAssignedasync7dTASKTasks can sit; people take days off
TaskCompletedasync24hINFOConfirmation, not action — short useful window
StatusUpdateasync24hINFOSnapshot — recipient catches up next bootstrap or it’s stale
Acknowledgmentasync1hINFOSender already moved on within an hour
MasterPreempted (system)async2mINFOElection state moves fast; old preemption notices are noise
PeerJoined / PeerLeft (system)async5mINFORoster snapshot — useless once roster has churned
Defaults map per-type at insert. Explicit per-send override via delivery_class/ttl_seconds params in prism_signal.

7. Drain ordering

drain_for_caller and drain_pending order results by (priority DESC, created_at ASC). Priority is mapped from the existing category enum: BLOCKER (3) > ASK (2) > TASK (1) > INFO (0). Within a priority tier, oldest-first FIFO. Derived from category at query time.

8. Schema changes

ALTER TABLE signal_queue ADD COLUMN delivery_class VARCHAR(8) NOT NULL DEFAULT 'async';
ALTER TABLE signal_queue ADD COLUMN expires_at TIMESTAMPTZ;
ALTER TABLE signal_queue ADD COLUMN expired_at TIMESTAMPTZ;        -- terminal: sweeper-marked
ALTER TABLE signal_queue ADD COLUMN recalled_at TIMESTAMPTZ;       -- terminal: sender-recalled
ALTER TABLE signal_queue ADD COLUMN correlation_id VARCHAR(36);    -- for future bulk recall
ALTER TABLE signal_queue ADD COLUMN publish_path VARCHAR(24);      -- v0.3: bus's record of send-time publish branch

CREATE INDEX ix_signal_queue_expires_at ON signal_queue (expires_at)
  WHERE expires_at IS NOT NULL
    AND delivered_at IS NULL
    AND expired_at IS NULL
    AND recalled_at IS NULL;

CREATE INDEX ix_signal_queue_correlation_id ON signal_queue (correlation_id)
  WHERE correlation_id IS NOT NULL;
expires_at is populated at insert from _DEFAULT_TTL_BY_TYPE[signal_type] (or per-send override). publish_path is populated at insert from the publish-result branch the send took (§4.5). Three terminal columns (delivered_at, expired_at, recalled_at) are mutually exclusive by sweeper / drain / recall logic.

9. Sweeper

Periodic background task (interval: 60s):
UPDATE signal_queue
SET expired_at = NOW()
WHERE expires_at < NOW()
  AND delivered_at IS NULL
  AND expired_at IS NULL
  AND recalled_at IS NULL
RETURNING id, signal_type, to_identity, age(NOW(), created_at);
No delivery_method mutation, no delivered_at write — expired rows have a clean separate terminal stamp. Sweeper logs aggregate counters: signal_expired_total{signal_type} for Porsche’s dashboard.

10. Backwards compatibility

  • delivery_class DEFAULT 'async' makes pre-existing senders behave identically.
  • expires_at IS NULL rows (pre-migration backfill) never expire — sweeper ignores them. No retroactive backfill of historical rows.
  • publish_path IS NULL for pre-migration rows — dashboard panels that filter on publish_path treat NULL as “unknown / pre-v0.3 row” and surface in a separate bucket.
  • New recipient_state, publish_path, expires_at fields in send response are additive; existing consumers reading delivered/queued keep working.
  • prism_signal_recall is a new verb; not invoking it changes nothing.
  • The unknown-recipient 4xx rejection (PR #72) is unchanged; this SPEC formalizes the boundary, doesn’t move it.

11. Observability (Porsche lane)

The signal pipeline dashboard tile gains fidelity to match the new state space. Specifically requested as a large high-fidelity card. Concrete asks:
  • Five state breakdowns in real-time:
    • available_now (delivered_at IS NOT NULL, channels_push or broadcast)
    • pending_pickup (queued, not_available_*, no terminal columns set)
    • expired (expired_at IS NOT NULL)
    • recalled (recalled_at IS NOT NULL)
    • undeliverable (count of unknown-recipient SignalErrors per type per sender; observable via metrics, not via signal_queue rows since these are never persisted)
  • Threshold differentiation: the saturation HOT indicator must distinguish pending_pickup (transient) from undeliverable (structural). Required split per Porsche FLAG-1 from the 2026-05-03 status round.
  • TTL clock per pending row — visible expires_at countdown so operator can see what’s about to age out. Each pending row carries a publish_path label (pushed_to_ws | buffered_for_piggyback | queued_offline) so operators see which delivery path was attempted.
  • Recall events as a discrete metric — count per minute, broken out by recalled vs already_delivered vs already_expired outcomes. High already_delivered rates indicate the bus is too slow for sender’s mental model.
  • Per-signal_type histograms of time-to-drain (channels_push vs piggyback latency profiles).
  • SPEC-072 daemon panel cluster (folded in via Porsche acked task ea3d9f9c):
    • Daemon row counts (live / shutting_down / orphaned-stale-by-liveness-probe) keyed off agent_sessions.kind='daemon'
    • Per-persona daemon-attached vs daemon-less shim sessions (operator sees at a glance whether each surface has its always-on listener up)
    • Optional prism_daemon_spawn_failed{reason} counter (today the shim only logs to stderr; backend-side counter is a follow-up)
The card is “large” because it’s the central health view for the signal mesh. Rendering: timeseries (uPlot canvas per project_dashboard_chart_flicker_fix) + numeric tiles + small TTL countdown list + daemon panel cluster.

12. Implementation plan

Phase A — Schema + service layer (Donna)

  • A1. Migration: 6 columns (delivery_class, expires_at, expired_at, recalled_at, correlation_id, publish_path) + partial indexes
  • A2. _DEFAULT_TTL_BY_TYPE and _DEFAULT_DELIVERY_CLASS_BY_TYPE lookup tables
  • A3. send_signal populates expires_at, delivery_class, AND publish_path on insert
  • A4. Sync class fail-fast path
  • A5. Extended return shape with recipient_state, expires_at, publish_path
  • A6. Sweeper job stamping expired_at on expired rows

Phase B — Recall verb (Donna)

  • B1. prism_signal_recall(signal_id) verb + service implementation
  • B2. Schema enforcement: sender-only via from_identity match
  • B3. Atomic UPDATE with all-three-terminal-NULL guard, returning the four discriminated outcomes

Phase C — Drain ordering + observability hooks (Donna + Porsche)

  • C1. drain_for_caller ORDER BY priority DESC, created_at ASC, predicate filters all three terminal columns
  • C2. Metrics: signal_expired_total, signal_recalled_total{outcome}, recipient_state distribution, unknown_recipient_rejection_total
  • C3. Porsche large high-fidelity card per §11 (composes SPEC-071 state space + SPEC-072 daemon panel cluster)

Phase D — Per-type override surface (Donna, optional v1.x)

  • D1. prism_signal accepts delivery_class and ttl_seconds overrides
Each phase ships independently. Phase A is the foundation; Phase B is small and self-contained; Phase C composes; Phase D is optional.

13. Resolved during review

Texi review 2026-05-03 02:39Z answered all v0.1 open questions:
  • Daemon binary packaging: ships at mcp-node/dist/daemon/server.js (sibling of mcp-node/dist/server.js) eventually; v1 keeps the daemon at repo-root daemon/ per the SPEC-072 implementation actually shipped (PR #86).
  • Spawn-failure policy: non-fatal, log to stderr, emit metric. No system signal escalation.
  • gRPC heartbeat upgrade: aspirational, not a v1 gate. v1 ships with HTTP heartbeat.

14. References

  • SPEC-034 — agent-to-agent signal delivery (foundation)
  • SPEC-037 — backend-side piggyback drain (interaction with delivered_at semantics)
  • SPEC-052 — signal categories (BLOCKER/ASK/TASK/INFO — repurposed as priority here)
  • SPEC-067 — multi-store writes (sweeper participates in this contract)
  • SPEC-070 — per-persona daemon (orthogonal; daemon’s WS subscriber sees the same delivered_at writes)
  • SPEC-072 — per-session daemon lifecycle (composed in §11 dashboard card)
  • PR #72 — delivered=true semantics fix + unknown-recipient hard rejection (foundational for §4.1)
  • PR #56 — routing-registry liveness probe (provides the 60s heartbeat threshold for not_available_stale)
  • PR #86 — SPEC-072 implementation (the daemon panel cluster references this)
  • TODO-106 — operator-driven stale-row cleanup (separate verb, not in this SPEC)
  • feedback_eliminate_failures_improve_perf — eliminate failure modes, don’t work around them
  • feedback_completion_means_deployed — every phase ships PR → merge → deploy → smoke
  • Postmortem f15cd0cd — 95-min deploy gap that surfaced the return-shape ambiguity

15. Revision log

  • v0.1 (2026-05-03 02:19Z) — initial draft circulated for review
  • v0.2 (2026-05-03 02:25Z) — Texi review revisions: terminal-state column split, unknown-recipient as 4xx upstream, publish_path replaces recipient_will_see_at, sweeper cadence 60s, Blocker TTL note, no historical backfill
  • v0.2.1 (2026-05-03 02:26Z) — fixed §12 A1 column count nit; status flipped to accepted per Texi v0.2 approval
  • v0.3 (2026-05-03 03:03Z) — Porsche §5/§8 publish_path-persistence gap fix:
    • Added §4.5 explicit definition of publish_path as a persisted column (not just response-shape ephemeral)
    • §8 schema migration adds publish_path VARCHAR(24) (6 columns total now, not 5)
    • §11 §TTL clock bullet notes the publish_path label per pending row
    • §11 absorbs the SPEC-072 D2 daemon panel cluster Porsche acked
    • §12 Phase A1 column count updated (6); A3 notes publish_path populated at insert
    • §10 backwards-compat note: pre-v0.3 rows have publish_path IS NULL, dashboard buckets them as “unknown”
Last modified on May 18, 2026