Status: accepted · Version v0.3.1 · Filed 2026-05-03
SPEC-071 — Signal Bus QoS — Delivery Classes, TTL, Recall, Recipient State
Status: draft (v0.3 — publish_path persistence amendment per Porsche §5/§8 gap)
Author: Donna (Engineering)
Reviewer: Texi (Architect)
PO: Lola
Date: 2026-05-03
Supersedes: none — extends SPEC-034 (signal service), SPEC-037 (piggyback drain), SPEC-052 (categories)
1. Motivation
The Prism signal bus was designed for at-most-once piggyback delivery of agent-to-agent mail. As multi-agent operations have matured, three real deficiencies have surfaced:
-
Senders cannot distinguish urgent from lazy. Today every send is async-best-effort. A
Blocker sits in queue alongside a StatusUpdate. Senders of urgent signals have no path to fail-fast when the recipient is unavailable.
-
Signals never expire. Stale rows accumulate (TODO-106: 7 Candi rows, 1.6+ days old, dominating the dashboard saturation rollup with zero actionable signal). A 36h-old
MasterPreempted notice is noise; a 36h-old TaskAssigned may still be valid.
-
Senders cannot retract decisions. In agent-on-agent flows, the world changes between when a signal is sent and when the recipient acts. Today the only recovery is “send a contradicting follow-up and hope.” Recall frequency increases sharply as agents learn to spawn other agents and plans mutate mid-execution.
This SPEC formalizes a small QoS framework that closes these gaps without inventing RPC, idempotency, or retry machinery the system doesn’t need.
2. Goals
- Senders can declare delivery class per signal (sync / async).
- Sync signals fail-fast with a structured
recipient_state when the target is unavailable.
- Async signals carry a per-type TTL default, server-enforced via a sweeper.
- Senders can recall undelivered signals with email semantics.
- Send-response carries enough recipient-state information for senders to make routing decisions without polling.
- Existing observability gains the fidelity to distinguish stuck-because-undeliverable from stuck-because-mesh-issue.
3. Non-goals
- No RPC semantics. Sync = fail-fast on send-time availability check.
- No idempotency keys, no auto-retry. System never replays. Senders are agents — they have judgment.
- No post-drain recall notice. Email semantics: once delivered, recall fails; sender writes a follow-up.
- No bulk recall in v1. Correlation_id column lands now (free); the bulk-recall verb is v1.x.
- No edit/amend in v1. Sender’s path is recall + send-fresh.
- No plan-scoped recall. Workflow-layer concern, out of scope for the bus.
4. Definitions
4.1 Recipient state — three accepted-into-bus states
recipient_state is the send-time resolution result for sends that the bus accepted (i.e. the identity is known). It is not a QoS axis; it is the input that conditions how delivery_class behaves. Three values:
| State | Meaning | Detection |
|---|
not_available_offline | Identity registered but no active session | agent_sessions row missing or all released_at IS NOT NULL |
not_available_stale | Active session but heartbeat older than 60s threshold (PR #56) | last_heartbeat < NOW() - INTERVAL '60s' |
available | Active session with fresh heartbeat | else |
Unknown-recipient is NOT a recipient_state. Sending to an identity that has never registered an Agent row in this project is hard-rejected before recipient_state is computed, returning a 4xx SignalError with error_code='unknown_recipient' (already implemented PR #72 §2). The send is never persisted to signal_queue. This is the same contract for both sync and async classes.
4.2 Delivery class
| Class | Send-time behavior on each recipient_state |
|---|
sync | available → publish + return delivered: true. not_available_offline / not_available_stale → fail-fast, return structured failure with recipient_state. |
async | available → publish + return delivered: true. not_available_offline / not_available_stale → queue with TTL, return queued: true and expires_at. |
In both classes, the unknown-recipient rejection happens upstream and never reaches this branch.
4.3 TTL and terminal states
Every signal carries expires_at = created_at + ttl. Default TTL is per-signal-type (table §6).
A signal_queue row has three independent terminal-state columns, each meaning a different outcome:
| Column | Meaning | Stamped by |
|---|
delivered_at | Recipient actually consumed the signal (WS frame pushed OR piggyback-drained OR startup-drained) | mark_delivered_via_ws / drain_for_caller / drain_pending |
expired_at | TTL fired before delivery | sweeper |
recalled_at | Sender recalled before delivery | prism_signal_recall |
These are mutually exclusive — at most one is non-NULL on any row. Drain predicates filter delivered_at IS NULL AND expired_at IS NULL AND recalled_at IS NULL. Throughput dashboards counting deliveries read delivered_at IS NOT NULL and stay clean of expirations or recalls. Each terminal column has its own audit trail.
4.4 Recall (email semantics)
prism_signal_recall(signal_id) →
recalled — atomic UPDATE WHERE delivered_at IS NULL AND expired_at IS NULL AND recalled_at IS NULL AND from_identity = caller (recipient never sees it)
already_delivered — delivered_at IS NOT NULL — recipient consumed it; sender owns the cleanup
already_expired — expired_at IS NOT NULL — TTL fired before recall arrived; sender knows the window closed
not_found — bad signal_id, or someone else's signal (security boundary)
Recall is sender-only. Backend enforces from_identity match. Operator-driven force-expire (TODO-106 stale-recipient cleanup) is a separate verb (prism_signal_force_expire) — not in this SPEC.
4.5 Publish path (v0.3 amendment)
publish_path is the bus’s observable record of what it actually did at send time. Persisted alongside delivery_class and the terminal columns; populated at insert from the publish-result branch the send took.
Three values:
| Value | Meaning |
|---|
pushed_to_ws | Frame published to recipient’s agent channel; their WS subscriber should receive it. (Recipient was available.) |
buffered_for_piggyback | Recipient’s surface is piggyback-only (Desktop, etc.); shim WS got the frame and buffered it for the recipient’s next verb call. |
queued_offline | Recipient was not_available_*; row sits in signal_queue until next bootstrap drains it (or TTL fires). |
These are observable bus states, not predictions. Persistence is for two reasons:
- Per-row dashboard queries. Porsche’s TTL list panel (§11) wants
publish_path per pending row so operators see at a glance which delivery path each waiting signal is on. Computing it at query time from current state would be lossy — recipient state may have flipped since send.
- Audit trail. Together with
delivery_class + the three terminal columns, publish_path lets observability reconstruct the full lifecycle of any signal post-hoc: how was it sent, what publish path did the bus take, did it land via WS forwarder or piggyback drain or expire/recall.
5. Send response shape
prism_signal extended return for accepted sends:
{
"signal_id": "...",
"delivered": true | false,
"queued": true | false,
"recipient_state": "available | not_available_offline | not_available_stale",
"delivery_class": "sync | async",
"expires_at": "2026-05-04T12:00:00Z" | null,
"resolved_to_session": "..." | null,
"publish_path": "pushed_to_ws | buffered_for_piggyback | queued_offline"
}
The publish_path field is sourced from the same column persisted in signal_queue (§8). Senders that care can route on it; observability reads it from the row directly.
For unknown-recipient rejection, no response shape is returned — the verb errors with HTTP 4xx and error_code='unknown_recipient'.
6. Per-type defaults
Initial recommendations (override per-send allowed):
| signal_type | delivery_class | TTL | priority (existing category) | Notes |
|---|
Blocker | sync | 4h | BLOCKER | TTL only enforced when sender downgrades to async (sync never queues) |
Question | sync | 1h | ASK | Answer must be relevant to asker’s still-current problem |
ReviewRequested | async | 24h | ASK | PR review can wait overnight, not a week |
TaskAssigned | async | 7d | TASK | Tasks can sit; people take days off |
TaskCompleted | async | 24h | INFO | Confirmation, not action — short useful window |
StatusUpdate | async | 24h | INFO | Snapshot — recipient catches up next bootstrap or it’s stale |
Acknowledgment | async | 1h | INFO | Sender already moved on within an hour |
MasterPreempted (system) | async | 2m | INFO | Election state moves fast; old preemption notices are noise |
PeerJoined / PeerLeft (system) | async | 5m | INFO | Roster snapshot — useless once roster has churned |
Defaults map per-type at insert. Explicit per-send override via delivery_class/ttl_seconds params in prism_signal.
7. Drain ordering
drain_for_caller and drain_pending order results by (priority DESC, created_at ASC). Priority is mapped from the existing category enum: BLOCKER (3) > ASK (2) > TASK (1) > INFO (0). Within a priority tier, oldest-first FIFO. Derived from category at query time.
8. Schema changes
ALTER TABLE signal_queue ADD COLUMN delivery_class VARCHAR(8) NOT NULL DEFAULT 'async';
ALTER TABLE signal_queue ADD COLUMN expires_at TIMESTAMPTZ;
ALTER TABLE signal_queue ADD COLUMN expired_at TIMESTAMPTZ; -- terminal: sweeper-marked
ALTER TABLE signal_queue ADD COLUMN recalled_at TIMESTAMPTZ; -- terminal: sender-recalled
ALTER TABLE signal_queue ADD COLUMN correlation_id VARCHAR(36); -- for future bulk recall
ALTER TABLE signal_queue ADD COLUMN publish_path VARCHAR(24); -- v0.3: bus's record of send-time publish branch
CREATE INDEX ix_signal_queue_expires_at ON signal_queue (expires_at)
WHERE expires_at IS NOT NULL
AND delivered_at IS NULL
AND expired_at IS NULL
AND recalled_at IS NULL;
CREATE INDEX ix_signal_queue_correlation_id ON signal_queue (correlation_id)
WHERE correlation_id IS NOT NULL;
expires_at is populated at insert from _DEFAULT_TTL_BY_TYPE[signal_type] (or per-send override). publish_path is populated at insert from the publish-result branch the send took (§4.5). Three terminal columns (delivered_at, expired_at, recalled_at) are mutually exclusive by sweeper / drain / recall logic.
9. Sweeper
Periodic background task (interval: 60s):
UPDATE signal_queue
SET expired_at = NOW()
WHERE expires_at < NOW()
AND delivered_at IS NULL
AND expired_at IS NULL
AND recalled_at IS NULL
RETURNING id, signal_type, to_identity, age(NOW(), created_at);
No delivery_method mutation, no delivered_at write — expired rows have a clean separate terminal stamp. Sweeper logs aggregate counters: signal_expired_total{signal_type} for Porsche’s dashboard.
10. Backwards compatibility
delivery_class DEFAULT 'async' makes pre-existing senders behave identically.
expires_at IS NULL rows (pre-migration backfill) never expire — sweeper ignores them. No retroactive backfill of historical rows.
publish_path IS NULL for pre-migration rows — dashboard panels that filter on publish_path treat NULL as “unknown / pre-v0.3 row” and surface in a separate bucket.
- New
recipient_state, publish_path, expires_at fields in send response are additive; existing consumers reading delivered/queued keep working.
prism_signal_recall is a new verb; not invoking it changes nothing.
- The unknown-recipient 4xx rejection (PR #72) is unchanged; this SPEC formalizes the boundary, doesn’t move it.
11. Observability (Porsche lane)
The signal pipeline dashboard tile gains fidelity to match the new state space. Specifically requested as a large high-fidelity card. Concrete asks:
- Five state breakdowns in real-time:
available_now (delivered_at IS NOT NULL, channels_push or broadcast)
pending_pickup (queued, not_available_*, no terminal columns set)
expired (expired_at IS NOT NULL)
recalled (recalled_at IS NOT NULL)
undeliverable (count of unknown-recipient SignalErrors per type per sender; observable via metrics, not via signal_queue rows since these are never persisted)
- Threshold differentiation: the saturation HOT indicator must distinguish
pending_pickup (transient) from undeliverable (structural). Required split per Porsche FLAG-1 from the 2026-05-03 status round.
- TTL clock per pending row — visible expires_at countdown so operator can see what’s about to age out. Each pending row carries a
publish_path label (pushed_to_ws | buffered_for_piggyback | queued_offline) so operators see which delivery path was attempted.
- Recall events as a discrete metric — count per minute, broken out by
recalled vs already_delivered vs already_expired outcomes. High already_delivered rates indicate the bus is too slow for sender’s mental model.
- Per-signal_type histograms of time-to-drain (channels_push vs piggyback latency profiles).
- SPEC-072 daemon panel cluster (folded in via Porsche acked task
ea3d9f9c):
- Daemon row counts (live / shutting_down / orphaned-stale-by-liveness-probe) keyed off
agent_sessions.kind='daemon'
- Per-persona daemon-attached vs daemon-less shim sessions (operator sees at a glance whether each surface has its always-on listener up)
- Optional
prism_daemon_spawn_failed{reason} counter (today the shim only logs to stderr; backend-side counter is a follow-up)
The card is “large” because it’s the central health view for the signal mesh. Rendering: timeseries (uPlot canvas per project_dashboard_chart_flicker_fix) + numeric tiles + small TTL countdown list + daemon panel cluster.
12. Implementation plan
Phase A — Schema + service layer (Donna)
- A1. Migration: 6 columns (
delivery_class, expires_at, expired_at, recalled_at, correlation_id, publish_path) + partial indexes
- A2.
_DEFAULT_TTL_BY_TYPE and _DEFAULT_DELIVERY_CLASS_BY_TYPE lookup tables
- A3.
send_signal populates expires_at, delivery_class, AND publish_path on insert
- A4. Sync class fail-fast path
- A5. Extended return shape with
recipient_state, expires_at, publish_path
- A6. Sweeper job stamping
expired_at on expired rows
Phase B — Recall verb (Donna)
- B1.
prism_signal_recall(signal_id) verb + service implementation
- B2. Schema enforcement: sender-only via
from_identity match
- B3. Atomic UPDATE with all-three-terminal-NULL guard, returning the four discriminated outcomes
Phase C — Drain ordering + observability hooks (Donna + Porsche)
- C1.
drain_for_caller ORDER BY priority DESC, created_at ASC, predicate filters all three terminal columns
- C2. Metrics:
signal_expired_total, signal_recalled_total{outcome}, recipient_state distribution, unknown_recipient_rejection_total
- C3. Porsche large high-fidelity card per §11 (composes SPEC-071 state space + SPEC-072 daemon panel cluster)
Phase D — Per-type override surface (Donna, optional v1.x)
- D1.
prism_signal accepts delivery_class and ttl_seconds overrides
Each phase ships independently. Phase A is the foundation; Phase B is small and self-contained; Phase C composes; Phase D is optional.
13. Resolved during review
Texi review 2026-05-03 02:39Z answered all v0.1 open questions:
- Daemon binary packaging: ships at
mcp-node/dist/daemon/server.js (sibling of mcp-node/dist/server.js) eventually; v1 keeps the daemon at repo-root daemon/ per the SPEC-072 implementation actually shipped (PR #86).
- Spawn-failure policy: non-fatal, log to stderr, emit metric. No system signal escalation.
- gRPC heartbeat upgrade: aspirational, not a v1 gate. v1 ships with HTTP heartbeat.
14. References
- SPEC-034 — agent-to-agent signal delivery (foundation)
- SPEC-037 — backend-side piggyback drain (interaction with delivered_at semantics)
- SPEC-052 — signal categories (BLOCKER/ASK/TASK/INFO — repurposed as priority here)
- SPEC-067 — multi-store writes (sweeper participates in this contract)
- SPEC-070 — per-persona daemon (orthogonal; daemon’s WS subscriber sees the same delivered_at writes)
- SPEC-072 — per-session daemon lifecycle (composed in §11 dashboard card)
- PR #72 —
delivered=true semantics fix + unknown-recipient hard rejection (foundational for §4.1)
- PR #56 — routing-registry liveness probe (provides the 60s heartbeat threshold for
not_available_stale)
- PR #86 — SPEC-072 implementation (the daemon panel cluster references this)
- TODO-106 — operator-driven stale-row cleanup (separate verb, not in this SPEC)
feedback_eliminate_failures_improve_perf — eliminate failure modes, don’t work around them
feedback_completion_means_deployed — every phase ships PR → merge → deploy → smoke
- Postmortem
f15cd0cd — 95-min deploy gap that surfaced the return-shape ambiguity
15. Revision log
- v0.1 (2026-05-03 02:19Z) — initial draft circulated for review
- v0.2 (2026-05-03 02:25Z) — Texi review revisions: terminal-state column split, unknown-recipient as 4xx upstream, publish_path replaces recipient_will_see_at, sweeper cadence 60s, Blocker TTL note, no historical backfill
- v0.2.1 (2026-05-03 02:26Z) — fixed §12 A1 column count nit; status flipped to accepted per Texi v0.2 approval
- v0.3 (2026-05-03 03:03Z) — Porsche §5/§8 publish_path-persistence gap fix:
- Added §4.5 explicit definition of
publish_path as a persisted column (not just response-shape ephemeral)
- §8 schema migration adds
publish_path VARCHAR(24) (6 columns total now, not 5)
- §11 §TTL clock bullet notes the
publish_path label per pending row
- §11 absorbs the SPEC-072 D2 daemon panel cluster Porsche acked
- §12 Phase A1 column count updated (6); A3 notes publish_path populated at insert
- §10 backwards-compat note: pre-v0.3 rows have
publish_path IS NULL, dashboard buckets them as “unknown”