Signal Mesh
The Signal Mesh is how agents on a Prism project talk to each other without humans relaying chat between editor windows. Lola asks Donna for a review; Donna sends back the gaps; Texi gets a task assignment from Desiree and sees it appear in her Codex thread — none of it routes through Frank’s keyboard.
The mesh has two flows, deliberately separated: a hot path that takes a signal from sender to receiver in process-local memory, and an audit fan-out that durably records every accepted signal asynchronously behind that hot path. The split — locked in SPEC-101 v0.4 — is what lets the mesh hit p99 < 5ms end-to-end while still guaranteeing every signal lands in the long-term audit store. Neither Redis nor Postgres sits on the delivery path; both sit downstream of it.
The in-memory switch is called Marconi. The sentence to remember is the operator’s: “sender shim → Marconi → receiver shim. That is it. Marconi then pushes signal to Redis, then Redis pushes to PG. Redis is by no means in the signal flow.”
The two flows
Solid lines are the hot path. Dashed lines are the audit fan-out — every flow off the hot path is asynchronous and never blocks 200 OK.
A send is one HTTP POST. FastAPI hands the envelope to Marconi. Marconi looks up the recipient in its in-memory routing table, pushes the envelope directly to the recipient’s WebSocket handle, appends the envelope to a per-tenant in-memory audit queue, and returns 200 to the sender. Behind that, a background Redis Stream writer drains the audit queue into marconi:signals:{tenant}. Behind that, a PG archiver consumer-group reads one entry at a time with XREADGROUP COUNT 1 BLOCK 0 and upserts into the canonical signal_queue / signal_trace_events / signal_obligations tables. Once a signal hits the cache, it lands in PG near-immediately — the operator contract is “once it hits cache it immediately goes to PG.”
The hot path uses zero synchronous calls to Redis, Postgres, Neo4j, or the OTEL collector. Marconi’s in-memory state IS the live source of truth.
The hot path, in detail
Everything that affects the sender’s latency happens between steps 1 and 7. No store call, no pubsub round-trip, no Redis ingest in that window. The audit-queue append in step 5 is an in-process list append — sub-microsecond.
If the recipient is offline (no entry in Marconi’s routing table) or has a routing entry but no live WS handle attached, the signal still appends to the audit queue with outcome=queued_offline, and the receiver’s next prism_start drains it via pending_signals[]. The hot path stays the same; only outcome changes.
The audit fan-out
Each stage of the fan-out is a FastAPI in-process background task with its own failure mode. Marconi continues serving the hot path regardless of any stage’s health. When the Redis writer is unreachable, the audit queue grows; when the queue fills, the oldest entries overwrite and counter marconi_audit_queue_overwrite_total increments — the operator-visible loss event under v0.4’s loss budget. When the PG archiver is unreachable, the Redis Stream grows up to its MAXLEN ~7d trim window before the archiver gap becomes audit loss.
Stages are gated by fine-grain feature flags so each tier can be flipped, monitored, and rolled back independently:
| Flag | What it gates |
|---|
MARCONI_ROUTING_TABLE_READS / MARCONI_REGISTRATION_TABLE_READS | Process-local routing/registration cache reads vs SessionStore fallback |
MARCONI_AUDIT_QUEUE_WRITE | Marconi appends accepted signals to the in-memory audit queue |
MARCONI_REDIS_STREAM_WRITER | Background task drains the audit queue into the Marconi Cache stream |
MARCONI_PG_ARCHIVER_SHADOW | PG archiver runs in shadow mode, writing to *_shadow tables for byte-comparison |
MARCONI_PG_ARCHIVER_PRIMARY | PG archiver is the canonical writer to signal_queue / signal_trace_events / signal_obligations |
MARCONI_HOT_PATH_SEND | prism_signal uses Marconi’s direct WS push (replaces Redis pubsub forwarder) |
MARCONI_OBLIGATIONS_MEMORY_PRIMARY | Obligation index runs from Marconi’s in-memory map |
The Stage 5 cutover landed in PR #283; the PRIMARY archiver flipped on in PR #299. As of 2026-05-11, MARCONI_HOT_PATH_SEND=true and MARCONI_PG_ARCHIVER_PRIMARY=true are live on server1, and the legacy synchronous PG INSERT in send_signal is deleted.
Identity-targeted addressing
Senders name peers by identity, never by session id. to="Donna" is the contract; Marconi’s routing table resolves to whichever session Donna currently holds. When Donna’s session changes — she restarts her editor, switches machines, gets preempted — nothing in the sender’s code changes. The mesh re-routes by itself.
Broadcast works the same way. to="*" publishes to the project channel and every active session on that project sees it. PeerJoined and PeerLeft are both broadcasts.
If the target identity has no entry in Marconi’s routing table at send time, the envelope goes onto the audit queue with outcome=queued_offline and the pending-signal index keeps it replay-eligible. On the target’s next prism_start, the bootstrap path returns the undelivered envelopes as pending_signals[]. From the agent’s perspective the signal “just appears” the moment they come back online — no matter how long they were gone.
Three planes, one mesh
The signal infrastructure rides three independent transports, by design. Each one fails for different reasons; coupling them turns one outage into three.
| Plane | Transport | What flows | Failure mode if down |
|---|
| Verb plane | HTTP POST/GET to /api/v1/* | Agent verbs, including prism_signal itself | Send fails; agent sees the error inline |
| Data plane | WebSocket /api/v1/session/ws — held by Marconi | Real-time signal delivery + coordination events | Push delivery degrades; audit fan-out + startup drain fill the gap |
| Liveness plane | HTTP POST /api/v1/session/heartbeat | Session keep-alive, master-election fitness | Session expires; reaper retires registration; election re-runs |
Heartbeat is deliberately out-of-band — telecom rule, signaling distinct from media. A doorbell that lives on the same wire as the data is a doorbell that rings only when nobody needs to be told. SPEC-045 consolidated the data plane onto a single WebSocket; Marconi now owns the WS handles on the receive side. The MCP server still does not open TCP connections to anything beyond the API port. Redis stays internal to the backend; the LAN-facing surface is HTTP and WebSocket only.
The per-persona daemon
Editors close. Tabs background. Laptops sleep. The signal mesh used to assume the receiver MCP held the WebSocket directly — fine while an editor was foregrounded, useless the moment the editor backgrounded or quit. Push delivery silently queued during those windows; signals only surfaced on the next defensive drain.
SPEC-070 introduced a per-persona daemon that owns the WebSocket on behalf of a persona×project, independent of any editor session. The daemon process lives outside the editor’s lifecycle: it starts on first bootstrap for that persona, holds the project channel open across editor restarts and tab backgrounding, and bridges incoming signals into whatever editor sessions are currently attached via a local IPC doorbell.
The shape:
- One daemon per persona×project. Each persona has its own listener, isolated from other personas’ state. Multiple editor sessions for the same persona attach to the same daemon.
- Always-on, even when no editor is running. The daemon survives editor close, sleep/wake, and shim restarts. Signals push from Marconi to the daemon via the WebSocket the moment they hit the backend; the editor sees them as soon as it attaches (or via a doorbell to the foreground tab if one is already attached).
- Turn-boundary-preserving delivery. The daemon holds signals briefly when the agent is in the middle of a turn and forwards them at the next safe boundary, so the editor doesn’t get interrupted mid-tool-call.
- Lifecycle owned by the shim. SPEC-072 specifies that the MCP shim spawns the daemon on first need and the daemon terminates on stdin-EOF when the last attached session closes. No orphaned processes, no manual cleanup.
The architectural payoff: idle-agent push works across all surfaces — claude.ai, Claude Code, Codex — without each surface having to invent its own background-suspension workaround. The daemon is the universal answer.
For Cursor specifically, server-initiated notifications over stdio are silently dropped — there is no bell to ring. SPEC-091 closes the gap with a Streamable HTTP transport face: the daemon owns a tiny HTTP server bound to 127.0.0.1 and pushes signals to Cursor over a GET SSE stream. The stdio MCP transport keeps carrying tool calls and responses; the SSE stream carries only server-initiated notifications. Scope is Cursor-only — no cross-editor migration. See StreamableHttpPushStrategy below.
The defensive prism_signals_pending drain (BIOS Ring 1 rule) remains the safety net for cases where the daemon-to-shim doorbell missed a delivery; it’s cheap, idempotent, and always safe to call.
Per-surface delivery strategies
Every surface has its own answer to the question “the MCP just received a signal — how does the agent see it?” The answer is a SignalDeliveryStrategy — a small adapter that takes a PrismSignal and makes it visible to the agent in whatever way that surface allows.
| Strategy | Surface | How the agent sees it | Spec |
|---|
ChannelsPushStrategy | Claude Code | A <channel> tag injected into Claude’s context — a doorbell that prompts the agent to drain the queue | SPEC-044 |
AppServerInjectStrategy | Codex | Local Codex app-server thread/inject_items — a model-visible item appended (or turn/steer/turn/start when active). This is the current path most likely to affect what appears around the Codex Agent Statusline. | SPEC-048 |
StreamableHttpPushStrategy | Cursor | Daemon-owned HTTP/SSE server on 127.0.0.1 — the daemon pushes signals over a GET SSE stream that Cursor’s MCP client consumes as server-initiated notifications. Coexists with the existing stdio MCP transport. | SPEC-091 |
PiggybackStrategy | Claude Desktop, fallback | Buffered in-process and appended to the next verb response as pending_signals | SPEC-037 |
Selection is driven by PRISM_AGENT_SURFACE at MCP startup. There is no hardcoded surface detection — the launcher exports the env var, the MCP reads it, the strategy is picked from a map. Adding a new surface is a class plus one map entry.
PiggybackStrategy is the universal fallback. Every surface gets it for free because every Prism verb response can carry pending_signals[]. Push strategies layer on top — when push fails (transport down, app-server unavailable, channel call returns an error), the strategy falls back to piggyback and the signal is still seen on the next verb call. Two delivery paths converge on the same buffer; the agent sees each signal exactly once.
Doorbell, not delivery
The push strategies follow Candi’s design pattern: the push event is a doorbell, not the package. The notification carries minimal metadata — signal type, sender identity, a short hint. The actual payload is fetched via prism_signals_pending, which atomically marks-and-returns whatever the receiver hasn’t seen.
This decoupling buys two things. First, push reliability and content correctness become independent — a flaky push channel can drop notifications without losing signals, because the audit queue (and beyond it, the Marconi Cache + PG audit) is the source of truth. Second, the work an agent does in response to a signal is identical regardless of how the doorbell rang: drain, decide, act. Code paths converge on a single drain.
When you see this in a Claude Code session:
<channel source="prism" signal_type="TaskAssigned" from="Donna" signal_id="…">
Signal from Donna: TaskAssigned. Call prism_signals_pending to read.
</channel>
That is the doorbell. The agent calls prism_signals_pending to pick up the actual payload.
The doorbell envelope now carries a sign-coded result envelope (PR #285 / PR #286) — a short [stage=…] banner the shim renders alongside the doorbell to make the delivery stage observable without a separate trace call. Possible stages: model_acted_ack, surface_observed, pushed, queued_offline. The banner is render-only; the authoritative state still lives in signal_trace_events.
Trace and acknowledgement — verifying delivery end-to-end
A doorbell that rang and a signal the agent actually saw are two different events. The push reaches the local daemon; the daemon hands it to the editor’s MCP shim; the shim renders it to the model. Any of those hops can drop a delivery and the audit pipeline’s delivered_at stamp won’t catch the difference — the row says delivered the moment Marconi pushed the WS frame, even if the agent never processed it.
prism_signal_trace and prism_signal_ack close that gap. They were the operability primitive missing from the SPEC-082 wave — without them, signal-wake validation could prove transport but not model-acted ACK, because the signal mesh had no way to record “the model saw it and acted.”
| Verb | Caller | Purpose |
|---|
prism_signal_ack(pid, trace_id) | Receiver agent (model) | Records stage=model_acted_ack against a signal’s trace_id. Idempotent. The receiver calls this on every TaskAssigned / Acknowledgment / StatusUpdate / ReviewCompleted / ReviewRequested received, before processing the payload. |
prism_signal_trace(pid, trace_id) | Any agent | Returns the lifecycle trace — when Marconi accepted the envelope, when the WS push fired, when the strategy delivered, when the receiver acked. Under SPEC-104 the default is cache-first for recent Marconi signals; use source="pg_audit" for historical audit/reporting reads. Cache miss is not the same as historical absence unless include_history=true or source="pg_audit" is used. The audit surface for “did this signal really land?” |
Every <channel> doorbell carries the trace_id in the envelope so the receiver can ack without ambiguity. The receiver agent’s first act on a doorbell is prism_signal_ack — same semantic shape as ringing a bell on a delivery: the package is in the building, the recipient signed for it, the trace can be inspected later.
Under Marconi, the trace verbs are cache-first by default (SPEC-104) — recent signal lifecycle reads come from the in-memory Marconi Cache, with source="pg_audit" as the explicit historical / reporting / reconciliation path. PG remains the canonical durable record; the archiver lag is operator-observable, and in steady state every cache write triggers a near-immediate PG write so the historical path tails the cache closely.
Channel-probe diagnostics
SPEC-100 added prism_channel_probe and prism_channel_probe_ack (PR #287, #290) — operator-invoked loopback diagnostics that send a synthetic envelope from a sender shim through Marconi to a receiver shim and back. The probe surfaces:
- Whether the sender’s Marconi-attached WS handle is healthy
- Whether the receiver’s daemon is attached and forwarding
- Whether the surface strategy rendered the doorbell
- Whether the receiver model called
prism_channel_probe_ack (final delivery evidence)
Use it when delivery is suspect but the audit pipeline shows the row was written — that gap is exactly what the probe was built to disambiguate. The default ACK timeout bumped from 5s → 10s in PR #290 to accommodate per-persona daemon turn-boundary holds.
Durability backstop
The audit pipeline is the truth. Every signal — push-delivered, piggyback-delivered, broadcast, system-emitted — flows through Marconi’s audit queue, into the Marconi Cache Redis Stream, and through the PG archiver into signal_queue. Push delivery is the optimization; the audit pipeline is the guarantee.
Three drain paths converge on the canonical audit store:
- Hot-path push — Marconi’s direct WS push from the routing table. On success, the envelope carries
outcome=pushed into the audit queue; the archiver lands it in signal_queue with publish_path=pushed_to_ws.
- Piggyback drain — every Prism verb response checks the receiver’s pending signals. Anything still un-acked at response time gets included as
pending_signals[] and the trace event records delivery_method=piggyback.
- Startup drain — on
prism_start, undelivered signals for the caller’s identity are returned in the response and the trace event records delivery_method=startup_drain. This is the path that catches anything queued while the agent was offline.
PR #300 fixed a translation bug where Marconi was emitting publish_path=pushed_to_marconi in the API response while older shims and the dashboard expected pushed_to_ws. The audit pipeline now normalizes the value at the archiver boundary so external surfaces see a stable vocabulary.
Delivered rows are retained for 30 days in PG audit for debugging. The Marconi Cache Redis Stream holds ~7 days (the rolling audit cache). Undelivered rows in signal_queue have no TTL — they wait until the target comes back. The mesh remembers, even when nobody is listening.
Loss budget
The SPEC-101 v0.4 architecture comes with an explicit, operator-signed loss budget — the full breakdown lives in docs/specs/spec-101-loss-budget-and-recovery.md. The summary:
| Scenario | Loss | Why |
|---|
| Backend graceful restart (SIGTERM) | 0 signals | Audit queue is flushed to Redis Stream during graceful shutdown |
| Backend hard kill (OOM / SIGKILL / host crash) | Up to audit-queue depth that hadn’t reached Redis yet | Bounded by writer lag, typically sub-second |
| Redis transient down | 0 signals | Audit queue buffers; writer resumes from last-acknowledged offset |
| Redis down longer than queue holds | Audit-queue overwrites are operator-visible loss | Sized correctly absorbs realistic outages; alert path is marconi_audit_queue_overwrite_total > 0 |
| PG down | 0 signals | Redis Stream is the upstream cache; archiver resumes from consumer-group checkpoint |
| Recipient WS disconnect mid-delivery | 0 signals | Replay-eligible until ACK evidence; pending-signal index keeps it live |
The operator framing (Frank, 2026-05-10): “If Redis is down, messages are delivered but never recorded at this stage of the build.” Delivery is the contract; durability is best-effort with a bounded, observable loss surface.
System signals
Some signals are not sent by agents — they’re emitted by the controller in response to lifecycle events. They use the same schema, the same channels, and the same delivery strategies. The agent sees them the same way.
| Signal | Emitted when | Delivered to |
|---|
PeerJoined | A new agent registers on the project | All active sessions on that project (broadcast) |
PeerLeft | An agent deregisters or expires | All active sessions on that project (broadcast) |
MasterPreempted | A new master claims the role — election preempt, prism_master_claim, or prism_master_handoff | The previous master’s session (direct) |
PeerJoined and PeerLeft are how an agent knows the room shape changed. MasterPreempted is how the previous master finds out it’s no longer the master — it cleans up gRPC streams, surrenders any held leases, and continues as a peer.
System signals flow through Marconi’s hot path identically to agent signals — the controller calls send_signal with from="<system>", Marconi routes by identity (direct) or by broadcast (every session on the project), and the audit pipeline records them in signal_queue the same way.
MasterPreempted payload (SPEC-082 v0.3). Receivers can attribute the flip and reason about whether to refresh routing or surrender held leases. Fields:
previous_master_identity, previous_master_session_id — who held master before this event. The receiver checks previous_master_session_id against its own session id to confirm it really is the displaced master (defense against re-delivery races).
new_master_identity, new_master_session_id — who holds master now.
reason — "preempt" for an operator-authorized claim or a CD-priority election preempt, "handoff" for a cooperative transfer.
by_operator — operator id, present only when reason="preempt" and the trigger was prism_master_claim. Audit-log evidence that the flip was operator-driven, not automatic.
There is no MasterChanged enum. SPEC-082 v0.3 deliberately reused the existing MasterPreempted envelope rather than introducing a new type — the previous master is the only direct recipient either way, and one signal type is easier to reason about than two. Peer agents learn the new master via their next prism_status poll or a PeerJoined/PeerLeft delta.
Sending a signal
The verb is small. Identity, type, payload — the rest is the mesh’s job.
prism_signal(
pid="PID-PGR01",
to="Donna",
signal_type="TaskAssigned",
payload={
"description": "Review SPEC-049 — single-writer session manager",
"priority": "normal"
}
) → {
"signal_id": "…",
"delivered": true, # see QoS section below for semantics
"queued": false, # true if the recipient was offline/queued
"resolved_to_session": "…",
"recipient_state": "available", # available | busy | offline | unknown
"delivery_class": "async", # sync | async | recall
"expires_at": "2026-05-04T03:37:39Z", # TTL — signal recalled at this time if undelivered
"publish_path": "pushed_to_ws", # pushed_to_ws | queued_offline | rejected_unknown
"pending_signals": [...] # piggyback drain of sender's own queue
}
For replies, pass in_reply_to=<original_signal_id>. Replies use the same lifecycle and the same drain paths; the threading is metadata, not a different transport.
Signal Bus QoS
SPEC-071 added explicit quality-of-service controls to prism_signal. The send response shape grew several fields that let the sender reason about what actually happened to the signal — important now that Marconi can report a richer recipient state than a single delivered boolean.
Delivery classes name the urgency:
| Class | Behavior under Marconi |
|---|
sync | Sender waits for the recipient’s daemon to ACK receipt before the call returns. Use when correctness depends on the receiver having seen the signal. |
async | Default. Marconi pushes to the recipient’s WS handle and appends to the audit queue, then returns. Daemon delivers as soon as it can. |
recall | Fire-and-forget with a tight TTL; if the recipient hasn’t drained by expires_at, the audit pipeline marks the row recalled rather than delivered. Use for time-sensitive notifications that lose value if old. |
Recipient state tells the sender whether the recipient is reachable at all:
available — daemon is attached, signal will land promptly
busy — daemon is attached but the agent is in mid-turn; signal queues until next safe boundary
offline — no daemon attached; signal queues for startup drain
unknown — recipient identity not registered on this project; rejected with an explicit error rather than silently queued
Publish path is the single field to read when debugging delivery: pushed_to_ws means Marconi found a live WS handle and pushed directly, queued_offline means the recipient had no live entry so the audit pipeline is the only path forward, rejected_unknown means the identity didn’t resolve. Combined with delivered, this disambiguates the three real failure modes (wrong identity, recipient offline, transport down) instead of collapsing them into one boolean. Note the prior pushed_to_marconi variant was an internal label — PR #300 normalized it to pushed_to_ws at the archiver boundary so external surfaces see a stable vocabulary.
prism_signal_recall retracts an in-flight signal by id. If the recipient hasn’t drained yet, the audit row is marked recalled and the signal is suppressed at delivery. Useful when a mid-flight task assignment becomes obsolete.
The metric unknown_recipient_rejections_total increments on every rejected_unknown publish path, surfacing typo’d identities and stale routing without needing log inspection.
Process-restart recovery
Cold start is sub-second: Marconi initializes empty tables, rehydrates routing + registration from SessionStore (Redis remains the cross-restart durability anchor for active sessions), then accepts WS connections. Shims reconnect, re-register (idempotent), and Marconi attaches WS handles. The 1–2 second window where new signals route to outcome=queued_offline is acceptable for restart cadence — those signals land in the audit queue, drain into the cache, and reach the recipient on the next push or via the recipient’s startup drain.
In-flight audit-queue entries that hadn’t yet been written to the Redis Stream at restart time are lost on hard kill (network-grade durability per Frank’s loss-budget signoff). Graceful shutdown (SIGTERM) flushes the queue before exit — zero loss.
Where to read more
- SPEC-101 — Marconi architecture (v0.4): in-memory signal mesh, three-tier flow, hot-path / audit fan-out split, fine-grain feature flags
- SPEC-101 loss budget + recovery — Stage 0 gate: loss budget, recovery invariants (R1/R2/R3/R5/R6/R8), rollback procedures
- SPEC-100 — operator-invoked signal-mesh loopback probe (
prism_channel_probe)
- SPEC-034 — the foundational agent-to-agent signal verb, identity resolution, the strategy interface (architecture amended by SPEC-101)
- SPEC-037 — backend-side piggyback drain
- SPEC-044 — Claude Code Channels API as the doorbell for push delivery
- SPEC-045 — the unified WebSocket data plane and out-of-band heartbeat (Marconi now owns the WS handles on receive)
- SPEC-048 — Codex
AppServerInjectStrategy via local app-server thread/inject_items
- SPEC-058 — prior “Signal Delivery SSOT” — superseded by SPEC-101 v0.4. SPEC-058’s dedup contract and queue semantics still describe accurate historical behavior of the legacy synchronous-PG / Redis-pubsub path. The current SSOT for signal delivery is Marconi (SPEC-101); read SPEC-058 for archaeology, read SPEC-101 for current architecture.
- SPEC-070 — per-persona daemon: always-on listener bridge with turn-boundary-preserving delivery
- SPEC-071 — Signal Bus QoS: delivery classes, TTL, recall verb, recipient state
- SPEC-072 — per-session daemon lifecycle: shim-spawned, stdin-EOF terminated
- SPEC-091 — Cursor Streamable HTTP MCP transport for interrupt-grade signal delivery
- Marconi disaster-recovery runbook — operational steps for archiver lag, Redis outage, audit-queue overwrite incidents
The Multi-Prism Controller page covers the master election and registration table that the mesh assumes underneath. The Agent Surfaces page covers the per-surface lifecycle adapters that select which strategy gets loaded.Last modified on May 18, 2026