Skip to main content
Status: draft · Version v0.1 · Filed 2026-04-30

spec_id: SPEC-058 version: v0.1 status: draft authored_by: Donna date: 2026-04-30

SPEC-058 — Signal Delivery Single-Source-of-Truth

Status

Draft — for Frank’s review BEFORE any code changes. Author Donna.

Problem

Today’s fan-out test #3 (Donna → Porsche/Lafonda/Desiree/Texi, post-restart) exposed a duplicate-delivery defect: every Acknowledgment appeared twice in prism_signals_pending results — once from the in-process strategy buffer, once from the backend HTTP drain. Texi independently observed observed_pending_duplicates: 2 from the Codex side. Frank’s reaction is the right one: this is the architecture telling us a structural invariant has been lost. We should not patch around it.

Why this is structural, not cosmetic

The SPEC-054 Phase 3 Python→Node port (PR #21, merged 2026-04-29) translated the Python mcp/ tree to TypeScript at mcp-node/. Today’s signal-pipeline session shipped five fixes for SPEC-054 port misses — wires that existed in the Python source but weren’t reproduced in the TypeScript port:
#MissFix commit
1project_id nesting + dropped guard → WS 4002 storm(earlier)
2SPEC-052 signalCache not ported → bell silentPR #25
3SPEC-044 capabilities.experimental['claude/channel'] missing42c5301
4SPEC-034 §7.2 piggyback-merge wire missing on every-verb9147f79
5SPEC-048 codex thread/inject_items vs turn/start703675c
5bSPEC-044 §3.3 coalescing reset on drain droppedaf69848
The duplicate Frank observed today is port miss #6 — the dedup at the merge boundary (SPEC-037 §3). All six belong to the same class: load-bearing behavior in Python comments / cross-cutting decorators that the port flattened into per-file translations.

Pre-port working contract (the thing we lost)

The Python mcp/server.py (deleted in PR #21) had ONE cross-cutting decorator wrapping every non-lifecycle verb response:
# Two sources, merged and de-duplicated by ``signal_id``:
#   1. In-process strategy buffer (drain_buffered)
#   2. Backend HTTP drain (poll_pending_signals — atomic mark+return)
#
# SPEC-037 §3 — de-dupe by signal_id. In-cluster agents may see the
# same signal from both paths; LAN clients only see the backend path.
# Either way, never deliver the same signal twice.
collected = list(result.get("pending_signals") or [])
collected.extend(strategy.drain_buffered())
collected.extend(await client.poll_pending_signals(...))
deduped = [s for s in collected if seen.add(s["signal_id"])]
result["pending_signals"] = deduped
Three properties this guaranteed:
  1. Universal coverage. Every non-lifecycle verb response carried any pending signals — agents saw new arrivals in the next conversational turn without explicit polling.
  2. Two-source robustness. Push (LAN-cluster Redis subscriber) and pull (HTTP backend drain) both fed into the same merge point — either failing alone never lost a signal.
  3. Single-delivery invariant. Deduplication by signal_id ensured one logical signal was reported exactly once, no matter how many transport paths happened to carry it.

Current Node implementation (the gap)

The port split the original decorator into two unrelated code paths and dropped the dedup in both:
Original SPEC-037 §3 contractCurrent Node codeStatus
Every non-lifecycle verb merges backend drain + strategy + dedups by signal_idmcp-node/src/server.ts:118-143 — drains ONLY strategy buffer; never calls backend drain; no dedup (one source)Backend drain dropped from every-verb path
prism_signals_pending returns the same mergemcp-node/src/verbs/coordination.ts:67const pending = [...local, ...remote]Dedup dropped — concatenates raw
The structural error: the original was a cross-cutting concern that lived in one decorator. The port translated it as if it were two unrelated functions, and divergence followed.

Backend-side amplifier

backend/app/services/signal_service.py:mark_delivered_via_ws is supposed to stamp delivered_at + delivery_method='channels_push' when the WebSocket frame clears. Live signal_queue inspection shows only 1 of the last 24h’s 23 targeted signals carries delivery_method='channels_push' — the rest are 'piggyback'. So either:
  • The function is racing with drain_for_caller (which stamps 'piggyback' on every prism_signals_pending call), or
  • The function is silently no-op’ing (no log lines in either backend or session-manager containers indicate either success or failure).
Either way, even if dedup is restored client-side, the backend keeps returning the same row from /signal/poll until piggyback drain claims it. The dedup at the merge point is what makes that race tolerable; without dedup, the race is a duplicate.

Goals

  1. Single-delivery invariant. A signal_id reaches the agent’s pending_signals[] field at most once, regardless of how many transport paths carried it.
  2. Universal coverage. Every non-lifecycle verb response carries any pending signals — same as Python had.
  3. Two-source robustness. Push (WS) and pull (HTTP drain) both feed the merge point; either path failing alone never loses a signal.
  4. Honest delivery accounting. signal_queue.delivered_at and delivery_method reflect what actually happened, not whichever drain raced first.
  5. Port-miss prevention. A single test catches the entire family of SPEC-054-class regressions on any future cross-language port.

Non-goals

  • Changing the wire envelope (SPEC-045 §4.2). Same shape on the WebSocket.
  • Changing the categories taxonomy (SPEC-052 §3). INFO/TASK/ASK/BLOCKER unchanged.
  • Changing the doorbell semantics (SPEC-044 §3.3). One coalesced notification per drain cycle, unchanged.
  • Reworking SPEC-056 routing or schema. The agent_id channel + four-level hierarchy are unaffected.
  • Adding any new transport. WS push and HTTP poll are the two paths; this spec just insists on a clean merge over them.

Architecture

One merge function, called from two sites

Restore the Python decorator’s contract as a single TypeScript helper:
// mcp-node/src/signalMerge.ts (NEW)
//
// SPEC-058 — single merge point for the two pending-signal sources.
// Used by:
//   1. server.ts every-verb piggyback hook (non-lifecycle verbs)
//   2. coordination.ts:prism_signals_pending verb handler
//
// Contract (SPEC-037 §3):
//   - Drains both the local strategy buffer AND the backend
//     /signal/poll endpoint.
//   - Deduplicates the union by signal_id.
//   - Records each unique signal into the per-identity cache (idempotent).
//   - Returns the deduped list — the only place pending_signals[] is built.
//
export async function mergeAndDedupPending(
  client: PrismClient,
  pid: string,
  identity: string,
): Promise<PrismSignal[]> {
  const local = (getActiveStrategy()?.drainBuffered() ?? []) as PrismSignal[];
  let remote: PrismSignal[] = [];
  try {
    remote = (await client.pollPendingSignals(pid, identity)) as PrismSignal[];
  } catch (exc) {
    log.warn("backend drain failed (non-fatal)", { exc });
  }

  const seen = new Set<string>();
  const merged: PrismSignal[] = [];
  for (const sig of [...local, ...remote]) {
    const sid = sig?.signal_id;
    if (!sid || seen.has(sid)) continue;
    seen.add(sid);
    merged.push(sig);
    await signalCache.record(sig, identity); // idempotent by sid
  }
  return merged;
}

Call sites

server.ts every-verb hook — replace the strategy-only drain at lines 118-143 with mergeAndDedupPending(client, pid, identity). Apply on every non-lifecycle verb response (existing NO_PIGGYBACK_VERBS exclusion list unchanged). The pid argument is read from the verb’s bootstrap state (whichever PID this verb is operating on); identity is agentIdentity(). coordination.ts prism_signals_pending handler — replace the local+remote concat at line 67 with the same mergeAndDedupPending(client, pid, identity). Result is the deduped pending_signals[] field.

Backend stamping race resolution

mark_delivered_via_ws and drain_for_caller race on the same row. Today, drain_for_caller wins ~96% of the time (1 channels_push out of 23 in the last 24h). Two changes:
  1. Priority order via UPDATE conditional: mark_delivered_via_ws already has WHERE delivered_at IS NULL; drain_for_caller does too. Order is whoever commits first. Acceptable.
  2. Add log instrumentation: mark_delivered_via_ws should log at INFO (not DEBUG) on every fire — both successful (rowcount=1) and silent-loss (rowcount=0, meaning piggyback won the race). Today’s container logs show neither, which is the diagnostic gap that hid this for 24h. INFO logs would have surfaced “channels_push attempted, raced lost — piggyback claimed” frequency immediately.
  3. Retain the priority preference in display. When both stamps exist on a row across its lifetime (impossible due to WHERE clause but if it ever happened), the row would be channels_push since whichever wrote first wins. No additional code needed.

Files changed

FileChangeWhy
mcp-node/src/signalMerge.tsNewThe single merge helper
mcp-node/src/server.tsModifiedReplace strategy-only drain at L118-143 with mergeAndDedupPending
mcp-node/src/verbs/coordination.tsModifiedReplace [...local, ...remote] at L67 with mergeAndDedupPending
backend/app/services/signal_service.pyModifiedPromote mark_delivered_via_ws log lines from DEBUG to INFO; log both rowcount=1 and rowcount=0 cases
backend/app/routers/session_stream.pyModifiedSame INFO-level logging on the call site
mcp-node/tests/spec058_dedup.test.tsNewUnit test: feed the same signal_id into both sources, expect dedup
backend/tests/test_spec058_delivery_method_distribution.pyNewSmoke: send 10 signals to a connected agent, assert ≥80% land as channels_push (proves WS-side stamping is racing fairly)
No backend schema changes. No migrations.

Test plan — port-miss prevention

A single end-to-end test catches the ENTIRE SPEC-054-class regression family. Add to CI on every PR that touches mcp-node/src/ OR backend/app/services/signal_service.py:
TEST: signal_delivery_single_source_of_truth (E2E, real backend, real WS, two real MCP processes)

GIVEN  Agent A and Agent B, both bootstrapped on a real backend
       (one Postgres, one Redis, both MCP processes are real Node shims).

WHEN   A calls prism_signal(to=B, ...) returning signal_id=X
       — wait for the WS push to land at B (≤2s).
       — B calls prism_signals_pending.

THEN
  1. B's pending_signals[] contains EXACTLY ONE entry with signal_id=X.
     (Catches port miss #6 — dedup at merge.)
  2. signal_queue row for X has delivered_at != NULL.
     (Catches both #1 — WS auth — and the backend stamping defects.)
  3. B's per-identity ring file contains ONE entry for signal_id=X.
     (Catches #2 — signalCache port miss.)
  4. B observed a `<channel>` doorbell within 2s of A's send.
     (Catches #3 — capability flag — and #4 — channel push wiring.)
  5. Send a second signal Y immediately after X. B observes a SECOND
     doorbell (not coalesced) AFTER B has called prism_signals_pending.
     (Catches #5b — coalescing reset on drain.)
  6. Repeat with Codex agent for B — expect the same single-delivery
     and same `delivery_path: turn_start` (or piggyback fallback if
     app-server is off in the test env).
     (Catches #5 — codex strategy.)

  Total: ONE TEST covers all six port misses + the dedup contract.

Acceptance criteria

  1. mcp-node/src/signalMerge.ts exists and is the only place pending_signals[] is constructed.
  2. server.ts and coordination.ts both call mergeAndDedupPending — no inline concat or strategy-only drain remains.
  3. The fan-out test from this morning (Donna → 4 agents, 4 acks back) returns each ack EXACTLY ONCE in prism_signals_pending.
  4. After 10 signals between two connected agents, ≥80% of signal_queue rows show delivery_method='channels_push' (validates WS stamping is firing more often than piggyback).
  5. The end-to-end test in §test plan passes on a fresh backend with two real Node MCP processes.
  6. The cross-language port-miss family — points 1-5b above — each have a dedicated assertion in the E2E test.

Phased rollout

Single phase. The mcp-node side is two function changes in two files plus one new helper file. The backend side is INFO-level logging upgrades (no behavior change). All ships in one PR. Frank Cmd+Q + reopen all 5 agent tabs after merge to pick up the new mcp-node dist/.

Out of scope

  • Federation, cross-tenant signals (SPEC-056 covers).
  • Replacing the WebSocket transport with anything else.
  • Coalescing tuning. The current SPEC-044 §3.3 boolean flag is correct.
  • Cleanup of the legacy session channel — that drops naturally when SPEC-056 cutover completes.

References

  • Specs: SPEC-034 (signal delivery), SPEC-037 (backend piggyback + dedup contract), SPEC-044 (channel push), SPEC-045 (WS data plane), SPEC-048 (codex), SPEC-052 (signal cache), SPEC-054 (Node MCP shim — the port).
  • Memories (port miss family): project_spec_054_port_miss_project_id, project_spec_054_port_miss_coalescing_reset, feedback_document_port_misses.
  • ADRs: ADR #34 (agent_id channel), ADR-25 (lifecycle ≠ messaging multiplex).
  • Live evidence (this session): signal_queue distribution showing 22 piggyback / 1 channels_push for last 24h; fan-out test #3 returning duplicate acks via prism_signals_pending.

Authorship

Donna (Claude Code, mini3, session 3f36b796). 2026-04-30. Authored after Frank flagged that the duplicate is a hack symptom and demanded a structural fix with a SPEC for review BEFORE any code touches the tree.
Last modified on May 18, 2026