Skip to main content

Signal Visibility Remediation Plan

Date: 2026-05-03 Owner: Texi Project: Prism / PID-PGR01 Status: recommendation

Objective

Close the remaining signal-visibility gap with a plan that is complete enough to execute through SPEC, design, build, test, and deploy. This plan is based on the research note:
  • docs/research/codex-piggyback-surfacing-2026-05-03.md

Problem Split

This is not one bug.

Problem A: cross-surface race in same-turn signal visibility

Observed live on 2026-05-03:
  • self-targeted prism_signal returned publish_path=buffered_for_piggyback
  • the same response did not include pending_signals
  • the signal later appeared via explicit prism_signals_pending
Most likely path:
  1. backend persists signal
  2. backend publishes WS envelope
  3. backend marks delivered_at immediately after send_json
  4. mcp-node follow-on piggyback poll sees no undelivered row
  5. local strategy buffer has not necessarily populated yet
Net effect:
  • same-turn piggyback merge can miss a just-delivered signal
  • the system claims piggyback semantics without actually surfacing the payload
Scope note:
  • this is now treated as universal, not Codex-specific
  • Donna observed the same signature on claude_code receivers where signals were marked pushed_to_ws yet sat silent until explicit drain

Problem B: stale Codex delivery classification

Backend code still treats codex as piggyback-only:
  • backend/app/services/signal_service.py:_publish_path_for_surface
But current mcp-node Codex behavior is capability-based:
  • mcp-node/src/surfaces/codex.ts
  • mcp-node/src/strategies/app_server_inject.ts
When app-server injection is configured, Codex is not behaviorally equivalent to a piggyback-only surface. Current publish_path labeling is therefore stale and can mislead both operators and downstream logic.

Completion Definition

This work is complete only when all five are present:
  1. SPEC
  2. Design
  3. Build
  4. Test
  5. Deploy
Anything short of that is partial.

SPEC

Recommended next artifact:
  • file a follow-on spec for “cross-surface same-turn signal-visibility race closure + Codex-specific capability registration”
Suggested spec scope:
  1. Close the cross-surface race between WS push stamping and same-turn fallback visibility.
  2. Replace static Codex surface inference with explicit session capability registration.
  3. Define honest delivery semantics for:
    • attempted push
    • accepted by client bridge
    • surfaced to current turn
    • available via fallback drain
  4. Preserve backward compatibility for surfaces that remain piggyback-only.
Suggested owner split:
  • architecture/spec: Texi
  • implementation: Donna

Design

D1. Close the race universally

The race fix should apply across surfaces, not just to Codex. Recommended implementation:
  • backend Redis just-pushed cache keyed by receiving session and signal id
  • short TTL, e.g. Redis hash just_pushed:{session_id} with field <signal_id> -> full serialized PendingSignal payload, expiring in ~5s
  • /signal/poll returns an additive field such as just_pushed_signals: []
  • mcp-node consumes when present and ignores when absent
Why this over a shim-local cache:
  • one source of truth
  • works for wildcard fan-out to N receivers
  • clean additive rollout path

D2. Register capabilities, not guesses

Replace static agent_surface -> publish_path inference with explicit session capabilities recorded at bootstrap or registration time. Minimum fields:
  • push_capable: bool
  • push_mode: claude_channel | codex_app_server | none
  • optional push_visibility: current_turn | side_channel | fallback_only
Why:
  • codex is now a family of behaviors, not one behavior
  • current backend labeling assumes too much from the surface string

D3. Separate routing truth from UX truth

Current publish_path conflates transport intent and user-visible surfacing. Recommended split:
  • route_path: how the backend attempted delivery
  • surface_path: how the receiving surface actually exposed it
Example values:
  • route_path = ws_push | queue_only
  • surface_path = claude_channel | codex_app_server | piggyback | startup_drain
Why:
  • a WS frame sent to a Codex bridge is not the same thing as a signal surfaced in the current model turn

D4. Close the same-turn race

One of these designs should become canonical: Option 1: client ack before delivery stamp
  • backend does not mark delivered_at immediately after send_json
  • receiving bridge acks only after local strategy accepts the envelope
  • strongest semantic integrity
  • highest implementation cost
Option 2: short-lived just-pushed cache
  • keep current WS publish flow
  • backend tracks a short-lived Redis cache of signals pushed in the current turn
  • post-verb piggyback merge consults:
    • backend drain
    • local strategy buffer
    • just-pushed cache
  • lowest disruption to current architecture
Option 3: special-case prism_signal
  • prism_signal response may include the just-sent self-targeted or same-session envelope directly
  • narrower fix
  • does not solve the general “next arbitrary verb” race cleanly
Recommendation:
  • implement Option 2 first
  • evaluate Option 1 later if stronger semantics are needed everywhere

D5. Codex app-server is primary UX, piggyback is fallback durability

For Codex sessions with app-server injection enabled:
  • primary user-facing path should be app-server injection
  • piggyback should remain durability fallback and recovery path
Do not design future Codex logic around piggyback as the main UX path.

Build

B1. Backend

Touch points:
  • backend/app/services/signal_service.py
  • backend/app/routers/session_stream.py
Build tasks:
  1. Introduce a short-lived Redis just-pushed visibility mechanism.
  2. Expose an additive field on /signal/poll, e.g. just_pushed_signals: [], so rollout stays backward-compatible.
  3. Replace _publish_path_for_surface(...) with capability-driven resolution.
  4. Stop reporting Codex as piggyback-only when session metadata says otherwise.
  5. Ensure explicit drains remain idempotent and dedup-safe.

B2. mcp-node

Touch points:
  • mcp-node/src/server.ts
  • mcp-node/src/signalMerge.ts
  • mcp-node/src/bootstrap/stream.ts
  • mcp-node/src/surfaces/codex.ts
  • mcp-node/src/strategies/app_server_inject.ts
Build tasks:
  1. Merge from the new race-closure source in addition to backend poll and local strategy buffer.
  2. Keep pending_signals attachment universal for non-lifecycle verbs.
  3. Preserve dedup by signal_id.
  4. Surface capability metadata during bootstrap/registration if backend adopts explicit capability fields.

B3. No Codex-client fork assumed

Do not start by assuming an OpenAI-side client patch is required. Current evidence says:
  • Codex can preserve MCP result payloads
  • at least one live failure occurred before a piggyback field even reached the MCP response
Only revisit Codex-client behavior after Prism-side fixes land.

Test

T1. Unit / integration

  1. prism_signal self-targeted, same session, delivery_class=async:
    • response must include pending_signals when classified as piggyback fallback
    • no duplicate on subsequent explicit drain
  2. prism_signal self-targeted, same session, delivery_class=sync:
    • same expectations as async
    • verify timing does not regress for the active-conversation path
  3. pushed-via-WS then immediate follow-on verb:
    • no lost signal between WS stamp and piggyback merge
  4. Codex capability registration:
    • app-server enabled session does not report piggyback-only semantics
  5. dedup:
    • same signal seen from local buffer plus backend poll merges once

T2. Surface tests

  1. Claude Code improved:
    • channel path still works
    • silent doorbell/drop cases from this session no longer reproduce
  2. Codex app-server enabled:
    • signal appears through app-server path
    • fallback drain still works
  3. Codex app-server disabled:
    • signal remains piggyback-capable
    • same-turn response includes pending_signals

T3. Live operator test

Re-run the Donna/Texi round-trip protocol subset:
  1. self-targeted probe
  2. Donna -> Texi direct signal
  3. wildcard broadcast
  4. 5x burst
Success criteria:
  • no missing signals
  • no duplicate signals
  • honest publish_path or successor fields
  • Codex no longer depends on explicit prism_signals_pending for basic same-turn visibility when fallback semantics are claimed

Deploy

Dp1. Sequence

  1. file/approve spec
  2. implement backend + mcp-node
  3. run local and LAN smokes
  4. deploy backend first
  5. deploy mcp-node / launcher consumers second
  6. re-run live Codex and Claude cross-surface signal tests
Rollout rule:
  • backend field additions must be backward-compatible
  • old mcp-node ignores new poll fields
  • new mcp-node consumes them when present
  • no deploy step should create a broken mixed-version window

Dp2. Rollout caution

This change touches delivery semantics and operator observability. Deploy with:
  • strong logs around route/surface classification
  • temporary metrics for race-hit detection
  • explicit note in changelog that Codex publish-path semantics changed

Final Recommendation

Proceed as a Prism-side remediation, not a Codex-blame exercise. Priority order:
  1. fix the same-turn piggyback race
  2. replace stale Codex surface guessing with capability registration
  3. treat Codex app-server injection as the primary UX path
  4. only then reassess whether any remaining Codex prompt-surfacing gap still needs a surface-specific fix
If only one change can ship first, ship the race closure. It is the shortest path from today’s contradictory behavior to honest and testable delivery semantics.
Last modified on June 7, 2026