1. Problem & framing
This plan is ratification of work already shipped (§4) plus the genuinely new model-acted ACK extension (§5). It is not a gate on Texi’s first-pass implementation — that landed during this morning’s session. Step 1 (§9) is therefore “Texi PRs the existing first-pass,” not “Texi codes the first-pass.”
The remaining gap, called out in Retro #11:
Backend delivered_at and publish_path prove transport progress, not model-visible action.
Texi’s runtime_diagnostics snapshot closes the observability half (stream / strategy / surface wake / tickler events visible per-process). What’s still missing:
- Durability — events live in an in-process 80-entry ring; lost on shim restart, not queryable across surfaces.
- Model-acted signal — no explicit record that the agent’s model context actually consumed the doorbell beyond drain.
- Cross-process query verb —
prism_runtime_diagnostics is per-runtime; no joined view per signal_id.
§5 closes those three. Everything else is documentation and validation matrix.
2. Scope
In-scope (this plan):
- End-to-end observability of stages 1–9 (signal_created → backend publish → backend WS frame → MCP stream receipt → strategy delivery → surface wake → model turn → drain → reply).
- Closing the transport-vs-model gap with a model-acted ACK protocol.
- A paced validation matrix between Donna and Texi before fleet rollout.
Out-of-scope (deferred):
- Always-on per-persona daemon wake path (SPEC-070 — separate effort, lane: Texi).
- Surface-adapter rendering bugs (e.g. PeerJoined
Unknown — postmortem 74b62026, Donna 5-line patch lane).
- Doorbell durability across
prism_wrap → prism_start cycle (Porsche open question).
2.1 Texi Close-Out Status
Frank approved closing the residual end-to-end before returning to Plan #10
review. The original plan is now ratification of shipped first-pass work plus
the remaining trace/ACK implementation.
Implemented locally:
- Durable
trace_id on signal_queue, included in signal wire payloads and
pending-signal drain payloads.
- Durable
signal_trace_events table with ordered per-trace stage events.
- Backend trace endpoints for event recording, model ACK, and trace query.
- MCP verbs
prism_signal_trace and prism_signal_ack.
- MCP stream/runtime propagation of
trace_id through frame receipt,
adapter-delivery diagnostics, and visible wake prompts.
- Focused smoke coverage for Codex publish path, trace_id wire payload, and
controller-registration lock helper.
Validated locally:
npm --prefix mcp-node run build
PYTHONPATH=backend backend/.venv/bin/python -m pytest backend/tests/test_signal_publish_path.py backend/tests/test_spec038_imports.py
PYTHONPATH=backend backend/.venv/bin/python -m py_compile ...
Deployment/formal validation pending:
- Alembic migration
035_signal_wake_trace.py must be applied to the live
backend database.
- Backend and MCP runtimes must be restarted so the new endpoints and verbs are
available to live agents.
- Formal V1-V5 must run twice after deployment. Local implementation is ready,
but this session could not deploy to
server1.home.lan because SSH rejected
the available credentials and local Postgres on localhost:5433 was not
running.
3. Lanes
| Lane | Owner | Surface |
|---|
| Backend signal pipeline + new verb | Donna | backend/app/services/signal_service.py, new prism_signal_trace verb |
| MCP runtime + surface adapters | Texi | mcp-node/src/runtime_diagnostics.ts, bootstrap/idle_tickler.ts, strategies/* |
Per memory feedback_engineering_authority.md: each owner edits within their lane; the other reviews + smokes only.
4. First Pass — Already Shipped (Texi, uncommitted local)
Land status: working tree, not yet PR’d (verified via git diff --stat HEAD -- mcp-node/).
Components (~290 new lines + ~185 insertions across 9 files):
mcp-node/src/runtime_diagnostics.ts (217 lines) — in-process state for stream open/close/error, signal frame receipt, strategy delivery result, surface wake result, tool calls, drains, signal sends.
mcp-node/src/bootstrap/idle_tickler.ts (73 lines) — 4-minute stale threshold, registered+not-wrapped prerequisite, active-turn suppression.
prism_runtime_diagnostics MCP verb (verbs/coordination.ts) — query interface for the above.
- Strategy instrumentation (
channels_push.ts, app_server_inject.ts) — per-stage event emission.
- Codex app-server
turn/start + turn/steer adapter; Claude Code maintenance channel tick.
Validation status: Texi ran wake-diagnostic probe (signal bc553d7f) at Donna 2026-05-04 ~16:49Z. Donna ACK’d via signal 34fc1ebb (publish_path: buffered_for_piggyback, woken via channel push, stages 1–6 observed OK).
5. Joint Additions — Donna Lane
5.1 trace_id in signal frame
- Mint a UUID
trace_id at signal_created (backend, signal_service.py).
- Propagate through every stage: persisted on the row, included in every WS frame, in every MCP stream event, in every surface adapter event, in the model-acted ACK envelope.
- One
trace_id ties every stage event to a single signal across surfaces and processes.
5.2 prism_signal_trace verb
Read-only query: prism_signal_trace(trace_id) → ordered timeline of every recorded stage event for that signal across backend, MCP runtime, surface adapter, and model ACK.
Returns: [{stage, ts, source, outcome, payload_meta}]. Cheap, idempotent, non-mutating. Replaces ad-hoc log-grepping during paced probes.
5.3 Model-acted ACK protocol
A new lightweight ACK separate from prism_signal reply:
- Surface adapter, on doorbell delivery to the model context, records
delivered_to_surface_at.
- The next model turn that observes the doorbell SHOULD emit
prism_signal_ack(trace_id) as its first verb call. Records model_acted_at.
- Gap (
model_acted_at − delivered_to_surface_at) is the transport-to-model latency — the metric the retro called out as currently unmeasurable.
- Failure mode: doorbell with no model ACK within window N → flagged in
prism_runtime_diagnostics and surfaces in next bootstrap rules_reminders.
This ACK is a diagnostic primitive, not a replacement for content replies. Both can be sent in the same turn.
6. Validation Matrix — Joint
Pre-fleet smoke. Run after both lanes land + dist reload. Texi-driver, Donna-responder; then swap.
| # | Scenario | Expected stages | Pass criterion |
|---|
| V1 | Active-foreground baseline | 1–9 all observed | model_acted_at < 2s after delivered_to_surface_at |
| V2 | Backgrounded Claude tab | 1–6 observed; 7+ delayed | trace shows wake gap, ACK fires within 5s of refocus |
| V3 | Codex post-shim-respawn | 1–9 with adapter restart event | trace_id continuity across restart |
| V4 | Burst (10 signals/sec) | All trace_ids resolved | no stage skipped, no duplicate ACK |
| V5 | Cross-machine LAN (mini3 → server1 → mini3) | 1–9 with WS hop | latency budget < 500ms p95 |
Fleet rollout gated on all 5 passing twice consecutively (one Donna→Texi, one Texi→Donna).
7. Surface-Specific Notes
Per first-pass retro:
- Codex:
app-server/turn/start and app-server/turn/steer are the wake primitives. Idle tickler is sufficient for ≤4-min idle windows.
- Claude Code: channel notification is primary; only fallback option for REPL/terminal poke is the stricter-flag REPL nudge — adds risk of input-stuffing collisions, kept as Phase 2 if V2 fails.
Maintenance ticks must NOT consume the same coalescing slot as real signal doorbells (locked invariant from retro).
8. Open Questions for Frank
- Approve scope? — does the trace_id + prism_signal_trace + model-acted ACK trio match what you want, or do you want narrower (just the verb) or broader (also doorbell durability across wrap/start)?
- Validation cadence — run V1–V5 sequentially in one session, or spread over multiple sessions to capture realistic background/idle conditions?
- Where do trace events persist? — Postgres only (durable, queryable, Plan-#10-aligned with Postgres-as-long-term), or also Redis ring-buffer for fast in-process query? Recommend PG-only for v1; Redis if perf matters later (per
feedback_optimize_later.md).
- rules_reminders surfacing — should missed-ACK trigger a rules_reminders entry on next bootstrap, or is that too noisy? Recommend yes, with a per-trace cooldown.
9. Sequencing
Step 0: Frank approves this plan markdown ← gate
Step 1: Texi commits first-pass + opens PR; Donna reviews + smokes (no edits)
Step 2: Donna implements §5.1 trace_id-in-frame (backend); Texi reviews
Step 3: Donna implements §5.2 prism_signal_trace verb; Texi reviews
Step 4: Joint implementation §5.3 model-acted ACK (Donna backend, Texi adapter); cross-review
Step 5: Run V1–V5 validation matrix twice; record traces in retro
Step 6: Fleet rollout (default-on for runtime_diagnostics; ACK protocol recommended-not-required initially)
Step 7: Postmortem + retro after 1 week of fleet data
Each step ships independently. No big-bang merge.
10. Memory & Postmortem Hooks
feedback_eliminate_failures_improve_perf.md — every new verb gets structured-failure returns + duration_s tracking.
feedback_postmortem_on_every_error.md — any V1–V5 failure files a postmortem inline.
feedback_completion_means_deployed.md — Steps 1–6 are not “done” until merged + deployed + smoke green.
project_signal_isolation_multitenant.md — prism_signal_trace must enforce the same membership-only authorization as prism_signals_pending (no cross-tenant trace exposure).
Last modified on June 7, 2026