Marconi disaster-recovery runbook
Operator-facing companion to SPEC-101 Stage 0 v0.2 §2 (recovery invariants). Each section pairs an invariant with the on-call action sequence.
v0.2 scope. The architecture is three tiers: RAM hot path → Redis Stream (rolling 7d cache, durable boundary) → PG (audit). R4 (spool disk full) and R7 (spool corruption) from earlier drafts are gone — there is no spool. Surviving invariants: R1, R2, R3, R5, R6, R8.For metric definitions see SPEC-101 v0.2 §9. For loss-budget reasoning see SPEC-101 Stage 0 §1.
Triage shortcut
| Symptom (dashboard / alert) | Likely scenario | Section |
|---|---|---|
Backend pod restart, signals briefly route to no_subscriber | Graceful restart | R1 |
Backend pod OOM/SIGKILL, post-restart signals briefly route to no_subscriber | Hard kill | R2 |
| Recipient agent shows missed signals on resume despite mesh healthy | WS disconnect mid-delivery | R3 |
marconi_redis_writer_lag_seconds rising; marconi_ring_depth rising | Redis unreachable | R5 |
marconi_pg_archiver_lag_seconds rising | PG unreachable | R6 |
marconi_pg_archiver_lag_seconds > 7d (paging) | Redis Stream trimmed past archiver | R8 |
marconi_ring_overwrite_total > 0 | Redis-down outage exceeded ring capacity; signals lost | Loss event |
R1 — Graceful restart
Trigger. SIGTERM, container restart, deploy. Expected behavior.- 1-2s window where new signals route to
no_subscriber, queued_offlinewhile shims reconnect. - Routing/registration tables rehydrate from Redis active-registration sorted set.
- Redis Stream + PG archiver resume from their checkpoints.
- In-flight ring entries are flushed to Redis Stream during graceful shutdown’s last-resort sync barrier.
- Zero accepted signals lost.
- Confirm
marconi_routing_table_size{tenant}returns to pre-restart value within 5s of pod ready. - Confirm
marconi_redis_writer_lag_secondsreturns to ~0 within 30s. - Confirm
marconi_pg_archiver_lag_secondsreturns to ~0 within 30s. - Spot-check a pending-signal drain on a representative agent; signals queued during the warmup window must drain.
R2 — Hard kill
Trigger. SIGKILL, OOM, host power loss, kernel panic. Expected behavior.- Same recovery path as R1.
- In-flight ring entries that hadn’t been written to Redis Stream at the time of kill are lost.
- Loss bounded by ring writer lag at kill time (typically sub-second under normal load).
- Run R1 checks 1-4 above.
- Compare
marconi_signals_received_total{tenant}against pre-kill value plus expected gap; the gap should be roughly equal to the writer lag at kill time. - If the gap is materially larger than expected writer lag, the ring writer was wedged before kill — investigate as a separate ring-writer-stall incident.
- File postmortem if loss exceeded budget.
R3 — Recipient WS disconnect mid-delivery
Trigger. Shim WS drops between Marconi’s push and the recipient’s ack of the frame. Expected behavior.- Signal is in the Redis Stream with
outcome=pushed, delivery_state=awaiting_ack. outcome=pushedis a delivery attempt, not final delivery. Final delivery requires an application-level ACK (shim ACK frame,prism_signal_ackwithack_kind∈{model_acted, surface_observed}, or pending-signal drain on a subsequent session).- Pending-signal index keeps the entry replay/drain-eligible until ACK evidence lands.
- Recipient reconnects → fresh push or
prism_signals_pendingdrain promotesdelivery_statetoacked; entry released. - Zero signals lost.
- If a recipient reports missed signals, confirm
prism_signals_pendingreturns the missing envelope (it should, whiledelivery_state=awaiting_ack). - Inspect
marconi_signals_delivered_total{outcome=pushed}minusmarconi_signals_acked_totalfor the tenant — the gap is the in-flight-without-ack window. Sustained growth points to recipient-side WS instability or a missing ACK path in a shim. - If
prism_signals_pendingreturns empty but the Redis Stream for that tenant + window contains the entry, escalate — pending-signal index is broken (Marconi §3.5 contract violation) OR the entry was incorrectly promoted toackedwithout ACK evidence (delivery-evidence model violation).
R5 — Redis down
Trigger. Redis container down, network partition, AOF/RDB load. Alert:marconi_redis_writer_lag_seconds sustained rise; marconi_ring_depth rising.
Expected behavior.
- Ring buffers; Redis writer fails on every batch with backoff.
marconi_redis_writer_errors_total{reason}increments.- Signals deliver normally over the hot path (recipient WS push is unaffected).
- Zero signals lost as long as the ring absorbs the outage.
- Recover Redis (page Redis on-call if not self-healing).
- Confirm Redis writer resumes from last-acknowledged ring offset;
marconi_redis_writer_lag_secondsfalls toward 0 within minutes. - Watch
marconi_ring_depth— must drain proportionally as the writer catches up. - If outage approaches ring capacity (
marconi_ring_depth→max_entriesORmarconi_redis_writer_lag_seconds→max_age_seconds), prepare for the loss event:- Increase ring
max_entriesormax_age_secondsif memory budget allows. - If ring overflows, see Loss event.
- Increase ring
R6 — PG down
Trigger. Postgres container down, schema-migration window, connection saturation. Alert:marconi_pg_archiver_lag_seconds sustained rise.
Expected behavior.
- Archiver reads from Redis Stream succeed; PG batch UPSERTs fail; archiver retries with backoff.
- Zero signals lost (Redis Stream is the source of truth at this tier).
- Recover PG (page PG on-call).
- Confirm
marconi_pg_archiver_lag_secondsfalls toward 0 as the archiver drains. - Spot-check idempotency: re-running the archiver against the same Redis Stream window must not produce duplicate rows (
signal_idUPSERT contract). - If the outage runs longer than the Redis Stream
MAXLEN ~ 7d, escalate to R8.
R8 — Redis Stream trimmed past archiver checkpoint
Trigger.marconi_pg_archiver_lag_seconds > 7d (Redis MAXLEN trim horizon). Paging incident.
Expected behavior.
- Archiver reads return “stream trimmed.”
- Trimmed window is unrecoverable to PG audit; signals already delivered.
- Archiver resumes from the new stream head.
- Document the trimmed window: bracket from the archiver’s last successful checkpoint to the current Redis stream head.
- Investigate why the archiver fell so far behind — likely PG outage (R6) compounded with insufficient archiver throughput.
- Long-term: increase
MAXLENretention OR provision additional archiver workers. - File postmortem; the audit gap is the loss budget for this scenario.
Loss event — paging incident
Trigger.marconi_ring_overwrite_total > 0 for any tenant. Always paging.
What this means. Redis was unreachable longer than the ring’s max_age_seconds (or the ring filled before the writer could drain it). Oldest ring entries were overwritten before Redis ingested them. Those signals are lost from the durable record. They may have already been delivered to recipients (the hot path doesn’t depend on Redis), but they will not appear in Redis Stream history or in PG audit.
On-call action.
- Snapshot the counter immediately: per-tenant counts.
- Capture timestamp range from
marconi_redis_writer_lag_seconds+marconi_redis_writer_errors_totalto bracket the outage window. - Confirm Redis is recovering or recover it (R5).
- Cross-reference recipient delivery: signals that delivered via the hot path during the window are recoverable from recipient-side memory if recipients are still online — operators may choose to re-emit them via a follow-up workflow.
- File postmortem with: total signals lost, time window, root cause (Redis outage / ring undersized / writer wedged), and remediation.
- Long-term: if this fires more than once, the ring is undersized for the realistic Redis-outage window or alerting is misconfigured.
Rollback procedures (per fine-grain feature flag)
For per-flag rollback recipes, see SPEC-101 Stage 0 §3. Each flag has a written forward + rollback contract; this runbook does not duplicate them. Rule of thumb.MARCONI_HOT_PATH_SEND, MARCONI_REDIS_STREAM_WRITER, and MARCONI_PG_ARCHIVER_PRIMARY MUST flip in lockstep. Rolling back one without the others creates new failure modes outside the loss budget.
References
- SPEC-101 v0.2 — Marconi architecture (three-tier)
- SPEC-101 Stage 0 v0.2 — loss budget, recovery invariants, rollback
- ADR-56 — locks the MUST and the rename
- Test harness:
backend/tests/test_marconi_recovery_invariants.py(R1, R2, R3, R5, R6, R8 stubs, filled per stage)

