Skip to main content

Marconi disaster-recovery runbook

Operator-facing companion to SPEC-101 Stage 0 v0.2 §2 (recovery invariants). Each section pairs an invariant with the on-call action sequence.
v0.2 scope. The architecture is three tiers: RAM hot path → Redis Stream (rolling 7d cache, durable boundary) → PG (audit). R4 (spool disk full) and R7 (spool corruption) from earlier drafts are gone — there is no spool. Surviving invariants: R1, R2, R3, R5, R6, R8.
For metric definitions see SPEC-101 v0.2 §9. For loss-budget reasoning see SPEC-101 Stage 0 §1.

Triage shortcut

Symptom (dashboard / alert)Likely scenarioSection
Backend pod restart, signals briefly route to no_subscriberGraceful restartR1
Backend pod OOM/SIGKILL, post-restart signals briefly route to no_subscriberHard killR2
Recipient agent shows missed signals on resume despite mesh healthyWS disconnect mid-deliveryR3
marconi_redis_writer_lag_seconds rising; marconi_ring_depth risingRedis unreachableR5
marconi_pg_archiver_lag_seconds risingPG unreachableR6
marconi_pg_archiver_lag_seconds > 7d (paging)Redis Stream trimmed past archiverR8
marconi_ring_overwrite_total > 0Redis-down outage exceeded ring capacity; signals lostLoss event

R1 — Graceful restart

Trigger. SIGTERM, container restart, deploy. Expected behavior.
  • 1-2s window where new signals route to no_subscriber, queued_offline while shims reconnect.
  • Routing/registration tables rehydrate from Redis active-registration sorted set.
  • Redis Stream + PG archiver resume from their checkpoints.
  • In-flight ring entries are flushed to Redis Stream during graceful shutdown’s last-resort sync barrier.
  • Zero accepted signals lost.
On-call action.
  1. Confirm marconi_routing_table_size{tenant} returns to pre-restart value within 5s of pod ready.
  2. Confirm marconi_redis_writer_lag_seconds returns to ~0 within 30s.
  3. Confirm marconi_pg_archiver_lag_seconds returns to ~0 within 30s.
  4. Spot-check a pending-signal drain on a representative agent; signals queued during the warmup window must drain.
No incident filed unless any check above fails.

R2 — Hard kill

Trigger. SIGKILL, OOM, host power loss, kernel panic. Expected behavior.
  • Same recovery path as R1.
  • In-flight ring entries that hadn’t been written to Redis Stream at the time of kill are lost.
  • Loss bounded by ring writer lag at kill time (typically sub-second under normal load).
On-call action.
  1. Run R1 checks 1-4 above.
  2. Compare marconi_signals_received_total{tenant} against pre-kill value plus expected gap; the gap should be roughly equal to the writer lag at kill time.
  3. If the gap is materially larger than expected writer lag, the ring writer was wedged before kill — investigate as a separate ring-writer-stall incident.
  4. File postmortem if loss exceeded budget.

R3 — Recipient WS disconnect mid-delivery

Trigger. Shim WS drops between Marconi’s push and the recipient’s ack of the frame. Expected behavior.
  • Signal is in the Redis Stream with outcome=pushed, delivery_state=awaiting_ack.
  • outcome=pushed is a delivery attempt, not final delivery. Final delivery requires an application-level ACK (shim ACK frame, prism_signal_ack with ack_kind∈{model_acted, surface_observed}, or pending-signal drain on a subsequent session).
  • Pending-signal index keeps the entry replay/drain-eligible until ACK evidence lands.
  • Recipient reconnects → fresh push or prism_signals_pending drain promotes delivery_state to acked; entry released.
  • Zero signals lost.
On-call action.
  1. If a recipient reports missed signals, confirm prism_signals_pending returns the missing envelope (it should, while delivery_state=awaiting_ack).
  2. Inspect marconi_signals_delivered_total{outcome=pushed} minus marconi_signals_acked_total for the tenant — the gap is the in-flight-without-ack window. Sustained growth points to recipient-side WS instability or a missing ACK path in a shim.
  3. If prism_signals_pending returns empty but the Redis Stream for that tenant + window contains the entry, escalate — pending-signal index is broken (Marconi §3.5 contract violation) OR the entry was incorrectly promoted to acked without ACK evidence (delivery-evidence model violation).

R5 — Redis down

Trigger. Redis container down, network partition, AOF/RDB load. Alert: marconi_redis_writer_lag_seconds sustained rise; marconi_ring_depth rising. Expected behavior.
  • Ring buffers; Redis writer fails on every batch with backoff.
  • marconi_redis_writer_errors_total{reason} increments.
  • Signals deliver normally over the hot path (recipient WS push is unaffected).
  • Zero signals lost as long as the ring absorbs the outage.
On-call action.
  1. Recover Redis (page Redis on-call if not self-healing).
  2. Confirm Redis writer resumes from last-acknowledged ring offset; marconi_redis_writer_lag_seconds falls toward 0 within minutes.
  3. Watch marconi_ring_depth — must drain proportionally as the writer catches up.
  4. If outage approaches ring capacity (marconi_ring_depthmax_entries OR marconi_redis_writer_lag_secondsmax_age_seconds), prepare for the loss event:
    • Increase ring max_entries or max_age_seconds if memory budget allows.
    • If ring overflows, see Loss event.

R6 — PG down

Trigger. Postgres container down, schema-migration window, connection saturation. Alert: marconi_pg_archiver_lag_seconds sustained rise. Expected behavior.
  • Archiver reads from Redis Stream succeed; PG batch UPSERTs fail; archiver retries with backoff.
  • Zero signals lost (Redis Stream is the source of truth at this tier).
On-call action.
  1. Recover PG (page PG on-call).
  2. Confirm marconi_pg_archiver_lag_seconds falls toward 0 as the archiver drains.
  3. Spot-check idempotency: re-running the archiver against the same Redis Stream window must not produce duplicate rows (signal_id UPSERT contract).
  4. If the outage runs longer than the Redis Stream MAXLEN ~ 7d, escalate to R8.

R8 — Redis Stream trimmed past archiver checkpoint

Trigger. marconi_pg_archiver_lag_seconds > 7d (Redis MAXLEN trim horizon). Paging incident. Expected behavior.
  • Archiver reads return “stream trimmed.”
  • Trimmed window is unrecoverable to PG audit; signals already delivered.
  • Archiver resumes from the new stream head.
On-call action.
  1. Document the trimmed window: bracket from the archiver’s last successful checkpoint to the current Redis stream head.
  2. Investigate why the archiver fell so far behind — likely PG outage (R6) compounded with insufficient archiver throughput.
  3. Long-term: increase MAXLEN retention OR provision additional archiver workers.
  4. File postmortem; the audit gap is the loss budget for this scenario.

Loss event — paging incident

Trigger. marconi_ring_overwrite_total > 0 for any tenant. Always paging. What this means. Redis was unreachable longer than the ring’s max_age_seconds (or the ring filled before the writer could drain it). Oldest ring entries were overwritten before Redis ingested them. Those signals are lost from the durable record. They may have already been delivered to recipients (the hot path doesn’t depend on Redis), but they will not appear in Redis Stream history or in PG audit. On-call action.
  1. Snapshot the counter immediately: per-tenant counts.
  2. Capture timestamp range from marconi_redis_writer_lag_seconds + marconi_redis_writer_errors_total to bracket the outage window.
  3. Confirm Redis is recovering or recover it (R5).
  4. Cross-reference recipient delivery: signals that delivered via the hot path during the window are recoverable from recipient-side memory if recipients are still online — operators may choose to re-emit them via a follow-up workflow.
  5. File postmortem with: total signals lost, time window, root cause (Redis outage / ring undersized / writer wedged), and remediation.
  6. Long-term: if this fires more than once, the ring is undersized for the realistic Redis-outage window or alerting is misconfigured.

Rollback procedures (per fine-grain feature flag)

For per-flag rollback recipes, see SPEC-101 Stage 0 §3. Each flag has a written forward + rollback contract; this runbook does not duplicate them. Rule of thumb. MARCONI_HOT_PATH_SEND, MARCONI_REDIS_STREAM_WRITER, and MARCONI_PG_ARCHIVER_PRIMARY MUST flip in lockstep. Rolling back one without the others creates new failure modes outside the loss budget.

References

  • SPEC-101 v0.2 — Marconi architecture (three-tier)
  • SPEC-101 Stage 0 v0.2 — loss budget, recovery invariants, rollback
  • ADR-56 — locks the MUST and the rename
  • Test harness: backend/tests/test_marconi_recovery_invariants.py (R1, R2, R3, R5, R6, R8 stubs, filled per stage)
Last modified on June 7, 2026