Skip to main content

Multi-store writes — the architectural rule

Pick one Source of Record per fact. Write to it transactionally. Project to other stores asynchronously. Never hold a connection in store A while waiting for store B.
This document codifies the discipline that emerged from the 2026-05-01 pool-exhaustion outage. Every cross-store write path in the backend should be auditable against this rule.

The four anti-patterns we’ve now hit

1. Holding a session across an audit write

Symptom: request handler awaits a “decoupled” audit UPDATE that opens its own session. Both connections held simultaneously per request. Under contention + cancellation, both leak into the pool with open transactions. Fix recipe: queue the audit write to Redis, drain it asynchronously via a worker that uses its own short-lived session. SPEC-067 Phase 1.

2. Holding a session across an external service call

Symptom: worker opens a Postgres session, then issues an RPC to a different system (Neo4j, third-party HTTP, slow internal service). The Postgres connection is held for the duration of the external call. Fix recipe: decompose into three independent operations:
# 1. Read what you need from store A.
async with SessionLocal() as session:
    inputs = await fetch(session)
# session closed; connection back in pool

# 2. Do the external work with NO local connection held.
result = await call_external_service(inputs)

# 3. Open a fresh, short session for the resulting write.
async with SessionLocal() as session:
    await write_status(session, inputs, result)
SPEC-067 Phase 2.1.

3. Bcrypt-validating (or any CPU-heavy work) on every request

Symptom: auth path runs an expensive synchronous computation (typically ~100ms) inside the request handler. In an async event loop, this serializes all requests behind the CPU work. Fix recipe: validate once, cache the result keyed by a hash of the secret, with a short TTL (60-300s). Bcrypt failures are NOT cached — brute-force attackers still pay the full bcrypt cost. SPEC-067 Phase 2.3.

4. No server-side timeouts

Symptom: stuck queries hold locks indefinitely; abandoned transactions sit in idle in transaction for hours; lock waiters queue forever. A single bug can cascade into a full pool exhaustion. Fix recipe: set tight Postgres timeouts at the connection level. Lab values:
SettingValueWhat firing means
statement_timeout1sSlow query — index, rewrite, or async-ify
idle_in_transaction_session_timeout3sSession held across external work
lock_timeout500msRow contention — debounce or rethink
Per-route overrides via SET LOCAL for legitimate exceptions (vector search, hourly reconciler, bulk imports). SPEC-067 Phase 2.2.

The decision framework for new write paths

For every cross-store write, ask:
  1. Which store is the Source of Record (SOR)?
    • One answer per fact. The SOR is the one that, if it survives, the rest can be reconstructed or projected from.
  2. Which other stores are projections / optimizations?
    • Indexes, caches, hot-path readers, search indexes, graph projections. They do not own the fact; they reflect it.
  3. What’s the user-visible effect of a projection failing?
    • If it’s “data seems missing temporarily” — projection failure is tolerable; reconciliation worker catches up.
    • If it’s “wrong answer / lost data” — projection isn’t actually a projection; you have two SORs in tension. Pick one.
  4. What’s the recovery path for projection failure?
    • Outbox pattern: SOR write + outbox row in same transaction; worker drains outbox.
    • Event log replay: events are SOR; projections rebuild from log.
    • Periodic reconciliation: reconciler sweeps SOR vs. projection; diffs are repaired.

What we already do correctly

Write pathSORProjectionsRecovery
Memory ingestionPostgres memory_embeddings + trigraph_write_intents (same tx)Neo4j graphWorker drains intents; failed retries; dead-letter on exhaustion (SPEC-026)
Signal send (Marconi, SPEC-101)Marconi in-memory state for delivery; PG signal_queue for auditMarconi Cache (Redis Stream marconi:signals:{tenant}, MAXLEN ~7d) — the rolling cache + PG staging bufferIn-memory audit queue → Redis Stream writer → PG archiver consumer group (XREADGROUP COUNT 1 BLOCK 0, idempotent UPSERT by signal_id); on hard kill, audit-queue overwrite is the operator-visible loss event
Session registerRedis hash + master claimPostgres agent_sessions (audit)Audit drift acceptable; runtime queries hit Redis (ADR-27)
Audit timestamps (SPEC-067)Postgres rowsRedis hash queue + drain workeraudit_failures table for unrecoverable drains

Marconi as the canonical multi-store-write exemplar

Marconi (SPEC-101 v0.4, shipped 2026-05-11) is the cleanest application of the rule in the current backend, and worth studying as the reference pattern:
  • Hot path is in-memory only. The synchronous request-response cycle from prism_signal API entry to 200 OK does zero Redis, Postgres, Neo4j, or OTEL calls. Marconi’s in-memory routing table + WS handle storage IS the live SOR for delivery state.
  • Audit fan-out is explicitly off the hot path. Three FastAPI background tasks drain the in-memory audit queue: Redis Stream writer (writes to Marconi Cache), PG archiver (consumes via XREADGROUP COUNT 1 BLOCK 0, upserts canonical tables), OTEL emitter.
  • No pubsub anywhere in the surface. All cache invalidation is direct write-through hooks at the controller-service call sites. Pubsub coupling — the previous architecture’s failure mode — was removed in v0.3.
  • MCP boundary preserved. Even the audit workers run as FastAPI in-process background tasks, never as MCP shortcuts. The boundary MCP shim → FastAPI → store holds without exception.
  • Loss budget is explicit and observable. Every failure mode has a counter (marconi_audit_queue_overwrite_total, marconi_redis_writer_lag_seconds, marconi_pg_archiver_lag_seconds); when the loss budget is breached, the operator gets paged rather than discovering it post-hoc.
The Marconi pattern is a strict superset of the SPEC-067 audit-queue discipline: it never holds a connection across a slow operation, it queues durability work off the hot path, and the SOR for delivery is in-memory (Marconi’s tables) while the SOR for audit is PG (signal_queue / signal_trace_events / signal_obligations). Two SORs, two distinct write paths, joined by an explicit cache tier with a known loss budget.

The orchestration discipline

The API handler is the explicit decision point for sync-vs-async per write. The handler decides which writes must complete before the response returns and which are queued.
Every multi-store write decomposes into:
  • One SOR write — synchronous, in the request transaction; if it fails, the request fails.
  • Zero or more projection writes — queued, async, retryable; if they fail, the SOR is intact and the projection worker recovers.
If a route violates this — by holding a Postgres session during a Neo4j call, or awaiting a Redis write inside the auth path — that’s an architectural review failure, not a “later” problem.

Cross-references

  • SPEC-067 — Phase 1 (audit queue) + Phase 2 (close-out)
  • SPEC-026 — Trigraph write intents (outbox pattern, exemplary)
  • SPEC-034 — Signal queue + drain (drain-on-startup pattern; foundational, architecture amended by SPEC-101)
  • SPEC-101 — Marconi: in-memory signal mesh with off-hot-path audit fan-out (architecture, spec)
  • ADR-027 — Runtime state through SM (Redis-as-SOR for runtime, Postgres-as-audit)
  • ADR-56 — Marconi rename + MUST (locks SPEC-101 v0.4)
  • RCA: ~/prism-rca-pool-exhaustion-2026-05-01.md
Last modified on June 7, 2026