Multi-store writes — the architectural rule

Pick one Source of Record per fact. Write to it transactionally. Project to other stores asynchronously. Never hold a connection in store A while waiting for store B.

This document codifies the discipline that emerged from the 2026-05-01 pool-exhaustion outage. Every cross-store write path in the backend should be auditable against this rule.

The four anti-patterns we’ve now hit

1. Holding a session across an audit write

Symptom: request handler awaits a “decoupled” audit UPDATE that opens its own session. Both connections held simultaneously per request. Under contention + cancellation, both leak into the pool with open transactions. Fix recipe: queue the audit write to Redis, drain it asynchronously via a worker that uses its own short-lived session. SPEC-067 Phase 1.

2. Holding a session across an external service call

Symptom: worker opens a Postgres session, then issues an RPC to a different system (Neo4j, third-party HTTP, slow internal service). The Postgres connection is held for the duration of the external call. Fix recipe: decompose into three independent operations:

# 1. Read what you need from store A.
async with SessionLocal() as session:
    inputs = await fetch(session)
# session closed; connection back in pool

# 2. Do the external work with NO local connection held.
result = await call_external_service(inputs)

# 3. Open a fresh, short session for the resulting write.
async with SessionLocal() as session:
    await write_status(session, inputs, result)

SPEC-067 Phase 2.1.

3. Bcrypt-validating (or any CPU-heavy work) on every request

Symptom: auth path runs an expensive synchronous computation (typically ~100ms) inside the request handler. In an async event loop, this serializes all requests behind the CPU work. Fix recipe: validate once, cache the result keyed by a hash of the secret, with a short TTL (60-300s). Bcrypt failures are NOT cached — brute-force attackers still pay the full bcrypt cost. SPEC-067 Phase 2.3.

4. No server-side timeouts

Symptom: stuck queries hold locks indefinitely; abandoned transactions sit in idle in transaction for hours; lock waiters queue forever. A single bug can cascade into a full pool exhaustion. Fix recipe: set tight Postgres timeouts at the connection level. Lab values:

Setting	Value	What firing means
`statement_timeout`	1s	Slow query — index, rewrite, or async-ify
`idle_in_transaction_session_timeout`	3s	Session held across external work
`lock_timeout`	500ms	Row contention — debounce or rethink

Per-route overrides via SET LOCAL for legitimate exceptions (vector search, hourly reconciler, bulk imports). SPEC-067 Phase 2.2.

The decision framework for new write paths

For every cross-store write, ask:

Which store is the Source of Record (SOR)?
- One answer per fact. The SOR is the one that, if it survives, the rest can be reconstructed or projected from.
Which other stores are projections / optimizations?
- Indexes, caches, hot-path readers, search indexes, graph projections. They do not own the fact; they reflect it.
What’s the user-visible effect of a projection failing?
- If it’s “data seems missing temporarily” — projection failure is tolerable; reconciliation worker catches up.
- If it’s “wrong answer / lost data” — projection isn’t actually a projection; you have two SORs in tension. Pick one.
What’s the recovery path for projection failure?
- Outbox pattern: SOR write + outbox row in same transaction; worker drains outbox.
- Event log replay: events are SOR; projections rebuild from log.
- Periodic reconciliation: reconciler sweeps SOR vs. projection; diffs are repaired.

What we already do correctly

Write path	SOR	Projections	Recovery
Memory ingestion	Postgres `memory_embeddings` + `trigraph_write_intents` (same tx)	Neo4j graph	Worker drains intents; failed retries; dead-letter on exhaustion (SPEC-026)
Signal send (Marconi, SPEC-101)	Marconi in-memory state for delivery; PG `signal_queue` for audit	Marconi Cache (Redis Stream `marconi:signals:{tenant}`, MAXLEN ~7d) — the rolling cache + PG staging buffer	In-memory audit queue → Redis Stream writer → PG archiver consumer group (`XREADGROUP COUNT 1 BLOCK 0`, idempotent UPSERT by `signal_id`); on hard kill, audit-queue overwrite is the operator-visible loss event
Session register	Redis hash + master claim	Postgres `agent_sessions` (audit)	Audit drift acceptable; runtime queries hit Redis (ADR-27)
Audit timestamps (SPEC-067)	Postgres rows	Redis hash queue + drain worker	`audit_failures` table for unrecoverable drains

Marconi as the canonical multi-store-write exemplar

Marconi (SPEC-101 v0.4, shipped 2026-05-11) is the cleanest application of the rule in the current backend, and worth studying as the reference pattern:

Hot path is in-memory only. The synchronous request-response cycle from prism_signal API entry to 200 OK does zero Redis, Postgres, Neo4j, or OTEL calls. Marconi’s in-memory routing table + WS handle storage IS the live SOR for delivery state.
Audit fan-out is explicitly off the hot path. Three FastAPI background tasks drain the in-memory audit queue: Redis Stream writer (writes to Marconi Cache), PG archiver (consumes via XREADGROUP COUNT 1 BLOCK 0, upserts canonical tables), OTEL emitter.
No pubsub anywhere in the surface. All cache invalidation is direct write-through hooks at the controller-service call sites. Pubsub coupling — the previous architecture’s failure mode — was removed in v0.3.
MCP boundary preserved. Even the audit workers run as FastAPI in-process background tasks, never as MCP shortcuts. The boundary MCP shim → FastAPI → store holds without exception.
Loss budget is explicit and observable. Every failure mode has a counter (marconi_audit_queue_overwrite_total, marconi_redis_writer_lag_seconds, marconi_pg_archiver_lag_seconds); when the loss budget is breached, the operator gets paged rather than discovering it post-hoc.

The Marconi pattern is a strict superset of the SPEC-067 audit-queue discipline: it never holds a connection across a slow operation, it queues durability work off the hot path, and the SOR for delivery is in-memory (Marconi’s tables) while the SOR for audit is PG (signal_queue / signal_trace_events / signal_obligations). Two SORs, two distinct write paths, joined by an explicit cache tier with a known loss budget.

The orchestration discipline

The API handler is the explicit decision point for sync-vs-async per write. The handler decides which writes must complete before the response returns and which are queued.

Every multi-store write decomposes into:

One SOR write — synchronous, in the request transaction; if it fails, the request fails.
Zero or more projection writes — queued, async, retryable; if they fail, the SOR is intact and the projection worker recovers.

If a route violates this — by holding a Postgres session during a Neo4j call, or awaiting a Redis write inside the auth path — that’s an architectural review failure, not a “later” problem.

Cross-references

SPEC-067 — Phase 1 (audit queue) + Phase 2 (close-out)
SPEC-026 — Trigraph write intents (outbox pattern, exemplary)
SPEC-034 — Signal queue + drain (drain-on-startup pattern; foundational, architecture amended by SPEC-101)
SPEC-101 — Marconi: in-memory signal mesh with off-hot-path audit fan-out (architecture, spec)
ADR-027 — Runtime state through SM (Redis-as-SOR for runtime, Postgres-as-audit)
ADR-56 — Marconi rename + MUST (locks SPEC-101 v0.4)
RCA: ~/prism-rca-pool-exhaustion-2026-05-01.md

​Multi-store writes — the architectural rule

​The four anti-patterns we’ve now hit

​1. Holding a session across an audit write

​2. Holding a session across an external service call

​3. Bcrypt-validating (or any CPU-heavy work) on every request

​4. No server-side timeouts

​The decision framework for new write paths

​What we already do correctly

​Marconi as the canonical multi-store-write exemplar

​The orchestration discipline

​Cross-references