SPEC-104 v0.2.1 — Marconi Cache-first trace/read path
0. Operator Decision
Frank’s correction is the governing rule for this SPEC:
Real-time signal operations read from Marconi Cache first. PG audit is explicit for historical reporting, retrospective analysis, compliance, and reconciliation. The API must not teach agents to default to PG for recent state.
SPEC-101 v0.4 locked the write-side topology:
Marconi switch (in-memory) -> Marconi Cache (Redis Stream, ~7d) -> PG audit
SPEC-104 locks the read-side topology for signal trace and recent signal lookup:
Marconi Cache first -> PG audit only when explicit or cache-aged-out
1. Problem
The current Marconi path can mint and return signal_id and trace_id from the in-memory switch before Tier 2 has durably accepted the envelope. prism_signal_trace also reads only PG audit tables, so a caller can hold a valid-looking trace_id while the real-time read surface reports trace_not_found or an incomplete timeline.
Observed implementation gaps:
send_via_marconi returns IDs after enqueueing to the in-process audit queue, not after Redis Stream XADD acceptance.
recipient_not_registered currently returns IDs without going through the audit enqueue path.
get_signal_trace reads signal_queue and signal_trace_events in PG only.
- The PG archiver projects
trace_id into signal_queue, but the Marconi envelope path does not seed companion signal_trace_events rows for signal_created / backend_published / backend_queued.
The product risk is behavioral: agents and dashboards will naturally call the default trace verb for recent signals. If that verb starts with PG, the system drifts back toward database polling and hides the architecture distinction between real-time cache and historical audit.
2. Scope
This is a new read-path SPEC, not a SPEC-101 amendment. SPEC-101 owns the write-side Marconi topology and fan-out. SPEC-104 owns how operators, agents, and dashboards resolve recent trace/read state.
In scope:
- Cache-first
prism_signal_trace behavior for recent traces.
- Explicit PG-audit read mode for historical/reporting use.
- Response-surface contract for whether returned IDs are cache-accepted or provisional.
- Audit handling for rejected sends such as
recipient_not_registered.
- Tier 3.6 trace-event archiver behavior for Marconi-generated trace stages.
- Metrics that make cache hit/miss, provisional responses, and audit gaps visible.
Out of scope:
- Moving Redis into the signal delivery hot path.
- Requiring PG persistence before
prism_signal returns.
- Adding a new service, sidecar, daemon, or MCP direct store writer.
- Cross-instance Marconi routing.
- Changing non-signal durable verbs such as
prism_journal, prism_decide, or prism_postmortem.
3. Read-Path Contract
3.1 Default source order
For real-time signal trace and recent lookup calls, the default source order is:
- Marconi Cache: Redis Stream
marconi:signals:{tenant_id} and its trace index (§4).
- PG audit: only when the caller explicitly requests history or when cache lookup proves the trace is older than the cache retention window.
PG audit MUST NOT be the first default read for recent signal operations.
3.2 Source parameter
Trace/read verbs MUST expose an explicit source selector:
| Source | Meaning |
|---|
cache | Read Marconi Cache only. Misses return cache miss metadata; they do not query PG. |
pg_audit | Read PG audit only. Intended for historical reporting, retrospectives, compliance, and reconciliation. |
auto | Default. Cache first; PG only when cache reports aged-out / retention miss or caller explicitly allows historical fallback. |
source=pg_audit strictly bypasses cache. This keeps historical/reporting jobs deterministic and keeps real-time calls honest.
For a recent cache miss, source=auto MUST NOT query PG unless include_history=true. A recent cache miss is diagnostic signal; silently falling through to PG would hide cache/audit gaps from the operator.
3.3 Recency boundary
The recency boundary is time-based and uses the envelope’s created_at.
- If
created_at is inside the configured Marconi Cache retention window, auto MUST attempt cache first.
- If
created_at is outside the retention window, auto MAY go directly to PG audit.
- If
created_at is unknown, auto MUST attempt cache first, then return clear miss metadata or fall back only when include_history=true.
Default cache retention follows SPEC-101 v0.4: approximately seven days.
Trace/read responses MUST include source metadata:
| Field | Meaning |
|---|
source | cache, pg_audit, or mixed if the response merged cache envelope with PG trace events. |
cache_hit | Boolean. True when the trace envelope was found in Marconi Cache. |
pg_audit_used | Boolean. True only when PG audit was queried. |
cache_stream_id | Redis Stream ID when known. |
trace_state | cache_accepted, pg_archived, provisional, not_found, or aged_out. |
Existing response fields remain backward compatible.
4. Cache Lookup Shape
Redis Streams are not indexed by trace_id. The implementation MUST add a small per-tenant trace index when an envelope is appended to Marconi Cache:
marconi:trace:{tenant_id}:{trace_id} -> {stream_key, stream_id, signal_id, created_at}
The key TTL MUST match the stream retention window. This makes cache-first trace lookup O(1) instead of scanning the stream.
On XADD success, the cache writer records both:
- the envelope in
marconi:signals:{tenant_id};
- the trace index key above.
If the trace-index write fails after XADD succeeds, the writer MUST increment an error metric and retry or repair asynchronously. It MUST NOT delete the accepted stream entry.
v0.2 pins the implementation path to a bounded approach: do not build a complex retry queue for trace-index failures. If the stream XADD succeeds but the trace-index write fails, increment marconi_trace_index_errors_total, keep the accepted stream entry, and allow the cache lookup service to use a bounded recent-stream scan fallback for that metric-flagged minority. If scan fallback volume becomes material, a later SPEC may add a repair worker.
5. Send Response Contract
5.1 Default target state
For normal accepted sends, prism_signal MUST return signal_id and trace_id only after Marconi Cache accepts the envelope, except for the explicit provisional path in §5.2. This makes returned IDs immediately resolvable by the cache-first trace path in the healthy case.
Returned fields:
| Field | Meaning |
|---|
audit_state | cache_accepted, provisional, or not_recorded. |
cache_stream_id | Redis Stream ID when cache accepted the envelope before response. |
trace_state | Mirrors the trace lookup state expected immediately after return. |
5.2 Cache failure behavior
Do not block signal delivery indefinitely on cache health.
Cache acceptance has a default timeout of 250ms, configurable by PRISM_MARCONI_CACHE_ACCEPT_TIMEOUT_MS. Provisional response rate is the operator-facing dial that tells us whether the timeout is too aggressive or the cache writer is unhealthy.
If delivery succeeds but cache acceptance does not complete inside the bounded cache-accept timeout:
- return the delivery result with
audit_state="provisional";
- include
action_required / routing_advisory language suitable for operator display;
- enqueue or retain the envelope for async retry when possible;
- increment provisional-response metrics.
This preserves SPEC-101’s rule that Redis is not in the signal-delivery flow while making the response honest.
5.3 Rejected or invalid recipient sends
recipient_not_registered and similar rejected sends MUST prefer a rejected audit envelope with outcome="recipient_not_registered" and backend_rejected trace stage.
When Marconi Cache is available, the response MAY include signal_id and trace_id after the rejected envelope is cache-accepted.
When Marconi Cache is unavailable for a rejected send, the response MUST be a structured failure without minted durable-looking IDs. Do not return provisional IDs for rejected sends in v0.2; retry-retention semantics for rejected attempts are deferred until operators ask for them.
6. Trace Event Archiving (Tier 3.6)
The PG archiver MUST write trace-event rows for Marconi-generated stages when the stream entry carries enough information to do so.
Minimum seeded stages:
| Stage | When |
|---|
signal_created | Envelope minted by Marconi switch. |
backend_published | Delivered to live recipient queue / WebSocket path. |
backend_queued | Queued for offline recipient or backpressure fallback. |
backend_rejected | Rejected before delivery, including recipient_not_registered. |
The archiver writes these to signal_trace_events_shadow in shadow mode and signal_trace_events in primary mode using idempotent keys compatible with the existing uniqueness contract.
This is named Tier 3.6 because it extends SPEC-101 Tier 3 without changing the Tier 1/Tier 2 delivery contract.
7. Metrics
The implementation MUST add or expose:
| Metric | Meaning |
|---|
marconi_trace_cache_lookup_total{tenant,outcome} | Cache-first trace lookups by outcome: hit, miss, aged_out, index_error. |
marconi_trace_pg_lookup_total{tenant,reason} | PG audit lookups by reason: explicit, aged_out, fallback_allowed. |
marconi_signal_response_audit_state_total{tenant,audit_state} | Send responses grouped by cache acceptance state. |
marconi_trace_index_errors_total{tenant,reason} | Trace-index write/read failures. |
marconi_trace_event_archiver_errors_total{tenant,reason} | Tier 3.6 trace-event archiver failures. |
Dashboard surfaces should treat audit_state="provisional" and cache index errors as operator-visible yellow states. Data loss remains governed by SPEC-101’s overwrite metrics.
8. Implementation Plan
Slice 1 — Cache-first trace read
Owner: Donna. Architecture gate: Texi. Merge lane: Samantha yellow.
Deliver:
- Add cache trace index on cache writer
XADD.
- Add cache lookup service for
trace_id.
- Change
prism_signal_trace default to source=auto with cache-first semantics.
- Add
source=cache and source=pg_audit.
- Keep backward-compatible response fields; add source metadata.
Acceptance:
- Recent trace lookup hits cache without PG.
source=pg_audit bypasses cache.
- Cache miss does not silently hide source metadata.
- Tenant/project authorization remains identical to the existing trace verb.
- Before Slice 4 ships, cache-first trace responses MAY contain the envelope and source metadata with an empty or partial
events array. The response MUST mark that state clearly, e.g. trace_state="cache_accepted" and events_source="pending_pg_trace_events" or equivalent, so callers do not confuse missing trace events with a missing trace.
- Compatibility gate §9.1 passes for trace/read source behavior.
Slice 2 — Send-response provisional/cache-accepted contract
Owner: Donna. Architecture gate: Texi + Candi if response-shape governance requires it. Merge lane: Samantha red.
Deliver:
- Add bounded cache-accept path or cache-accept future for Marconi sends.
- Return
audit_state, trace_state, and cache_stream_id when available.
- Preserve direct delivery latency by timing out to
provisional.
- Add async retry path or retention path for provisional envelopes.
Acceptance:
- Healthy cache returns
audit_state="cache_accepted" and the returned trace resolves from cache immediately.
- Redis unavailable returns delivered/queued result with
audit_state="provisional", not a false durable claim.
- No PG write is introduced on the hot path.
- Compatibility gate §9.2 passes for send response durability semantics.
Slice 3 — Rejected-send audit enqueue
Owner: Donna. Architecture gate: Texi. Merge lane: Samantha yellow.
Deliver:
recipient_not_registered records a rejected envelope in Marconi Cache or returns without IDs.
- Preferred: cache rejected envelope with
backend_rejected trace stage.
Acceptance:
- Misspelled recipient attempts are traceable.
- No branch returns
signal_id / trace_id with audit_state omitted.
- Compatibility gate §9.3 passes for rejected-send behavior.
Slice 4 — Tier 3.6 trace-event archiver
Owner: Donna. Architecture gate: Texi. Merge lane: Samantha red.
Deliver:
- Extend PG archiver to materialize trace events from stream envelopes.
- Write shadow first, then primary under existing Stage 4 ladder controls.
- Add migration only if idempotency or provenance requires schema support.
Acceptance:
- Marconi-originated signals have
signal_trace_events rows after archiver catch-up.
- Shadow and primary modes behave consistently.
- Duplicate stream processing is idempotent.
- Compatibility gate §9.4 passes for trace-event compatibility.
Slice 5 — Metrics and dashboard handoff
Owner: Donna for exporter, Porsche for dashboard consumption, Texi architecture gate.
Deliver:
- Metrics in §7.
- Dashboard distinguishes cache health from PG audit lag.
- Provisional send responses and cache lookup failures are visible.
Acceptance:
- Operator can see whether trace miss is cache miss, aged-out, PG fallback, or real absence.
- Dashboard does not imply PG is the real-time source of truth.
9. Documentation Plan
9.1 Compatibility Gate — Trace/read source behavior
Candi’s governance pre-check requires a lightweight formal compatibility gate before ratification and before each implementation slice merges. Additive response fields alone do not require a gate; the gate exists because default behavior changes from PG-only trace reads to cache-first trace reads.
Slice 1 MUST include tests proving:
- Existing
prism_signal_trace response fields remain present and semantically stable for trace-found and trace-miss paths.
- Strict response models, MCP schemas, or generated clients are updated in the same slice that exposes new fields.
- Consumers tolerate unknown additive fields:
source, cache_hit, pg_audit_used, cache_stream_id, trace_state, and any events_source marker.
source=pg_audit preserves historical PG-audit behavior.
source=cache never consults PG.
source=auto is cache-first.
- Recent cache miss with
include_history=false does not query PG and returns explicit source, cache_hit, pg_audit_used, and trace_state metadata.
include_history=true behavior is covered by tests.
9.2 Compatibility Gate — Signal response durability semantics
Slice 2 MUST include tests proving:
- Existing
prism_signal response fields remain present and semantically stable for success and queued/offline paths.
- Strict response models, MCP schemas, or generated clients are updated in the same slice that exposes new fields.
- Healthy cache returns
audit_state="cache_accepted", cache_stream_id, and trace_state="cache_accepted".
- Cache timeout/provisional path returns
audit_state="provisional" without claiming PG archival.
- Provisional response includes visible advisory metadata.
- No PG write is introduced on the hot path.
9.3 Compatibility Gate — Rejected sends
Slice 3 MUST include tests proving:
- Existing rejected/unknown-recipient response fields remain stable where they still apply.
- Rejected send with cache acceptance may return
signal_id / trace_id and is traceable as backend_rejected.
- Rejected send without cache acceptance returns structured failure without
signal_id / trace_id.
- No branch returns durable-looking IDs without an explicit audit state once Slice 2 lands.
9.4 Compatibility Gate — Trace events
Slice 4 MUST include tests proving:
- Cache-first trace responses that predate trace-event archiving remain understandable through
events_source or equivalent metadata.
- After Tier 3.6 lands, Marconi-originated traces expose seeded events without changing the meaning of existing event fields.
- Shadow and primary trace-event writes are idempotent.
- Duplicate stream processing does not duplicate trace events.
9.5 Documentation compatibility note
Desiree docs sweep after Slices 1-3 land:
docs/signal-mesh.mdx: prism_signal_trace is cache-first for recent traces.
docs/agent-surfaces.mdx: explain audit_state / trace_state on signal responses.
docs/marconi.mdx or equivalent: read-side topology paired with SPEC-101 write-side topology.
docs/changelog.mdx: note response-shape additions and cache-first behavior.
Docs MUST phrase PG audit as historical/reporting/reconciliation, not as the default trace backend. They MUST name the compatibility break surface: default read source semantics changed, pg_audit is the explicit historical/reconciliation mode, and cache miss is not equivalent to historical not-found unless include_history=true or source=pg_audit is used.
10. Deployment and Validation
Rollout order:
- Fold v0.2 amendments and record the Candi compatibility gate for response-shape and default-behavior changes.
- Merge and deploy Slice 5 metrics scaffold and Slice 3 rejected-send audit enqueue if they remain additive.
- Merge Slice 1 cache-first trace read with the explicit empty-events caveat while Slice 4 is still pending.
- Validate live cache-first trace lookup using a new doorbell signal.
- Merge Slice 2 behind response-shape compatibility guards.
- Validate cache healthy path and Redis-unavailable provisional path.
- Merge Slice 4 in shadow mode.
- Compare shadow trace events, then promote primary when clean.
- Update docs and journal/checkpoint.
Live validation checklist:
- Send a doorbell to an online agent; response returns
audit_state.
- Immediately call trace with default source; response source is
cache.
- Call trace with
source=pg_audit; response bypasses cache.
- Send to misspelled/unregistered identity; either no IDs are returned or rejected audit envelope is cache-visible.
- Stop or simulate Redis failure; delivery does not hang, response is
provisional, metrics increment.
- Confirm PG archiver catches up within the SPEC-101 near-immediate expectation under healthy Redis/PG.
11. Open Questions
- Trace-index scan fallback bound. Implementation should choose a conservative scan bound for index-error fallback and expose it in tests/config. The architecture requirement is bounded fallback, not unbounded stream scan.
12. Ratification Criteria
SPEC-104 is ratified when:
- Texi and Donna agree the read path is cache-first by default.
- Frank confirms PG audit remains explicit for historical/reporting/reconciliation use.
- Candi compatibility gate is recorded in this SPEC and enforced per implementation slice.
- Samantha has a merge-lane test plan for yellow/red slices.
Last modified on June 7, 2026