Skip to main content

SPEC-104 v0.2.1 — Marconi Cache-first trace/read path

0. Operator Decision

Frank’s correction is the governing rule for this SPEC:
Real-time signal operations read from Marconi Cache first. PG audit is explicit for historical reporting, retrospective analysis, compliance, and reconciliation. The API must not teach agents to default to PG for recent state.
SPEC-101 v0.4 locked the write-side topology: Marconi switch (in-memory) -> Marconi Cache (Redis Stream, ~7d) -> PG audit SPEC-104 locks the read-side topology for signal trace and recent signal lookup: Marconi Cache first -> PG audit only when explicit or cache-aged-out

1. Problem

The current Marconi path can mint and return signal_id and trace_id from the in-memory switch before Tier 2 has durably accepted the envelope. prism_signal_trace also reads only PG audit tables, so a caller can hold a valid-looking trace_id while the real-time read surface reports trace_not_found or an incomplete timeline. Observed implementation gaps:
  • send_via_marconi returns IDs after enqueueing to the in-process audit queue, not after Redis Stream XADD acceptance.
  • recipient_not_registered currently returns IDs without going through the audit enqueue path.
  • get_signal_trace reads signal_queue and signal_trace_events in PG only.
  • The PG archiver projects trace_id into signal_queue, but the Marconi envelope path does not seed companion signal_trace_events rows for signal_created / backend_published / backend_queued.
The product risk is behavioral: agents and dashboards will naturally call the default trace verb for recent signals. If that verb starts with PG, the system drifts back toward database polling and hides the architecture distinction between real-time cache and historical audit.

2. Scope

This is a new read-path SPEC, not a SPEC-101 amendment. SPEC-101 owns the write-side Marconi topology and fan-out. SPEC-104 owns how operators, agents, and dashboards resolve recent trace/read state. In scope:
  • Cache-first prism_signal_trace behavior for recent traces.
  • Explicit PG-audit read mode for historical/reporting use.
  • Response-surface contract for whether returned IDs are cache-accepted or provisional.
  • Audit handling for rejected sends such as recipient_not_registered.
  • Tier 3.6 trace-event archiver behavior for Marconi-generated trace stages.
  • Metrics that make cache hit/miss, provisional responses, and audit gaps visible.
Out of scope:
  • Moving Redis into the signal delivery hot path.
  • Requiring PG persistence before prism_signal returns.
  • Adding a new service, sidecar, daemon, or MCP direct store writer.
  • Cross-instance Marconi routing.
  • Changing non-signal durable verbs such as prism_journal, prism_decide, or prism_postmortem.

3. Read-Path Contract

3.1 Default source order

For real-time signal trace and recent lookup calls, the default source order is:
  1. Marconi Cache: Redis Stream marconi:signals:{tenant_id} and its trace index (§4).
  2. PG audit: only when the caller explicitly requests history or when cache lookup proves the trace is older than the cache retention window.
PG audit MUST NOT be the first default read for recent signal operations.

3.2 Source parameter

Trace/read verbs MUST expose an explicit source selector:
SourceMeaning
cacheRead Marconi Cache only. Misses return cache miss metadata; they do not query PG.
pg_auditRead PG audit only. Intended for historical reporting, retrospectives, compliance, and reconciliation.
autoDefault. Cache first; PG only when cache reports aged-out / retention miss or caller explicitly allows historical fallback.
source=pg_audit strictly bypasses cache. This keeps historical/reporting jobs deterministic and keeps real-time calls honest. For a recent cache miss, source=auto MUST NOT query PG unless include_history=true. A recent cache miss is diagnostic signal; silently falling through to PG would hide cache/audit gaps from the operator.

3.3 Recency boundary

The recency boundary is time-based and uses the envelope’s created_at.
  • If created_at is inside the configured Marconi Cache retention window, auto MUST attempt cache first.
  • If created_at is outside the retention window, auto MAY go directly to PG audit.
  • If created_at is unknown, auto MUST attempt cache first, then return clear miss metadata or fall back only when include_history=true.
Default cache retention follows SPEC-101 v0.4: approximately seven days.

3.4 Response metadata

Trace/read responses MUST include source metadata:
FieldMeaning
sourcecache, pg_audit, or mixed if the response merged cache envelope with PG trace events.
cache_hitBoolean. True when the trace envelope was found in Marconi Cache.
pg_audit_usedBoolean. True only when PG audit was queried.
cache_stream_idRedis Stream ID when known.
trace_statecache_accepted, pg_archived, provisional, not_found, or aged_out.
Existing response fields remain backward compatible.

4. Cache Lookup Shape

Redis Streams are not indexed by trace_id. The implementation MUST add a small per-tenant trace index when an envelope is appended to Marconi Cache: marconi:trace:{tenant_id}:{trace_id} -> {stream_key, stream_id, signal_id, created_at} The key TTL MUST match the stream retention window. This makes cache-first trace lookup O(1) instead of scanning the stream. On XADD success, the cache writer records both:
  • the envelope in marconi:signals:{tenant_id};
  • the trace index key above.
If the trace-index write fails after XADD succeeds, the writer MUST increment an error metric and retry or repair asynchronously. It MUST NOT delete the accepted stream entry. v0.2 pins the implementation path to a bounded approach: do not build a complex retry queue for trace-index failures. If the stream XADD succeeds but the trace-index write fails, increment marconi_trace_index_errors_total, keep the accepted stream entry, and allow the cache lookup service to use a bounded recent-stream scan fallback for that metric-flagged minority. If scan fallback volume becomes material, a later SPEC may add a repair worker.

5. Send Response Contract

5.1 Default target state

For normal accepted sends, prism_signal MUST return signal_id and trace_id only after Marconi Cache accepts the envelope, except for the explicit provisional path in §5.2. This makes returned IDs immediately resolvable by the cache-first trace path in the healthy case. Returned fields:
FieldMeaning
audit_statecache_accepted, provisional, or not_recorded.
cache_stream_idRedis Stream ID when cache accepted the envelope before response.
trace_stateMirrors the trace lookup state expected immediately after return.

5.2 Cache failure behavior

Do not block signal delivery indefinitely on cache health. Cache acceptance has a default timeout of 250ms, configurable by PRISM_MARCONI_CACHE_ACCEPT_TIMEOUT_MS. Provisional response rate is the operator-facing dial that tells us whether the timeout is too aggressive or the cache writer is unhealthy. If delivery succeeds but cache acceptance does not complete inside the bounded cache-accept timeout:
  • return the delivery result with audit_state="provisional";
  • include action_required / routing_advisory language suitable for operator display;
  • enqueue or retain the envelope for async retry when possible;
  • increment provisional-response metrics.
This preserves SPEC-101’s rule that Redis is not in the signal-delivery flow while making the response honest.

5.3 Rejected or invalid recipient sends

recipient_not_registered and similar rejected sends MUST prefer a rejected audit envelope with outcome="recipient_not_registered" and backend_rejected trace stage. When Marconi Cache is available, the response MAY include signal_id and trace_id after the rejected envelope is cache-accepted. When Marconi Cache is unavailable for a rejected send, the response MUST be a structured failure without minted durable-looking IDs. Do not return provisional IDs for rejected sends in v0.2; retry-retention semantics for rejected attempts are deferred until operators ask for them.

6. Trace Event Archiving (Tier 3.6)

The PG archiver MUST write trace-event rows for Marconi-generated stages when the stream entry carries enough information to do so. Minimum seeded stages:
StageWhen
signal_createdEnvelope minted by Marconi switch.
backend_publishedDelivered to live recipient queue / WebSocket path.
backend_queuedQueued for offline recipient or backpressure fallback.
backend_rejectedRejected before delivery, including recipient_not_registered.
The archiver writes these to signal_trace_events_shadow in shadow mode and signal_trace_events in primary mode using idempotent keys compatible with the existing uniqueness contract. This is named Tier 3.6 because it extends SPEC-101 Tier 3 without changing the Tier 1/Tier 2 delivery contract.

7. Metrics

The implementation MUST add or expose:
MetricMeaning
marconi_trace_cache_lookup_total{tenant,outcome}Cache-first trace lookups by outcome: hit, miss, aged_out, index_error.
marconi_trace_pg_lookup_total{tenant,reason}PG audit lookups by reason: explicit, aged_out, fallback_allowed.
marconi_signal_response_audit_state_total{tenant,audit_state}Send responses grouped by cache acceptance state.
marconi_trace_index_errors_total{tenant,reason}Trace-index write/read failures.
marconi_trace_event_archiver_errors_total{tenant,reason}Tier 3.6 trace-event archiver failures.
Dashboard surfaces should treat audit_state="provisional" and cache index errors as operator-visible yellow states. Data loss remains governed by SPEC-101’s overwrite metrics.

8. Implementation Plan

Slice 1 — Cache-first trace read

Owner: Donna. Architecture gate: Texi. Merge lane: Samantha yellow. Deliver:
  • Add cache trace index on cache writer XADD.
  • Add cache lookup service for trace_id.
  • Change prism_signal_trace default to source=auto with cache-first semantics.
  • Add source=cache and source=pg_audit.
  • Keep backward-compatible response fields; add source metadata.
Acceptance:
  • Recent trace lookup hits cache without PG.
  • source=pg_audit bypasses cache.
  • Cache miss does not silently hide source metadata.
  • Tenant/project authorization remains identical to the existing trace verb.
  • Before Slice 4 ships, cache-first trace responses MAY contain the envelope and source metadata with an empty or partial events array. The response MUST mark that state clearly, e.g. trace_state="cache_accepted" and events_source="pending_pg_trace_events" or equivalent, so callers do not confuse missing trace events with a missing trace.
  • Compatibility gate §9.1 passes for trace/read source behavior.

Slice 2 — Send-response provisional/cache-accepted contract

Owner: Donna. Architecture gate: Texi + Candi if response-shape governance requires it. Merge lane: Samantha red. Deliver:
  • Add bounded cache-accept path or cache-accept future for Marconi sends.
  • Return audit_state, trace_state, and cache_stream_id when available.
  • Preserve direct delivery latency by timing out to provisional.
  • Add async retry path or retention path for provisional envelopes.
Acceptance:
  • Healthy cache returns audit_state="cache_accepted" and the returned trace resolves from cache immediately.
  • Redis unavailable returns delivered/queued result with audit_state="provisional", not a false durable claim.
  • No PG write is introduced on the hot path.
  • Compatibility gate §9.2 passes for send response durability semantics.

Slice 3 — Rejected-send audit enqueue

Owner: Donna. Architecture gate: Texi. Merge lane: Samantha yellow. Deliver:
  • recipient_not_registered records a rejected envelope in Marconi Cache or returns without IDs.
  • Preferred: cache rejected envelope with backend_rejected trace stage.
Acceptance:
  • Misspelled recipient attempts are traceable.
  • No branch returns signal_id / trace_id with audit_state omitted.
  • Compatibility gate §9.3 passes for rejected-send behavior.

Slice 4 — Tier 3.6 trace-event archiver

Owner: Donna. Architecture gate: Texi. Merge lane: Samantha red. Deliver:
  • Extend PG archiver to materialize trace events from stream envelopes.
  • Write shadow first, then primary under existing Stage 4 ladder controls.
  • Add migration only if idempotency or provenance requires schema support.
Acceptance:
  • Marconi-originated signals have signal_trace_events rows after archiver catch-up.
  • Shadow and primary modes behave consistently.
  • Duplicate stream processing is idempotent.
  • Compatibility gate §9.4 passes for trace-event compatibility.

Slice 5 — Metrics and dashboard handoff

Owner: Donna for exporter, Porsche for dashboard consumption, Texi architecture gate. Deliver:
  • Metrics in §7.
  • Dashboard distinguishes cache health from PG audit lag.
  • Provisional send responses and cache lookup failures are visible.
Acceptance:
  • Operator can see whether trace miss is cache miss, aged-out, PG fallback, or real absence.
  • Dashboard does not imply PG is the real-time source of truth.

9. Documentation Plan

9.1 Compatibility Gate — Trace/read source behavior

Candi’s governance pre-check requires a lightweight formal compatibility gate before ratification and before each implementation slice merges. Additive response fields alone do not require a gate; the gate exists because default behavior changes from PG-only trace reads to cache-first trace reads. Slice 1 MUST include tests proving:
  • Existing prism_signal_trace response fields remain present and semantically stable for trace-found and trace-miss paths.
  • Strict response models, MCP schemas, or generated clients are updated in the same slice that exposes new fields.
  • Consumers tolerate unknown additive fields: source, cache_hit, pg_audit_used, cache_stream_id, trace_state, and any events_source marker.
  • source=pg_audit preserves historical PG-audit behavior.
  • source=cache never consults PG.
  • source=auto is cache-first.
  • Recent cache miss with include_history=false does not query PG and returns explicit source, cache_hit, pg_audit_used, and trace_state metadata.
  • include_history=true behavior is covered by tests.

9.2 Compatibility Gate — Signal response durability semantics

Slice 2 MUST include tests proving:
  • Existing prism_signal response fields remain present and semantically stable for success and queued/offline paths.
  • Strict response models, MCP schemas, or generated clients are updated in the same slice that exposes new fields.
  • Healthy cache returns audit_state="cache_accepted", cache_stream_id, and trace_state="cache_accepted".
  • Cache timeout/provisional path returns audit_state="provisional" without claiming PG archival.
  • Provisional response includes visible advisory metadata.
  • No PG write is introduced on the hot path.

9.3 Compatibility Gate — Rejected sends

Slice 3 MUST include tests proving:
  • Existing rejected/unknown-recipient response fields remain stable where they still apply.
  • Rejected send with cache acceptance may return signal_id / trace_id and is traceable as backend_rejected.
  • Rejected send without cache acceptance returns structured failure without signal_id / trace_id.
  • No branch returns durable-looking IDs without an explicit audit state once Slice 2 lands.

9.4 Compatibility Gate — Trace events

Slice 4 MUST include tests proving:
  • Cache-first trace responses that predate trace-event archiving remain understandable through events_source or equivalent metadata.
  • After Tier 3.6 lands, Marconi-originated traces expose seeded events without changing the meaning of existing event fields.
  • Shadow and primary trace-event writes are idempotent.
  • Duplicate stream processing does not duplicate trace events.

9.5 Documentation compatibility note

Desiree docs sweep after Slices 1-3 land:
  • docs/signal-mesh.mdx: prism_signal_trace is cache-first for recent traces.
  • docs/agent-surfaces.mdx: explain audit_state / trace_state on signal responses.
  • docs/marconi.mdx or equivalent: read-side topology paired with SPEC-101 write-side topology.
  • docs/changelog.mdx: note response-shape additions and cache-first behavior.
Docs MUST phrase PG audit as historical/reporting/reconciliation, not as the default trace backend. They MUST name the compatibility break surface: default read source semantics changed, pg_audit is the explicit historical/reconciliation mode, and cache miss is not equivalent to historical not-found unless include_history=true or source=pg_audit is used.

10. Deployment and Validation

Rollout order:
  1. Fold v0.2 amendments and record the Candi compatibility gate for response-shape and default-behavior changes.
  2. Merge and deploy Slice 5 metrics scaffold and Slice 3 rejected-send audit enqueue if they remain additive.
  3. Merge Slice 1 cache-first trace read with the explicit empty-events caveat while Slice 4 is still pending.
  4. Validate live cache-first trace lookup using a new doorbell signal.
  5. Merge Slice 2 behind response-shape compatibility guards.
  6. Validate cache healthy path and Redis-unavailable provisional path.
  7. Merge Slice 4 in shadow mode.
  8. Compare shadow trace events, then promote primary when clean.
  9. Update docs and journal/checkpoint.
Live validation checklist:
  • Send a doorbell to an online agent; response returns audit_state.
  • Immediately call trace with default source; response source is cache.
  • Call trace with source=pg_audit; response bypasses cache.
  • Send to misspelled/unregistered identity; either no IDs are returned or rejected audit envelope is cache-visible.
  • Stop or simulate Redis failure; delivery does not hang, response is provisional, metrics increment.
  • Confirm PG archiver catches up within the SPEC-101 near-immediate expectation under healthy Redis/PG.

11. Open Questions

  1. Trace-index scan fallback bound. Implementation should choose a conservative scan bound for index-error fallback and expose it in tests/config. The architecture requirement is bounded fallback, not unbounded stream scan.

12. Ratification Criteria

SPEC-104 is ratified when:
  • Texi and Donna agree the read path is cache-first by default.
  • Frank confirms PG audit remains explicit for historical/reporting/reconciliation use.
  • Candi compatibility gate is recorded in this SPEC and enforced per implementation slice.
  • Samantha has a merge-lane test plan for yellow/red slices.
Last modified on June 7, 2026