Skip to main content

Why Advanced Hybrid Search

Most AI memory systems use a single retrieval method — usually vector similarity search. That’s fast and cheap, but it misses things. A vector embedding of “database selection rationale” will find semantically similar text, but it won’t reliably find an ADR titled “ADR-007: Postgres over MySQL” because the terms don’t overlap enough for cosine similarity to rank it highly. And neither vector nor lexical search can answer “what was true about the database choice as of March 15?” because neither has a concept of time. And none of them know that a six-month-old memory is more likely to be obsolete than a six-day-old one. Prism’s Hybrid RAG uses four retrieval legs precisely because each one catches what the others miss. This page explains how the system works on ingestion, how it works on retrieval, what the performance cost is, and why that cost pays for itself many times over.

The four retrieval legs

LegTechnologyWhat it catchesWhat it misses
Dense vectorpgvector (MiniLM 384-dim)Paraphrases, synonyms, conceptual near-missesExact proper nouns, spec IDs, error codes
BM25 lexicalPostgres tsvector generated columnsExact terms, technical jargon, identifiersSemantic similarity when wording differs
Tri-graph traversalNeo4j (SPEC-020 knowledge graph, typed semantic edges)Identity, typed references, structural relationshipsFree-text content not yet projected as entities
Temporal recencyexp(-age_days / half_life), half_life=180dTime-relevance: recent memories outrank stale ones at the same semantic scoreDomain-specific staleness (an old fact may still be load-bearing)
No single leg is sufficient. A vector-only search returns semantically similar content but can’t distinguish current from historical. A lexical-only search finds exact terms but misses paraphrases. A graph-only search knows identity and structure but can’t search prose. A temporal-only signal would surface yesterday’s noise over last year’s load-bearing decision. The fusion of all four is what makes recall accurate enough to trust.

How ingestion works

When an agent files a typed artifact — a decision, a spec, a plan, a retrospective, anything — the backend automatically fans out to three searchable representations and a recorded creation timestamp the temporal leg reads at query time. The agent does nothing special. It calls one MCP verb. The backend does the rest.
Hybrid RAG ingestion: an agent calls one MCP verb. The backend persists the artifact row in Postgres, generates a dense vector embedding via MiniLM stored in pgvector, auto-indexes lexically via a tsvector generated column, and projects an Entity plus EntityState into the Neo4j tri-graph via the SPEC-025 live-write hook. One write produces three searchable indexes automatically.

Dense vector embedding

A MiniLM model (baked into the Docker image — 88 MB, zero first-request latency) generates a 384-dimensional vector from the artifact’s body text. Stored in pgvector alongside the artifact row. At query time, cosine similarity finds semantically related content even when the exact words don’t match. What this catches: an agent searching for “why did we pick that database” finds an ADR whose body says “after evaluating PostgreSQL, MySQL, and MongoDB, we selected Postgres for its pgvector extension” — even though the query shares almost no exact words with the result.

BM25 lexical index

Postgres tsvector generated columns auto-index every artifact body on write. Zero additional code, zero extra writes — the database does it as a side-effect of the INSERT. At query time, term-frequency ranking finds content that contains the exact words the agent used. What this catches: an agent searching for “SPEC-020” or “MissingGreenlet” or “RRF_K=60” gets exact matches on those identifiers. Vector embeddings blur technical terms into semantic neighborhoods; BM25 finds the needle.

Tri-graph projection

The SPEC-025 live-write hook creates an :Entity and :EntityState in Neo4j for every typed artifact at create time. The entity carries canonical identity (UUID, type, immutable props). The state carries domain properties and a valid_from timestamp. Typed references link entities to each other through a granular taxonomy that distinguishes how entities relate, not just that they relate: :IMPLEMENTS, :EXTENDS, :RELATES_TO, :SUPERSEDES, :DEPENDS_ON, :SPECIFIES, :FORMALIZES, :RESOLVES, :DOCUMENTS, plus the generic :REFERENCES fallback. The extractor reads section headings (## Implements, ## Extends, ## Supersedes, ## Related, ## See also, ## Complements, ## Informs) on every spec / ADR / plan / retro / journal write and writes the typed edge directly. Aliases map surface forms to canonical identity ("PrismGR"Prism). What this catches: structural relationships, identity across renames, and — critically — temporal state. “What was true about SPEC-020 as of March 15?” is one graph traversal. No other retrieval leg can answer time-scoped questions structurally.

Temporal recency

Every artifact row carries a created_at timestamp. At query time the temporal leg computes score = exp(-age_days / half_life) per candidate (half_life=180d by default — a 6-month-old memory scores ≥ 0.37, a 1-year-old ≥ 0.13, today’s memories ≈ 1.0). The leg re-weights existing fused candidates from vector + lexical + graph; it doesn’t add new candidates, so the result-set composition is unchanged but ordering favors recent-but-relevant memories. What this catches: historical-context queries where two semantically similar memories sit at almost the same vector / lexical / graph score but one is current and the other is six months stale. The temporal leg breaks the tie toward the recent one without requiring the agent to inspect timestamps manually.

The ingestion performance tax

Ingestion adds approximately 50ms of overhead per artifact — embedding generation (~30ms on the baked-in MiniLM) plus Neo4j projection (~20ms). The BM25 index is free (Postgres tsvector generated column — computed on INSERT with zero additional I/O). The temporal leg is free at ingest (it reads created_at at query time, no separate index). For a system where artifacts are filed a few times per session, not thousands of times per second, this is invisible. A session that files 10 artifacts pays 500ms total — less than one human blink. The payoff: every future query against that artifact hits four independent retrieval surfaces instead of one. The 50ms tax per write produces years of faster, more accurate reads.

How retrieval works

When an agent calls semantic_recall, the vector / lexical / graph legs fire simultaneously, the temporal leg re-weights the fused candidate set, and the tri-graph annotates the output with temporal-state context before returning.
Hybrid RAG retrieval: the agent's query hits four parallel retrieval legs simultaneously — dense vector similarity via pgvector, BM25 lexical search via tsvector, tri-graph traversal in Neo4j, and temporal recency-decay (exp(-age_days / 180d)). Reciprocal Rank Fusion (RRF K=60) merges all four ranked lists; the temporal leg additionally re-weights the surviving candidates so recent-but-relevant memories rank ahead of stale ones at the same fused score. The tri-graph annotates results with temporal-state tags: current vs historical. The fused, annotated result returns approximately 200 tokens of grounded context.

Reciprocal Rank Fusion (RRF K=60)

Each retrieval leg returns a ranked list of candidates. RRF assigns each candidate a score based on its rank position in each list:
score = Σ 1/(K + rank)
A result that appears in the top-5 of multiple legs scores higher than a result that tops one leg but is absent from the others. The K=60 constant (a global methodology lesson from Phase 3) controls how much rank position matters versus mere presence. The temporal leg participates as a fourth leg in the fusion sum and additionally re-weights the surviving candidates by exp(-age_days / 180d) so recent-but-relevant memories rank ahead of stale ones at the same fused score. Why this matters: a result doesn’t need to be the best match in any single leg — it needs to be a good match across multiple legs. That cross-validation is what makes hybrid retrieval more accurate than any single method. A false positive in one leg is unlikely to also be a false positive in three others.

Temporal annotation

After fusion, the tri-graph layer checks each result against its entity-state history:
  • If the result references a current state → tagged [current]
  • If the result references a superseded state → tagged [historical]
  • If the result has no tri-graph entity (pre-SPEC-025 legacy) → no tag, treated as unversioned
The agent receives this distinction as structured metadata, not as a heuristic it has to reason about. This is the structural separation that eliminates the “stale memory” problem other systems solve with periodic cleanup cycles. Prism doesn’t need to clean up stale entries because the tri-graph already knows which state is current.

Retrieval performance

Environmentp50 latencyNotes
Local Docker stack~80msAll four legs on localhost
LAN server185–310msIncludes network round-trip
Both are fast enough to be invisible in an agent session where each turn takes seconds. The parallel query adds approximately 30ms over a single-leg vector-only search — the cost of the additional BM25 and tri-graph lookups running concurrently. The temporal leg adds essentially zero overhead: it computes exp(-age_days / half_life) per surviving candidate from the row’s created_at timestamp, no separate fetch.

The tokenomics payoff

This is where the performance tax on ingestion pays back exponentially.

The comparison

ApproachTokens consumedAccuracyTemporal awareness
Full-history context load20,000+ tokensLow — agent scans irrelevant materialNone — agent reasons about freshness from text
Vector-only retrieval~400 tokensMedium — semantic matches, no exact-term or temporal signalNone
Prism Hybrid RAG~400 tokensHigh — three-leg cross-validated, temporally annotatedStructural — [current] vs [historical] tags
The token cost is the same for vector-only and hybrid. The accuracy is dramatically higher for hybrid. The temporal annotation is exclusive to Prism’s tri-graph leg — no other retrieval architecture in the ecosystem provides it.

Compound savings

Multiply the per-query savings across a real workload:
  • A typical agent session makes 5–15 context-retrieval queries
  • A multi-agent team runs 4–8 sessions per day
  • A project lifetime spans weeks to months
At 20,000 tokens per full-history load vs 400 tokens per hybrid retrieval, the savings compound to hundreds of thousands of tokens per project-week. At Anthropic’s current pricing, that’s the difference between a multi-agent workflow being a viable engineering tool and being a cost experiment.

The real cost of inaccurate retrieval

The tokenomics comparison above shows the per-query savings. But the deeper cost of bad retrieval isn’t the tokens wasted on the query itself — it’s the cascade of damage that follows when an agent acts on the wrong answer. Incorrect code generation. An agent that retrieves a superseded architecture decision as its top result generates code against assumptions that are no longer true. The code compiles. It may even pass unit tests. But it’s structurally wrong — built on a foundation the team already moved away from. By the time a human catches it, the agent has generated dozens of files, and the correction isn’t a patch. It’s a delete-and-rebuild. Every line of incorrect code costs the original generation tokens plus the review tokens plus the deletion plus the regeneration against the correct context. Incorrect spec and artifact generation. Worse than wrong code is a wrong spec — because a wrong spec compounds into more wrong code by other agents in future sessions. An agent that retrieves stale project state and drafts a spec from it produces an artifact that looks authoritative, gets filed into the project record, and misleads every subsequent session that retrieves it. The error propagates forward through the temporal record until someone notices the contradiction. The remediation isn’t just deleting the bad spec — it’s tracing every downstream artifact and session that may have consumed it. Accidental deletion of prior work. An agent that can’t accurately recall what already exists will sometimes rebuild something that was already built, overwriting or conflicting with prior work in the process. This is especially common in multi-agent setups where two agents independently retrieve different (or incomplete) views of project state. One agent’s “new implementation” is another agent’s “destroyed three days of work.” With accurate hybrid retrieval, the agent knows what exists before it acts. Introduction of bugs and unexpected behavior. When an agent’s context is stale or wrong, the bugs it introduces are the hardest to diagnose — because the agent’s reasoning was internally consistent, just grounded in outdated facts. The resulting behavior looks like a regression but isn’t traceable to a code change; it’s traceable to a context change that happened silently between sessions. These bugs eat hours of debugging time because the human is looking in the code for a cause that lives in the memory layer. The multiplier: double the tokens, double the time. Every failure mode above has the same cost shape — the original work (wasted), the diagnosis (additional), the correction (additional), and the rebuild (additional). A conservative estimate is 2–3× the token cost and wall-clock time of doing it right the first time. In a multi-agent team running 4–8 sessions per day, even a 10% bad-retrieval rate produces a compounding drag that erases most of the productivity gain the agents were supposed to deliver. This is the argument for paying the 50ms ingestion tax and the 80ms retrieval latency. The performance cost of hybrid search is measured in milliseconds. The cost of not having it is measured in hours, deleted work, propagated errors, and tokens burned on rework that should never have been necessary.

Less toil from context management

In systems without structured retrieval, the human operator becomes the context manager — re-explaining decisions, pasting prior artifacts into the chat, correcting stale references, verifying that the agent’s understanding of current state is actually current. That’s the invisible toil tax that doesn’t show up in any dashboard but eats hours per week. With Prism’s Hybrid RAG, the agent retrieves its own context, verifies its own temporal freshness via the tri-graph’s [current] / [historical] tags, and asks the human only when the retrieval surface genuinely has no answer. The human’s job shifts from re-supplying context to directing work — which is the job they were hired for.

What most people don’t know is happening

The agent experience is deceptively simple: call semantic_recall, get a grounded answer. What’s invisible is the machinery behind that simplicity:
  • Every artifact write fans out to three indexes automatically and the temporal leg reads its created_at at query time. The agent never thinks about indexing. It just files artifacts through normal MCP verbs and the backend handles the rest.
  • Every query runs four parallel scoring legs and fuses the results. The agent doesn’t choose which retrieval method to use. It asks one question and gets the cross-validated answer.
  • Temporal freshness is structural, not heuristic. The agent doesn’t need to reason about whether a result is current. The tri-graph tells it, as metadata, before reasoning begins.
  • The performance cost is front-loaded on writes, not reads. 50ms per write, 80ms per read. The write tax is paid once per artifact. The read savings compound across every future query by every agent on the project.
  • Accuracy compounds with corpus size. Unlike context-window approaches that degrade as history grows (more material to scan, more irrelevant tokens displacing reasoning), hybrid retrieval gets better with more data — more candidates for RRF to cross-validate, more entity-state history for temporal annotation, more aliases and references for the tri-graph to traverse.
This is why Prism’s retrieval architecture is not a feature checkbox (“supports hybrid search”). It is the structural foundation that makes scoped, temporal, cross-platform memory economically viable at the scale of real multi-agent teams.

Beyond the four legs: structured memory contracts

The four-leg fusion is the retrieval engine. The work shipped through Plan #10 added a contract layer above it — the part that decides what an agent is allowed to read or write, what counts as evidence, and how procedural knowledge becomes a first-class artifact instead of free-text scratch. Three SPECs land together to make this real, all default-off behind feature flags so they can roll out without disturbing the existing recall surface.

Memory domain contracts (SPEC-079)

Per-agent memory historically mixed long-term facts (architecture decisions, postmortems, ratified specs) with transient session scratch (debugging notes, work-in-progress sketches, intermediate handoffs). The agent treated both as equally authoritative — the structural failure mode that drove the “context bleed” complaint on every shipped memory system. SPEC-079 v0.2 codifies twelve memory domains with explicit read/write contracts. Each session delta and stored memory carries a domain tag that names which contract governs it (governance, methodology, runtime, retrospective, signal-trace, install, and so on). The CI loop shipped in PR #148 tags every write at the source and surfaces missing-domain advisories so the boundary is observable rather than implicit. The intent is that semantic recall reads from a typed surface, not a flat soup — “give me governance evidence for this decision” and “give me debugging context for this regression” become structurally distinct queries against the same Hybrid RAG fusion. The domains are also the substrate the Capability Library and Evidence Graph (described in Vision) read from. A capability’s evidence chain is a domain-typed traversal across the same artifact corpus the four-leg fusion already indexes — no separate store, just typed edges and read contracts on top.

Method fragments — procedural knowledge as typed artifacts (SPEC-078)

DFW carried procedural knowledge in CLAUDE.md and prose retros. Prism inherited the same shape until SPEC-078 landed. Method fragments are typed artifacts — small, focused, evidence-bound — that capture how to do a recurring thing the right way: a review pattern, a deploy ritual, a regression-test discipline. They live in the same store the spec / ADR / plan surface uses, with the same lifecycle (proposed → experimental → active → deprecated → superseded → retired) and the same evidence-binding requirements. Two seed fragments ship in v0.2:
  • method.completion.done-definition — the rule that “completion” means merged + deployed + tested, not merged alone. Proven in production by the Donna deploy lane discipline that closes every PR through the full ship sequence.
  • method.parallel.ownership-contract — the consensus-first parallelism contract that lets multiple agents work without colliding (explicit write ownership, single-driver-per-domain, signal-mediated handoffs).
Method fragments retrieve through prism_method_fragment_recall — a typed surface on top of the Hybrid RAG fusion that returns the fragment body, its evidence count, its supersession history, and its applicability to the current task context. The verb is one of three new verbs (alongside prism_signal_ack and prism_signal_trace) that landed with the SPEC-078 wave.

Governance lookup — graph-backed rule and capability recall (SPEC-080)

The fourth retrieval leg (tri-graph traversal) was always capable of structural lookup, but the queries an agent actually wanted to ask — “what governance applies to this risk tier on this surface?”, “which capabilities are in scope for this project?”, “is this rule still in force or has it been superseded?” — needed a typed verb on top. prism_governance_lookup is that verb. Given a pid and a query class (applicable_rules, surface_capabilities, authority_lookup, method_fragments, memory_domains, review_gates, validation_context, supersession_check, conflict_lookup), it returns the governance state with source, citation, freshness, and supersession reporting baked into the result. The result is advisory only — it never overrides Ring authority or live prism_start state — but it gives agents a structured window into the governance graph that didn’t exist before. The shipped Phase 2 governance_backfill CLI seeded 196 :Artifact / 48 :Decision / 12 :SourceDocument+:InstructionSurface entities for PID-PGR01 behind the default-off PRISM_GOVERNANCE_LOOKUP_ENABLED flag. Phase 3 v1.A and v1.B added strong-evidence enforcement at checkpoint and wrap so governance recall doesn’t silently degrade as the corpus grows.

Active recall, not passive

Together, these three SPECs move memory from passive retrieval (“agent asks, agent receives, agent decides”) toward active recall — the system noticing when a recall result is stale, when a method fragment applies to current work, when a governance rule has been superseded. The agent still asks. The substrate now has the typed contracts and the evidence graph to answer with confidence and a citation. The four-leg fusion is the engine. SPEC-078 + SPEC-079 + SPEC-080 are the contracts, the typed artifacts, and the citation discipline that turn the engine into a system an operator can trust.

Where to go next

Overview

Back to what Prism is and what it solves

Tri-Graph Architecture

The three-layer knowledge graph in depth

Vision

Where the market is going and how retrieval compounds

Installation

Two commands from a fresh clone to a working install
Last modified on May 8, 2026