Joye Personal Blog

Back

🔬 Research ✅ Ready · aillmragretrievalembeddingrerankerreference

RAG Retrieval Details and Pipeline Design

Notes on the key retrieval-side details of RAG — Embedding, Reranker, Chunking, Hybrid Search, Query Transformation, and more.

Core idea#

This card isn’t about any one model vendor — it’s about what the genuinely important details are at the retrieval layer of RAG.

If you want the shortest possible takeaway, start with this:

The thing that makes or breaks production-grade RAG usually isn’t “which LLM you pick,” but whether the retrieval pipeline is clean enough, accurate enough, and cheap enough.

Seen through this lens, Embedding, Reranker, Chunking, Hybrid Search, Query Transformation, and Post-Retrieval Processing aren’t scattered tricks — they’re different stages of the same retrieval pipeline.

Key points#

1. Embedding + Reranker is basically the default combo for production-grade RAG#

The most minimal version of RAG often consists of just:

  1. Chunk the documents
  2. Build an index with embeddings
  3. Embed the query too
  4. Run a similarity search to get the top-k
  5. Hand the results to the LLM

This pipeline works, but once the goal shifts to:

  • Higher accuracy
  • Fewer hallucinations
  • Cleaner context
  • Lower token cost

you’ll usually drift naturally toward two-stage retrieval:

Embedding recall -> top-k candidates -> Reranker rerank -> final contexts -> LLM

In other words:

  • Embedding handles high recall, low cost, and rapidly shrinking the search space
  • Reranker does the fine-grained ranking over the candidates, surfacing what’s actually relevant

2. Embedding is more of a recall layer than a final-ranking layer#

Embedding’s strengths are mainly:

  • Precomputable
  • Indexable at scale
  • Fast to retrieve
  • Suited to corpora in the millions of items, or even larger

But it has inherent limits too.

A typical embedding follows the bi-encoder approach:

  • The query is encoded on its own
  • The chunk is encoded on its own
  • Similarity is estimated from the vector distance

This is very efficient, but it also means it’s better at judging “do these roughly look alike semantically” and worse at making the finest-grained relevance calls.

So the problems it tends to run into are:

  • Chunks that are topically close but don’t actually answer the question still get recalled
  • It isn’t sensitive enough to negation, qualifiers, or fine-grained logic
  • A fair amount of noise sneaks into the top-k

Put differently, embedding is more like a high recall, medium precision first pass.

3. The Reranker’s value is turning “looks relevant” into “actually relevant”#

A reranker usually takes a cross-encoder approach, or an approximation like late-interaction, putting the query and the candidate chunk together and re-judging them.

The core value of this step is:

  • It can understand the correspondence between query and chunk at a finer grain
  • It can tell whether they’re merely topically close, or whether the conditions genuinely match too
  • It cuts the odds of feeding junk context to the LLM

A reranker is slower than embedding, but it usually processes only a few dozen candidates rather than scanning the whole corpus, so it’s still very cost-effective overall.

A lot of the time, adding a good reranker has a better cost-performance ratio than simply swapping in a larger generation model.

4. A handy mental model#

If you want the shortest way to remember their division of labor:

  • Embedding is RAG’s legs — it runs fast and casts a wide net
  • Reranker is RAG’s eyes — it sees accurately and picks precisely

This distinction is useful because it directly shapes your later model selection and cost allocation.

5. Chunking is still the foundation of retrieval quality#

A lot of people pour all their attention into the models, but in practice, when you actually build a system, chunking often sets the ceiling earlier.

Common problems include:

  • Chunks too small, so the meaning gets fractured
  • Chunks too large, so the vectors aren’t precise enough and you waste tokens
  • Unreasonable overlap, leading to duplication or gaps
  • A chunking scheme that doesn’t fit the document’s structure

Directions worth studying continuously include:

  • fixed-size chunking
  • semantic chunking
  • overlap strategies
  • parent-child / hierarchical chunking
  • Late Chunking

The most important lesson here isn’t “some method is always best,” but rather:

Chunking has to be considered together with the shape of the corpus, the type of queries, and how you rerank / generate downstream.

6. Hybrid Search is more of a production default than pure dense retrieval#

Relying on embedding alone for dense retrieval tends to suffer in cases like:

  • Product numbers
  • API names
  • Error strings
  • Code symbols
  • Legal clause numbers
  • Proper nouns

because this kind of content often needs stronger literal matching.

So a very common production setup is:

  • dense retrieval for semantic recall
  • sparse / keyword retrieval (e.g. BM25) for literal recall
  • then fusion and rerank on top

That’s the basic logic of Hybrid Search.

7. Query Transformation is a low-cost, high-return step#

The user’s raw query is often too short, too colloquial, too vague, or missing context.

So doing query transformation before the actual retrieval is often well worth it — for example:

  • query rewrite
  • query expansion
  • multi-query retrieval
  • HyDE

Its value lies in:

  • Improving recall
  • Reducing how much the quality of the user’s phrasing affects the system’s performance
  • Often not requiring you to rewrite the entire retrieval chain

8. Post-Retrieval Processing can directly improve cost and controllability#

What comes back from retrieval doesn’t have to be handed to the LLM verbatim.

You can add another cleanup layer in between:

  • metadata filtering
  • context compression
  • chunk summarization
  • duplicate removal
  • citation alignment

The benefits of this layer are very concrete:

  • Less junk context fed in
  • Fewer wasted tokens
  • Lower odds of the answer going off the rails
  • Better readability for citations and the chain of evidence

9. The more advanced directions are Agentic / Corrective / Graph RAG#

Once the foundational retrieval layer is reasonably solid, the next step is when these more complex architectures become worth a look:

  • Agentic RAG: the system itself decides whether to search a few more rounds, rephrase the query, or call tools
  • Corrective RAG: it first judges whether the current retrieval results are good enough, and corrects the flow if they aren’t
  • Graph RAG: organize knowledge into an entity-relationship graph, using the graph structure to aid retrieval and reasoning

These directions decide the ceiling for RAG’s leap from “looking things up” to “retrieval-driven reasoning” — but they’re not something to dive headfirst into before the foundational retrieval is solid.

Current understanding / conclusion#

From an engineering standpoint, the rough order in which RAG’s retrieval layer is most worth optimizing can be remembered as:

  1. First get chunking right
  2. Then get embedding recall + reranker rerank running smoothly
  3. Then add hybrid search and query transformation
  4. Then consider post-retrieval processing
  5. Finally, move on to more complex architectures like agentic / corrective / graph

In other words, what makes RAG genuinely stable usually isn’t some single magic trick, but:

A retrieval pipeline from query to context, with each layer making a few fewer mistakes.

To be added#

  1. The boundary and applicability conditions for semantic chunking vs. Late Chunking
  2. Concrete fusion strategies for Hybrid Search (e.g. RRF)
  3. The real benefit boundaries of HyDE / multi-query retrieval
  4. The difference between the cross-encoder and late-interaction approaches to reranking
  5. An empirical table of chunk size / overlap across different corpus types
  6. A general-purpose production RAG retrieval reference pipeline

Notes#

  • This card was split out of the earlier Jina Embeddings card, with the aim of separating provider-specific content from general RAG retrieval design.
  • The current content leans toward the retrieval side and doesn’t cover higher-level concerns like generation prompts, tool use, or agent orchestration.

🗂️ A research from the knowledge base.

← Back