Chunking determines which text exists in the index at all. Reranking only reorders candidates already retrieved. If required facts never appear in any chunk, reranking cannot recover them.
Decision rule
Measure recall on required facts per question set first. If recall is low, change chunk size, overlap, and metadata boundaries before adding reranker latency and cost.
Architecture differences
• Chunking runs at ingest time and defines embedding inputs; reranking runs at query time on top-k candidates.
• Rerankers cannot surface text that was never chunked into the index.
Choose Chunking
How documents are split, overlapped, and tagged before embedding — defines the searchable units.
• Recall@k is low because passages are too large, too small, or split mid-thought.
• You are still designing ingestion and have no stable baseline metrics.
• Required facts are missing from every retrieved candidate in eval traces.
Choose Reranking
A second stage that scores and reorders top-k hits from the first retrieval pass before generation.
• Recall is acceptable but precision at top ranks is noisy.
• You can afford extra latency per query for a cross-encoder or rerank model.
• Top-10 hits contain the answer but the wrong passage ranks first.
Where people confuse them
• Adding a reranker when required facts are absent from every candidate chunk.
• Tuning overlap while tables are still split mid-row.
What experts agree on
Shared ground practitioners cite before choosing sides in this comparison.
•Both affect which text the generator sees — chunking upstream, reranking immediately before the prompt.
•Both should be logged in eval traces to debug failures.
•Retrieval quality dominates many production failures; fixing prompts alone rarely fixes wrong or missing chunks.
•Chunking, embedding model choice, and metadata boundaries materially affect what the model can see.
What experts disagree on
Open engineering debates — compare indexed explanations before you commit to an architecture.
Semantic chunking versus fixed token windows for technical docs.
Semantic chunking versus fixed token windows for technical docs.
Whether reranking belongs in-line or only in offline eval tooling.
Whether reranking belongs in-line or only in offline eval tooling.
Common mistakes
•Buying reranker SaaS before baseline recall metrics exist.
•Using fixed token windows on API reference docs with code blocks.
•Rerankers recover answers missing from all chunks.
•Larger chunks always improve retrieval.
•Tuning rerankers while answers never appear in any chunk.
•Huge chunks that bury the single sentence containing the fact.
Implementation tradeoffs
•Chunk changes require re-embed and re-index jobs; reranker deploys add latency and GPU cost per query.
•Chunk bugs are fixed in ETL; rerank bugs are fixed in model serving.
•Smaller chunks increase index size and query fan-out; rerankers scale with top-k size and cross-encoder throughput.
•Huge chunks dilute embeddings and inflate generation token cost downstream.
•Establish recall on required facts before reranker experiments — measure MRR shift only after recall plateaus.
•Log both retrieved and reranked spans in eval traces.
Themes repeated across indexed engineering talks and practitioner writeups — not a survey, vote count, or attributed quote roundup.
Example use cases
• API docs with split endpoints → structure-aware chunking.
• Noisy top-10 support tickets → reranker before generation.
Related engineering concepts
Chunking explained
Retrieval evaluation
RAG hallucination causes
Best expert explanation
Best expert explanation
mprove the final quality of your results using
Chosen for clarity and how directly it answers the question — not for views or hype.
"We're going to cover how to use Cohere's reranker to improve the final quality of your results"
Weights & Biases · Chunking and embedding tradeoffs · 0:23