Teams evaluate RAG in two layers: retrieval (did we fetch the right chunks?) and generation (did the answer stay faithful to those chunks?). Recall on required facts is a common retrieval metric practitioners highlight before tuning prompts.
Chosen for clarity and how directly it answers the question — not for views or hype.
"You might be missing data. You might be chunking them in the wrong way. You might be using an embedding model that isn't optimum. Maybe your retrieval strategy needs to change."
There are a few metrics, but the most important one for us is “Recall.” Basically, for a given question, there is at least one required fact. If the retrieval step of the application found at least one context for every required fact, we mark that for a set of questions.
•Picking an embedding model that mismatches domain vocabulary without offline recall checks.
Implementation tradeoffs
•Reranking: Cross-encoder or LLM rerankers improve top-k quality at higher latency and inference cost.
•Regression testing: Fine-tune releases need behavior suites on fixed prompts; RAG releases need recall suites on labeled questions — teams often test only one.
•Evaluation: Offline labeled sets catch regressions early; online failure logs catch drift and long-tail queries production suites miss.
•Eval datasets: Synthetic QA scales cheaply but can miss domain phrasing; human rubrics are slower but catch faithfulness gaps automation misses.
Themes repeated across indexed engineering talks and practitioner writeups — not a survey, vote count, or attributed quote roundup.