Operational case studies

Clearly labeled simulated, anonymized, and representative outcomes — not named customers unless contracted separately.

Hybrid search regression after fusion deploy

anonymized

B2B SaaS copilot (anonymized)

recall@10 −22% after alpha=1.0 deploy; incidents triaged via ad-hoc log grep with no retrieve span standard.

RRF fusion restored; investigation workflow + Slack export cut triage time; eval gate blocks regressions.

Timeline: Representative 5-day path: Day 1 rollback, Day 2 benchmark, Day 3 eval gate, Day 4–5 monitor.

Retrieval: recall@10 restored from 0.58 → 0.76 on golden set (representative benchmark).

Hallucination: Faithfulness gate reduced empty-context answers (representative −40% empty-context spans).

Trust 52% → 81% · fusion config in trace · sparse index freshness check · eval delta on deploy window

simulated

Simulated enterprise support RAG

Simulated: expected chunks rank #12–#18 below threshold after embedding deploy without reindex.

Simulated: reindex + threshold calibration; investigation captures config diff per deploy.

Timeline: Simulated 7-day: reindex (3d), golden eval (2d), threshold tune (2d).

Retrieval: Simulated recall@10: 0.41 → 0.74.

Hallucination: Simulated context_recall improvement on labeled incident queries.

Trust 48% → 79% · embedding version hash · chunk rank in retrieve span · score_threshold

representative

Representative multi-tenant production RAG

Representative pattern: metadata_filter v2 over-constrains namespace; faithfulness collapse post-deploy.

Filter rollback + investigation timeline; postmortem export for customer success review.

Timeline: Representative 48h incident window with monitoring status.

Retrieval: Representative context_recall recovery after filter rollback.

Hallucination: Representative faithfulness: 0.54 → 0.88 after rollback (pattern from incident library).

Trust 55% → 83% · empty_context flag · filter predicate diff · faithfulness eval on window