RAG incident root cause
- Symptom
- Hallucination rate 3.2× post-deploy; empty-context retrieve spans spike
- Root cause
- Metadata filter dropped boundary overlap chunks; overlap 128→32
- Remediation
- Rollback filter deploy, restore overlap=128, Phoenix faithfulness gate
- Prevention
- Canary eval gate on overlap + filter diff before prod rollout
Config evidence
- • chunk_overlap: 128→32
- • metadata filter v2
- • top_k: 20
Trace / metric evidence
- • faithfulness: 0.91 → 0.54
- • blast radius: high-traffic tenant
- • postmortem trace lineage: retrieve→generate
Why this answer won: Incident chain symptom→root cause→remediation with trace/metric hard signals; production_rag_failure_incidents contract.
Rejected: Excluded: generic “AI safety” clip with no config diff or incident timeline.