Operational incident library

Anonymized patterns for retrieval debugging — symptom chains, stack fingerprints, remediation, and trust/explainability envelopes. No customer PII.

Retrieval miss after embedding deploy

trust 78%

LangChain+pgvector

Symptom chain

  1. Expected chunk ranks below score threshold
  2. recall@10 drops post deploy
  3. retrieve span max_score below gate

Remediation

  • Reindex with new embedding model hash
  • Lower score threshold temporarily with eval gate
  • Diff golden-set chunk IDs vs retrieve ranking

Evidence types

retrieve_span, eval_metrics, embedding_version

Explainability

score_threshold in trace · chunk rank position · embedding model tag

Uncertainty: Confirm namespace filter unchanged

Hybrid search regression (alpha=1)

trust 82%

Qdrant+LangChain

Symptom chain

  1. recall@10 −18% after config deploy
  2. sparse leg cold-start in trace
  3. p95 latency +12ms

Remediation

  • Restore alpha / RRF fusion from last known good
  • Warm sparse index post deploy
  • Benchmark dense-only vs hybrid on golden set

Evidence types

fusion_config, benchmark_delta, sparse_index_freshness

Explainability

alpha in retrieval config · RRF vs weighted sum mode

Metadata filter failure → empty context

trust 74%

Pinecone+LangGraph

Symptom chain

  1. empty_context=true on retrieve spans
  2. faithfulness collapse post filter v2
  3. tenant filter over-constrains namespace

Remediation

  • Rollback metadata filter deploy
  • Audit filter JSON vs prior version
  • Add eval gate on context_recall pre-prod

Evidence types

metadata_filter_diff, faithfulness_eval, trace_empty_context

Explainability

filter predicate in trace · candidate count pre/post filter

Uncertainty: Validate multi-tenant filter keys

Reranker failure / latency regression

trust 71%

LangChain+OpenSearch

Symptom chain

  1. top_k candidates drop after rerank timeout
  2. cross-encoder batch queue depth spike
  3. boundary chunks demoted below threshold

Remediation

  • Increase rerank timeout / shrink batch size
  • Cache rerank scores for hot queries
  • A/B reranker off on latency SLO breach

Evidence types

rerank_span, latency_histogram, top_k_before_after

Explainability

reranker model id · timeout events in trace

Eval regression on deploy window

trust 85%

LlamaIndex+Weaviate

Symptom chain

  1. Ragas context_precision below gate
  2. golden set harness version drift
  3. faithfulness 0.91 → 0.54

Remediation

  • Pin eval harness + dataset version
  • Block deploy on faithfulness delta > ε
  • Segment failures by query cluster

Evidence types

eval_harness_version, ragas_metrics, deploy_correlation

Explainability

eval gate config · metric deltas with timestamps

Hallucination spike after chunking change

trust 69%

LangGraph+pgvector

Symptom chain

  1. hallucination rate 3.2× baseline
  2. chunk_overlap=32 drops boundary context
  3. answer faithfulness below SLO

Remediation

  • Restore chunk_overlap ≥ 128 for policy docs
  • Re-embed affected doc partition
  • Run faithfulness gate on incident window queries

Evidence types

chunking_config, faithfulness_eval, generation_trace

Explainability

chunk_overlap in config · context length in generation span

Uncertainty: Confirm reranker unchanged