Designing evaluation sets when labels are noisy

Label noise is the quiet killer of retrieval metrics. When subject-matter experts disagree on whether a document answers a prompt, naive majority votes bake uncertainty into a single golden label. We start by logging disagreement codes—partial answer, outdated policy, ambiguous intent—before any aggregate score is computed.

In the Model Evaluation Clinic cohort we pair teams with a spreadsheet that forces reviewers to cite span evidence. That discipline slows labeling throughput at first, but it prevents false confidence when you later slice by product line. The second week focuses on building replay sets that freeze prompt versions alongside label versions so comparisons stay honest.

The third paragraph is about communication: metrics decks that hide reviewer disagreement often get challenged in release reviews. We publish a short appendix that states inter-rater agreement and lists the top three contested intents. That appendix travels with the model card so downstream teams know where human judgment remains load-bearing. Finally, we recommend a quarterly re-calibration session—not because scores drift magically, but because language and policies evolve faster than embeddings.