Why This Matters Now

Most teams treat LLM evals and A/B tests as a fork in the road: pick one. That's a mistake. At Spotify, only about 12% of A/B tests end in a shipped positive result. Around 64% produce valid learning — a regression caught, an idea ruled out, a hypothesis refined. The win rate understates the value of experimentation.

LLM evals bring a new capability: they can assess relevance, coherence, tone, and intent alignment at scale, faster and cheaper than human annotation. But they measure quality of output, not user behavior. The right relationship is a funnel, not a fork. Evals belong before your experiment, not instead of it.

This insight, drawn from Spotify's engineering research and the work of Schultzberg and Ottens (2024), reframes how we think about evaluation infrastructure. Let's break down why evals and experiments serve different purposes, how to calibrate them, and what happens when you close the loop.

Data scientist analyzing LLM eval scores on a funnel chart Technical Structure Concept

The Evaluation Funnel: Verification vs. Validation

Schultzberg and Ottens draw a critical distinction:

  • Verification — Does the output conform to quality standards? (Evals)
  • Validation — Do real users respond as predicted? (Experiments)

Evals discard non-promising candidates before they consume experiment bandwidth. They raise the hit rate of the experiments that follow. But they can't tell you whether users who received the improved version actually had better outcomes — whether the fix prevented the slow erosion of trust that eventually leads to churn. That question requires an experiment.

What Evals Give You

  • Speed: Run on test sets or A/B variants in minutes.
  • Granularity: Assess dimensions you couldn't scale before (relevance, tone, intent).
  • Hypothesis generation: An LLM judge that flags trust-breaking content can surface patterns your team didn't know to look for. After the fix ships, the same judge verifies it worked.

What Evals Don't Give You

  • Business impact: Did the improved version actually drive retention or revenue?
  • Secondary metric detection: At Spotify, teams roll back about 42% of launched experiments to prevent regression in secondary metrics — session length dropping, crash rates climbing, retention eroding. No eval flagged those.
  • Long-term behavior: Long-running tasks and long-term user behavior are inherently hard to capture with evals.

The Calibration Loop

Evals are proxies. They substitute a score for an outcome you actually care about. That substitution is only valid as long as the score tracks the real outcome — the same dynamic as proxy metrics.

Now LLM judges add a second calibration layer on top of traditional quantitative metrics (ranking scores, precision, recall). Both layers need validation against online outcomes. When the judge says Variant A is better, does it actually deliver a better user experience, or is the judge rewarding surface patterns that don't drive outcomes?

For example, when Anthropic released the Opus 4.5 model, Qodo's coding evals showed no improvement, but the model had improved substantially on longer tasks a controlled experiment would have surfaced. Miscalibration runs both ways.

By continuously adjusting the evals to improve their mapping to online outcomes, the evals become better verification tools. We are not ruling out that in the future, as AI develops, evals can map well enough to start acting as validations — but only if the offline/online calibration loop is in place.

Source: This analysis is based on Spotify's engineering blog post on LLM evals and experiments. Read the original article for deeper context.

A/B test dashboard showing experiment results and guardrail metrics Dev Environment Setup

Practical Advice: Close the Loop

  1. Run evals early and often to find the best treatments before they hit the experiment pipeline.
  2. Let the experiment validate that real users and systems respond as predicted. Monitor the metrics you didn't optimize for (guardrails).
  3. Run your LLM evals on the A/B test data itself. Did the version the judge preferred actually perform better with users? This extends the traditional evaluation funnel.
  4. When the gap between eval scores and experiment outcomes is large, treat that as diagnostic gold. Each cycle helps calibrate the next.

Limitations & Caveats

  • Evals can drift over time as the model or data distribution changes. Recalibrate periodically.
  • Not every change needs the same evidence: quick directional tests for iteration and data gathering, rigorous tests for ship decisions.
  • Evals are opinions, not evidence, without offline-online signal calibration.

Next Steps

  • Start with a simple LLM judge for one dimension (e.g., relevance).
  • Pair it with a small A/B test on a low-risk feature.
  • Compare judge scores with experiment outcomes. Look for mismatches.
  • Iterate: adjust the judge prompt or scoring rubric based on the calibration signal.

Related Reading

Developer reviewing LLM judge calibration output on laptop Coding Session Visual

Conclusion

Spotify already has a strong evaluation culture in the shape of experimentation. LLM evals extend that culture upstream, with a clear role in the funnel: find the best treatments before the experiment, and calibrate the judges after it.

As Ankargren (2025) argues, success comes from doing the basics well at scale. The value compounds when the system is simple enough to use, and rigorous enough to trust. Don't fork your evaluation pipeline — funnel it.

This content was drafted using AI tools based on reliable sources, and has been reviewed by our editorial team before publication. It is not intended to replace professional advice.