Does LLM-as-a-judge work?

Yes, for subjective grading where there is no exact answer to match against, LLM-as-judge scales evaluation that would otherwise need a human on every output, and for many tasks it correlates well enough with human judgment to be useful. But it only works once you validate it: measure how well its scores agree with a sample of human-labeled data before you trust it, and re-check that agreement whenever you change the judge model or rubric. An unvalidated judge is an assumption, not a measurement.

What are the pitfalls of LLM-as-a-judge?

The main pitfalls are position bias (favoring whichever answer it sees first in a comparison), verbosity bias (scoring longer, more confident answers higher even when they are not better), inconsistency (the judge is itself probabilistic, so the same pair can score differently across runs), and gameability (outputs can be optimized to please the judge rather than the user, and a model can flatter its own style). Treat the judge as a component you must itself evaluate against human labels.

What is the difference between offline and online evals?

Offline evaluation runs your system against a fixed golden set, usually in CI, before you ship — repeatable, comparable across changes, and cheap, but limited to the cases you included. Online evaluation measures behavior against live traffic and real outcomes after release, catching distribution shift and inputs your golden set never anticipated. Offline tells you a change is safe to ship; online tells you it actually worked. You need both, and online failures are the best source of new offline cases.

For engineers shipping LLM features

How to evaluate an LLM agent: evals, golden sets, and LLM-as-judge

By Wibo · Amsterdam Published 26 Jun 2026 ~7 min read

Short answer

You can't unit-test an LLM to correctness, because the same input can take a different path on the next run. Evals are the test suite for probabilistic systems: a scored, repeatable check of whether the system reached an acceptable outcome across runs, not whether it returned one exact string once.

The core toolkit is small: a curated golden set of input-to-expected-outcome examples, an offline run of that set in CI plus online checks against live traffic, and the right scorer for each task — assertion, golden set, LLM-as-judge, or human review. Wire it into CI as a regression gate and a quality drop fails the build.

Why evals, not unit tests

A unit test asserts that a deterministic function returns one exact value. An LLM is not deterministic: temperature, model updates, and the model's own sampling mean the same prompt can produce different — and differently worded — outputs across runs. Pinning to a single expected string makes a test that either flakes or asserts nothing useful.

Evals replace that assertion with a different question: did the system reach an acceptable outcome, across multiple runs, at an acceptable rate? That reframes testing as a measurement problem. You score outputs against a target, aggregate over a set, and track the score over time. It is closer to an SLO with an error budget than to a pass/fail unit test, and it is the discipline that lets you change a prompt or swap a model without flying blind.

Golden sets

A golden set (also called a reference or eval set) is a curated collection of input-to-expected-outcome examples. Each row pairs an input the system will actually see with the outcome you consider correct or acceptable — sometimes an exact answer, more often a rubric or a set of properties the answer must satisfy.

The most valuable golden sets are not invented up front; they are grown from real failures. Every production bug, every escaped edge case, every "the model did something weird here" becomes a new row. Over time the set encodes the actual shape of your traffic and your hardest cases, so a passing eval run means something concrete. Keep the set version-controlled, keep it representative rather than merely large, and treat adding a case after an incident as part of the fix.

Offline vs online

Offline evaluation runs your system against a fixed golden set, usually in CI, before anything ships. It is repeatable, comparable across changes, and cheap to run on every commit. Its limit is that it only measures the cases you thought to include.

Online evaluation measures behavior against live traffic and real outcomes after release — sampling production runs, scoring them, and watching real signals such as task completion, user corrections, or downstream success. It catches the distribution shift and the inputs your golden set never anticipated. Offline tells you a change is safe to ship; online tells you it actually worked. You need both, and online failures are the best source of new offline cases.

LLM-as-judge (and its pitfalls)

For subjective qualities — is this summary faithful, is this tone right, did the answer address the question — there is often no exact string to match against. LLM-as-judge uses a model to score the output against a rubric. It scales grading that would otherwise need a human on every row, and for many tasks it correlates well enough with human judgment to be useful.

It is also a component with real failure modes, and you should treat the judge as something you must itself evaluate:

Position bias: in pairwise comparisons the judge can favor whichever answer it sees first, regardless of quality.
Verbosity bias: longer, more confident-sounding answers tend to score higher even when they are not better.
Inconsistency: the judge is itself probabilistic, so the same pair can score differently across runs.
Gameable: outputs can be optimized to please the judge rather than the user, and a self-judging model can flatter its own style.

The mitigation is to validate the judge against a sample of human-labeled data before you trust it, measure how well its scores agree with those labels, and re-check that agreement when you change the judge model or its rubric. A judge you have not validated is an unscored assumption, not a measurement.

Evals as regression gates

An eval suite earns its keep when it runs automatically. Wire the offline golden-set run into CI so a change that regresses quality below a threshold fails the build, the same way a failing unit test or a dropped coverage number blocks a merge. That turns "we think this prompt change is better" into a measured claim, and it stops silent quality regressions from shipping when someone edits a prompt, upgrades a model, or refactors the harness.

One discipline matters more than the rest: pick the metric that reflects task success or outcome correctness, not just string similarity. Two answers can be worded completely differently and both be right; an exact-match score would fail a correct answer and a similarity score can pass a fluent wrong one. Score what you actually care about — did it do the job — and let the gate enforce that.

Choosing an eval approach

Eval approach	What it measures	Use when
Assertion / code checks	Hard, objective properties: valid JSON, required fields present, value in range, no banned content	The output has a checkable structure or invariant; cheapest and most reliable, so reach for it first
Golden set	Outcome correctness against curated input-to-expected examples, aggregated across the set	You have known-good answers or rubrics and want a repeatable score you can gate CI on
LLM-as-judge	Subjective quality — faithfulness, tone, relevance — scored by a model against a rubric	There is no exact answer to match and you need to scale grading; only after validating the judge against human labels
Human review	Ground truth and nuanced judgment; the reference everything else is calibrated against	Stakes are high, the task is ambiguous, or you are validating a judge or building the golden set

Evals before scale

Build the eval harness before you scale usage, not after. Without it you have no way to tell whether a prompt change, a model upgrade, or a new tool made things better or worse — you are shipping on vibes. The team that can answer "did quality go up or down, and by how much" is the team that can iterate safely; the one that can't will eventually regress in production and not know why. Eval design is also the single strongest signal in hiring that a person has actually shipped with LLMs.

Sources & provenance

Course material: AI Architect Academy Track B (eval design) — golden sets, offline vs online, judges, and CI gates.
Anthropic — platform docs on evaluation (designing and running evals).
The limitations of LLM-as-judge (position bias, verbosity bias, inconsistency, gameability, and the need to validate against human labels) are widely documented in the evaluation literature.

This is a conceptual overview; no specific benchmarks or figures are claimed, and API shapes change — verify against current provider docs before implementing. Corrections: hello@aiarch.dev.

Learn to design evals that actually catch regressions.

AI Architect Academy teaches eval design — golden sets, LLM-as-judge, and CI gates — as a first-class skill, across Anthropic, AWS, and Cloudflare.

Browse the curriculum → Try a sample lesson