For engineers shipping LLM features
How to evaluate an LLM agent: evals, golden sets, and LLM-as-judge
You can't unit-test an LLM to correctness, because the same input can take a different path on the next run. Evals are the test suite for probabilistic systems: a scored, repeatable check of whether the system reached an acceptable outcome across runs, not whether it returned one exact string once.
The core toolkit is small: a curated golden set of input-to-expected-outcome examples, an offline run of that set in CI plus online checks against live traffic, and the right scorer for each task — assertion, golden set, LLM-as-judge, or human review. Wire it into CI as a regression gate and a quality drop fails the build.
Why evals, not unit tests
A unit test asserts that a deterministic function returns one exact value. An LLM is not deterministic: temperature, model updates, and the model's own sampling mean the same prompt can produce different — and differently worded — outputs across runs. Pinning to a single expected string makes a test that either flakes or asserts nothing useful.
Evals replace that assertion with a different question: did the system reach an acceptable outcome, across multiple runs, at an acceptable rate? That reframes testing as a measurement problem. You score outputs against a target, aggregate over a set, and track the score over time. It is closer to an SLO with an error budget than to a pass/fail unit test, and it is the discipline that lets you change a prompt or swap a model without flying blind.
Golden sets
A golden set (also called a reference or eval set) is a curated collection of input-to-expected-outcome examples. Each row pairs an input the system will actually see with the outcome you consider correct or acceptable — sometimes an exact answer, more often a rubric or a set of properties the answer must satisfy.
The most valuable golden sets are not invented up front; they are grown from real failures. Every production bug, every escaped edge case, every "the model did something weird here" becomes a new row. Over time the set encodes the actual shape of your traffic and your hardest cases, so a passing eval run means something concrete. Keep the set version-controlled, keep it representative rather than merely large, and treat adding a case after an incident as part of the fix.
Offline vs online
Offline evaluation runs your system against a fixed golden set, usually in CI, before anything ships. It is repeatable, comparable across changes, and cheap to run on every commit. Its limit is that it only measures the cases you thought to include.
Online evaluation measures behavior against live traffic and real outcomes after release — sampling production runs, scoring them, and watching real signals such as task completion, user corrections, or downstream success. It catches the distribution shift and the inputs your golden set never anticipated. Offline tells you a change is safe to ship; online tells you it actually worked. You need both, and online failures are the best source of new offline cases.
LLM-as-judge (and its pitfalls)
For subjective qualities — is this summary faithful, is this tone right, did the answer address the question — there is often no exact string to match against. LLM-as-judge uses a model to score the output against a rubric. It scales grading that would otherwise need a human on every row, and for many tasks it correlates well enough with human judgment to be useful.
It is also a component with real failure modes, and you should treat the judge as something you must itself evaluate:
- Position bias: in pairwise comparisons the judge can favor whichever answer it sees first, regardless of quality.
- Verbosity bias: longer, more confident-sounding answers tend to score higher even when they are not better.
- Inconsistency: the judge is itself probabilistic, so the same pair can score differently across runs.
- Gameable: outputs can be optimized to please the judge rather than the user, and a self-judging model can flatter its own style.
The mitigation is to validate the judge against a sample of human-labeled data before you trust it, measure how well its scores agree with those labels, and re-check that agreement when you change the judge model or its rubric. A judge you have not validated is an unscored assumption, not a measurement.
Evals as regression gates
An eval suite earns its keep when it runs automatically. Wire the offline golden-set run into CI so a change that regresses quality below a threshold fails the build, the same way a failing unit test or a dropped coverage number blocks a merge. That turns "we think this prompt change is better" into a measured claim, and it stops silent quality regressions from shipping when someone edits a prompt, upgrades a model, or refactors the harness.
One discipline matters more than the rest: pick the metric that reflects task success or outcome correctness, not just string similarity. Two answers can be worded completely differently and both be right; an exact-match score would fail a correct answer and a similarity score can pass a fluent wrong one. Score what you actually care about — did it do the job — and let the gate enforce that.
Choosing an eval approach
| Eval approach | What it measures | Use when |
|---|---|---|
| Assertion / code checks | Hard, objective properties: valid JSON, required fields present, value in range, no banned content | The output has a checkable structure or invariant; cheapest and most reliable, so reach for it first |
| Golden set | Outcome correctness against curated input-to-expected examples, aggregated across the set | You have known-good answers or rubrics and want a repeatable score you can gate CI on |
| LLM-as-judge | Subjective quality — faithfulness, tone, relevance — scored by a model against a rubric | There is no exact answer to match and you need to scale grading; only after validating the judge against human labels |
| Human review | Ground truth and nuanced judgment; the reference everything else is calibrated against | Stakes are high, the task is ambiguous, or you are validating a judge or building the golden set |
- Course material: AI Architect Academy Track B (eval design) — golden sets, offline vs online, judges, and CI gates.
- Anthropic — platform docs on evaluation (designing and running evals).
- The limitations of LLM-as-judge (position bias, verbosity bias, inconsistency, gameability, and the need to validate against human labels) are widely documented in the evaluation literature.
This is a conceptual overview; no specific benchmarks or figures are claimed, and API shapes change — verify against current provider docs before implementing. Corrections: hello@aiarch.dev.
Learn to design evals that actually catch regressions.
AI Architect Academy teaches eval design — golden sets, LLM-as-judge, and CI gates — as a first-class skill, across Anthropic, AWS, and Cloudflare.
Get notified when new tracks ship.