Eval harnesses in CI: what to measure for custom AI systems
Custom AI without continuous evaluation degrades silently
Every custom AI deployment we audit has the same shape of problem when it ages: the team built a sharp v1 with hand-curated eval cases, shipped it, and then drifted. Vendor model upgrades changed behavior. Prompt edits accumulated without re-running evals. The knowledge base grew and the retrieval distribution shifted. Six months in, the system performs differently than it did at launch — and nobody can quantify how differently because the eval harness was a launch artifact, not a continuous practice.
Continuous evaluation is the difference between a custom AI system that compounds improvements over time and one that compounds regressions. The harness has to run on every change, the suites have to cover what matters, and the team has to actually block on failures. All three are cultural commitments as much as engineering ones.
Quality evals are domain-specific by construction
Quality evals measure whether the model performs the task as the domain experts define correct. Generic benchmarks (MMLU, HumanEval, the latest GPT-version-X demo) are useful for vendor selection and useless for production. The eval set has to come from the actual task: real inputs from real users, scored against the actual outcomes the domain experts want.
On a legal-review system: 200 real contracts paired with the senior associate's red-line decisions. On a clinical-summarization system: 150 real patient encounters paired with the physician's documentation. The eval is what the model needs to do, not what a leaderboard rewards. We refuse to ship custom AI without a domain-curated eval set; the absence of one is the absence of a definition of correct.
- Eval suites in CI
- 5 quality, safety, regression, latency, cost
- Eval set size
- 300–800 per system, domain-curated
- CI run time
- < 15 min full suite, gating release
- Regression threshold
- ~2% critical category, blocks build
Safety evals measure failures that customers should never see
Quality evals measure success at the task. Safety evals measure absence of harm: the model declined the jailbreak, refused the off-policy request, didn't leak training data, didn't produce the toxic output, didn't hallucinate the legal citation. Safety has to be evaluated separately because a model can be high-quality on the task and still produce occasional unsafe outputs that quality evals don't catch.
Safety eval sets are partly generic (jailbreak corpora, prompt-injection patterns) and partly domain-specific (failure modes from your actual deployment, red-team probes against your specific model). We run safety evals at higher thresholds than quality evals — a 1% regression on safety blocks the build automatically, where the same regression on quality might be reviewable.
Regression evals catch the case where the new build is better in aggregate but worse on the work that matters
Quality evals report aggregate scores. Regression evals look at the delta from the baseline: which examples did the new build get worse on, even if overall accuracy went up. A 2-point average improvement that came with a 12-point drop on a specific high-value category is a regression, not a win. The regression suite catches this.
We tag every eval example with categories — task type, customer segment, difficulty tier, business priority — and dashboard the per-category deltas on every release. The eval surfaces 'better overall but worse on enterprise refund requests' as the actual outcome, not as a buried datapoint. The conversation with the team becomes 'do we want this tradeoff,' not 'did the model get better.'
Latency evals are SLA evidence, not vibes
Production AI systems have latency SLAs — first token under 600ms for chat, sub-800ms turn-take for voice, end-to-end under 5 seconds for batch. Latency evals measure whether the new build meets these SLAs at the percentiles that matter (p50, p95, p99) on representative input distributions. A new build that's 20% smarter and 40% slower is often the wrong tradeoff, and the latency eval is what surfaces it.
We run latency evals on warm and cold paths, on short and long contexts, on the input distribution drawn from production samples. The result is a latency report that quantifies the change, not a vibe-based 'feels about the same.' Builds that miss latency SLA at any percentile block the release pipeline.
Cost evals catch the silent budget drift
Custom AI systems can blow their cost budget with a single prompt change that adds context, a model upgrade to a more expensive tier, or a retrieval change that returns more chunks. Cost evals measure per-call cost on the eval distribution and gate the release on cost thresholds. Cost regressions of more than 15% need explicit approval from the team that owns the budget.
We have seen multiple production incidents where a quality improvement quietly tripled the cost per call and the budget was discovered weeks later in a finance review. The eval harness includes cost because cost is a first-class engineering metric, not a finance afterthought.
LLM-as-judge with calibration is how the harness scales
Hand-grading 600 eval examples per release is impossible. LLM-as-judge — a scoring model that evaluates whether the candidate response satisfies the eval criteria — scales to thousands of examples per release. The risk is that the judge has its own biases and drifts. Calibration against a human-graded subset every release is the discipline that keeps the judge honest.
If the judge agrees with humans on at least 92% of the calibration set, the judge's scores on the full eval are trusted. If agreement drops, the judge is retrained or replaced before any decisions are made. Treating the judge as ground truth without calibration is the most common eval-harness failure we see in audits.
The eval set evolves; the harness has to support that
Eval sets that don't grow stop measuring real risk. New failure modes from production, new edge cases from customer feedback, new categories the business added — all need to flow into the eval set continuously. We add 5–20 new examples per week to most production eval sets, sourced from real production logs flagged by the team or by anomaly detection on quality scores.
Versioning the eval set is mandatory. When the eval set changes, scores across versions are not directly comparable, which the dashboard has to make obvious. The harness reports 'eval set v3.4.2, +18 examples since v3.4.1' so the team knows what changed and how to interpret regressions.
We caught a model regression in a Friday afternoon CI run that would have shipped Monday and tanked refund-policy accuracy by 14 points. The aggregate score barely moved; the regression eval flagged it because we tagged refund-policy as a critical category. Without the eval, we would have learned about the problem from a customer escalation ten days later.
— Engineering manager, custom AI deployment
Frequently asked
What suites should an AI eval harness run?
Five at minimum: quality (does the model perform the task on a domain-curated set), safety (does the model avoid harmful behavior on jailbreak and red-team probes), regression (per-category deltas from baseline, not just aggregate), latency (p50/p95/p99 against SLA), and cost (per-call spend against budget). All five run on every PR; failures gate the release. Without all five, AI degrades silently in dimensions the team didn't measure.
Why are domain-curated eval sets necessary?
Generic benchmarks measure general capability, not your specific task. The eval set has to come from real inputs paired with the actual outcomes domain experts want. On a legal-review system: real contracts paired with senior associate red-lines. On a clinical-summarization system: real encounters paired with physician documentation. We refuse to ship custom AI without a domain-curated eval set; the absence is the absence of a definition of correct.
How are regression evals different from quality evals?
Quality evals report aggregate scores. Regression evals report per-category deltas from the baseline, tagged by task type, customer segment, difficulty tier, and business priority. A 2-point aggregate improvement that came with a 12-point drop on a high-value category is a regression, not a win — and the regression eval surfaces this where the quality eval would not. The conversation becomes 'do we accept this tradeoff,' not 'did the model get better.'
Can LLM-as-judge be relied on for grading?
Only with calibration discipline. The judge is calibrated against a human-graded subset every release. If agreement drops below 92%, the judge is retrained or replaced before any decisions are made. Treating the judge as ground truth without calibration is the most common eval-harness failure we see in audits. Used correctly, the judge scales to thousands of examples per release; used carelessly, it produces confident garbage.
How long should the CI eval suite take?
Under 15 minutes for the full suite. Beyond that, the team starts skipping the gate or running it asynchronously, and either failure mode degrades into not running it. Parallel execution across suites and aggressive caching for unchanged code paths get most production systems comfortably under the 15-minute budget.
Should cost really block a release?
Yes. Cost regressions of more than ~15% block the release for explicit approval from the team that owns the budget. We have seen production incidents where a quality improvement quietly tripled per-call cost and the budget surprised finance weeks later. Cost is a first-class engineering metric. The eval harness gates on it because the alternative is finding out at month-end in a deck.
More from Field Notes
All essays
Engineering A custom build that actually ships: the eight-to-twenty-four week playbook
Discovery → prototype → hardening → transfer. The four-phase rhythm we run on every custom AI engagement, with the gates and deliverables.
Engineering Fine-tuning vs RAG vs prompt engineering: when each actually wins
An honest decision tree for fine-tuning, RAG, and prompt engineering — what each does well, what each costs, and how to choose without religion.
Engineering Safety, red-team, and the failure modes specific to your domain
How to red-team a custom AI system for the failure modes that matter in your domain — beyond generic jailbreaks, into the harm patterns specific to the work.