6 min read

Why Unit Tests Aren't Enough for LLM Features

All tests pass. The deploy goes green. But your LLM feature degrades silently in production — and your test suite never noticed. Here's the fundamental reason why, and what actually works instead.

Picture this: you've built a feature that uses an LLM to classify customer support tickets. You wrote unit tests. You wrote integration tests. They all pass on every CI run. You deploy with confidence.

Three weeks later, a customer flags that the routing has been wrong for days. You check your test suite — it's green. You check the model configuration — nothing changed on your end. But something changed. And your entire testing infrastructure missed it completely.

This isn't a gap in your test coverage. It's a fundamental mismatch between how software testing works and how LLMs behave.

What Unit Tests Are Built For

Unit tests work because the systems they test are deterministic. Given input X, a pure function always returns output Y. The test captures that contract. If someone breaks it, the test fails. The feedback loop is instant, local, and reliable.

This model depends on one critical assumption: the code doesn't change unless you change it. Functions don't drift. Libraries don't silently update behavior between CI runs. The math stays the same.

LLMs break every part of this assumption.

Four Reasons Unit Tests Can't Catch LLM Regression

1. Non-determinism is the baseline, not the exception.

Call the same LLM with the same prompt twice and you'll get two different outputs. This is by design — temperature, sampling, and model stochasticity are features. But it makes assertions fragile. You can't write expect(output).toBe("Billing") and have it mean anything, because the model might return "billing", "Billing issue", or a slightly different phrasing on the next run.

Teams work around this by asserting on structure (typeof output === 'string') or mocking the LLM call entirely. Both approaches miss the point. Structural tests verify your parsing code, not model quality. Mocks verify that your code calls the API — they say nothing about what the API returns.

The mock problem: When you mock an LLM call in tests, you're testing that your code handles a specific, pre-written response correctly. You're not testing the model at all. The mock stays frozen while the actual model drifts — and your tests keep passing the whole time.

2. The model is a black box that changes underneath you.

OpenAI, Anthropic, and Google push model updates continuously. Safety fine-tunes, capability improvements, cost optimizations — they change behavior without changing the version string. gpt-4o today is not the same model as gpt-4o six months ago. Your test suite runs against whichever version is live at CI time. Once deployed, it runs against whatever version the provider decides to serve.

Your tests passed against last week's model. This week's model is different. You never ran the tests against this week's model. The gap is invisible.

3. Prompt sensitivity makes small changes catastrophic.

LLMs are extraordinarily sensitive to prompt wording. Adding a period. Changing "classify" to "categorize." Tweaking the system message by one sentence. These changes can shift accuracy by 5–15 percentage points — sometimes more. Your unit tests run against a fixed prompt, so they don't catch what happens when prompts evolve in production, when context windows get filled differently, or when the model's response to your exact phrasing shifts over time.

4. Distribution shift happens in production, not in your test fixtures.

Your test suite has 20 labeled examples. Your production system processes thousands of inputs per day with a distribution that evolves — new product categories, new user phrasings, seasonal language patterns. A model that handles your test fixtures correctly might handle 15% of real production inputs poorly, and you'd never see it in the test results.

The coverage gap: Integration test suites for LLM features typically cover 20–100 hand-picked examples. Production traffic covers millions of input variations. The examples you test are not representative of the distribution that breaks things.

What Unit Tests Can (and Can't) Cover

What You're Testing Unit Tests Continuous Evaluation
Your parsing code handles the response ✓ Yes ✓ Yes
The API call is constructed correctly ✓ Yes ✓ Yes
Model output quality on your eval set ✗ No (mocked) ✓ Yes
Behavior after provider model updates ✗ No ✓ Yes
Accuracy drift over weeks ✗ No ✓ Yes
Format compliance rate in production ✗ No ✓ Yes
Regression from prompt changes ✗ No ✓ Yes
Cross-model performance comparison ✗ No ✓ Yes

Unit tests aren't useless for LLM features — they're just covering the wrong half of what can break. Your parsing logic, API client, and error handling should absolutely be unit tested. But the model's behavior? That requires a different approach.

What Continuous Evaluation Actually Catches

Continuous evaluation treats your LLM feature like a production service with measurable outputs — because that's what it is. Instead of a test suite that runs once and freezes, you run evaluations on a schedule: daily, or after every deploy.

Behavioral drift. When a provider update changes how your model handles a class of inputs, continuous evaluation catches it within 24 hours. You see the accuracy chart drop. You have a timestamp. You can correlate it with provider changelogs. Without continuous evaluation, you'd find out from a user report three weeks later.

Quality degradation over time. Some regressions aren't sudden — they're gradual. Format compliance slips from 99% to 96% to 93% over six weeks. No single day is alarming. The trend is. Continuous evaluation gives you the time-series data to see it coming.

Cross-model comparison before you switch. When you're considering upgrading to a newer model, you don't run a vibe check — you run your evaluation set against both models and compare accuracy, latency, format compliance, and cost. Data beats intuition every time.

Prompt change impact. Before you ship a prompt revision, run it against your evaluation set. If accuracy drops 8%, you know before it hits production. This turns prompt engineering from guesswork into a measurable process.

The operating model shift: Traditional software testing assumes your code is the variable and the dependencies are stable. LLM evaluation assumes the model is the variable and your test set is the stable ground truth. Both approaches are right — for their respective domains.

How to Set Up an Eval Pipeline

The minimum viable eval pipeline has three components:

A representative evaluation set. 50–200 real inputs from production with labeled ground-truth outputs. Not synthetic examples — actual inputs your system has processed, labeled by a human or by a higher-quality model. This is your ground truth. It needs to be maintained as your product evolves.

Automated daily runs. A scheduled job that runs your evaluation set against your production model configuration and records the results: accuracy, format compliance, latency, token cost. Every run. Every day. Results stored in a queryable form so you can see trends, not just snapshots.

Regression alerts. Thresholds that trigger notifications when metrics degrade. A 5% accuracy drop. Format compliance falling below 95%. Average output length increasing by 40%. You define what "regression" means for your feature — the system tells you when it happens, before your users do.

Building this yourself is straightforward in concept: a cron job, a database, some charting. The hard part is the operational overhead — keeping the evaluation set fresh, maintaining the infrastructure reliably, building alert logic that doesn't false-positive constantly. Most teams start, ship something workable, and watch it go stale over the following quarter because it's not a revenue-generating feature.

That's what Benchwright handles — continuous evaluation as infrastructure. Automated runs, regression detection, cross-model comparison, delivered as a service so the maintenance overhead isn't your problem.

The Takeaway

Keep your unit tests. They're verifying real things — your parsing code, your API client, your error handling. But don't mistake a green test suite for confidence in your LLM feature's production behavior. Those tests were written against a frozen mock of a model that has since changed.

The layer that's missing is continuous evaluation: real model calls, against a real evaluation set, on a real schedule, with real alerts when behavior changes. That's the layer that tells you what your test suite can't.

If you're shipping LLM features and relying on CI to catch regressions, you're not monitoring a production system — you're hoping nothing changed since the last deploy.

Add the layer your tests can't cover

Benchwright runs continuous evaluations against your LLM features — catching behavioral drift, format regression, and quality degradation before your users do.

Start Evaluating Free → No credit card required. No infrastructure to manage.