When teams evaluate LLM providers, they almost always do it wrong. They run a prompt, compare the outputs, pick the one that sounds best, and move on. Three months later they're dealing with inconsistent behavior, unexpected cost spikes, or mysterious accuracy drops they can't explain.
The problem isn't the evaluation — it's that they're measuring the wrong things. Output quality in a controlled test is not the same as output quality in production. What matters is what happens over time, at scale, under variance. Here's what to actually measure.
The 5 Metrics That Matter
| Metric | What It Tells You | Target Range |
|---|---|---|
| Accuracy Consistency | Does the model perform the same on identical inputs over time? | CV < 5% across daily runs |
| Latency p95 | What's your 95th percentile response time? | < 2s for most tasks |
| Cost per Eval | What's your evaluation cost per test run? | Track trend, not absolute |
| Regression Frequency | How often does behavior change unexpectedly? | Monthly or less |
| Format Compliance Rate | Does output match your expected structure? | > 98% for structured tasks |
1. Accuracy Consistency
Accuracy on day one means nothing if it drifts on day 30. Accuracy consistency is the coefficient of variation in your evaluation scores across repeated runs over weeks. A model that scores 91% Monday and 88% Friday is less consistent than one that holds 89–90% every day.
This is different from raw accuracy. A model could be consistently mediocre — always 82% — and that's stable. But if it's 95% one week and 80% the next, you can't trust it in production even if the average looks fine.
To measure this: run your evaluation set at the same time every day for at least two weeks. Plot the daily accuracy scores. If the variance is high with no external cause (no model update, no prompt change), that's a consistency problem — not a bad model, just an unstable one for your use case.
How to use it: Run accuracy consistency alongside any model upgrade evaluation. Even if a new model scores higher on average, flag it if consistency degrades — variance is invisible until it hits a critical moment in production.
2. Latency p95
Average latency lies. A model that averages 800ms but spikes to 4 seconds during peak load is worse than one that averages 1.2s but stays within 1.5s. p95 latency — the response time at the 95th percentile — tells you what your users actually experience.
Why p95 and not p99? p99 is so dominated by cold starts and rare events that it doesn't reflect user experience. p95 is where you start seeing the tail that impacts real users, not infrastructure anomalies.
Measure this in production, not just in your evaluation environment. Your eval harness probably isn't sending concurrent requests. Production will — and that's when latency compounds.
Watch for patterns: does latency creep up over the month? Does it spike on certain time windows? Provider infrastructure changes over time, and p95 trends are the canary.
3. Cost per Evaluation Run
Token cost is easy to track. Cost per eval run is what it actually costs you to run your full evaluation suite — all prompts, all inputs, all output processing. This compounds quickly.
If you're running 200 evaluation inputs daily at 500 tokens in and 150 out at $3/1M tokens, that's about $0.39/day. That sounds trivial. But run that across 5 different model configurations you're comparing, and you're at $2/day — $730/year before you ship a single feature. Some teams are running eval costs in the thousands monthly without realizing it.
Track this metric not to minimize it but to make it visible. Once you see the real cost, you can make informed tradeoffs: do you need 200 inputs or is 50 statistically equivalent for your use case? Can you run the full suite weekly instead of daily?
Rule of thumb: If your evaluation cost per month exceeds your expected savings from switching models (e.g., cheaper per token), re-examine your eval strategy. Evaluations should inform decisions, not become a budget line item.
4. Regression Frequency
This is the hardest metric to measure but the most important. Regression frequency is how often the model changes behavior in ways that affect your production output — without notice from the provider.
Providers don't announce every fine-tune. Safety updates, cost optimizations, capability shifts — these happen continuously and silently. Regression frequency tracks how many times your evaluation metrics moved outside normal variance in a given period. If you see a 3%+ accuracy drop with no code or prompt change on your end, that counts as a regression event.
You can't prevent regressions if you're using a provider's rolling release. What you can do is detect them faster than your users do. That's why continuous evaluation matters — you want to be the one who catches the drop, not the support ticket.
Target: zero unexplained regressions per month. If you get more than one, it's either a bad model fit for your use case or a sign that your evaluation set doesn't cover your production distribution well enough.
5. Format Compliance Rate
If your LLM output is consumed by code — not just humans — then format compliance rate matters as much as output quality. A classification model that's 94% accurate but only returns valid JSON 87% of the time is effectively an 87% accurate model in your pipeline.
Format compliance means: does the output match your expected structure? For JSON extraction, does it parse cleanly? For bullet-point summaries, does it return a list or prose? For tool calls, does it include all required fields?
This metric is especially important for structured output tasks. If you're using JSON mode, tool calling, or any system where downstream code depends on consistent parsing, track what percentage of outputs your parser accepts without fallback. A drop from 99% to 94% means 5% of your production requests are hitting fallback behavior — and you might not even know it.
The compliance gap: Most teams discover format compliance failures through downstream errors — a parse exception, a missing field in a database insert, a malformed webhook. By the time you see the error, the output is lost. Automated format checking catches every failure, not just the ones that crash.
Putting It Together
These five metrics aren't independent. Accuracy consistency and regression frequency are related — a model with high regression frequency will have low accuracy consistency. Format compliance rate and latency often trade off — enforcing strict output schemas can slow down inference. Cost per eval and latency connect through token count and batching.
The framework isn't about finding a perfect model. It's about finding a model that's predictably good for your specific use case. A model that's 88% accurate every day is more useful than one that's 95% one week and 71% the next.
The practical workflow: establish baseline metrics with your current configuration, then re-run the same evaluation against any proposed model change before switching. That way you're comparing models on your evaluation criteria, not on the provider's marketing benchmarks.
Most teams don't do this because it takes time to build a representative evaluation set and the infrastructure to run it reliably. That's the operational gap Benchwright fills — automated evaluation runs, regression detection, and provider comparison across your evaluation criteria on a continuous schedule.
Evaluation isn't a one-time decision. It's a continuous process. The teams that get the most out of LLM providers are the ones measuring them like production systems — with metrics, alerts, and baselines — not like demos.