If you haven't re-evaluated your model selection in the past six months, you are almost certainly overpaying. The LLM pricing landscape has moved more in Q1–Q2 2026 than in most full calendar years before it. Multiple flagship models dropped 50–80% in price. New model generations entered with competitive pricing from day one. And a few quiet deprecations pushed some teams onto more expensive tiers without noticing.
This is a full-provider pricing audit as of May 2026 — what changed, by how much, and what it means for production workloads. All pricing reflects published API rates. Use the Benchwright /compare tool to model your specific call volume and token mix.
The Full Pricing Table — Q2 2026
Every major provider, current rates, with change indicators versus late 2025 prices.
| Provider | Model | Input ($/1M) | Output ($/1M) | vs Late 2025 |
|---|---|---|---|---|
| OpenAI | GPT-4o | $2.50 | $10.00 | −50% input |
| OpenAI | GPT-4o mini | $0.15 | $0.60 | Stable |
| OpenAI | GPT-4.1 | $2.00 | $8.00 | NEW |
| OpenAI | GPT-4.1 Nano | $0.10 | $0.40 | NEW |
| OpenAI | o3 | $2.00 | $8.00 | −80% |
| OpenAI | o4-mini | $1.10 | $4.40 | NEW |
| OpenAI | GPT-5 | $1.25 | $10.00 | NEW |
| Anthropic | Claude Haiku 3 (retired) | $0.25 | $1.25 | EOL Apr 19 |
| Anthropic | Claude Haiku 3.5 | $0.80 | $4.00 | Stable |
| Anthropic | Claude Haiku 4.5 | $1.00 | $5.00 | NEW |
| Anthropic | Claude Sonnet 4.6 | $3.00 | $15.00 | NEW |
| Anthropic | Claude Opus 4.6 | $5.00 | $25.00 | NEW |
| Gemini 2.0 Flash | $0.10 | $0.40 | EOL Jun 1 | |
| Gemini 2.5 Flash-Lite | $0.10 | $0.40 | NEW | |
| Gemini 2.5 Flash | $0.30 | $2.50 | NEW | |
| Gemini 2.5 Pro (≤200K) | $1.25 | $10.00 | −25% vs 1.5 Pro | |
| Mistral | Mistral Small 3.1 | $0.10 | $0.30 | −75% |
| Mistral | Mistral Large 3 | $2.00 | $6.00 | −50% |
| xAI | Grok 4.3 | $1.25 | $2.50 | −83% output |
| DeepSeek | DeepSeek V3.2 | $0.28 | $0.42 | Stable |
| DeepSeek | DeepSeek R1 | $0.55 | $2.19 | Stable |
| Meta | Llama 4 Maverick (Together) | $0.15 | $0.60 | NEW |
| Cohere | Command A | $2.50 | $10.00 | NEW flagship |
Who Got Cheaper (and by How Much)
OpenAI — Aggressive Repricing Across the Board
OpenAI has made the most dramatic pricing moves of any major provider in 2026. GPT-4o input dropped from $5/M to $2.50/M in a cut that happened quietly in mid-2025 and held into Q2 2026. The bigger story is o3: at launch it was priced at $10 input / $40 output per million tokens. It now sits at $2/$8 — an 80% reduction in under a year.
The GPT-4.1 family is the other structural change. GPT-4.1 Nano at $0.10/$0.40 matches Gemini 2.5 Flash-Lite on price with OpenAI's ecosystem familiarity. GPT-5 launched at $1.25 input / $10.00 output — cheaper input than GPT-4o was a year ago, with better capability.
The o3 repricing signal: When a reasoning model drops 80% in price in one year, it's not a product decision — it's a statement about where compute costs are heading. Reasoning at scale is becoming economically viable for production workloads that would have been cost-prohibitive in 2024.
xAI Grok — Biggest Single-Cut Story of Q2
Grok 4.3 launched around April 30, 2026 at $1.25/$2.50 — replacing Grok 3 at $3/$15. That's an 83% reduction in output cost for the flagship model. The output price of $2.50/M puts it well below GPT-4o and Claude Sonnet on the same dimension, while the 1M context window is a meaningful differentiator for long-document workloads.
xAI still has a thin track record on production reliability compared to OpenAI and Anthropic. But at these prices, it warrants a place in your evaluation set.
Mistral — Steady Downward Drift
Mistral Large went from ~$4/$12 (Large 2) to ~$2/$6 (Large 3) — roughly a 50% reduction. Mistral Small 3.1 at $0.10/$0.30 is now one of the cheapest options from a European provider, useful for teams with data residency constraints or who want provider diversification.
Google — New Generation, Better Value
Gemini 2.5 Pro at $1.25/$10 undercuts Gemini 1.5 Pro ($1.25/$5 — but now deprecated). Gemini 2.5 Flash at $0.30/$2.50 is the interesting one: it has 1M context, solid multimodal capabilities, and a price point that makes it viable as a default for many production workloads that previously defaulted to GPT-4o mini.
Who Got More Expensive (and Why)
Not all movement was down. Two situations quietly raised costs for teams that weren't paying attention.
Anthropic's Budget Tier Repriced Upward
Claude Haiku 3 — priced at $0.25 input / $1.25 output — retired on April 19, 2026. Teams that didn't migrate were bumped to Claude Haiku 3.5 at $0.80/$4.00 or Claude Haiku 4.5 at $1.00/$5.00.
That's a 3–4× cost increase on output tokens for anyone who didn't notice the deprecation. At 10,000 calls/day with 400 completion tokens:
- Claude Haiku 3: ~$150/month in output costs
- Claude Haiku 3.5: ~$480/month in output costs
- Claude Haiku 4.5: ~$600/month in output costs
If your budget was built around Haiku 3 and you weren't monitoring costs, this was a silent 3× increase that hit on a specific date. This is exactly the kind of change Benchwright's continuous monitoring flags — not a regression in output quality, but a pricing event that changes your cost structure overnight.
Action required if you're on Haiku 3: The model retired April 19. If you haven't migrated, you're either hitting errors or being routed to a replacement. Check your API costs from the past 30 days against the prior 30 days — the jump will be visible.
The Hidden Cost of Not Re-Evaluating
Claude 3 Opus ($15/$75) is still technically accessible but has been functionally superseded by Claude Opus 4.6 ($5/$25). Teams still running Opus 3 are paying 3× the output cost for an older model. That's not a price increase from Anthropic — it's a failure to migrate that creates the same effect.
Same pattern with GPT-4 Turbo ($10/$30) vs GPT-4o ($2.50/$10): a 75% savings is sitting there for teams that haven't updated their model string.
What This Means for Production Workloads
The Budget Tier Is Now Genuinely Capable
In 2024, "cheap" meant compromising significantly on quality. In Q2 2026, GPT-4.1 Nano at $0.10/$0.40, Gemini 2.5 Flash-Lite at $0.10/$0.40, and Mistral Small 3.1 at $0.10/$0.30 are all significantly more capable than what was considered "flagship" 18 months ago.
For classification, extraction, summarization, and light reasoning tasks, defaulting to a $0.10/M input model and validating the quality tradeoff is the right starting point — not the fallback.
Reasoning Models Are Becoming Viable at Scale
o3 at $2/$8 and o4-mini at $1.10/$4.40 are priced in the same range as non-reasoning frontier models from a year ago. For workloads that benefit from chain-of-thought — complex code generation, multi-step data extraction, decision support — the price delta versus a standard model no longer represents a major budget line item.
Provider Diversification Has Real Risk-Adjusted Value
The Haiku 3 retirement is a reminder: when you build a production workload on a single provider's specific model, that provider controls your cost structure. DeepSeek at $0.28/$0.42 and Mistral Small at $0.10/$0.30 are real alternatives for teams with high-volume, quality-tolerant workloads. The diversification is not just about price — it's about not having your budget repriced by a deprecation decision you didn't see coming.
Caching and Batching Discounts Are Now Universal
Every major provider now offers batch API discounts (typically 50%) and prompt cache discounts (typically 50–90% on cache hits). For production workloads with repeated system prompts, few-shot examples, or shared context — and that's most of them — effective rates are half to one-tenth of the published prices. If you're not using caching, your real cost is roughly double what it should be.
The Headline Number
GPT-4 launched in March 2023 at $30 input / $60 output per million tokens. GPT-5 is available today at $1.25 input / $10 output. That's a 96% reduction in input cost in just over three years.
More practically: GPT-4o class intelligence — the quality benchmark for production AI in 2024 — is now available from multiple providers at $1–3/M input. The question is no longer "can we afford to use a capable model?" It's "which capable model fits our workload, and are we measuring it continuously enough to catch the moment that answer changes?"
Prices will keep moving. The model you benchmarked last quarter is not the best option today, and the pricing you budgeted last quarter is not the right number to plan against. The only reliable approach is to keep measuring — which is what the Benchwright /compare tool is built for.
The Three Decisions to Make Now
1. Check whether any model you're running has been deprecated or repriced. Haiku 3 retired April 19. Gemini 2.0 Flash retires June 1. GPT-4 Turbo and Claude 3 Opus are legacy cost centers. If you haven't explicitly confirmed your current model strings against provider documentation in the past 60 days, do it today.
2. Add Gemini 2.5 Flash and GPT-4.1 Nano to your next evaluation run. These two represent the best value points in the Q2 2026 market for high-volume workloads. Most teams haven't evaluated them yet. The teams that have are surprised by the quality-to-cost ratio.
3. Enable prompt caching if you haven't already. If your workload has any repeated context — system prompts, instructions, few-shot examples — you're likely paying 2× what you should be. The implementation is usually a single flag or a minor API change.
Also on Benchwright:
• How LLM Model Updates Silently Break Production Features — the silent regression problem
• Why Unit Tests Aren't Enough for LLM Features — what manual testing misses
• 5 Signs Your LLM Feature Is Silently Degrading