Inference Costs Fell 90% in 18 Months

In March 2023, GPT-4 launched at $60 per million output tokens. In April 2026, GPT-4o mini costs $0.60 per million output tokens. That is a 100x price reduction in three years. The steepest portion of that decline happened in the last 18 months.

This is not a gradual optimization story. It is a structural repricing of what it costs to run intelligence as a service.

The raw numbers

OpenAI's pricing trajectory tells the clearest version. GPT-4 at launch: $30 per million input tokens, $60 per million output. GPT-4 Turbo in late 2023 cut that to $10/$30. GPT-4o in mid-2024 brought it to $5/$15. GPT-4o mini landed at $0.15/$0.60. Each step was not incremental. Each was a 3-5x reduction.

Anthropic followed a parallel track. Claude 3 Opus launched at $15/$75 per million tokens in early 2024. Claude 3.5 Sonnet offered near-Opus quality at $3/$15. Claude 3.5 Haiku dropped to $0.80/$4. The pattern: each generation delivers the previous flagship's quality at a fraction of the previous budget tier's price.

Google undercut both with Gemini 1.5 Flash at $0.075/$0.30 per million tokens, and Gemini 2.0 Flash pushed even lower. For high-volume use cases with caching, effective per-token costs are now measured in fractions of a cent.

The solar panel parallel

This curve looks familiar. Solar panel costs fell 99% between 1976 and 2023, following a predictable learning curve called Swanson's Law. Each doubling of cumulative production capacity reduced costs by roughly 20%.

Inference pricing is following a steeper version of the same dynamic. The drivers are different (distillation, quantization, speculative decoding, better hardware utilization) but the shape is identical. Costs fall. Then they fall faster than projected. Then entire business models built on the old cost structure become unviable.

The solar analogy breaks in one important way: nobody shipped a "worse sun" to get cheaper energy. In AI, cheaper inference often means a smaller, more efficient model. Quality tiers still exist. Running Haiku is not the same as running Opus. The $0.60 token is not doing the same work as the $75 token.

Who this hurts

Between 2023 and early 2025, dozens of startups raised money on a specific thesis: inference is expensive, and we have figured out how to make it cheaper. Inference optimization companies. Prompt compression startups. Routing layers that sent easy queries to cheap models. Caching services that reduced redundant API calls.

That category is now compressed. When the frontier labs themselves ship models that are 100x cheaper than their predecessors, the optimization layer's value proposition shrinks. You cannot build a durable business saving customers money on a cost that is already approaching zero for commodity workloads.

The startups that built pricing models assuming $30-60 per million tokens as the baseline now face customers who can get comparable quality for under $1. Their gross margins assumed expensive inference as a permanent condition. It was not.

Who this helps

Every company that delayed AI integration because the per-query economics did not work at their scale. Customer service automation that was unaffordable at $60 per million tokens becomes obvious at $0.60. Document processing, code review, content moderation, search re-ranking: all of these use cases cross the viability threshold as costs drop by orders of magnitude.

The beneficiaries are not the inference layer companies. They are the application companies that can now embed AI into workflows where the cost per transaction is low enough to be invisible. A SaaS product that adds an AI feature costing $0.001 per user action does not need to charge extra for it. It just becomes part of the product.

The quality caveat

Cheap inference on a weak model is not cheap intelligence. It is cheap text generation. The 100x cost reduction compares GPT-4 at launch to GPT-4o mini today. These are not the same model. Mini is good. It handles most routine tasks well. But for complex reasoning, multi-step planning, and tasks that require the frontier, you still pay frontier prices.

Claude Opus 4 costs $15/$75. GPT-4o costs $2.50/$10. The top of the market has not collapsed. What collapsed is the floor. The minimum cost to run a competent model in production went from tens of dollars per million tokens to under a dollar. That changes who can afford to build with AI. It does not change what the best AI costs.

What comes next

If the trajectory holds, and there is no obvious reason it will not, commodity inference will approach zero marginal cost within two years. Not literally free, but cheap enough that it stops being a line item anyone tracks. Like bandwidth. Like storage. Like compute for basic web serving.

The remaining pricing power will concentrate at the frontier. The latest, most capable models will still command premium prices because they do things cheaper models cannot. The gap between "good enough" and "best available" will widen in absolute capability terms even as the price of "good enough" approaches zero.

For the AI industry, this means the money shifts. Less value in running models. More value in what you do with the output. The picks-and-shovels era of AI infrastructure is giving way to an applications era. The companies that win from here are not the ones with the cheapest GPUs. They are the ones with the best products built on top of inference that now costs almost nothing.