The subsidized access phase is closing
The token free lunch is ending.
Flat-rate access to frontier AI coding tools was always a subsidy play. Coding agents made the cost too visible to keep hiding.
On April 27, 2026, GitHub announced that Copilot is switching to usage-based billing on June 1, 2026. Their own blog post says the current model is "no longer sustainable." That is a company saying out loud that it has been absorbing inference cost, and it is done absorbing it the old way. This is not a minor pricing update. It is the end of a subsidy phase.
The free lunch was never the model. It was the subsidy.
Flat-rate AI coding subscriptions worked the same way early ride-share pricing worked: charge less than cost to build the habit, then reprice once the habit is locked in. The difference with AI is that the marginal cost is directly metered. Every token has a price. Every model call has a price. Every retry has a price. The only question was how long vendors would absorb it.
For most of the major coding tools, the answer is: until now.
This is not an anti-AI argument. Coding agents are genuinely useful, and the tools have improved fast. The point is narrower: the era of frontier model access at a flat price that decouples subscription revenue from actual consumption is ending, and the teams still assuming it will stay that way are about to be surprised by what the accounting looks like underneath.
Coding agents changed the unit of consumption.
An old-style code completion was predictable: short prompt, short output, small token footprint. Flat-rate billing worked because the distribution of requests was manageable. The median user was not expensive. Even heavy users of tab completion were not spending orders of magnitude more than light users.
A coding agent breaks that model. It does not complete lines. It executes plans across a repo. One user prompt might expand into something that looks more like this:
const agentRun = {
input: ["repo context", "tool output", "test failures", "previous attempts"],
output: ["patches", "summaries", "commands", "follow-up plans"],
hiddenCostDrivers: ["retries", "reasoning", "long context", "model choice"],
}; That loop can consume more tokens than a hundred short completions. Retries alone can double the context. Add reasoning, tool calls, test output, and a large codebase, and "one message" becomes a very poor description of what just ran through the model.
GitHub said this explicitly in their announcement: Copilot "evolved from a simple in-editor code completion tool to an agentic platform capable of handling long, multi-step coding tasks across entire repositories." That evolution is also a cost evolution, and the flat-rate model was not built for it.
Copilot made the accounting visible.
GitHub's announcement is worth reading carefully because it says several things plainly that other vendors have only implied. Starting June 1, 2026, premium request units are replaced by GitHub AI Credits. One AI Credit equals $0.01. Usage is calculated from input tokens, output tokens, and cached tokens, using listed API rates. Code completion stays included in paid plans, but agentic work and model-choice premiums now come out of your credit pool.
The monthly credit allocations by plan:
- Free: 50 AI Credits per month.
- Pro: 1,000 AI Credits per month.
- Pro+: 3,900 AI Credits per month.
- Business: 1,900 AI Credits per user per month (pooled across the org).
- Enterprise: 3,900 AI Credits per user per month (pooled across the org).
The pooling for Business and Enterprise is a genuine benefit: light users offset heavy ones, and you do not need to perfectly predict individual consumption. But there is no automatic fallback to a cheaper model when credits run out. Usage either continues as overage, if your org allows it, or stops. Heavy agent sessions on the most expensive models can hit the ceiling quickly, especially if a few engineers are running long multi-file debugging tasks.
The model pricing table is where the real news is. GitHub's own examples show Claude Sonnet 4.6 moving from the old 1x premium request rate to 9x. Claude Opus 4.6 moves from 3x to 27x. GPT-5.4 mini moves from 0.33x to 6x. GPT-5.3-Codex and GPT-5.4 both move from 1x to 6x.
Those multipliers are not arbitrary. They reflect what those models cost when the accounting gets closer to API rates. Copilot was absorbing the gap between what it charged and what inference cost. The new system stops doing that. The upstream economics were always there; users just did not see them in this form.
Codex is already moving toward token math.
OpenAI moved Codex pricing to API token-based rates for some plan types on April 2, 2026. Credits remain the purchased unit, but what you spend is driven by actual token consumption: that replaces what OpenAI calls "average per-message estimates" with a direct mapping between what the model processed and what you owe.
The practical effect is real for anyone doing serious agent work. Larger codebases, longer conversations, extended reasoning, and multi-step tool use cost significantly more per session than a clean short prompt. That was always true at the infrastructure level. Now the statement reflects it.
The current Codex rate card for the plan types already on token pricing (credits per million tokens):
- GPT-5.5: 125 input, 12.50 cached input, 750 output.
- GPT-5.4: 62.50 input, 6.25 cached input, 375 output.
- GPT-5.4 mini: 18.75 input, 1.875 cached input, 113 output.
- GPT-5.3-Codex: 43.75 input, 4.375 cached input, 350 output.
GPT-5.5 is more expensive per token than GPT-5.4, and OpenAI says it uses significantly fewer tokens to achieve comparable results. Whether that efficiency actually wins on a per-task basis depends on the task. If GPT-5.5 finishes in one pass where GPT-5.4 requires three retries, the cost math shifts in its favor. If the tasks are already short and well-specified, you pay the unit price premium without a corresponding reduction in volume.
One concrete response to token-based pricing is prompt caching discipline. The OpenAI model documentation is explicit: stable context belongs at the start of the prompt, dynamic context near the end, so the cache prefix matches across calls. For long-running agents, compaction matters too. Carrying the full conversation history through every turn is expensive; a well-designed agent summarizes state instead of restating it. These are architecture decisions, not just prompting hygiene.
Cursor and Windsurf got there earlier.
Both tools sell subscriptions, and both face the same infrastructure economics underneath. If you look at the last year of changes, the pattern was already visible before GitHub made the Copilot accounting explicit.
Cursor's June 2025 pricing change introduced Ultra at $200/month and moved Pro away from fixed request limits toward compute limits. Cursor said Pro would include at least $20 of model inference at API prices, while also introducing unlimited Auto model access. That wording matters. Unlimited Auto is not the same thing as unlimited manual access to every frontier model. It is a routing strategy wrapped in a subscription.
The July clarification made the mechanism plainer: Cursor acknowledged that the communication had been confusing, offered refunds for unexpected charges, and explained why request counts were breaking down. Hard requests can cost an order of magnitude more than simple ones. A syntax question and a multi-file agent task are not the same unit, even if both show up as "a request" in the UI. By August, Cursor was saying the quiet part directly: it was adapting pricing to an agentic world.
Windsurf's March 2026 quota change is the same story with different product language. Windsurf replaced prompt credits with daily and weekly usage allowances based on model token usage. It says short requests with a few files in context use fewer tokens than long requests across larger codebases, and that the new system better reflects underlying costs. If a paid user exceeds the included allowance, extra usage is billed at API list prices for the model used.
Developer communities noticed the shape of this before the polished pricing pages made it neat. Cursor forum and Reddit threads spent 2025 trying to decode what "unlimited" meant once Auto, frontier models, usage pools, and overages all lived in the same product. Windsurf users did the same thing as the product moved from separate flow-action credits to simpler prompt packaging, then from credits to daily and weekly quotas. The details varied, but the complaint was consistent: people could feel the subsidy being converted into accounting.
The conclusion is not that these products are bad or unsustainable. It is that even subscription-framed tools now have economics underneath that reward users who route intelligently and penalize users who treat every task as equally expensive.
Claude is not exempt.
Anthropic offers Claude across several tiers: Free, Pro, Max, Team, Enterprise, and direct API access. Each has usage limits. Heavy sessions on the most capable models cost more than short sessions on smaller ones. None of that is surprising.
The surprising part was how cheap Claude Code, Codex, and similar agent tools felt while they were packaged as generous subscriptions. That cheapness trained the habit. It also hid the actual shape of the workload: long context, tool output, retries, and frontier models doing expensive reasoning for hours.
The specific evidence worth noting is GitHub's own multiplier table for Claude inside Copilot. Claude Sonnet 4.6 moving from 1x to 9x, and Claude Opus 4.6 moving from 3x to 27x, reflects what Anthropic's models actually cost at API rates. Copilot was absorbing that gap. Starting June 1, it is not.
The reaction in developer communities was immediate. Threads framing the change as a "9x price increase for Claude models" spread quickly. That framing is not entirely accurate: the per-token price did not change, the subsidy did. But the practical impact on a developer running Claude Opus through Copilot for long agentic sessions is the same either way. The credit pool drains faster. The bill looks different.
The open-source caveat.
Open-weight models are a real factor here, and dismissing them would be a mistake. Models like DeepSeek, Llama, Qwen, and newer entrants (Windsurf already surfaces options like Kimi K2.5 as cheaper alternatives for users watching their allowance) can handle a large share of routine coding work competently. As they improve, their ability to route heavy traffic away from frontier labs grows.
That weakens the pricing power of frontier labs. It does not repeal the hardware bill. Someone still pays for GPUs, memory bandwidth, power, and the operations team keeping inference stable, whether that cost lands at a frontier provider, an open-source hosting provider, or an engineering team running local inference. The hardware does not care about the model license.
Open source can weaken the pricing power of frontier labs. It does not repeal the hardware bill.
The likely near-term answer is routing: different models for different task types, with explicit decisions about when the expensive model is actually worth it.
routine_edits:
model: smaller
reason: predictable, low ambiguity
hard_debugging:
model: frontier
reason: multi-file reasoning and failure analysis
generated_tests:
model: midrange
reason: high volume, easy to verify SemiAnalysis has done careful work on the economics of coding assistants that is worth reading for the infrastructure-level picture. The short version: routing, user behavior, model quality, and vendor subsidy are all variables in the unit economics, and the one that was never real was the subsidy.
What businesses should do now.
The practical response is not to panic or switch tools. It is to start treating model access as a resource with a real cost, and to structure work accordingly.
Map your actual usage against what it would cost at metered rates. Most teams do not know this number. Business and Enterprise Copilot plans pool credits across users, which helps, but a few engineers running heavy agentic sessions can drain a shared pool faster than most teams expect. Know who the heavy users are and what they are doing.
Separate frontier-model tasks from routine tasks deliberately. Long-context planning, hard debugging across multiple files, ambiguous product decisions: these deserve the expensive model. First-pass test generation, docstring updates, boilerplate, simple renaming: they do not. This is a workflow decision as much as a cost decision. Smaller models on well-specified tasks often produce results that are as good as or better than frontier models on the same task with a lazy prompt.
Build prompt caching discipline into any long-running agent setup. Stable context goes early and stays stable. Dynamic context, like test output and tool results, goes late. This is not something you layer in after the fact: it shapes how you structure agent memory, context windows, and session management from the start.
Watch the overage signals before you hit them. Both Copilot and Codex let you buy additional credits when limits are reached. That is useful to know ahead of time, not mid-session. Enterprise Copilot pools credits across the org, so a single heavy user does not hard-stop the team. But that same pooling means usage is harder to attribute without reporting discipline. Team-level billing analytics and usage attribution to cost centers need to be set up now, not after the first surprise invoice.
My bet.
Subscriptions survive as the packaging. The economics underneath become metered, whether or not the user sees that directly. Every major coding AI product is moving in this direction at different speeds. The direction is not reversible: the marginal cost of frontier model inference is too high for flat-rate access to survive at scale without either rationing model quality or repricing what customers pay.
Open-source models keep improving and will push more routine work out of frontier pricing territory. That is genuinely good for developers and for cost management. It does not end the pricing conversation; it shifts where the routing cutoff lands.
The teams that do well through this are not the ones assuming every prompt is free. They are the ones who learn which work deserves the expensive model, which context should never be resent in the first place, and which tasks can be routed to something cheaper without changing the outcome. That is both a product discipline and an engineering discipline, and it is more valuable now than it was six months ago.
The invoice does not care how efficient the token curve looked in the demo.