Skip to content

Margin of Safety #56: The Structural Problem with AI Pricing

Jimmy Park. Kathryn Shih

May 20, 2026

  • Blog Post

AI pricing models don’t eliminate cost variability. They just re-shuffle who gets stuck with it

The last few weeks have had repeated news of AI sticker shock. Uber CTO Praveen Neppalli Naga confirmed to The Information that his company had already exhausted its entire 2026 AI budget, four months into the year. Uber has seen individual developer costs between $500 and $2,000 per month, with Naga himself burning $1,200 in a single two-hour demo session. Other news stories are suggesting that Microsoft is rolling back Claude access due to a similar set of price challenges. This isn’t the first sticker shock cycle: when Cursor shifted from flat-rate request limits to usage-based credit pools in June 2025, customers received bills 20x their subscription expectations. Cursor issued a public apology and refunds.

We’re seeing discussion around failures of communication (vendors should better explain pricing) and pricing (is it unsustainably high?). But we think there’s also a structural challenge, namely that token-based billing shifts cost variability from vendor to customer. For investors or buyers evaluating AI-powered services companies, particularly in cybersecurity, this shift has serious strategic implications.

Subscription pricing works as long as average consumption within a user cohort stays close enough to the price point. When usage distribution becomes too wide — when the same $20 seat can cover 15 minutes of autocomplete or a 10-hour agentic refactor of a production monorepo — the model can fall apart due to adverse selection[1]. As average usage goes up, prices must follow. But as prices go up, the customers most likely to stay on a subscription are the ones who use it most. Now you have a vicious cycle of price increases. We believe this is what forced first Cursor and now Anthropic’s hand.

Token based pricing fixes this, but it achieves this by putting all the cost variability onto the customer. In a sense, it’s the cloud model continued, and the cloud model spawned cost estimators so complex as to need their own help documentation[2]. But the difference between traditional cloud and LLM economics is that usage prediction is even harder for LLMs. Clouds already has high variance: without knowing exactly how much demand a solution will see, you can’t predict how much spend will occur on any horizontally scaling components. But it often *is* possible to at least estimate costs as a function of usage.

The challenge with agents and LLMs is the weak association between user-level usage (a task or a subtask) and a cost unit (a token). This is the double edge sword of fixed price billing: there’s innate risk of price/cost mismatch, and the risk must fall on someone. At the end of the day, task-based pricing shifts some risk onto the provider in the same way that a fixed price subscription does. The difference is that the purchaser assumes the risk of volume variability, and the provider maintains the risk of complexity variability. To the extent that the underlying consumption has an intolerable degree of complexity variability within a billing unit (in this case, a task), the provider will need to either eat the difference, increase price (which may or may not be viable), or change the pricing model.

To understand why price increases might not be feasible, we think it’s illustrative to look at Claude Code. Rather than moving users to consumption-based billing, Anthropic could have simply increased the price, at least in theory. We suspect the reason why it couldn’t increase in practice was the extreme nature of underlying usage variability – Claude Code’s users are a highly heterogenous mix ranging from casual hobbyists (low usage, and potentially even profitable) up to full time, 996[3] engineers (extreme usage, and basically certain to be unprofitable). The subscription works if the average usage vs price comes out somewhere reasonable. But with such a heterogenous mix, it’s easy to raise the price, only to have that drive out the users who are already paying a lot for what they get. Then the average usage shifts by as much as the price, and Anthropic is no better off in terms of margin and worse off in terms of adoption. Faced with these sorts of dynamics, the only thing you can do is offer a subscription to the casual users (backed by very strong usage caps) and push the industrial strength users onto consumption-based pricing.

Continuing with the code usage case, it’s not clear that task-based pricing would be a viable 3rd option. Some code tasks are demonstrably simple and can happen with minimal context. But many tasks have historically presented a challenge for any kind of difficulty or complexity estimate – look at all the adages around engineering teams being unable to estimate delivery dates, or of reports of Claude misfiring[4]in complexity-increasing ways. For higher complex tasks, how do you even begin bucketing them into an outcome-based pricing model? You either need a lot of analysis to begin guessing the price, or you need to use a quick heuristic that likely clusters some groups of heterogenous, dissimilar tasks into the same bucket. As soon as you do the later, and try to assign a price to Claude taking on that task, you’re basically back to the original problem – when you offer a fixed price service that can deliver on a huge range of customer values (and costs to you), a high price tag will leave you with only the hardest use cases while a low one may fail to cover your costs. Once again, it can be impossible to stabilize the fixed price model due to underlying heterogeneity.

Cybersecurity is full of these sorts of use cases. Even for a standard phishing alert, next steps can vary significantly – did the user engage? What did they click? Was there lateral movement? How much has to be cleaned up? Similarly, penetration testing and red teaming can have huge complexity swings based on the exact environment or codebase being examined. While you can technically break these macro goals – deal with the alert, test the system – up into smaller, bite-sized subtasks, the problem remains that someone, either the vendor or the buyer, has to bear the financial risk of a highly variable (and potentially uncertain) number of subtasks and their resultant compute costs.

So what does this mean for task-based pricing? We think the vision is valuable, but the unfortunate reality is that many use cases where AI creates the most value are ones where stable outcome-based pricing is hardest to sustain. Investors should not treat “we charge per task” as a pricing model that resolves the token billing problem. The key question is whether the vendor has bounded the cost distribution for the specific task class they’re pricing. If they haven’t, the fixed-price model will either produce poor margins or collapse under adverse selection[1] as the easy tasks are subsumed into other processes and the remaining workload skews toward the expensive tail.¹ If they have, there’s the potential that the bounds erode user value.

For a practitioner, we think this means you should be paying attention to the complexity mix of the tasks you’re automating with AI. The more you can discover signs of significant heterogeneity, the more suspicious you should be of the medium term ability of a vendor to sustainably offer any kind of fixed price scheme. We are particularly suspicious of tasks with strong potential to evolve over time (for example, policy enforcement in the face of ever-increasing policy complexity) and with poorly defined success criteria or high variability in existing human effort requirements.

Once you’ve identified heterogeneous task classes, the practical goal is maintaining pricing power and vendor flexibility before you need them.

Human fallback capacity matters here for two reasons: it handles actual outages, and it gives you a credible exit option when a vendor reprices. In practice, the fastest response to a pricing change will likely be shifting task mix, which is operationally an outage for that subset of work. Without fallback, you have neither lever.

Owning execution data is the other prerequisite for flexibility. It enables the quality evaluation that you need to evaluate substitutes or build model routing for lower-complexity tasks. Without it, you have to either incur quality risk or start data collection from scratch.

For larger clients, extended pricing commitments are worth negotiating but the window is early. After you’ve signed a multi-year contract, vendors have little motivation to worsen their side of the deal.

Unfortunately, we don’t think any of these strategies eliminates the underlying uncertainty. The pricing models that govern this market are still unsettled, and it is likely that some will collapse in ways that are hard to predict from the outside. The goal isn’t certainty — it’s reducing the scope of the fire drill when a pricing shift lands. Teams that have addressed task definition, process continuity, data ownership, and contract terms are in a materially better position to absorb that disruption than those who haven’t.

If you’re building in this space, we’d like to hear from you.

Feel free to reach out to jpark@forgepointcap.com and kshih@forgepointcap.com.

This blog is also published on Margin of Safety, Jimmy and Kathryn’s Substack, as they research the practical sides of security + AI so you don’t have to.

[1] The adverse selection dynamic has a counterintuitive implication for vendor strategy: a pricing model that looks durable during a period of rapid capability improvement may become unstable as the task distribution shifts. When models get better, they increasingly accept previously impossible hard tasks. As those hard tasks come online, they can increase average costs – in this world, capability improvements expose pricing problems rather than generating greater margin.

[2] And in the case of Windows SQL on EC2, a tutorial with 6 sections and over 20 individual steps on how to estimate the license settings required for a cost estimate: https://docs.aws.amazon.com/pricing-calculator/latest/userguide/estimate-workload-tutorial.html

[3] https://www.wired.com/story/silicon-valley-china-996-work-schedule/

[4] Sometimes spectacularly so: https://www.theguardian.com/technology/2026/apr/29/claude-ai-deletes-firm-database