Why do most enterprise AI pilots fail to scale?

The evidence says the cause is rarely the model. Two structural causes dominate: budget follows visibility rather than return (around 50–70% of GenAI budget flows to front-office sales and marketing, where KPIs are visible, while better-ROI back-office work is underfunded), and programmes measure throughput rather than outcome. A single use case instrumented to cost-per-outcome, with a named business owner whose P&L moves with it, beats a portfolio of pilots measured on engagement.

Why are enterprise AI bills rising even though token prices are falling?

Per-token prices for constant quality fell roughly 280-fold between late 2022 and late 2024, but total bills are rising because consumption is growing faster than price is dropping — the Jevons-paradox dynamic. Agentic, multi-call workflows amplify token consumption by 5 to 30 times for a single user-visible task, which is why only about 15% of enterprises forecast their AI cost to within ±10%.

How should a buyer handle AI provider and price risk?

Assume today's token price is promotional and write that assumption into the multi-year business case. Avoid single-provider lock-in on any material workload, keep an open-weight or smaller-model fallback qualified for high-volume low-complexity tasks, and put price-change and exit terms in the contract. You do not need a view on whether the providers will be profitable — you need your business case to survive the day the price moves.

What questions should I ask an AI consultant or vendor?

Three cut through a pitch faster than any RFP: show me the cost-per-outcome you measured on your last engagement; show me how you instrumented the cost side — token, model, agent-step; and show me the business owner whose number moved. If a firm answers in pilots, demos and satisfaction scores, what you are buying is enablement, not measured value.

← Insights

The AI measurement crisis: what enterprise AI actually costs, and how to know if it pays

Q: How do you measure the ROI of an enterprise AI programme?

Start by asking: at what unit? If the organisation cannot state a cost-per-outcome for its flagship AI workload — cost per resolved ticket, per generated document, per deflected case — then the ROI does not exist yet as a number, and any figure presented is satisfaction wearing a finance costume. The work is to instrument one workload up a Crawl/Walk/Run ladder, from cost-per-token to cost-per-call to cost-per-outcome, and report that.

The loudest critics are right about one uncomfortable thing: most organisations cannot say what their AI costs, or whether it pays. This is a strategy problem, not an investment thesis. Here is what the primary evidence actually shows on the ROI gap, the cost opacity and the provider risk — and the instrumentation that separates the roughly one-in-twenty programmes that scale from everyone else.

A practitioner deep-dive · Consulting Huber · 3 June 2026

Bernhard Huber

Interim Executive & Innovation Leader · CV · LinkedIn

The provocation, taken seriously

The technology writer Ed Zitron has spent two years arguing, loudly, that the AI industry runs on numbers nobody can pin down — that the true cost of inference is obscured, that the revenue is thin against the spend, and that "AI doesn't have a return on investment". It is a polemic, and parts of it are contestable. But underneath the rhetoric sits a claim that is harder to wave away, and that this piece sets out to test against primary sources rather than vibes: most enterprises genuinely cannot measure what their AI costs, and cannot demonstrate what it returns.

That is not a stock-market question. It is a strategy question. A board does not need to know whether OpenAI is a good investment to need to know whether its own AI programme is producing value — and right now, on the published evidence, most boards cannot answer the second question with a number. The striking part is that the people building the tooling to fix this agree. The FinOps Foundation — the Linux Foundation body that effectively defines cloud-cost discipline — states plainly that "measuring and quantifying the business value of AI initiatives has been called out as a major challenge" by the practitioners managing AI spend, and that the methods to do it are still emerging rather than settled.¹¹

So the critique lands. The interesting question is what a serious operator does about it. This piece walks the four places the measurement breaks — the ROI evidence, the cost side, the provider economics, and why pilots stall — and then sets out the instrumentation that the organisations who do measure are actually using. Every chart below is drawn from a primary survey or a framework document, and where a source is weak or contested, it is flagged in the text, not buried.

A note on the evidence. The numbers in this piece come from executive surveys, field experiments and framework bodies — not from a single audited dataset, because none exists. Sample sizes, denominators and the difference between "satisfaction" and "measured return" matter enormously here, and the captions say so. The most-quoted figure in the whole debate — MIT's "95% of pilots fail" — is also the most methodologically contested, and is presented below as what the report reports, with the critique attached.

Part I · The return

Satisfaction is high. Measured return is not.

The cleanest finding across the 2025 surveys is not that AI fails. It is that adopters are happy with it and still cannot show the money. Bain & Company's Q3 2025 executive survey found that among the 59% of companies meaningfully adopting generative AI, the technology met or exceeded expectations in roughly 80% of cases across the functions where it was deployed. In the same survey, only about 23% of all respondents said generative AI had actually delivered more revenue or lower costs.¹ That gap — between "it works" and "we can attribute value to it" — is the measurement crisis in a single chart.

The satisfaction–attribution gap

Enterprise generative AI, Bain executive survey, Q3 2025

Met or exceeded expectations among meaningful adopters

~80%

Delivered more revenue or lower cost all respondents

~23%

Note the two bars use different denominators: the first is among meaningful adopters, the second among all respondents — so they are not a clean before/after. The point is the shape, not the subtraction. Sample is small (n=197) and self-reported; read it as an executive survey, not a population statistic. Source: Bain & Company, "AI moves from pilots to production" (2025).

If that were one survey, it would be an anecdote. It is not. S&P Global Market Intelligence's Voice of the Enterprise survey of roughly 1,006 IT and line-of-business professionals across North America and Europe found that the share of organisations abandoning the majority of their generative-AI initiatives before production more than doubled year over year — from 17% to 42% — and that, on average, 46% of projects were scrapped somewhere between proof of concept and broad adoption.²

Abandonment more than doubled in a year

Share of organisations abandoning the majority of GenAI initiatives before production

2024

17%

2025

42%

Scrapped between PoC and adoption average across projects

46%

Source: S&P Global Market Intelligence, Voice of the Enterprise: AI & Machine Learning (~1,006 respondents, North America + Europe), "Generative AI shows rapid growth but yields mixed results". Figures independently reported by CIO Dive.

The same longitudinal survey found something more telling than any single abandonment number: the proportion of organisations reporting a positive impact from generative AI fell across every enterprise objective it measured, year over year. Not a reallocation, not a plateau — a decline on all three fronts at once.

Positive impact fell on every objective measured

Share of organisations reporting positive GenAI impact, 2024 → 2025

2024 2025

Revenue growth

81%

76%

Cost management

79%

74%

Risk management

74%

70%

A decline-everywhere pattern from a structured year-over-year survey is an unflattering finding, not a marketing one — which is part of why it is credible. Source: S&P Global Market Intelligence, same survey as above.

The 95% number, and why to handle it with gloves

No statistic in this debate travels further than MIT's. The Media Lab's NANDA initiative report, The GenAI Divide: State of AI in Business 2025 — built on 150 leader interviews, a 350-person employee survey and a scan of 300 public deployments — reports that about 5% of enterprise AI pilots achieve rapid revenue acceleration while roughly 95% deliver little or no measurable impact on profit and loss.³ It also reports a steep adoption funnel for task-specific, embedded tools, against a much gentler path for generic chatbots like ChatGPT and Copilot.

The pilot-to-production funnel

Task-specific, embedded enterprise tools vs. generic chatbots reaching production

Evaluated task-specific tools

~60%

Piloted

~20%

Reached production

~5%

— for comparison —

Generic chatbots pilot → implementation

~83%

Handle with care. Wharton's Kevin Werbach and other researchers have said the headline 95% figure is under-documented — that they cannot trace how it was derived — and the 60/20/5 funnel is reported without clear denominators. The publisher also promotes commercial agentic-AI protocols, a potential conflict of interest. Treat these as what the report claims, corroborated in direction by the Bain and S&P findings above, not as settled fact. Source: MIT NANDA, The GenAI Divide (2025), via Fortune and the report PDF.

The reason to keep all three sources in view at once is that they fail differently. Bain is a small executive survey. S&P is a larger longitudinal one. MIT is a contested headline. They do not agree on a number — they agree on a shape: adoption is broad, satisfaction is real, and attributable financial return is rare and getting harder to claim. That shape is robust even when each individual figure is soft.

Part II · The cost

Why "cheaper tokens" has produced bigger bills

The return side is hard to measure. The cost side is, if anything, worse — because the headline trend points the wrong way from the bill. Per-token prices have collapsed. Stanford HAI's AI Index documents a roughly 280-fold drop in the cost of running a GPT-3.5-equivalent quality query between November 2022 and October 2024 — from about $20 to about $0.07 per million tokens.⁸ Even on a conservative same-model basis, practitioners put the fall at roughly an order of magnitude over two years. And yet enterprise AI bills are rising, not falling, because consumption is growing faster than price is dropping — the classic Jevons-paradox dynamic, where efficiency expands usage faster than it cuts unit cost.⁹

↓ ~280×

Per-token price for constant quality, Nov 2022 → Oct 2024 (Stanford HAI)

↑ >100×

Token consumption growth over roughly two years

↑ Net

Total enterprise AI bill — consumption outruns the price cut

4–5 mo.

Time in which Uber and ServiceNow reportedly exhausted full-year 2026 AI budgets

Anchor the magnitude on Stanford HAI; the ">100×" consumption figure and the budget-exhaustion anecdotes come from VentureBeat's reporting and are directional. Sources: Stanford HAI AI Index 2025; VentureBeat, "Cheaper tokens, bigger bills".

Falling prices would still let a buyer forecast, if the unit were stable. It is not. The reason the true cost of an AI workload is so hard to know is that it depends on too many interacting variables to reason about intuitively: which model actually serves a given request, where the workload executes, how the prompt and context are structured, how much retrieval is stuffed into the window, and — above all — how many times an agentic workflow loops. Industry analysis from CloudZero and IDC describes agentic multi-call patterns amplifying token consumption by 5 to 30 times for a single user-visible task. Managing this is, in the words of one practitioner, "an engineering problem that requires continuous tuning" — which reframes prompt engineering as a cost-governance discipline, not a prompt-craft one.⁹

The consequence shows up directly in budgeting accuracy. The FinOps Foundation's State of FinOps 2026 data indicates that only about 15% of enterprises forecast their AI costs to within ±10%, while roughly one in four miss their forecast by more than 50%.¹⁰ A line item you miss by half is not a line item you can build a business case on.

Most enterprises cannot forecast their AI bill

Accuracy of enterprise AI cost forecasts

Forecast within ±10% "on target"

~15%

Miss forecast by >50% materially wrong

~25%

Token pricing, per-agent-step billing and retrieval costs create a volatility that legacy annual budgeting was never built to handle. Source: FinOps Foundation, State of FinOps 2026.

This is the part of Zitron's critique that holds up best. Not "AI is worthless" — the productivity evidence below contradicts that — but "the true unit cost is structurally hard to know." It is. And an organisation that cannot state its cost-per-unit-of-work cannot compute a return, no matter how good the work is.

Part III · The provider

Provider economics — as a sourcing risk, not an investment view

It is not the buyer's job to value the model providers. But it is the buyer's job to understand that the price they pay today rests on an economic structure that is still finding its level — because that structure determines pricing stability and counterparty risk, which are budgeting inputs. Three facts, all from reporting of the providers' own figures, are enough to frame the exposure.

First, OpenAI's spend plans have moved by enough to matter. In February 2026, CNBC reported that the company had reset its compute-spend target downward — from the roughly $1.4 trillion in infrastructure commitments that CEO Sam Altman had touted, to around $600 billion by 2030 — explicitly to tie spend more directly to expected revenue growth.⁵ Second, its 2025 results, as relayed, show real burn: about $13.1 billion in revenue against roughly $8 billion of cash burned.⁵

$13.1B

OpenAI 2025 revenue (above its $10B target)

~$8B

2025 cash burn (below its $9B target)

$1.4T → ~$600B

2030 compute-spend target, reset downward

>$280B

Projected 2030 revenue (consumer + enterprise)

These are unaudited figures relayed via reporting of a private company's internal projections — the strongest channel available, corroborated across CNBC, Reuters and Bloomberg, but inherently not independently verifiable. Read them as "reportedly targeting," not as accounts. Source: CNBC, "OpenAI resets spend expectations" (Feb 2026).

Third, the strain is now visible in the credit ratings of the companies financing the build-out. In mid-2025, Moody's revised Oracle's outlook to negative from stable — while affirming its Baa2 rating, the lower end of investment grade — citing counterparty-concentration risk tied to a roughly $300 billion, 4.5-gigawatt compute contract with OpenAI, which Moody's characterised as one of the largest project financings in the world.⁶ This was an outlook revision, not a downgrade — but for an enterprise buyer it is a concrete, named signal.

Depending on external LLMs at scale is a strategic exposure in its own right

Underneath the price and counterparty numbers sits a larger point that deserves to be named plainly. Routing a core, high-volume business process through an external model API concentrates an operational dependency outside the organisation's control. At pilot scale that is a sensible trade — capability and speed in exchange for a small, contained spend. At production scale, when thousands of daily decisions, documents or customer interactions flow through a single third-party endpoint, the same arrangement becomes a question of resilience rather than convenience. A provider that is still burning cash, resetting its own spend roadmap and financing its build-out through concentrated counterparties is not yet a stable utility; it is a fast-moving supplier of an input the business has quietly made load-bearing. A pricing change, a rate-limit, a deprecated model version or an outage then lands not as an IT inconvenience but as an interruption to a core process.

The conclusion is not to avoid external models — they are too capable, and building frontier capability in-house is rarely the right call. It is to treat a model provider the way a serious operator treats any critical single-source supplier the moment a process scales past experimentation, and to be able to answer one honest question: what happens to this process if the price doubles, the model is retired, or the endpoint is unavailable next quarter? If there is no answer, the dependency is a strategic risk wearing the costume of a convenient API. The concrete hedges that follow from that question are the ones a buyer should write down — below.

What this means for a buyer — not an investor. If frontier-model API prices are being held below cost to win the market, then today's per-token price is a promotional price, and a prudent multi-year business case should budget for the possibility that it rises. Three practical hedges follow directly: (1) avoid single-provider lock-in for any workload large enough to matter; (2) write price-change and exit assumptions into the business case, not just the current rate card; (3) keep a smaller or open-weight model qualified as a fallback for high-volume, low-complexity tasks. None of this requires a view on whether the providers will be profitable. It only requires treating the price as variable.

Part IV · The gap

Why pilots stall — and what the evidence says actually works

If satisfaction is high and attribution is rare, the obvious question is what separates the programmes that convert. The evidence points away from the model and toward two structural causes: what gets funded, and what gets measured.

The budget goes where it is easy to see, not where it pays

The MIT NANDA report's most actionable finding — more defensible than its headline failure rate — is that generative-AI budgets are systematically misallocated. Around half of GenAI budgets (the report's abstract says ~50%; its survey detail runs as high as ~70%) flow to front-office sales and marketing functions, while back-office automation that often yields better ROI is underfunded. The reason is itself a measurement problem: sales and marketing outcomes map cleanly onto board-level KPIs and investor updates, whereas the efficiencies in legal, procurement and finance are real but harder to surface in an executive conversation.³

Budget follows visibility, not return

Allocation of enterprise GenAI budget by function

Front office sales & marketing — visible KPIs

~50–70%

Back office legal, procurement, finance — better ROI, often underfunded

remainder

Spend was elicited via a hypothetical "$100 allocation" exercise, so read the split as directional. The strategic point stands: the function that is easiest to measure attracts the budget, even when it is not where the return is. Source: MIT NANDA, The GenAI Divide (2025).

The productivity is real — but unevenly distributed

It would be wrong to leave the impression that AI does not work. A large, pre-registered field experiment across Microsoft, Accenture and an anonymous Fortune 100 manufacturer (n=4,867 developers, published in Management Science) found that GitHub Copilot raised the number of completed tasks by about 26%.⁷ Two caveats matter for any ROI built on that figure. First, the study measured task throughput, not code quality or financial return — the researchers did not have access to the produced code. Second, and more useful for strategy: the gains were highly uneven by experience.

The same tool, very different gains

Output increase from AI coding assistance, by developer experience

Junior / less-experienced developers

+27–39%

Senior developers

+8–13%

Because the gain depends so heavily on who is using the tool, the same deployment can return very different value across two teams — which is exactly why a single blended "AI productivity" number is misleading at the portfolio level. Measured as output, not quality or ROI. Source: field RCT, MIT/Princeton/Wharton/Microsoft, Management Science (2025).

Put the two findings together and the strategic implication is sharp. The value is real, but it is contingent — on the function, on the workforce composition, on whether the workflow was redesigned around the tool. A programme that does not measure at that granularity will see the average and miss the distribution, fund the visible use case over the valuable one, and report "it met expectations" while the P&L does not move. That is not a model failure. It is an instrumentation failure.

Part V · The fix

The measurement playbook: from cost-per-token to cost-per-outcome

The good news is that the discipline to fix this is not theoretical. The FinOps Foundation — the body that standardised cloud-cost management — has extended its framework to AI, and its core construct, Unit Economics, is the most concrete primary answer available. Unit Economics is defined as "metrics that provide an understanding of how an organisation's technology use and technology management practices impact the value of the organisation's products, services, or activities," and it sits squarely under the framework's Quantify Business Value domain. The Foundation states the principle bluntly: "without a way to relate costs to benefits received, it is difficult to understand whether spending is appropriate."⁴

The practical move is a ladder. AI cost measurement is meant to start at the cost-per-token level and climb toward outcome-oriented metrics — cost per assist, cost per agent action, cost per case deflected — with the granular tracking (down to token, GPU and per-prediction level) feeding the rungs above it.⁴

Crawl Cost per token / GPU-hourFine-grained tracking. Necessary, but answers "what did we spend?", not "was it worth it?"

Walk Cost per call / feature / AI predictionAttributing spend to a specific model, task or workload — the first view a product owner can act on.

Run Cost per outcomeCost per assist, per agent action, per case deflected, per ticket resolved — the rung where cost finally meets value, and ROI becomes computable.

The FinOps "Crawl / Walk / Run" maturity model applied to AI. Most organisations are stuck on the bottom rung — which is why they can report spend but not return. Source: FinOps Foundation, Unit Economics capability.

The metric ladder needs an owner, or it stalls in finance. The Foundation's recommended governance vehicle is a cross-functional AI Investment Council — and the value of the recommendation is in the specific membership, because it is the cross-functional composition that lets cost meet outcome in one room. The council, FinOps notes, drives the unit-economics conversation higher in the organisation by defining the specific outcomes and KPIs that AI projects are required to address.¹¹

AI Investment Council · cross-functional membership

Business & product owners own the outcome / KPI

AI / technology lead feasibility, model choice

Enterprise architecture & platform where it runs

Infrastructure leaders capacity, GPU economics

IT security & risk governance, AI Act exposure

Finance & FinOps unit economics, forecasting

Procurement / contracts provider & price risk

Defined outcomes & KPIs each AI project must address cost-per-outcome targets · attribution · go / no-go on the same evidence base

FinOps frames the council as "one of the most effective ways" to drive the unit-economics discussion — a central recommended mechanism, not the only one. The membership is the point: every function that touches AI cost or value is in the room when the KPI is set. Source: FinOps Foundation, Managing AI Value working group.

Two honest qualifications. The first is that the Foundation itself does not claim to have finished the job: it concedes there is no settled, standardised methodology for quantifying AI business value yet — the approaches are still emerging. That is precisely why the critique at the top of this piece lands; the discipline building the fix is candid that the fix is incomplete. The second is that the framework's language is descriptive, not a mandate — it observes that mature practices "expand toward" outcome metrics, it does not order anyone to. The strategic reading is the same either way: the destination is cost-per-outcome, almost nobody is there yet, and the organisations that get there earliest will be the ones that can prove value while their competitors are still reporting satisfaction.

How to read this if you are the buyer

Strip the surveys away and the operator's job comes down to four situations. The framing below cuts through the noise faster than any maturity scorecard.

Situation 1 — the board asks "what is our AI ROI?". The honest first answer is a counter-question: at what unit? If the organisation cannot state a cost-per-outcome for its flagship AI workload — cost per resolved ticket, per generated document, per deflected case — then the ROI does not exist yet as a number, and any figure presented is satisfaction wearing a finance costume. The work is not to produce a better slide; it is to instrument one workload to the cost-per-outcome rung and report that.

Situation 2 — the CEO with stalled pilots. The evidence says the cause is rarely the model. Check two things first: where the budget went (front office by visibility, or where the return is?) and what is being measured (throughput, or outcome?). A single use case instrumented to cost-per-outcome, with a named business owner whose P&L moves with it, beats a portfolio of pilots measured on "engagement." Three months of that beats twelve months of pilots.

Situation 3 — the cost line is volatile and nobody can forecast it. This is the ±10% problem, and it is an engineering-and-governance problem, not a procurement one. The fixes are concrete: instrument token, model and agent-step consumption per workload; treat prompt and context design as cost governance; cap agentic loop depth; and qualify a cheaper fallback model for high-volume, low-complexity tasks. Forecastability is a capability you build, not a rate you negotiate.

Situation 4 — provider and price risk. Assume today's token price is promotional and write that assumption into the multi-year case. Avoid single-provider lock-in on any material workload, keep an open-weight or smaller-model fallback qualified, and put price-change and exit terms in the contract. You do not need a view on whether the providers will be profitable. You need your business case to survive the day the price moves.

And three questions that cut through a vendor pitch faster than any RFP: "Show me the cost-per-outcome you measured on your last engagement." "Show me how you instrumented the cost side — token, model, agent-step." "Show me the business owner whose number moved." If a firm answers in pilots, demos and satisfaction scores, what you are buying is enablement, not measured value.

Where Consulting Huber fits

Consulting Huber is a practitioner firm. We do not sell an AI platform, and we have no incentive to inflate a token bill or a pilot count. We work with CEOs, boards, transformation officers and PE operators on the unglamorous half of the problem the surveys keep pointing at: making AI cost and value measurable, so the decision to scale or stop rests on a number rather than a mood.

In practice that means instrumenting one strategically important workload to the cost-per-outcome rung in the first weeks; standing up the cross-functional cadence — business owner, engineering, finance, risk — that the FinOps framework calls a council and we simply call the room where the KPI gets set; building forecastability into the cost line instead of negotiating it; and writing provider and price risk into the business case. The full shape of that delivery discipline sits in our companion deep-dive on the delivery layer underneath AI and in the AI Value Creation Playbook; the regulatory side sits in our EU AI Act compliance playbook.

If you are sitting in one of the four buyer situations above and want a direct conversation about how to make your AI spend measurable, the fastest way to start is our two-week, fixed-fee Delivery & AI-Readiness Diagnostic — an honest read on whether your AI and digital spend pays, delivered as an IC-ready memo. Or use the calendar link below.

Sources consulted

Enterprise ROI evidence

[1] Bain & Company, "AI moves from pilots to production", Q3 2025 executive survey (n=197) — satisfaction ~80% among meaningful adopters, ~23% reporting revenue or cost impact. Corroborated by Bloomberg, "AI Delivers Less Cost Reduction Than Firms Predicted" (June 2026). Cite as an executive survey, not a population statistic.

[2] S&P Global Market Intelligence, Voice of the Enterprise: AI & Machine Learning (~1,006 respondents, North America + Europe) — abandonment 17%→42% YoY; 46% of projects scrapped between PoC and adoption; positive-impact decline across revenue (81→76), cost (79→74) and risk (74→70). Figures independently reported by CIO Dive.

[3] MIT Media Lab NANDA initiative, The GenAI Divide: State of AI in Business 2025 (150 leader interviews, 350-person employee survey, 300 public deployments) — ~95% of pilots with no measurable P&L impact; 60/20/5 funnel; ~50–70% of budget to sales & marketing. Via Fortune and the report PDF. Methodologically contested: Wharton's Kevin Werbach and others question the derivation of the 95% figure and the missing funnel denominators; the publisher promotes commercial agentic protocols. Presented throughout as what the report reports, with the critique attached.

[7] Cui, Demirer, Jaffe, Musolff, Peng & Sadun et al., field RCT across Microsoft, Accenture and a Fortune 100 manufacturer (n=4,867; pre-registered AEARCTR-0014530), published in Management Science (2025) — GitHub Copilot raised completed tasks ~26% (SE ~10.3%); junior developers +27–39%, seniors +8–13%. Measures throughput, not code quality or financial return.

Token & inference cost opacity

[8] Stanford HAI, AI Index 2025 — ~280-fold drop in per-token cost for GPT-3.5-equivalent quality ($20 → $0.07 per million tokens, Nov 2022 – Oct 2024). The primary anchor for the cost-decline magnitude.

[9] VentureBeat, "Cheaper tokens, bigger bills: the new math of AI infrastructure" — consumption up >100× while price fell ~10× (same-model floor); cost is "an engineering problem that requires continuous tuning"; Uber and ServiceNow reportedly exhausted full-year 2026 AI budgets in 4–5 months. Agentic 5–30× token amplification corroborated by CloudZero and IDC. Secondary source; treat the consumption multiple as directional.

[10] FinOps Foundation, State of FinOps 2026 — ~15% of enterprises forecast AI cost within ±10%; ~1 in 4 miss by >50%. Token pricing, agent-step billing and retrieval costs create volatility legacy budgeting cannot handle. Fast-moving; verify before republication.

Provider economics (as buyer risk)

[5] CNBC, "OpenAI resets spend expectations, targets around $600 billion by 2030" (Feb 2026) — spend reset from a touted $1.4T to ~$600B by 2030; 2025 revenue $13.1B against ~$8B burn; projected 2030 revenue >$280B. Corroborated by Reuters and Bloomberg. Unaudited figures relayed from a private company's internal projections — "reportedly targeting," not accounts.

[6] Moody's Ratings — Oracle outlook revised to negative from stable (Baa2 affirmed), citing counterparty-concentration risk from a ~$300B / 4.5 GW OpenAI compute contract; characterised as one of the world's largest project financings. Via Yahoo Finance; clarified by The Register as an outlook revision (mid-2025), not a downgrade. For citation, prefer Moody's own rating action at ratings.moodys.com.

The measurement playbook

[4] FinOps Foundation, Unit Economics capability — the definitional framing of unit economics under "Quantify Business Value," and the Crawl/Walk/Run progression from cost-per-token toward cost per assist / agent action / case deflected. The Linux Foundation project is the standards authority for cloud and AI cost management.

[11] FinOps Foundation, Managing AI Value working group — the cross-functional AI Investment Council and its membership; tracking to token, GPU and per-prediction level; and the explicit concession that quantifying AI business value is "a major challenge" with no settled methodology yet.

The provocation

[0] Ed Zitron, "AI Doesn't Have a Return on Investment" and related essays — cited as the framing polemic this piece tests, not as an evidentiary source. The argument that true AI cost and ROI are obscured is taken seriously above and checked against primary data; the more sweeping conclusions are not adopted.

What the evidence does not yet settle

Four questions remained open after this research, and any honest reader should hold them: (1) the net blended unit cost of a representative agentic workload after retries, context bloat and multi-step amplification — no source quantified how much of the "cheaper tokens" saving survives at the workload level; (2) how far below cost, if at all, current frontier-model API prices sit — burn figures show losses but do not isolate per-token inference economics; (3) the specific, repeatable instrumentation that distinguishes the ~5% who scale, with before/after outcome data, beyond the frameworks above; (4) how GPU depreciation and useful-life assumptions affect the durability of today's pricing. These are the questions to ask any vendor or internal team claiming certainty.

Book a 30-min call Or send a brief on your situation