The AI measurement crisis: what enterprise AI actually costs, and how to know if it pays
The loudest critics are right about one uncomfortable thing: most organisations cannot say what their AI costs, or whether it pays. This is a strategy problem, not an investment thesis. Here is what the primary evidence actually shows on the ROI gap, the cost opacity and the provider risk — and the instrumentation that separates the roughly one-in-twenty programmes that scale from everyone else.
The provocation, taken seriously
The technology writer Ed Zitron has spent two years arguing, loudly, that the AI industry runs on numbers nobody can pin down — that the true cost of inference is obscured, that the revenue is thin against the spend, and that "AI doesn't have a return on investment". It is a polemic, and parts of it are contestable. But underneath the rhetoric sits a claim that is harder to wave away, and that this piece sets out to test against primary sources rather than vibes: most enterprises genuinely cannot measure what their AI costs, and cannot demonstrate what it returns.
That is not a stock-market question. It is a strategy question. A board does not need to know whether OpenAI is a good investment to need to know whether its own AI programme is producing value — and right now, on the published evidence, most boards cannot answer the second question with a number. The striking part is that the people building the tooling to fix this agree. The FinOps Foundation — the Linux Foundation body that effectively defines cloud-cost discipline — states plainly that "measuring and quantifying the business value of AI initiatives has been called out as a major challenge" by the practitioners managing AI spend, and that the methods to do it are still emerging rather than settled.11
So the critique lands. The interesting question is what a serious operator does about it. This piece walks the four places the measurement breaks — the ROI evidence, the cost side, the provider economics, and why pilots stall — and then sets out the instrumentation that the organisations who do measure are actually using. Every chart below is drawn from a primary survey or a framework document, and where a source is weak or contested, it is flagged in the text, not buried.
Part I · The return
Satisfaction is high. Measured return is not.
The cleanest finding across the 2025 surveys is not that AI fails. It is that adopters are happy with it and still cannot show the money. Bain & Company's Q3 2025 executive survey found that among the 59% of companies meaningfully adopting generative AI, the technology met or exceeded expectations in roughly 80% of cases across the functions where it was deployed. In the same survey, only about 23% of all respondents said generative AI had actually delivered more revenue or lower costs.1 That gap — between "it works" and "we can attribute value to it" — is the measurement crisis in a single chart.
The satisfaction–attribution gap
Enterprise generative AI, Bain executive survey, Q3 2025
If that were one survey, it would be an anecdote. It is not. S&P Global Market Intelligence's Voice of the Enterprise survey of roughly 1,006 IT and line-of-business professionals across North America and Europe found that the share of organisations abandoning the majority of their generative-AI initiatives before production more than doubled year over year — from 17% to 42% — and that, on average, 46% of projects were scrapped somewhere between proof of concept and broad adoption.2
Abandonment more than doubled in a year
Share of organisations abandoning the majority of GenAI initiatives before production
The same longitudinal survey found something more telling than any single abandonment number: the proportion of organisations reporting a positive impact from generative AI fell across every enterprise objective it measured, year over year. Not a reallocation, not a plateau — a decline on all three fronts at once.
Positive impact fell on every objective measured
Share of organisations reporting positive GenAI impact, 2024 → 2025
The 95% number, and why to handle it with gloves
No statistic in this debate travels further than MIT's. The Media Lab's NANDA initiative report, The GenAI Divide: State of AI in Business 2025 — built on 150 leader interviews, a 350-person employee survey and a scan of 300 public deployments — reports that about 5% of enterprise AI pilots achieve rapid revenue acceleration while roughly 95% deliver little or no measurable impact on profit and loss.3 It also reports a steep adoption funnel for task-specific, embedded tools, against a much gentler path for generic chatbots like ChatGPT and Copilot.
The pilot-to-production funnel
Task-specific, embedded enterprise tools vs. generic chatbots reaching production
— for comparison —
The reason to keep all three sources in view at once is that they fail differently. Bain is a small executive survey. S&P is a larger longitudinal one. MIT is a contested headline. They do not agree on a number — they agree on a shape: adoption is broad, satisfaction is real, and attributable financial return is rare and getting harder to claim. That shape is robust even when each individual figure is soft.
Part II · The cost
Why "cheaper tokens" has produced bigger bills
The return side is hard to measure. The cost side is, if anything, worse — because the headline trend points the wrong way from the bill. Per-token prices have collapsed. Stanford HAI's AI Index documents a roughly 280-fold drop in the cost of running a GPT-3.5-equivalent quality query between November 2022 and October 2024 — from about $20 to about $0.07 per million tokens.8 Even on a conservative same-model basis, practitioners put the fall at roughly an order of magnitude over two years. And yet enterprise AI bills are rising, not falling, because consumption is growing faster than price is dropping — the classic Jevons-paradox dynamic, where efficiency expands usage faster than it cuts unit cost.9
Anchor the magnitude on Stanford HAI; the ">100×" consumption figure and the budget-exhaustion anecdotes come from VentureBeat's reporting and are directional. Sources: Stanford HAI AI Index 2025; VentureBeat, "Cheaper tokens, bigger bills".
Falling prices would still let a buyer forecast, if the unit were stable. It is not. The reason the true cost of an AI workload is so hard to know is that it depends on too many interacting variables to reason about intuitively: which model actually serves a given request, where the workload executes, how the prompt and context are structured, how much retrieval is stuffed into the window, and — above all — how many times an agentic workflow loops. Industry analysis from CloudZero and IDC describes agentic multi-call patterns amplifying token consumption by 5 to 30 times for a single user-visible task. Managing this is, in the words of one practitioner, "an engineering problem that requires continuous tuning" — which reframes prompt engineering as a cost-governance discipline, not a prompt-craft one.9
The consequence shows up directly in budgeting accuracy. The FinOps Foundation's State of FinOps 2026 data indicates that only about 15% of enterprises forecast their AI costs to within ±10%, while roughly one in four miss their forecast by more than 50%.10 A line item you miss by half is not a line item you can build a business case on.
Most enterprises cannot forecast their AI bill
Accuracy of enterprise AI cost forecasts
This is the part of Zitron's critique that holds up best. Not "AI is worthless" — the productivity evidence below contradicts that — but "the true unit cost is structurally hard to know." It is. And an organisation that cannot state its cost-per-unit-of-work cannot compute a return, no matter how good the work is.
Part III · The provider
Provider economics — as a sourcing risk, not an investment view
It is not the buyer's job to value the model providers. But it is the buyer's job to understand that the price they pay today rests on an economic structure that is still finding its level — because that structure determines pricing stability and counterparty risk, which are budgeting inputs. Three facts, all from reporting of the providers' own figures, are enough to frame the exposure.
First, OpenAI's spend plans have moved by enough to matter. In February 2026, CNBC reported that the company had reset its compute-spend target downward — from the roughly $1.4 trillion in infrastructure commitments that CEO Sam Altman had touted, to around $600 billion by 2030 — explicitly to tie spend more directly to expected revenue growth.5 Second, its 2025 results, as relayed, show real burn: about $13.1 billion in revenue against roughly $8 billion of cash burned.5
These are unaudited figures relayed via reporting of a private company's internal projections — the strongest channel available, corroborated across CNBC, Reuters and Bloomberg, but inherently not independently verifiable. Read them as "reportedly targeting," not as accounts. Source: CNBC, "OpenAI resets spend expectations" (Feb 2026).
Third, the strain is now visible in the credit ratings of the companies financing the build-out. In mid-2025, Moody's revised Oracle's outlook to negative from stable — while affirming its Baa2 rating, the lower end of investment grade — citing counterparty-concentration risk tied to a roughly $300 billion, 4.5-gigawatt compute contract with OpenAI, which Moody's characterised as one of the largest project financings in the world.6 This was an outlook revision, not a downgrade — but for an enterprise buyer it is a concrete, named signal.
Depending on external LLMs at scale is a strategic exposure in its own right
Underneath the price and counterparty numbers sits a larger point that deserves to be named plainly. Routing a core, high-volume business process through an external model API concentrates an operational dependency outside the organisation's control. At pilot scale that is a sensible trade — capability and speed in exchange for a small, contained spend. At production scale, when thousands of daily decisions, documents or customer interactions flow through a single third-party endpoint, the same arrangement becomes a question of resilience rather than convenience. A provider that is still burning cash, resetting its own spend roadmap and financing its build-out through concentrated counterparties is not yet a stable utility; it is a fast-moving supplier of an input the business has quietly made load-bearing. A pricing change, a rate-limit, a deprecated model version or an outage then lands not as an IT inconvenience but as an interruption to a core process.
The conclusion is not to avoid external models — they are too capable, and building frontier capability in-house is rarely the right call. It is to treat a model provider the way a serious operator treats any critical single-source supplier the moment a process scales past experimentation: with a qualified fallback, a contract that addresses price and continuity, a budget that assumes the rate will move, and an honest answer to one question — what happens to this process if the price doubles, the model is retired, or the endpoint is unavailable next quarter? If there is no answer, the dependency is a strategic risk wearing the costume of a convenient API.
Part IV · The gap
Why pilots stall — and what the evidence says actually works
If satisfaction is high and attribution is rare, the obvious question is what separates the programmes that convert. The evidence points away from the model and toward two structural causes: what gets funded, and what gets measured.
The budget goes where it is easy to see, not where it pays
The MIT NANDA report's most actionable finding — more defensible than its headline failure rate — is that generative-AI budgets are systematically misallocated. Around half of GenAI budgets (the report's abstract says ~50%; its survey detail runs as high as ~70%) flow to front-office sales and marketing functions, while back-office automation that often yields better ROI is underfunded. The reason is itself a measurement problem: sales and marketing outcomes map cleanly onto board-level KPIs and investor updates, whereas the efficiencies in legal, procurement and finance are real but harder to surface in an executive conversation.3
Budget follows visibility, not return
Allocation of enterprise GenAI budget by function
The productivity is real — but unevenly distributed
It would be wrong to leave the impression that AI does not work. A large, pre-registered field experiment across Microsoft, Accenture and an anonymous Fortune 100 manufacturer (n=4,867 developers, published in Management Science) found that GitHub Copilot raised the number of completed tasks by about 26%.7 Two caveats matter for any ROI built on that figure. First, the study measured task throughput, not code quality or financial return — the researchers did not have access to the produced code. Second, and more useful for strategy: the gains were highly uneven by experience.
The same tool, very different gains
Output increase from AI coding assistance, by developer experience
Put the two findings together and the strategic implication is sharp. The value is real, but it is contingent — on the function, on the workforce composition, on whether the workflow was redesigned around the tool. A programme that does not measure at that granularity will see the average and miss the distribution, fund the visible use case over the valuable one, and report "it met expectations" while the P&L does not move. That is not a model failure. It is an instrumentation failure.
Part V · The fix
The measurement playbook: from cost-per-token to cost-per-outcome
The good news is that the discipline to fix this is not theoretical. The FinOps Foundation — the body that standardised cloud-cost management — has extended its framework to AI, and its core construct, Unit Economics, is the most concrete primary answer available. Unit Economics is defined as "metrics that provide an understanding of how an organisation's technology use and technology management practices impact the value of the organisation's products, services, or activities," and it sits squarely under the framework's Quantify Business Value domain. The Foundation states the principle bluntly: "without a way to relate costs to benefits received, it is difficult to understand whether spending is appropriate."4
The practical move is a ladder. AI cost measurement is meant to start at the cost-per-token level and climb toward outcome-oriented metrics — cost per assist, cost per agent action, cost per case deflected — with the granular tracking (down to token, GPU and per-prediction level) feeding the rungs above it.4
The metric ladder needs an owner, or it stalls in finance. The Foundation's recommended governance vehicle is a cross-functional AI Investment Council — and the value of the recommendation is in the specific membership, because it is the cross-functional composition that lets cost meet outcome in one room. The council, FinOps notes, drives the unit-economics conversation higher in the organisation by defining the specific outcomes and KPIs that AI projects are required to address.11
Two honest qualifications. The first is that the Foundation itself does not claim to have finished the job: it concedes there is no settled, standardised methodology for quantifying AI business value yet — the approaches are still emerging. That is precisely why the critique at the top of this piece lands; the discipline building the fix is candid that the fix is incomplete. The second is that the framework's language is descriptive, not a mandate — it observes that mature practices "expand toward" outcome metrics, it does not order anyone to. The strategic reading is the same either way: the destination is cost-per-outcome, almost nobody is there yet, and the organisations that get there earliest will be the ones that can prove value while their competitors are still reporting satisfaction.
How to read this if you are the buyer
Strip the surveys away and the operator's job comes down to four situations. The framing below cuts through the noise faster than any maturity scorecard.
Situation 1 — the board asks "what is our AI ROI?". The honest first answer is a counter-question: at what unit? If the organisation cannot state a cost-per-outcome for its flagship AI workload — cost per resolved ticket, per generated document, per deflected case — then the ROI does not exist yet as a number, and any figure presented is satisfaction wearing a finance costume. The work is not to produce a better slide; it is to instrument one workload to the cost-per-outcome rung and report that.
Situation 2 — the CEO with stalled pilots. The evidence says the cause is rarely the model. Check two things first: where the budget went (front office by visibility, or where the return is?) and what is being measured (throughput, or outcome?). A single use case instrumented to cost-per-outcome, with a named business owner whose P&L moves with it, beats a portfolio of pilots measured on "engagement." Three months of that beats twelve months of pilots.
Situation 3 — the cost line is volatile and nobody can forecast it. This is the ±10% problem, and it is an engineering-and-governance problem, not a procurement one. The fixes are concrete: instrument token, model and agent-step consumption per workload; treat prompt and context design as cost governance; cap agentic loop depth; and qualify a cheaper fallback model for high-volume, low-complexity tasks. Forecastability is a capability you build, not a rate you negotiate.
Situation 4 — provider and price risk. Assume today's token price is promotional and write that assumption into the multi-year case. Avoid single-provider lock-in on any material workload, keep an open-weight or smaller-model fallback qualified, and put price-change and exit terms in the contract. You do not need a view on whether the providers will be profitable. You need your business case to survive the day the price moves.
And three questions that cut through a vendor pitch faster than any RFP: "Show me the cost-per-outcome you measured on your last engagement." "Show me how you instrumented the cost side — token, model, agent-step." "Show me the business owner whose number moved." If a firm answers in pilots, demos and satisfaction scores, what you are buying is enablement, not measured value.
Where Consulting Huber fits
Consulting Huber is a practitioner firm. We do not sell an AI platform, and we have no incentive to inflate a token bill or a pilot count. We work with CEOs, boards, transformation officers and PE operators on the unglamorous half of the problem the surveys keep pointing at: making AI cost and value measurable, so the decision to scale or stop rests on a number rather than a mood.
In practice that means instrumenting one strategically important workload to the cost-per-outcome rung in the first weeks; standing up the cross-functional cadence — business owner, engineering, finance, risk — that the FinOps framework calls a council and we simply call the room where the KPI gets set; building forecastability into the cost line instead of negotiating it; and writing provider and price risk into the business case. The full shape of that delivery discipline sits in our companion deep-dive on the delivery layer underneath AI and in the AI Value Creation Playbook; the regulatory side sits in our EU AI Act compliance playbook.
If you are sitting in one of the four buyer situations above and want a direct conversation about how to make your AI spend measurable, the calendar link below is the fastest way to start.
Sources consulted
Enterprise ROI evidence
[1] Bain & Company, "AI moves from pilots to production", Q3 2025 executive survey (n=197) — satisfaction ~80% among meaningful adopters, ~23% reporting revenue or cost impact. Corroborated by Bloomberg, "AI Delivers Less Cost Reduction Than Firms Predicted" (June 2026). Cite as an executive survey, not a population statistic.
[2] S&P Global Market Intelligence, Voice of the Enterprise: AI & Machine Learning (~1,006 respondents, North America + Europe) — abandonment 17%→42% YoY; 46% of projects scrapped between PoC and adoption; positive-impact decline across revenue (81→76), cost (79→74) and risk (74→70). Figures independently reported by CIO Dive.
[3] MIT Media Lab NANDA initiative, The GenAI Divide: State of AI in Business 2025 (150 leader interviews, 350-person employee survey, 300 public deployments) — ~95% of pilots with no measurable P&L impact; 60/20/5 funnel; ~50–70% of budget to sales & marketing. Via Fortune and the report PDF. Methodologically contested: Wharton's Kevin Werbach and others question the derivation of the 95% figure and the missing funnel denominators; the publisher promotes commercial agentic protocols. Presented throughout as what the report reports, with the critique attached.
[7] Cui, Demirer, Jaffe, Musolff, Peng & Sadun et al., field RCT across Microsoft, Accenture and a Fortune 100 manufacturer (n=4,867; pre-registered AEARCTR-0014530), published in Management Science (2025) — GitHub Copilot raised completed tasks ~26% (SE ~10.3%); junior developers +27–39%, seniors +8–13%. Measures throughput, not code quality or financial return.
Token & inference cost opacity
[8] Stanford HAI, AI Index 2025 — ~280-fold drop in per-token cost for GPT-3.5-equivalent quality ($20 → $0.07 per million tokens, Nov 2022 – Oct 2024). The primary anchor for the cost-decline magnitude.
[9] VentureBeat, "Cheaper tokens, bigger bills: the new math of AI infrastructure" — consumption up >100× while price fell ~10× (same-model floor); cost is "an engineering problem that requires continuous tuning"; Uber and ServiceNow reportedly exhausted full-year 2026 AI budgets in 4–5 months. Agentic 5–30× token amplification corroborated by CloudZero and IDC. Secondary source; treat the consumption multiple as directional.
[10] FinOps Foundation, State of FinOps 2026 — ~15% of enterprises forecast AI cost within ±10%; ~1 in 4 miss by >50%. Token pricing, agent-step billing and retrieval costs create volatility legacy budgeting cannot handle. Fast-moving; verify before republication.
Provider economics (as buyer risk)
[5] CNBC, "OpenAI resets spend expectations, targets around $600 billion by 2030" (Feb 2026) — spend reset from a touted $1.4T to ~$600B by 2030; 2025 revenue $13.1B against ~$8B burn; projected 2030 revenue >$280B. Corroborated by Reuters and Bloomberg. Unaudited figures relayed from a private company's internal projections — "reportedly targeting," not accounts.
[6] Moody's Ratings — Oracle outlook revised to negative from stable (Baa2 affirmed), citing counterparty-concentration risk from a ~$300B / 4.5 GW OpenAI compute contract; characterised as one of the world's largest project financings. Via Yahoo Finance; clarified by The Register as an outlook revision (mid-2025), not a downgrade. For citation, prefer Moody's own rating action at ratings.moodys.com.
The measurement playbook
[4] FinOps Foundation, Unit Economics capability — the definitional framing of unit economics under "Quantify Business Value," and the Crawl/Walk/Run progression from cost-per-token toward cost per assist / agent action / case deflected. The Linux Foundation project is the standards authority for cloud and AI cost management.
[11] FinOps Foundation, Managing AI Value working group — the cross-functional AI Investment Council and its membership; tracking to token, GPU and per-prediction level; and the explicit concession that quantifying AI business value is "a major challenge" with no settled methodology yet.
The provocation
[0] Ed Zitron, "AI Doesn't Have a Return on Investment" and related essays — cited as the framing polemic this piece tests, not as an evidentiary source. The argument that true AI cost and ROI are obscured is taken seriously above and checked against primary data; the more sweeping conclusions are not adopted.
What the evidence does not yet settle
Four questions remained open after this research, and any honest reader should hold them: (1) the net blended unit cost of a representative agentic workload after retries, context bloat and multi-step amplification — no source quantified how much of the "cheaper tokens" saving survives at the workload level; (2) how far below cost, if at all, current frontier-model API prices sit — burn figures show losses but do not isolate per-token inference economics; (3) the specific, repeatable instrumentation that distinguishes the ~5% who scale, with before/after outcome data, beyond the frameworks above; (4) how GPU depreciation and useful-life assumptions affect the durability of today's pricing. These are the questions to ask any vendor or internal team claiming certainty.
Related: AI Value Creation Playbook · The Delivery Layer Underneath AI · The Big Consulting AI Frameworks, Compared (2026) · Digital & AI Strategy service