The Honest Math of AI Productivity

$The Honest Math of AI Productivity$

At a recent work event, I watched two speakers from an AI consultancy tell a room of executives multiple times to strive for 5 to 10x productivity gains. Followed by the very quickly passing remark that if people are not willing or able to fully embrace becoming “AI-native” and reach those new heights of productivity, perhaps they don’t belong in your workforce anymore, similarly to a secretary not being able to work with Excel.

I breathe AI for a living. I build agent frameworks, I use Claude Code every day, Claude Cowork, I set up custom workspaces, custom plugins, connect services, … I know my stuff inside and outside. And my own gains from working with AI are large. I’m about as pro-AI as it gets. BUT the 5-10x pitch is a big giant jar of snake oil. The people selling it are either completely ignorant of what the evidence actually says (which, given the fact they had all the other familiar statistics everyone keeps parroting in their presentation, I doubt it), or counting on the room not to check.

So let’s check.

The Real Numbers

$The Real Numbers$

When you go looking for rigorous studies (randomized trials, peer-reviewed field experiments, government data) the gains are real and they sit in the single-to-low-double digits, not multiples.

Customer support is the cleanest case. A Stanford and MIT field study of over 5,000 agents found AI raised productivity about 15% on average, and roughly double that for the least-experienced agents while doing close to nothing for the most experienced. Professional writing: an MIT randomized trial published in Science cut task time about 40% and raised quality 18%. Management consulting: a pre-registered trial with 758 BCG consultants found real gains, but only inside AI’s “jagged frontier” of suitable tasks, where work was 25% faster. On a task chosen to sit just outside that frontier, the AI users were 19 percentage points more likely to get it wrong.

Notice the ceiling. The single best department-level result in the serious literature is roughly +40% on a narrow writing task, in a lab. Not 400%. Not 1000%.

Zoom out and it shrinks. The most careful meta-analysis to date pools the productivity effect at a moderate 0.33 standard deviations, and finds it collapses by setting: decent in the lab, much weaker in real enterprises, near zero in open-source work. At the level of the whole economy, MIT’s Daron Acemoglu estimates AI will add about 0.7% to total factor productivity over ten years. The most optimistic credible figure, from the St. Louis Fed, is about 1.1%, and that one is built on workers self-reporting how much time they think they saved.

As of mid-2026, there is no trustworthy study (no randomized trial, no peer-reviewed field experiment, no audited result) showing a sustained 5-10x gain in any department. Every time you see that number, trace it back. It resolves to a vendor benchmark, a demo, or a single cherry-picked task.

Why the Demo Looks Like 10x

The famous “55% faster” coding number is worth understanding, because it shows how the trick works. In GitHub’s own study, developers given Copilot finished a task about 56% faster. The task was writing an HTTP server in JavaScript from scratch: one self-contained greenfield problem of the kind that has a thousand known solutions in the training data.

Now put the same tools in a real, mature codebase. METR ran a randomized trial with experienced developers working on their own large open-source projects, and the result is the one every executive should hear: they were 19% slower with AI. They had forecast a 24% speedup. Even after finishing slower, they still believed AI had sped them up by 20%. A 39-point gap between what they felt and what actually happened.

That gap is the whole con. Greenfield work is mostly boilerplate (authentication, payments, CRUD, the same API scaffolding everyone writes), and AI is definitely really fast at it. A five-year-old product is the hard case: a new feature touches thousands of lines and depends on context no document captures, and models reliably lose the thread when the relevant details sit deep in a long context. So when someone generalizes the greenfield demo into “5x across your business,” they are pricing the easy 10% of the work as if it were the hard 90%.

Even the greenfield case carries a tax. AI errors don’t average out, they compound: in one study, nearly 20% of generated code samples imported a package that doesn’t exist, and the same fabricated names came back on repeated runs. A separate Stanford trial found developers with an AI assistant wrote less secure code while feeling more confident it was secure. The faster you generate, the more unreviewed code you ship, and the more of your time moves into review and cleanup.

This is why the agency pitch is so dangerous. Those speakers were selling to shops that build fresh custom projects for clients, so every project looks like greenfield and every demo looks like 5x. The speed gets captured at the start. The maintenance bill gets handed to the client, who inherits a fast-built codebase nobody fully understands.

Measure Both Sides

$Measure Both Sides$

The deeper problem is that almost nobody in that room could tell you whether they got 5x, 1.5x, or nothing, because they were only ever watching one side of the trade.

The honest metric is a comparison: your measured performance gain against your measured cost increase, both tracked over time. AI is worth it when the value it adds clearly beats what it adds to the bill, and you can’t know that unless you measure both halves. If you get 25% more value but spend 50% more to get it, that is a bad deal, and you won’t notice unless you tracked both.

What gets measured instead is tokens. The community started calling it tokenmaxxing: spending more and more on AI with no measurable increase in value. Instead, we measure our “productivity” by tokens spent. A silly concept, if you ask me. It happens the moment “we used X billion tokens this quarter” lands on a slide as evidence of progress. Tokens are the cost side, read as if they were the value side. As of mid-2026, per-seat AI spend can reach a serious fraction of a salary, in some cases even a multiple, so that cost side is no longer a rounding error you get to ignore.

Measuring the value side is the part the keynote skipped, and it’s hard. The research is blunt that there’s no single number for it: the SPACE framework from the people who wrote the book on engineering productivity says so directly, and the METR perception gap says why you can’t trust the gut check, since people feel multiples faster than they measurably are. So you do the slow part on both sides. Pick a few value-linked outcomes (cycle time, defect rate, time to ship), record where you stand before you adopt anything, and track the spend right next to them.

Skip that, and the cost side bites in a way most companies don’t see coming. The pitch is that AI is faster and cheaper, so you can run leaner. The cheaper half is the shaky one: at some companies, per-seat spend already costs more than the roles it replaced or quietly stopped backfilling. A team that cut people to fund AI and now depends on it has swapped a fixed, predictable salary line for a variable bill that climbs with usage. If you were never tracking the cost side against the value side, you find out only once the efficiency you were promised has turned into fragility.

The Cost Lands on People and on the Company

Chasing an impossible number does real damage, and not only to the people chasing it.

Start with the people. The early evidence is that AI tends to make knowledge workers work more, not less. A peer-reviewed study of young professionals captured the mechanism in their own words: finishing faster just earns you more assignments, while quality still has to hold. The time AI saves gets reabsorbed as a higher baseline of expected output, and the early excitement turns into something harder to sustain.

Now the company. Decades of management research predict what an impossible target does to an organization, on its own, before anyone burns out. Once a number becomes the target, people optimize the number instead of the goal. The classic review Goals Gone Wild catalogs the rest: tunnel vision that starves everything unmeasured, distorted risk-taking, a measurable rise in unethical behavior, and the best people leaving first because they’re the ones who can tell the target is fake. What’s left is theater. Dashboards performed for management while real value stays flat.

Which brings me back to the line that bothered me. When a pitch pairs an inflated target with a loyalty test, “and if they can’t get on board, maybe they don’t belong here,” that’s the tell. It converts a question you could test (“does this tool actually deliver that on our work?”) into a question of character, so the number can never be challenged, because challenging it now reads as disloyalty. The Excel comparison does the rest of the work, borrowing the certainty of a tool that delivered to cover for one that hasn’t, at that scale, on the evidence we have.

What the Honest Version Looks Like

$What the Honest Version Looks Like$

None of this means AI doesn’t help. My own gains are large, and I’m not interested in pretending otherwise. But they’re earned, not handed over at install. They come from the unglamorous work the keynote never mentions: structuring a codebase so an agent can navigate it, building the context and guardrails it needs, and tuning a setup per project, because no single configuration works everywhere. The tool multiplies the effort you put into making it useful.

This is where the smarter pitch lives. Sure, the gains take the right setup, and the right setup is what I’m selling, so buy mine and the 5-10x appears. It’s a slippery claim, because it turns every disappointment into proof you needed more of the product. You only got 30%? You didn’t have the right setup yet.

Hold it to the same standard as everything else here: measure it. A setup that actually returns 5-10x is the easiest thing in the world to prove. Baseline your outcomes, run the setup on your real work for a quarter, and put the performance gain next to the cost. Anyone who has that result will want the test, because the numbers close the sale for them. Anyone who keeps the number in demos and testimonials, and treats your request to measure it as a lack of faith, is telling you what the measurement would show.

And the serious studies already ran with good setups. The METR developers were experienced people on frontier tools and still came out slower on mature code, and the consulting and support trials used real deployments and still landed in the tens of percent on the tasks that suited AI. A better setup moves you toward the top of that range and widens which tasks fall inside the frontier. It doesn’t lift the ceiling to 10x, and no setup makes a senior engineer on a five-year-old codebase ten times faster. The frontier is real, with or without a consultant selling you the way around it.

So the honest pitch is smaller and far more useful than 5-10x. AI is a real, uneven productivity gain, largest for newer people on well-scoped work, that you have to measure at the level of value rather than tokens, and earn through setup you do before any speedup shows up. That’s a number you can actually hit, and defend.

The other number, the one you’re told to strive for or else, isn’t a target. It’s the product they’re selling.

The Honest Math of AI Productivity

The Real Numbers

Why the Demo Looks Like 10x

Measure Both Sides

The Cost Lands on People and on the Company

What the Honest Version Looks Like

More posts by me

Why Enterprise AI Keeps Failing (Hint: It's Not the Technology)

The Hidden Costs of LangChain, CrewAI, PydanticAI and Others: Why Popular AI Frameworks Are Failing Production Teams

The Jailbreak that Got Fable 5 Pulled Exists in Every Model