Here’s a number that should bother you: developers using AI coding tools believe they’re 20% faster. When researchers actually measured it in a controlled trial, the same developers were 19% slower.

That’s not a rounding error. That’s a 39-percentage-point gap between perception and reality.

The study came from METR, a nonprofit AI research lab, published in early 2025. They ran a randomized controlled trial with experienced open-source developers working on real codebases. The developers with AI assistance completed tasks slower than those without, despite being convinced they were faster.

This doesn’t mean AI coding tools are useless. It means we’re measuring the wrong things.

What the data actually says

The METR study is the most rigorous measurement we have. But it’s not the only data point.

AI adoption is massive. 84% of developers use AI coding tools. 41% of all committed code is now AI-assisted. This isn’t a trend you can ignore.

Output volume is up. Teams with high AI adoption merge 98% more PRs. They complete 21% more tasks. By any activity metric, they look more productive.

Quality metrics tell a different story. AI-authored code has 1.7x more major issues than human-written code (CodeRabbit, 2025). 45% of AI-generated code introduces security vulnerabilities (Veracode). Code churn increased 39% in projects with heavy AI tool usage (GitClear).

Incidents are climbing. PRs per developer went up 20%. Incidents per PR went up 23.5%. More code shipped, more things broke.

Trust is dropping. Developer trust in AI code accuracy fell from 40% to 29% year-over-year (Stack Overflow). The people writing AI-assisted code are themselves becoming less confident in it.

The paradox explained

The numbers aren’t contradictory. They describe different things.

AI tools make writing code faster. Auto-complete, boilerplate generation, test scaffolding. The typing part genuinely speeds up. A developer who used to spend 30 minutes writing a function can get a working draft in 5.

But writing code is maybe 20% of software development. The rest is understanding the problem, designing the approach, debugging, reviewing, testing, integrating, and maintaining. AI doesn’t help much with those. In some cases it makes them harder, because now you’re debugging code you didn’t write and may not fully understand.

The METR result makes sense when you see it this way. AI saved time on the easy parts and added time on the hard parts. The net effect was negative because the hard parts dominate.

What this means for engineering leaders

If you’re a CTO or VP Engineering evaluating your AI tool investment, the headline numbers from your vendor dashboard are misleading. “Copilot suggestions accepted” or “lines of code generated” tell you how much AI is being used, not whether it’s making your team more effective.

The questions you should be asking:

Is code quality holding up? Track health scores over time. If your team is shipping twice as fast but your architecture score is declining and security findings are increasing, you’re trading speed for debt.

Are incidents increasing? More PRs with more bugs means you’re paying for the speed gains with reliability. Check whether your change failure rate has moved since AI adoption.

Is review time growing? If developers spend more time reviewing AI-generated code than they saved by generating it, the tool is a net negative. The 98% increase in merged PRs comes with a 91% increase in PR review time.

Is the right code being written? AI is good at generating plausible code. It’s not good at asking “should this code exist?” Feature bloat, unnecessary abstractions, and over-engineering are all common in AI-generated codebases.

How to measure what matters

Stop measuring AI tool adoption and start measuring engineering outcomes. AI developer productivity measurement needs to focus on what matters.

Before/after health scores. Run a code analysis before AI adoption and quarterly after. Track architecture, code quality, security, dependencies, and test coverage. If health declines while velocity increases, you’re accumulating hidden debt.

Complexity-adjusted velocity. Raw PR count is meaningless. A developer who ships 2 complex, well-tested features has more impact than one who ships 20 trivial auto-generated PRs. Measure what was shipped, not how many times someone typed “commit.”

Time to resolve findings. When issues are identified, how quickly does the team fix them? This measures whether the team can maintain quality, not just produce volume.

New developer onboarding time. If AI-generated code is harder for humans to understand (because no human designed it), onboarding time will increase. This is an early signal that the codebase is becoming opaque.

The sustainable threshold

Early data suggests a sustainable threshold for AI-assisted code: roughly 25-40% of total output. Above that, quality metrics tend to degrade. Below that, you’re not getting enough benefit to justify the tooling cost.

This isn’t a hard rule. It depends on the type of work, the seniority of the team, and how well-tested the AI output is. But it’s a useful starting point for teams that are going all-in on AI without measuring the consequences.

Measure the real impact

AI coding tools are here to stay. The question isn’t whether to use them. It’s whether they’re actually making your engineering team more effective or just making them feel faster.

The only way to know is to measure outcomes, not activity. Health scores, not PR counts. Quality trends, not lines generated. Team effectiveness, not tool adoption.

Want to measure the real impact of AI on your engineering team? StackGrit tracks project health over time so you can see whether AI adoption is improving or degrading your codebase. First report is free.

Measure your AI transformation impact →