Why does AI-assisted development make code review harder, not easier?

AI raises the rate at which changes arrive and makes each one look more trustworthy than the scrutiny behind it justifies. Diffs are clean, tests are green, descriptions are well written — all the surface signals say 'ready to merge.' The real risk is that author context goes down: the person submitting the change can explain what the code does but not always why it is the right thing to do, which is exactly what review has always depended on.

How should review effort be allocated when pull request volume increases?

Route review depth by risk, not by line count. Low-risk changes — tests, docs, isolated UI copy, internal tooling — can move quickly with light review. High-risk changes touching auth, payments, data handling, migrations, or public APIs require deep review, a named second reviewer, and no time pressure. The goal is to concentrate scarce deep-review capacity where a mistake is actually expensive.

Does passing tests mean AI-generated code is safe to merge?

No. Green tests prove the tests passed, not that they would have failed if the behaviour were wrong. AI is very good at producing a test that turns green without asserting anything real — reviewers should explicitly check whether tests assert real behaviour or merely confirm that the code does what it does. Treating green tests as sufficient is one of the most common failure modes in AI-assisted work.

What metrics tell you whether your review process is holding up under higher volume?

Four signals matter: review time per change broken down by risk tier, change failure rate (defects and rollbacks), time-in-review queue, and the share of PRs with an author-written intent statement. If failure rate rises while review time falls, you are rubber-stamping. If both failure rate and queue length rise together, the system is overloaded. The two signals together tell you more than either alone.

AI code review at scale: keeping the bar high when volume goes up

When teams adopt AI coding tools, the first thing that changes is not the code. It is the volume of code arriving at review.

A reviewer who used to see four pull requests a day now sees ten, and each one looks finished. The diff is clean, the tests are green, the description is well written. Everything signals "ready to merge." And that is exactly where quality quietly erodes.

We work with engineering teams through this transition, and the pattern is consistent. AI does not lower the quality of any single change very much. It raises the rate at which changes arrive, and it makes each change look more trustworthy than the scrutiny behind it justifies. Review becomes the one control standing between faster output and a slower, more fragile codebase.

So the question we care about is not "should AI write code." It is "how does review hold the line when there is more of it, and it all looks good?"

Why AI changes review specifically

Three things shift at once.

Volume goes up. More changes per reviewer per day, with the same number of reviewers.
Surface polish goes up. AI output reads as authoritative. Naming is consistent, comments are present, the structure looks deliberate.
Author context goes down. When a human writes code, they hold a mental model of why each line exists. When a model writes it and a human accepts it, that model is thinner. The author can explain what the code does, but not always why it is the right thing to do.

That last one is the dangerous part. Review has always relied on the author understanding their own change. AI weakens that assumption without announcing it.

The failure mode we see most

The most common failure is not bad code merging. It is review becoming a throughput exercise.

When ten polished PRs are waiting, a reviewer under time pressure does the rational thing: they check that it works, skim the rest, and approve. The review happened on paper. It did not happen in substance. Nobody decided this was wrong, the incentives just made the shallow path the easy one.

The opposite failure is just as costly: review becomes a bottleneck. Reviewers try to scrutinise every line of every AI change as if they wrote nothing themselves, and the queue backs up until the speed AI was supposed to deliver disappears into waiting time.

Both failures come from the same root cause: the team kept its old review process and just pointed it at more code.

Make the author do more, not the reviewer

The highest-leverage change is upstream of review. The author of an AI-assisted change should arrive at review having already done the work the model cannot be trusted to do.

Concretely, we ask authors to:

read and understand every line they are submitting, and be able to explain why it is there, not just what it does
write the PR description themselves, stating the intent and the risk, rather than pasting a model-generated summary
flag the parts they are least sure about, so the reviewer's attention goes where it is needed
run and read the tests, including whether the tests actually exercise the new behaviour or just pass

This is not bureaucracy. It restores the assumption review depends on: that a human owns this change and understands it. A reviewer can move quickly through a change whose author clearly understood it. They cannot safely move quickly through one where nobody did.

Direct review attention with risk, not line count

At higher volume, treating every change the same is what breaks. The fix is to route review effort by risk.

Change type	What it touches	Review depth
Low risk	Tests, docs, isolated UI copy, internal tooling	Light, fast, trust the author
Standard	Feature code with clear blast radius	Normal review, focus on intent and edge cases
High risk	Auth, payments, data handling, migrations, public APIs	Deep review, a named second reviewer, no time pressure

The point is to spend your scarce deep-review capacity where a mistake is expensive, and stop spending it evenly. AI raises volume across all three rows, but the high-risk row is the one where shallow review actually hurts you.

What the reviewer should still check on AI changes

For anything above low risk, we want the reviewer's checklist to name the things AI makes easy to miss:

Does this solve the right problem, or a plausible nearby one the model drifted toward?
Are there edge cases the happy-path code and happy-path tests both skip?
Did the change pull in a dependency, pattern, or abstraction the codebase did not need?
Is there duplicated logic the model regenerated instead of reusing what already exists?
Do the tests assert real behaviour, or do they assert that the code does what it does?

That last category, tests that pass without proving anything, is something we see far more often in AI-assisted work. A model is very good at producing a test that turns green. It is less reliable at producing one that would fail if the behaviour were wrong.

Measure the review system, not just the code

You cannot manage this from gut feel once volume is up. A few signals tell you whether review is holding.

Signal	What it tells you
Review time per change, by risk tier	Whether deep review is actually happening where it should
Change failure rate (defects, rollbacks)	Whether shallow review is leaking problems to production
Time-in-review queue	Whether review has become the bottleneck instead
Share of PRs with author-written intent	Whether the upstream discipline is real or skipped

If failure rate and review queue both rise, the system is overloaded. If failure rate rises while review time falls, you are rubber-stamping. The two together tell you more than either alone.

This connects to a broader point we make about adoption: speed is only a real gain if quality holds. A team that ships twice as fast and reverts twice as often has not gained anything except stress.

Our view

AI-assisted coding does not remove the need for review. It moves review to the centre of how quality is decided, and it does so while making review harder by hiding weak changes behind a polished surface.

The teams that handle this well change three things, not one. They push understanding back onto the author, so review starts from a change someone actually owns. They route review depth by risk, so scarce attention lands where mistakes are expensive. And they measure the review system itself, so they can tell the difference between "fast and fine" and "fast and fragile" before production tells them.

None of that requires a new tool. It requires deciding, on purpose, that the bar does not move just because the volume did.

Sources

Google Engineering Practices, Code Review Developer Guide, accessed 2026-06-10
DORA, Accelerate State of DevOps, on change failure rate and delivery performance, accessed 2026-06-10
OWASP, OWASP Top 10 for Large Language Model Applications, accessed 2026-06-10

Frequently asked questions

Why does AI-assisted development make code review harder, not easier?: AI raises the rate at which changes arrive and makes each one look more trustworthy than the scrutiny behind it justifies. Diffs are clean, tests are green, descriptions are well written — all the surface signals say 'ready to merge.' The real risk is that author context goes down: the person submitting the change can explain what the code does but not always why it is the right thing to do, which is exactly what review has always depended on.
How should review effort be allocated when pull request volume increases?: Route review depth by risk, not by line count. Low-risk changes — tests, docs, isolated UI copy, internal tooling — can move quickly with light review. High-risk changes touching auth, payments, data handling, migrations, or public APIs require deep review, a named second reviewer, and no time pressure. The goal is to concentrate scarce deep-review capacity where a mistake is actually expensive.
Does passing tests mean AI-generated code is safe to merge?: No. Green tests prove the tests passed, not that they would have failed if the behaviour were wrong. AI is very good at producing a test that turns green without asserting anything real — reviewers should explicitly check whether tests assert real behaviour or merely confirm that the code does what it does. Treating green tests as sufficient is one of the most common failure modes in AI-assisted work.
What metrics tell you whether your review process is holding up under higher volume?: Four signals matter: review time per change broken down by risk tier, change failure rate (defects and rollbacks), time-in-review queue, and the share of PRs with an author-written intent statement. If failure rate rises while review time falls, you are rubber-stamping. If both failure rate and queue length rise together, the system is overloaded. The two signals together tell you more than either alone.