AI code review, without the hype.
What AI code review actually does, how multi-agent review works, how to evaluate tools on your own PRs, how to roll it out safely, and the honest limits you should know before trusting it on production code.
01. What AI code review actually is.
AI code review is the use of large language models — typically several routed together as agents — to read a pull request and post reviewer-style feedback automatically. The good version does what a senior reviewer would do on a slow day: summarise the change, flag defects and risky patterns, call out missing tests, and raise architecture concerns. The bad version is a linter with a chatbot voice.
The distinction is about grounding and orchestration. A model that only reads the diff will hallucinate. A model that reads the diff plus retrieved repo context — via symbol graphs, embeddings, and convention files — produces findings grounded in how your code actually works. A single model will miss classes of issues its training blind spots hide; multiple models routed as specialist sub-agents catch more.
AI code review is not a replacement for human review. It is a force multiplier: it clears the mechanical pass so humans spend their time on judgement calls — design trade-offs, product fit, accountability. Teams that frame it as a replacement tend to regret it within a quarter.
02. How multi-agent review works.
A multi-agent review pipeline typically has six stages:
- Webhook intake. PR is opened or synchronised; the tool receives the GitHub webhook and queues a job.
- Scout pass. A small, fast model scans the diff and decides what matters: which files to expand, which symbols to retrieve, which specialists to run.
- Hybrid retrieval. The scout triggers retrieval across the repo — semantic embeddings for similar code, symbol graph for callers and callees, convention files for repo-specific rules.
- Lead review. A frontier model reads the diff plus retrieved context and produces structured findings tagged with confidence and severity.
- Specialist sub-agents. Security, tests, architecture, and performance specialists each add a focused pass with different system prompts and sometimes different model families.
- Synthesis + post. A synthesiser deduplicates overlapping findings, ranks by severity and confidence, and posts a single GitHub review with inline comments and a summary.
Good tools expose this pipeline. You can see which model drove each finding, which retrieval hits it used, and how long each stage took. If your tool hides the stack, you cannot diagnose false positives — you can only accept them.
03. How to evaluate tools on your own PRs.
Do not trust vendor-curated demos. The PRs they show you are chosen for impact. Evaluate on your PRs, ideally ones that caused historical pain.
Actionable-findings rate
Fraction of findings that a reviewer would have wanted anyway. Target: 60%+. Below 40%, the tool adds noise.
False-positive rate
Fraction of findings that are wrong or irrelevant. Target: under 15%. Above 25%, reviewers stop reading.
Review latency
Time from PR open to first AI comment. Target: under 3 minutes for a typical diff. Above 10, review habits drift.
Monthly cost at your size
Extrapolate per-seat and credit plans to your real team. Include overage and growth.
Fix agent quality
If the tool offers fixes, check that the patches compile, pass tests, and do not widen blast radius.
Model transparency
Does the tool tell you which model drove each finding? If not, you cannot debug false positives.
04. Pricing models compared.
| Model | How it bills | Where it wins | Where it hurts |
|---|---|---|---|
| Per developer seat | Flat fee × headcount | Small teams, predictable cost per head | Expensive at 10+ devs; you pay for quiet months |
| Per organisation seat | Flat org fee + per-seat | Enterprise procurement simplicity | Opaque scaling; negotiation required |
| Per token (usage-based) | Billed by LLM tokens consumed | Cheap for quiet teams, experimentation | Unpredictable bills; hostile to forecasting |
| Credit pool (team-shared) | Flat monthly credit allowance shared across the team | Cheap for bursty teams; one bill; transparent per-PR cost | Overage math if you run hot; requires credit transparency |
Critique uses credit pools (Standard $12/mo, Pro $35/mo, Ultra $129/mo) because they keep pricing honest with work done while giving teams a single predictable bill. Students and OSS maintainers run on a dedicated $5/mo plan with unlimited repository indexing.
05. Rollout playbook — 6 steps.
- Step 01
Shadow it on one repo for two weeks
Install the tool on a single, representative repo — ideally one with historical PR pain. Let it comment without merging, no policy changes yet. Collect 10–20 PR reviews to study.
- Step 02
Score findings against ground truth
For each shadowed PR, tag every AI finding as true-positive, false-positive, or nit. Target at least 60% actionable (TP + useful-nit). Below 40% means tune model or retrieval before rolling out.
- Step 03
Tune policy: severity floor and exclusions
Raise the severity floor if reviewers are drowning. Exclude generated files, vendored deps, and fixtures. Add repo-specific instructions for team conventions. Re-run the shadow set.
- Step 04
Promote to required on a friendly repo
Pick one team that volunteers. Make the AI review a required check. Pair with weekly retrospectives for the first month. Track review latency, rework rate, and reviewer-hours.
- Step 05
Roll out by repo risk tier
Tier repos by blast radius (experiments → internal tools → core product). Promote tier by tier. Leave at least one week between tiers to catch regressions.
- Step 06
Measure monthly for 90 days
Track time-to-first-review, reviewer hours/PR, escaped-bug rate, and team NPS for the reviewer. If any regress for two consecutive months, roll back a tier and re-tune.
06. Honest limits.
AI reviewers still hallucinate APIs. Grounded retrieval cuts this a lot, but not to zero. Always verify fix suggestions compile and pass tests.
Rare-stack coverage is weaker. Elixir, Cairo, Solidity, Terraform/HCL, and legacy COBOL projects get less training weight. Hybrid retrieval narrows the gap; it does not close it.
Architecture judgement is weaker than a senior human. AI review catches local defects well; system-design trade-offs and org-specific product intuition are still human territory.
Signal-to-noise decays if you skip tuning. Teams that never raise severity floors get drowned in nits and stop reading the bot. Treat tuning as a quarterly habit.
Vendor trust is not optional. Read the DPA. Confirm no training on your code. Prefer tools with enterprise tenancy if you are on regulated data.
07. Where this is heading.
By the end of 2026, credible AI code review becomes table stakes for anything past Series A. The interesting question stops being whether to adopt and becomes how much review authority you give the agent: comment-only? Suggest-and-approve? Auto-merge for low-risk paths? Different teams will pick different points on that spectrum; tools that let you configure authority per repo and per file-glob win.
Multi-model routing, specialist sub-agents, and fix agents with closed-loop validation are converging into a single product shape. The differentiators going forward are cost transparency, audit surface, and the taste of whoever is tuning the prompts and policies.
Frequently asked.
01What is AI code review?
Open
AI code review is the practice of using large language models — often multiple models orchestrated as agents — to automatically read pull requests and post reviewer-style feedback: defect findings, security concerns, architecture notes, and test gaps. Modern tools like Critique run a scout, a lead reviewer, and specialist sub-agents in parallel and post a single synthesised review on the PR. AI code review does not replace human reviewers; it removes the mechanical pass so humans focus on judgement calls.
02How does AI code review work under the hood?
Open
A typical pipeline: (1) webhook fires when a PR is opened, (2) a scout fetches the diff plus relevant repo context via hybrid retrieval (embeddings + symbol lookup), (3) one or more lead models read the PR and produce findings, (4) specialist sub-agents add security, tests, and architecture passes, (5) a synthesiser deduplicates and ranks findings, (6) the result posts as a single GitHub review. Good tools show you exactly which model drove each finding.
03Is AI code review safe for private repos?
Open
Reputable tools scope access through the GitHub App permission model, process code in transit only, and publish a data processing addendum. Critique never trains on customer code, offers enterprise tenancy with SSO and audit logs, and honours request-level retention settings. Always read the DPA before enabling any tool on proprietary or regulated code.
04Does AI code review catch real bugs or just nitpicks?
Open
Quality depends on (a) retrieval grounding on the actual repo rather than pretraining alone, (b) how many independent models read the diff, and (c) whether specialist passes exist. Single-model single-pass tools are noisier. Multi-model agentic tools like Critique routinely catch null-safety, auth boundary, race-condition, and test-gap issues in real PRs — not just style. Measure it on your own PRs before rolling out.
05How should I price AI code review for my team?
Open
Per-seat pricing is simple but scales badly above ~5 developers ($15–30/dev/mo × headcount, often $300–600/mo at 20 devs). Credit-pool pricing bills by work actually done and shares a budget across the team — Critique starts at $12/mo shared, Pro $35/mo shared, Ultra $129/mo with frontier model access. For bursty PR volume, credits usually win; for steady high-volume teams, run the math both ways.
06Does multi-model review catch more than single-model?
Open
Yes, measurably. Different model families have different blind spots. Claude tends to catch reasoning errors; GPT catches API misuse; Gemini handles long-context diffs; Kimi and GLM are strong on throughput and cost. Routing a scout + lead + specialist sub-agents across families — as Critique does — produces broader coverage than any one model repeated three times.
07What should I evaluate tools on before buying?
Open
Run identical PRs (10–20 recent ones) through two or three tools in shadow mode. Score each on actionable-findings rate, false-positive rate, review latency, monthly cost at your team size, and fix-agent quality. Do not rely on vendor demos — they are cherry-picked. Use PRs that have historically caused production incidents as your test set.
See what multi-agent review looks like on your own repo.
Create a Critique account, try Critique Chat on a public repo you like, install the GitHub App on one private repo, and read the rollout playbook above. No sales calls required.