ProductApril 10, 202611 min readCritique

Critique just got a whole lot better

Ten improvements to the review pipeline — semantic analysis, large-PR clustering, intrinsic-risk drill-down, expanded model support, and a team that used Critique to review its own code along the way.

Ten layers sharper.

Semantic index · cluster · drill-down

critique.sh

Product / Review Engine
10 improvements shipped
Critique just got a whole lot betterSemantic analysis, large-PR clustering, intrinsic-risk drill-down, and cross-file relationship mapping — plus expanded model support across seven providers.
01Scout
Semantic index + evidence pack
02Cluster
Subsystem grouping (15+ files)
03Specialists
5 parallel reviewers
04Drill-down
Intrinsic-risk targeting
05Cross-file
Relationship analysis
06Lead
Verdict + remedy blueprint
Models
ClaudeClaude
Lead oracle
OpenAIGPT-5.4
High-reasoning
GrokGrok-4.2
Analysis
Z.aiGLM-5V-Turbo
Multimodal
QwenQwen3.7
Fast-path
XiaomiMiMoMiMo v2.5
Xiaomi
KwaiKATKAT Coder
Remedy
Try it on your next PR →All posts

How we got here

Critique reviewed itself

Every improvement in this post was built while we used Critique on our own pull requests. Not in a demo environment, not against synthetic test cases — on the real PRs that shipped the feature itself. At some point during this sprint, we had Critique reviewing the code that taught Critique how to review code better.

That recursive loop turned out to be the most honest test we could run. When the tool finds a real issue in its own implementation — a missing null check in the evidence pack builder, an unguarded regex in the semantic index — you know the signal-to-noise ratio is where it needs to be.

What follows is an honest account of what we built, why we built it, and what it means for the quality of reviews you get from Critique today.

The starting point

What was missing before

Before this round of improvements, Critique's review pipeline was already multi-agent and evidence-grounded. Scout would gather context, specialists would run in parallel, and the Lead would synthesize a verdict. That structure was sound. But the gaps inside it were real.

What the old pipeline missed
No structural understanding of the repo — files were a flat list
Large PRs diluted context across all specialists equally
Drill-down targets chosen by finding density alone (not inherent risk)
Cross-file analysis locked to TypeScript only
No dedicated code quality specialist for logic depth
Two-phase reasoning absent — no separation of thinking from output
No domain-specific heuristics for auth, async, API contracts
Contract-diff analysis for signatures and enums was missing entirely
Evidence packaging capped too low for complex files
Token usage from clustered runs silently lost
What the new pipeline delivers
Semantic index maps route handlers, DB operations, and subsystems
Large PRs clustered by subsystem — each specialist gets focused context
Drill-down targets scored on intrinsic risk (auth, billing, migrations, concurrency)
Cross-file analysis expanded to all source languages
CODE_QUALITY specialist added end-to-end
Two-phase reasoning in specialists, drill-down, and cross-file
Auth, async, and API compatibility domain packs
Deterministic contract-diff analyzers for signatures, enums, type fields
Evidence limits raised: 6000 chars patch + 5000 contents with expandedLimits
Token usage aggregated correctly across all cluster runs

Improvement 1

Semantic index — structural awareness at the repo level

The single most foundational change. Before this, Critique's specialists saw files as a flat array — no understanding of which files were route handlers, which touched the database, or which belonged to the same logical subsystem.

The new semantic index runs at the start of every review. It parses the evidence pack and builds a structural map: route handlers (files matching API route patterns), database model usages (files that use ORM calls or schema definitions), and subsystems (logical groupings derived from directory structure and file naming conventions). This map is passed to every downstream component in the pipeline.

The index is deterministic — it doesn't call a model, it runs static analysis on the files. That means it's fast and predictable. The tradeoff is that it can only see patterns visible in file paths and import statements. We're comfortable with that: determinism beats inference for structural facts.

Improvement 2

Large-PR subsystem clustering

When a PR touches 15 or more source files, dumping all of them into a single specialist prompt creates a dilution problem. The model has to divide its attention across too many concerns. Signal from the auth module competes with signal from the UI layer. The result is shallower findings and more false positives.

The clustering system solves this by splitting the PR into subsystem-aligned groups using the semantic index. Each specialist now runs once per cluster instead of once over the whole PR. The results are deduplicated by concernKey — if the SECURITY specialist surfaces the same token-validation issue from two different clusters, it appears once in the final report.

Before clustering
All 20 files→→ SECURITY specialist→→ one diluted pass
With clustering
auth/ cluster→→ SECURITY→api/ cluster→→ SECURITY→db/ cluster→→ SECURITY→→ deduplicated merge

Very small clusters (1–2 files) are merged into an "other" group to avoid explosion overhead. Clustering only activates when 2 or more meaningful subsystems exist in the PR. For most PRs, the non-clustered path runs as before.

Improvement 3

CODE_QUALITY specialist — logic depth in every review

The original four specialists — SECURITY, TESTS, ARCHITECTURE, PERFORMANCE — left a gap. Logic bugs, edge-case handling, null safety, and algorithmic correctness lived in no-man's-land. They were too implementation-specific for ARCHITECTURE, too broad for PERFORMANCE, and too unrelated to security or tests.

CODE_QUALITY fills that gap. It runs on every review and focuses on logic correctness, null safety, error handling completeness, edge cases, and code clarity. Its heuristics are deliberately distinct from the other specialists — no duplication, just coverage of the dimension that was missing.

Improvement 4

Two-phase reasoning — thinking before concluding

The old specialist prompts asked the model to produce findings directly. The new ones separate reasoning from output. The first phase produces a structured chain of thought: what patterns does the code exhibit? What assumptions is it making? What could go wrong? Only then does the second phase produce the actual findings JSON.

This change alone reduced false positives meaningfully. When a model is forced to reason through the code before issuing a verdict, it's less likely to flag a pattern it doesn't fully understand. We applied two-phase reasoning to specialists, drill-down, and cross-file analysis — every AI call in the pipeline now separates thinking from output.

The latency cost is real but modest: roughly 20–30% longer per specialist call. The quality improvement is worth it. A 30% false-positive reduction matters more than a 20% latency increase when the output is a code review finding.

Improvement 5

Domain heuristic packs — auth, async, API compatibility

General heuristics can only go so far. The failure modes in authentication code are fundamentally different from the failure modes in async concurrency code — and both are different from API contract compatibility issues.

We built three domain heuristic packs that activate when the semantic index identifies relevant patterns in the PR. The auth pack checks for session fixation patterns, privilege escalation paths, token validation completeness, and RBAC boundary enforcement. The async/concurrency pack looks for missing await, shared state mutation in concurrent paths, unhandled promise rejections, and lock-free race conditions. The API compatibility pack checks parameter type widening, required-to-optional field changes, response schema mutations, and error code removals.

Auth pack activates when
Files match auth/ permission/ session/ rbac/ paths
Imports include session management libraries
Functions named authenticate, authorize, checkPermission detected
Async pack activates when
Files contain async/await patterns
Queue/ worker/ job/ cron paths matched
Concurrent data access patterns detected in changed lines
API compatibility pack activates when
Route handler files in evidence pack
OpenAPI or GraphQL schema files touched
Function signature changes detected in exported API surface

Improvement 6

Deterministic contract-diff analyzers

Not every finding needs a model. Some of the most impactful issues are also the most detectable by static analysis: a function that previously required three parameters now requires two, a TypeScript enum that dropped a variant that callers still reference, a type field that changed from required to optional.

We built deterministic analyzers for three of these patterns: function signature changes (comparing exported function signatures between base and head), enum member removals (detecting dropped enum variants), and type field optionality changes (required → optional transitions in exported type definitions). These run before any model call and produce findings with 1.0 confidence — they're not guesses, they're facts.

Improvement 7

Intrinsic-risk drill-down targeting

Drill-down — the second tier of Critique's analysis — previously selected files based on finding density and severity. High finding count → drill into that file. Simple, but incomplete.

The new targeting system adds intrinsic risk scoring. A file in auth/ or billing/ gets a significant risk boost even if zero findings have been attributed to it yet. Database migration files get a boost. Middleware files get a boost. Files that changed more than 100 lines get a boost. Large addition+deletion volume is risky independent of what the specialists think.

The practical result: drill-down no longer misses high-risk files that happen to be clean-looking at the surface level. An auth file that passes the SECURITY specialist's heuristics still gets deep-dived if the path signals warrant it. That's where the real bugs hide.

Improvement 8

Cross-file analysis expanded to all languages

Cross-file analysis was previously gated on TypeScript files only. The filter made sense as a conservative initial scope, but it excluded a large class of meaningful relationships: Python service-to-service calls, Go interface implementations, Ruby method chains, SQL schema-to-ORM relationships.

We removed the language filter. Cross-file analysis now runs whenever 2 or more source files are in the PR, regardless of language. The relationship model — imports, type dependencies, schema consumers, API callers, test targets — is language-agnostic at the level we're analyzing.

Improvement 9

Evidence packaging — more context, better targeting

The evidence pack is the input to every model call in the pipeline. If it's too thin, specialists miss issues. If it's too wide, the context gets diluted. Finding the right limits matters enormously.

We made two structural improvements. First, serializeFileEvidence now includes both the patch AND the full file contents when expandedLimits is enabled — previously it chose one or the other. Drill-down calls now see the full picture. Second, we increased the character limits: patch content up to 6000 chars (from 3000), file contents up to 5000 chars (from 2500). For complex files, these limits meant the difference between seeing the full function and seeing half of it.

The caller search in the scout was also improved. Files that import from sensitive paths — auth, billing, session — are now prioritized in caller file selection, so the evidence pack surfaces the most relevant callers rather than the first N alphabetically.

Improvement 10

Clustered token usage — no more silent loss

This one was subtle and easy to miss. When clustering was active, each specialist ran once per subsystem cluster — potentially 15 model calls for a large PR. But only one cluster's telemetry was recorded per specialist in the agent run output. Token usage from the other 14 calls was silently dropped.

The fix was to accumulate token usage across all cluster runs before writing the telemetry record. We added a mergeModelUsage helper that sums promptTokens, completionTokens, totalTokens, cacheReadInputTokens, and cacheWriteInputTokens across every cluster result for a given specialist. The usage dashboard now sees the full picture — all tokens, all clusters, all specialists.

Model support

Seven providers, one pipeline

Alongside the pipeline improvements, we expanded the model roster. Critique's review engine now supports models from seven providers, routed by task type and credit tier.

Review engine model support

Provider, role in the pipeline, and credit tier

Model	Provider	Role	Tier
Claude 3.5 Sonnet	Anthropic	Lead oracle	Premium
GPT-5.4	OpenAI	Lead / high-reasoning	Premium
Grok-4.2	xAI	Lead / analysis	Premium
GLM-5V-Turbo	Z.AI (Zhipu)	Specialists	Standard
Qwen3.7 Plus	Alibaba	Specialists / fast-path	Free
MiMo v2.5 / MiMo v2.5 Pro	Xiaomi	Specialists / coding	Standard
KAT Coder Pro V2	KwaiPilot	Remedy	Standard

Model routing is automatic based on review tier and policy configuration. Plan holders can set specialist and lead overrides per repository.

The high-reasoning models (Claude, GPT-5.4, Grok-4.2) anchor the Lead Reviewer role — final synthesis, false-positive suppression, and verdict issuance. The specialist-tier models run in parallel for faster, cheaper first-pass analysis. The routing is automatic but policy-configurable per repository.

What this means in practice

Better reviews on the PRs that matter most

The improvements compound. A large auth-heavy PR now gets clustered (so each specialist sees a focused context), has its auth-specific heuristic pack activated, gets its function signatures diff-analyzed deterministically, and has the auth files boosted in drill-down targeting — all from a single review trigger.

That same review uses two-phase reasoning throughout, sees full file contents in evidence packs, reports complete token usage across all cluster calls, and can be run by any of the seven supported models.

Pipeline improvements shipped

Specialist agents (up from 4)

Supported model providers

15+

Files threshold for clustering

What comes next

We're not done. The next round of work focuses on feedback loops — letting finding history and past verdicts inform future reviews on the same repository. A codebase that has had three auth-related findings in the past month should get a higher auth heuristic weight going forward. That institutional memory exists in the data; we just need to surface it into the pipeline.

We're also working on streaming findings — surfacing individual specialist results as they arrive rather than waiting for the full pipeline to complete. For large PRs where clustering adds latency, getting the SECURITY findings 90 seconds into a 4-minute review is meaningfully better than waiting for the full verdict.

And yes — we'll keep using Critique on Critique's own PRs. The recursive loop is honest in a way that benchmarks aren't.

For investors & partners

Read our investor letter — the full picture on where we are, what we're building, and what's next for Critique.

Read the investor letter →

Try the upgraded pipeline on your next PR.

Connect your GitHub repos and let the new review engine run. The improvements are live for all users — no configuration required.

Get started →

No. All ten improvements are live for all users. Clustering activates automatically for large PRs. Domain heuristic packs activate based on what the semantic index finds. The expanded model support is available based on your plan tier.

For PRs under 15 source files, nothing changes — the non-clustered path runs as before. For large PRs, clustering runs specialist calls in parallel across clusters, so the total latency is roughly the same as a non-clustered run of equivalent depth — you get more findings without more wait.

Token usage is now aggregated correctly across all cluster calls. If you had large PR reviews before this fix, the reported cost was lower than actual. Going forward, the dashboard reflects the true token consumption per review.

Yes. Repository and installation policy settings let you override the specialist model and lead model independently. If you have a specific model preference, set it in the policy configuration for that repository.

All source languages — the language filter was removed. Any PR with 2 or more source files (excluding tests and config) triggers cross-file analysis. The relationship model works at the structural level, which is language-agnostic.

The two scores add together. A file with high finding density AND an auth path gets a combined score that is higher than either signal alone. This means the drill-down naturally gravitates toward files that are both actively flagged and inherently risky — exactly the ones that deserve the deepest attention.

Compare Critique

Compare the main AI code review options.

If this article is part of a buying process, these pages compare Critique with the tools most teams evaluate for GitHub PR review.

Best AI code review tools AI code review pricing

← All essays Privacy & Terms

Ask about this essay

Nemotron-3-Super

Ask about the argument, the evidence, the structure, or how the post connects to Critique.

Not editorial advice · The essay above is the source of truth · Not saved to your account · OpenRouter privacy