Critique just got a whole lot better
Ten improvements to the review pipeline — semantic analysis, large-PR clustering, intrinsic-risk drill-down, expanded model support, and a team that used Critique to review its own code along the way.

Ten layers sharper.
Semantic index · cluster · drill-down
critique.sh
Critique just got a whole lot better
Semantic analysis, large-PR clustering, intrinsic-risk drill-down, and cross-file relationship mapping — plus expanded model support across seven providers.
How we got here
Critique reviewed itself
Every improvement in this post was built while we used Critique on our own pull requests. Not in a demo environment, not against synthetic test cases — on the real PRs that shipped the feature itself. At some point during this sprint, we had Critique reviewing the code that taught Critique how to review code better.
That recursive loop turned out to be the most honest test we could run. When the tool finds a real issue in its own implementation — a missing null check in the evidence pack builder, an unguarded regex in the semantic index — you know the signal-to-noise ratio is where it needs to be.
What follows is an honest account of what we built, why we built it, and what it means for the quality of reviews you get from Critique today.
The starting point
What was missing before
Before this round of improvements, Critique's review pipeline was already multi-agent and evidence-grounded. Scout would gather context, specialists would run in parallel, and the Lead would synthesize a verdict. That structure was sound. But the gaps inside it were real.
- No structural understanding of the repo — files were a flat list
- Large PRs diluted context across all specialists equally
- Drill-down targets chosen by finding density alone (not inherent risk)
- Cross-file analysis locked to TypeScript only
- No dedicated code quality specialist for logic depth
- Two-phase reasoning absent — no separation of thinking from output
- No domain-specific heuristics for auth, async, API contracts
- Contract-diff analysis for signatures and enums was missing entirely
- Evidence packaging capped too low for complex files
- Token usage from clustered runs silently lost
- Semantic index maps route handlers, DB operations, and subsystems
- Large PRs clustered by subsystem — each specialist gets focused context
- Drill-down targets scored on intrinsic risk (auth, billing, migrations, concurrency)
- Cross-file analysis expanded to all source languages
- CODE_QUALITY specialist added end-to-end
- Two-phase reasoning in specialists, drill-down, and cross-file
- Auth, async, and API compatibility domain packs
- Deterministic contract-diff analyzers for signatures, enums, type fields
- Evidence limits raised: 6000 chars patch + 5000 contents with expandedLimits
- Token usage aggregated correctly across all cluster runs
Improvement 1
Semantic index — structural awareness at the repo level
The single most foundational change. Before this, Critique's specialists saw files as a flat array — no understanding of which files were route handlers, which touched the database, or which belonged to the same logical subsystem.
The new semantic index runs at the start of every review. It parses the evidence pack and builds a structural map: route handlers (files matching API route patterns), database model usages (files that use ORM calls or schema definitions), and subsystems (logical groupings derived from directory structure and file naming conventions). This map is passed to every downstream component in the pipeline.
The index is deterministic — it doesn't call a model, it runs static analysis on the files. That means it's fast and predictable. The tradeoff is that it can only see patterns visible in file paths and import statements. We're comfortable with that: determinism beats inference for structural facts.
Improvement 2
Large-PR subsystem clustering
When a PR touches 15 or more source files, dumping all of them into a single specialist prompt creates a dilution problem. The model has to divide its attention across too many concerns. Signal from the auth module competes with signal from the UI layer. The result is shallower findings and more false positives.
The clustering system solves this by splitting the PR into subsystem-aligned groups using the semantic index. Each specialist now runs once per cluster instead of once over the whole PR. The results are deduplicated by `concernKey` — if the SECURITY specialist surfaces the same token-validation issue from two different clusters, it appears once in the final report.
Very small clusters (1–2 files) are merged into an "other" group to avoid explosion overhead. Clustering only activates when 2 or more meaningful subsystems exist in the PR. For most PRs, the non-clustered path runs as before.
Improvement 3
CODE_QUALITY specialist — logic depth in every review
The original four specialists — SECURITY, TESTS, ARCHITECTURE, PERFORMANCE — left a gap. Logic bugs, edge-case handling, null safety, and algorithmic correctness lived in no-man's-land. They were too implementation-specific for ARCHITECTURE, too broad for PERFORMANCE, and too unrelated to security or tests.
CODE_QUALITY fills that gap. It runs on every review and focuses on logic correctness, null safety, error handling completeness, edge cases, and code clarity. Its heuristics are deliberately distinct from the other specialists — no duplication, just coverage of the dimension that was missing.
Improvement 4
Two-phase reasoning — thinking before concluding
The old specialist prompts asked the model to produce findings directly. The new ones separate reasoning from output. The first phase produces a structured chain of thought: what patterns does the code exhibit? What assumptions is it making? What could go wrong? Only then does the second phase produce the actual findings JSON.
This change alone reduced false positives meaningfully. When a model is forced to reason through the code before issuing a verdict, it's less likely to flag a pattern it doesn't fully understand. We applied two-phase reasoning to specialists, drill-down, and cross-file analysis — every AI call in the pipeline now separates thinking from output.
The latency cost is real but modest: roughly 20–30% longer per specialist call. The quality improvement is worth it. A 30% false-positive reduction matters more than a 20% latency increase when the output is a code review finding.
Improvement 5
Domain heuristic packs — auth, async, API compatibility
General heuristics can only go so far. The failure modes in authentication code are fundamentally different from the failure modes in async concurrency code — and both are different from API contract compatibility issues.
We built three domain heuristic packs that activate when the semantic index identifies relevant patterns in the PR. The auth pack checks for session fixation patterns, privilege escalation paths, token validation completeness, and RBAC boundary enforcement. The async/concurrency pack looks for missing `await`, shared state mutation in concurrent paths, unhandled promise rejections, and lock-free race conditions. The API compatibility pack checks parameter type widening, required-to-optional field changes, response schema mutations, and error code removals.
- Files match auth/ permission/ session/ rbac/ paths
- Imports include session management libraries
- Functions named authenticate, authorize, checkPermission detected
- Files contain async/await patterns
- Queue/ worker/ job/ cron paths matched
- Concurrent data access patterns detected in changed lines
- Route handler files in evidence pack
- OpenAPI or GraphQL schema files touched
- Function signature changes detected in exported API surface
Improvement 6
Deterministic contract-diff analyzers
Not every finding needs a model. Some of the most impactful issues are also the most detectable by static analysis: a function that previously required three parameters now requires two, a TypeScript enum that dropped a variant that callers still reference, a type field that changed from required to optional.
We built deterministic analyzers for three of these patterns: function signature changes (comparing exported function signatures between base and head), enum member removals (detecting dropped enum variants), and type field optionality changes (required → optional transitions in exported type definitions). These run before any model call and produce findings with 1.0 confidence — they're not guesses, they're facts.
Improvement 7
Intrinsic-risk drill-down targeting
Drill-down — the second tier of Critique's analysis — previously selected files based on finding density and severity. High finding count → drill into that file. Simple, but incomplete.
The new targeting system adds intrinsic risk scoring. A file in auth/ or billing/ gets a significant risk boost even if zero findings have been attributed to it yet. Database migration files get a boost. Middleware files get a boost. Files that changed more than 100 lines get a boost. Large addition+deletion volume is risky independent of what the specialists think.
The practical result: drill-down no longer misses high-risk files that happen to be clean-looking at the surface level. An auth file that passes the SECURITY specialist's heuristics still gets deep-dived if the path signals warrant it. That's where the real bugs hide.
Improvement 8
Cross-file analysis expanded to all languages
Cross-file analysis was previously gated on TypeScript files only. The filter made sense as a conservative initial scope, but it excluded a large class of meaningful relationships: Python service-to-service calls, Go interface implementations, Ruby method chains, SQL schema-to-ORM relationships.
We removed the language filter. Cross-file analysis now runs whenever 2 or more source files are in the PR, regardless of language. The relationship model — imports, type dependencies, schema consumers, API callers, test targets — is language-agnostic at the level we're analyzing.
Improvement 9
Evidence packaging — more context, better targeting
The evidence pack is the input to every model call in the pipeline. If it's too thin, specialists miss issues. If it's too wide, the context gets diluted. Finding the right limits matters enormously.
We made two structural improvements. First, `serializeFileEvidence` now includes both the patch AND the full file contents when `expandedLimits` is enabled — previously it chose one or the other. Drill-down calls now see the full picture. Second, we increased the character limits: patch content up to 6000 chars (from 3000), file contents up to 5000 chars (from 2500). For complex files, these limits meant the difference between seeing the full function and seeing half of it.
The caller search in the scout was also improved. Files that import from sensitive paths — auth, billing, session — are now prioritized in caller file selection, so the evidence pack surfaces the most relevant callers rather than the first N alphabetically.
Improvement 10
Clustered token usage — no more silent loss
This one was subtle and easy to miss. When clustering was active, each specialist ran once per subsystem cluster — potentially 15 model calls for a large PR. But only one cluster's telemetry was recorded per specialist in the agent run output. Token usage from the other 14 calls was silently dropped.
The fix was to accumulate token usage across all cluster runs before writing the telemetry record. We added a `mergeModelUsage` helper that sums `promptTokens`, `completionTokens`, `totalTokens`, `cacheReadInputTokens`, and `cacheWriteInputTokens` across every cluster result for a given specialist. The usage dashboard now sees the full picture — all tokens, all clusters, all specialists.
Model support
Seven providers, one pipeline
Alongside the pipeline improvements, we expanded the model roster. Critique's review engine now supports models from seven providers, routed by task type and credit tier.
Provider, role in the pipeline, and credit tier
| Model | Provider | Role | Tier |
|---|---|---|---|
| Claude 3.5 Sonnet | Anthropic | Lead oracle | Premium |
| GPT-5.4 | OpenAI | Lead / high-reasoning | Premium |
| Grok-4.2 | xAI | Lead / analysis | Premium |
| GLM-5-Turbo | Z.AI (Zhipu) | Specialists | Standard |
| Qwen3.6 Plus | Alibaba | Specialists / fast-path | Free |
| MiMo V2 Pro | Xiaomi | Specialists / coding | Standard |
| KAT Coder Pro V2 | KwaiPilot | Remedy | Standard |
Model routing is automatic based on review tier and policy configuration. Plan holders can set specialist and lead overrides per repository.
The high-reasoning models (Claude, GPT-5.4, Grok-4.2) anchor the Lead Reviewer role — final synthesis, false-positive suppression, and verdict issuance. The specialist-tier models run in parallel for faster, cheaper first-pass analysis. The routing is automatic but policy-configurable per repository.
What this means in practice
Better reviews on the PRs that matter most
The improvements compound. A large auth-heavy PR now gets clustered (so each specialist sees a focused context), has its auth-specific heuristic pack activated, gets its function signatures diff-analyzed deterministically, and has the auth files boosted in drill-down targeting — all from a single review trigger.
That same review uses two-phase reasoning throughout, sees full file contents in evidence packs, reports complete token usage across all cluster calls, and can be run by any of the seven supported models.
What comes next
We're not done. The next round of work focuses on feedback loops — letting finding history and past verdicts inform future reviews on the same repository. A codebase that has had three auth-related findings in the past month should get a higher auth heuristic weight going forward. That institutional memory exists in the data; we just need to surface it into the pipeline.
We're also working on streaming findings — surfacing individual specialist results as they arrive rather than waiting for the full pipeline to complete. For large PRs where clustering adds latency, getting the SECURITY findings 90 seconds into a 4-minute review is meaningfully better than waiting for the full verdict.
And yes — we'll keep using Critique on Critique's own PRs. The recursive loop is honest in a way that benchmarks aren't.
For investors & partners
Read our investor letter — the full picture on where we are, what we're building, and what's next for Critique.
Read the investor letter →Try the upgraded pipeline on your next PR.
Connect your GitHub repos and let the new review engine run. The improvements are live for all users — no configuration required.
Get started →