The Review That Refuses to Ghost You
Most AI review tools disappear mid-run. Critique ships a DeepSeek V4 Flash session brain and a platform-level safety net so every PR gets a verdict—not a spinner. Launch pricing halves DeepSeek Flash and Pro credits; benchmarks sit next to Claude Sonnet 4.6 and Claude Opus 4.6 for context.
Half-price DeepSeek V4 lanes for both Flash and Pro.
Anthropic Sonnet and Opus tiers still excel on frontier agentic dashboards — Critique bundles those models everywhere you expect them — DeepSeek sits in a different place on the value curve. When the watchdog and specialist graph can call Flash on a cadence measured in tens of seconds, predictable credit floors matter as much as raw Elo.

Most AI review demos are confident for ninety seconds. Then production happens. A sandbox wedges on a subprocess hang. A stream stops mid-tool-call with no exit code. The worker process exits cleanly from its own perspective and publishes nothing. From your side of the merge queue, all of that looks identical: silence. Humans interpret silence as thinking. Usually it is not thinking. It is stuck.
The consequence is specific and measurable. Your team runs three reviews before trust erodes. A bot that goes quiet once gets mentally filed as unreliable. After the second incident, developers stop waiting for it. By the third, the product is decorative infrastructure. That is the failure mode nobody talks about in AI code review—not "it gave a wrong comment" but "it just stopped."
TWO GUARDS, TWO TIME HORIZONS
The architecture separates the problem by time scale. In-session stalls—where the review workspace is still alive but quiet—need a fast, structured response before the silence becomes permanent. Run-level silence—where nothing has hit the board for long enough that the job may already be dead—needs a different enforcer: one that acts at the platform level, marks the outcome explicitly, and starts recovery without waiting for a human to notice.
Fires on a tight interval while the review is in flight. Reads the live server log, checks the cadence of tool calls, and decides: wait (stream is still alive), nudge (targeted recovery prompt into the same session, preserving context), or abort (kill the workspace and surface the outcome). Designed to recover without restarting.
deepseek/deepseek-v4-flash — default brainCalendar-scale enforcement. If the entire run stops emitting board activity beyond the stale threshold, Critique marks it failed with an explicit message, writes an audit entry, and queues fallback over the GitHub API path. You still get motion. The PR does not marinate. Nothing is left ambiguous.
Automatic recovery queue — no sandbox dependencyWHY DEEPSEEK V4 FLASH
Supervision is not synthesis. The supervisor does not write review comments. Its job is to look at traces, compare expected cadence against observed activity, and pick from a short action list with a tight time budget. That is a reasoning task, but latency is part of the specification. If the supervisor thinks at flagship cost and speed, every stall check becomes a budget event—and you end up rate-limiting the thing designed to prevent rate-limiting your reviews.
DeepSeek V4 Flash is Mixture-of-Experts at 284B total parameters with only 13B active per forward step. Hybrid attention handles long server log traces without burning the full parameter budget. And unlike prior generations, it comes with documented reasoning modes—you trade depth against latency on a dial, not by swapping providers. For a control loop that fires every few seconds, that is the right architectural profile.
- Prior generationDeepSeek V3.2 (Reasoning) — ~1203 Elo
Strong for deep synthesis. 128K context, 37B active parameters. Good for lead review work. Too slow and too expensive to fire on a tight supervision loop—so no session brain shipped.
- V4 Flash launchesDeepSeek V4 Flash — ~1414 Elo (+211)
284B total, 13B active per call. Hybrid attention for long traces. Reasoning modes that trade depth against speed without a vendor swap. The Elo lift makes the decision clean: this is not a cost compromise, it is the right model for the role.
- Escalation pathDeepSeek V4 Pro — ~1558 Elo (+144 above Flash)
1.6T total, 49B active. Open-weights GDPval-AA leader in Artificial Analysis's April 2026 snapshot. Critique routes synthesis here when the PR is high-risk—security modules, billing logic, concurrency primitives. Three credits versus one.
BENCHMARKS
GDPval-AA from Artificial Analysis measures Elo-style performance on agentic tasks with shell and browser access—a closer proxy to PR review reality than memoized benchmarks: tools fail, recovery matters. The Claude rows below use Adaptive reasoning with Max effort—the configuration Artificial Analysis exercised when Sonnet 4.6 set a new GDPval-AA high watermark in February 2026 publications, and Opus 4.6 reached 1606 Elo shortly before Sonnet narrowly exceeded it inside overlapping confidence intervals. The DeepSeek rows pair with their Reasoning Max effort reporting from Artificial Analysis DeepSeek launch notes later that spring (1554 for V4 Pro, 1388 for V4 Flash); these are directional leaderboards—your repository is the final test.
Higher is better. Anthropic datapoints cite Artificial Analysis pre-release benchmarking (Feb 2026 Sonnet bulletins; Feb 2026 Opus 4.6 article); DeepSeek values cite Artificial Analysis DeepSeek launch reporting (April 2026).
Sonnet and Opus are still the pragmatic answer when leadership explicitly wants frontier Anthropic execution for customer-facing merges. Critique mounts them next to lanes like DeepSeek because they answer different capex envelopes: the Claude stack pays for tighter confidence tails; DeepSeek buys orders-of-magnitude cheaper cadence workloads such as supervisory loops that probe server logs dozens of times per review.
These are publicly quoted vendor tiers for benchmarking discipline, before dynamic discounts, caches, routing surcharges, or Critique orchestration envelopes. Claude rows follow Anthropic list pricing surfaced in contemporaneous Artificial Analysis commentary; DeepSeek cites Artificial Analysis DeepSeek pricing tables from April 2026.
| Model | Approx. vendor output tier |
|---|---|
| Claude Opus latest list tier AA cited alongside Opus 4.6 | $25 |
| Claude Sonnet 4.6 / Sonnet-tier list output | $15 |
| DeepSeek V4 Pro | $3.48 |
| DeepSeek V4 Flash | $0.28 |
Translating those vendor anchors into Critique stacks is why we lean on DeepSeek V4 Flash for tight supervision—you can amortize stalls without financing them like a flagship rollout. Fifty percent off Critique credit floors until the promo ends simply compresses adoption risk on both DeepSeek tiers while teams trial the watchdog.
This is exactly why Critique standardized on DeepSeek V4 Flash for session supervision—not because it defeats Sonnet head-to-head, but because the intelligence density per marginal dollar aligns with infra that must chatter constantly. Critique couples that choice with deterministic board traces and a fallback path so stalls never disguise themselves as polite thinking. Promo pricing on Flash keeps the supervisory lane cheap enough that every team can justify turning continuity on everywhere, while Pro inherits the heavier GDPval-AA numbers when you escalate synthesis lanes.
Below we zoom into strictly DeepSeek-family curves so you can see how reasoning modes intersect with token throughput—the detail that persuaded us to bias the supervisor toward Flash's reasoning High profile instead of blindly maxing reasoning budget.
Higher is better. Source: Artificial Analysis April 2026. Flash (High) beats Flash (Max) at lower token cost — showing the efficiency story is real.
V4 Flash (High) outscores V4 Flash (Max) on GDPval-AA while using roughly half the output tokens. "More reasoning budget" is not the same as "better decisions."
Flash for supervision — Pro for high-stakes synthesis
You want both profiles in the stack. Flash fires often and cheaply on the control loop. Pro activates when a single wrong judgment is expensive.
Total output-token volume on the benchmark. Not a quality proxy—a reminder that efficient reasoning is a feature.
Artificial Analysis's data. Flash (High) wins on score while spending half the tokens of Flash (Max). Critique routes Flash in High mode for the supervisor by default.
DeepSeek V4 — Flash and Pro on the credit ladder
Credits meter complete review journeys: orchestration, specialist fan-out, depth routing, failover. Anchor against vendor list rates to understand the raw token economics underneath.
Session supervisor, specialist fan-out, and volume reviews. The always-on lane.
Open-weights GDPval-AA leader. Routes automatically when risk is high.
Vendor prices from Artificial Analysis April 2026 launch notes; hosted routes vary. Critique credits meter full journeys (orchestration, specialists, depth). Fifty-percent promo floors halve billed Flash / Pro credits through the advertised UTC cutoff without changing downstream vendor lists.
WHAT YOU ACTUALLY SEE
Open a PR on a repository with Critique installed. Watch the live board. You will see beat events arriving every few seconds as the review workspace makes progress: files read, test probes run, findings staged. If the workspace goes quiet past the stall threshold, a supervisor decision entry appears on the board—labeled, timestamped, with the action taken and the reason. If a nudge fires and the workspace recovers, the beats resume. If the supervisor aborts cleanly, the board shows that too. Nothing is ambiguous.
If the whole run goes cold—no beats for long enough that recovery is moot—the platform watchdog fires independently. The run status updates to failed with an explicit message. Another board entry logs the decision. A fallback review starts automatically, routed through the GitHub API path instead of the sandbox, so the PR still gets a verdict before the merge window closes.
Each layer handles a different kind of failure. Together they close the gaps.
| Layer | Time horizon | Success signal |
|---|---|---|
| Review workspace (lead) | Turn-scale seconds | Merge-grade findings, evidence pack, structured JSON artifact. |
| Session supervisor — Flash | Stall interval (~60–180 s) | Stall detected and recovered without restarting context, or cleanly aborted. |
| Platform watchdog | Run-level (~30 min) | Run marked failed with explicit reason; fallback path queued and publishes. |
All three layers write structured board entries. Every outcome is auditable.
- One ghost review trains the team to treat the bot as decorative
- Security fixes wait on merges that never finalize
- On-call time leaks into babysitting automation that should be infrastructure
- Leadership loses confidence in agentic workflows broadly
- Every stall becomes a labeled, auditable event instead of a mystery
- Credit burn stops accumulating on infinite hangs
- Teams trust automation because failure modes are explicit and bounded
- PRs get verdicts even when the sandbox has a bad day
COMMON QUESTIONS
Ship reviews that finish
Give Critique a PR that used to scare your bots. The board stays noisy in the right ways—and silent never.
Run Critique on your next PR