Skip to content
14 min readCritique

The Review That Refuses to Ghost You

Most AI review tools disappear mid-run. Critique ships a DeepSeek V4 Flash session brain and a platform-level safety net so every PR gets a verdict—not a spinner. Launch pricing halves DeepSeek Flash and Pro credits; benchmarks sit next to Claude Sonnet 4.6 and Claude Opus 4.6 for context.

Watchdog activeDeepSeek V4 Flashcritique.sh
Live session — PR #2847 · auth token rotation
review-agent · sandbox
$ git diff HEAD~1..origin/main
Tracing auth middleware — 12 touched files found
Expanding impact zone across 47 call sites
$ run edge-case probes on token rotation path
! 2 assertions missing in session-expiry branch
· · · silence: 63 s — no new tool activity · · ·
[SUPERVISOR] Stall detected. Recovery nudge sent.
Activity resumed. Writing expiry coverage notes.
Artifact sealed — /output/review.json ready.
The guard
Session supervisor
deepseek/deepseek-v4-flash
WATCHING
Wait — stream still alive
Nudge — surgical recovery prompt
Abort — kill ghost session
Platform watchdog
Automatic recovery queue
ARMED
Detects run-level silence
beyond stale threshold
Marks run failed — explicit reason
board entry + audit trail
Queues fallback path
GitHub API pipeline, no sandbox
1414
GDPval-AA Elo
13B
Active params/call
1 cr
Critique credit
<20s
Stall decision time
Session recovery without restarting contextStall visible in review board before fallback triggersEvery decision is a structured board entry
Launch window — treasury pricing

Half-price DeepSeek V4 lanes for both Flash and Pro.

Anthropic Sonnet and Opus tiers still excel on frontier agentic dashboards — Critique bundles those models everywhere you expect them — DeepSeek sits in a different place on the value curve. When the watchdog and specialist graph can call Flash on a cadence measured in tens of seconds, predictable credit floors matter as much as raw Elo.

50% off credit floor · launch window
DeepSeek V4 Flash
0.5 credits1 credit
Ends Tuesday, May 19, 2026 at UTC (UTC cutoff)
50% off credit floor · launch window
DeepSeek V4 Pro
1.5 credits3 credits
Ends Tuesday, May 19, 2026 at UTC (UTC cutoff)
DeepSeek V4 Flash and Pro — Return of the Whale: fifty percent off review credit floors on Critique through the May 2026 launch window.
Launch art · DeepSeek V4 Flash & Pro — 50% off credit floors on Critique while the window is open.
0
Guard layers — session brain + platform watchdog
0%
Off DeepSeek Flash & Pro credit floors while the promo is active
0
GDPval-AA · V4 Flash (Reasoning, High) · Artificial Analysis
0B
Active MoE parameters per supervisor call · built for cadence work

Most AI review demos are confident for ninety seconds. Then production happens. A sandbox wedges on a subprocess hang. A stream stops mid-tool-call with no exit code. The worker process exits cleanly from its own perspective and publishes nothing. From your side of the merge queue, all of that looks identical: silence. Humans interpret silence as thinking. Usually it is not thinking. It is stuck.

The consequence is specific and measurable. Your team runs three reviews before trust erodes. A bot that goes quiet once gets mentally filed as unreliable. After the second incident, developers stop waiting for it. By the third, the product is decorative infrastructure. That is the failure mode nobody talks about in AI code review—not "it gave a wrong comment" but "it just stopped."

TWO GUARDS, TWO TIME HORIZONS

The architecture separates the problem by time scale. In-session stalls—where the review workspace is still alive but quiet—need a fast, structured response before the silence becomes permanent. Run-level silence—where nothing has hit the board for long enough that the job may already be dead—needs a different enforcer: one that acts at the platform level, marks the outcome explicitly, and starts recovery without waiting for a human to notice.

Session supervisor

Fires on a tight interval while the review is in flight. Reads the live server log, checks the cadence of tool calls, and decides: wait (stream is still alive), nudge (targeted recovery prompt into the same session, preserving context), or abort (kill the workspace and surface the outcome). Designed to recover without restarting.

deepseek/deepseek-v4-flash — default brain
Platform watchdog

Calendar-scale enforcement. If the entire run stops emitting board activity beyond the stale threshold, Critique marks it failed with an explicit message, writes an audit entry, and queues fallback over the GitHub API path. You still get motion. The PR does not marinate. Nothing is left ambiguous.

Automatic recovery queue — no sandbox dependency
Without continuity discipline
Agent startsStream stallsSilence reads as "thinking"Hours passEngineers chase the bot
With Critique
Review workspace startsSession supervisor reactsRecovery nudge or clean abortPlatform watchdog on silenceFallback publishes verdict

WHY DEEPSEEK V4 FLASH

Supervision is not synthesis. The supervisor does not write review comments. Its job is to look at traces, compare expected cadence against observed activity, and pick from a short action list with a tight time budget. That is a reasoning task, but latency is part of the specification. If the supervisor thinks at flagship cost and speed, every stall check becomes a budget event—and you end up rate-limiting the thing designed to prevent rate-limiting your reviews.

DeepSeek V4 Flash is Mixture-of-Experts at 284B total parameters with only 13B active per forward step. Hybrid attention handles long server log traces without burning the full parameter budget. And unlike prior generations, it comes with documented reasoning modes—you trade depth against latency on a dial, not by swapping providers. For a control loop that fires every few seconds, that is the right architectural profile.

  1. Prior generation
    DeepSeek V3.2 (Reasoning) — ~1203 Elo

    Strong for deep synthesis. 128K context, 37B active parameters. Good for lead review work. Too slow and too expensive to fire on a tight supervision loop—so no session brain shipped.

  2. V4 Flash launches
    DeepSeek V4 Flash — ~1414 Elo (+211)

    284B total, 13B active per call. Hybrid attention for long traces. Reasoning modes that trade depth against speed without a vendor swap. The Elo lift makes the decision clean: this is not a cost compromise, it is the right model for the role.

  3. Escalation path
    DeepSeek V4 Pro — ~1558 Elo (+144 above Flash)

    1.6T total, 49B active. Open-weights GDPval-AA leader in Artificial Analysis's April 2026 snapshot. Critique routes synthesis here when the PR is high-risk—security modules, billing logic, concurrency primitives. Three credits versus one.

BENCHMARKS

GDPval-AA from Artificial Analysis measures Elo-style performance on agentic tasks with shell and browser access—a closer proxy to PR review reality than memoized benchmarks: tools fail, recovery matters. The Claude rows below use Adaptive reasoning with Max effort—the configuration Artificial Analysis exercised when Sonnet 4.6 set a new GDPval-AA high watermark in February 2026 publications, and Opus 4.6 reached 1606 Elo shortly before Sonnet narrowly exceeded it inside overlapping confidence intervals. The DeepSeek rows pair with their Reasoning Max effort reporting from Artificial Analysis DeepSeek launch notes later that spring (1554 for V4 Pro, 1388 for V4 Flash); these are directional leaderboards—your repository is the final test.

GDPval-AA — Claude 4.x vs DeepSeek V4 (max-effort agentic configs)

Higher is better. Anthropic datapoints cite Artificial Analysis pre-release benchmarking (Feb 2026 Sonnet bulletins; Feb 2026 Opus 4.6 article); DeepSeek values cite Artificial Analysis DeepSeek launch reporting (April 2026).

Sonnet and Opus are still the pragmatic answer when leadership explicitly wants frontier Anthropic execution for customer-facing merges. Critique mounts them next to lanes like DeepSeek because they answer different capex envelopes: the Claude stack pays for tighter confidence tails; DeepSeek buys orders-of-magnitude cheaper cadence workloads such as supervisory loops that probe server logs dozens of times per review.

List API economics — nominal output USD per million tokens

These are publicly quoted vendor tiers for benchmarking discipline, before dynamic discounts, caches, routing surcharges, or Critique orchestration envelopes. Claude rows follow Anthropic list pricing surfaced in contemporaneous Artificial Analysis commentary; DeepSeek cites Artificial Analysis DeepSeek pricing tables from April 2026.

ModelApprox. vendor output tier
Claude Opus latest list tier AA cited alongside Opus 4.6$25
Claude Sonnet 4.6 / Sonnet-tier list output$15
DeepSeek V4 Pro$3.48
DeepSeek V4 Flash$0.28

Translating those vendor anchors into Critique stacks is why we lean on DeepSeek V4 Flash for tight supervision—you can amortize stalls without financing them like a flagship rollout. Fifty percent off Critique credit floors until the promo ends simply compresses adoption risk on both DeepSeek tiers while teams trial the watchdog.

This is exactly why Critique standardized on DeepSeek V4 Flash for session supervision—not because it defeats Sonnet head-to-head, but because the intelligence density per marginal dollar aligns with infra that must chatter constantly. Critique couples that choice with deterministic board traces and a fallback path so stalls never disguise themselves as polite thinking. Promo pricing on Flash keeps the supervisory lane cheap enough that every team can justify turning continuity on everywhere, while Pro inherits the heavier GDPval-AA numbers when you escalate synthesis lanes.

Below we zoom into strictly DeepSeek-family curves so you can see how reasoning modes intersect with token throughput—the detail that persuaded us to bias the supervisor toward Flash's reasoning High profile instead of blindly maxing reasoning budget.

GDPval-AA — agentic Elo · DeepSeek V4 modes

Higher is better. Source: Artificial Analysis April 2026. Flash (High) beats Flash (Max) at lower token cost — showing the efficiency story is real.

V4 Flash (High) outscores V4 Flash (Max) on GDPval-AA while using roughly half the output tokens. "More reasoning budget" is not the same as "better decisions."

Same architecture, two roles

Flash for supervision — Pro for high-stakes synthesis

You want both profiles in the stack. Flash fires often and cheaply on the control loop. Pro activates when a single wrong judgment is expensive.

Metric
V4 Flash — Supervisor
V4 Pro — Lead / Flagship
GDPval-AA Elo (Reasoning, High)
Artificial Analysis April 2026
1414
1558
MoE active params / call
Throughput-first vs depth-first
13B
49B
Context window class
8× the old V3.2 128K class — both handle full traces
1M tokens
1M tokens
Critique credit floor
May 2026 launch window discounts through Tuesday, May 19, 2026 at UTC
1 (0.5 promo)
3 (1.5 promo)
Primary role in Critique
Session brain — stall detection, nudge, abort
Merge-grade synthesis — security, billing, concurrency PRs
GDPval-AA output tokens — reasoning mode efficiency

Total output-token volume on the benchmark. Not a quality proxy—a reminder that efficient reasoning is a feature.

Artificial Analysis's data. Flash (High) wins on score while spending half the tokens of Flash (Max). Critique routes Flash in High mode for the supervisor by default.

What this costs on Critique

DeepSeek V4 — Flash and Pro on the credit ladder

Credits meter complete review journeys: orchestration, specialist fan-out, depth routing, failover. Anchor against vendor list rates to understand the raw token economics underneath.

DeepSeek
Throughput tier
DeepSeek V4 Flash

Session supervisor, specialist fan-out, and volume reviews. The always-on lane.

Critique floor1 cr / run
DeepSeek API — input / 1M$0.14
DeepSeek API — output / 1M$0.28
Cache-hit input / 1M$0.028
DeepSeek
Capability tier
DeepSeek V4 Pro

Open-weights GDPval-AA leader. Routes automatically when risk is high.

Critique floor3 cr / run
DeepSeek API — input / 1M$1.74
DeepSeek API — output / 1M$3.48
Cache-hit input / 1M$0.145

Vendor prices from Artificial Analysis April 2026 launch notes; hosted routes vary. Critique credits meter full journeys (orchestration, specialists, depth). Fifty-percent promo floors halve billed Flash / Pro credits through the advertised UTC cutoff without changing downstream vendor lists.

WHAT YOU ACTUALLY SEE

Open a PR on a repository with Critique installed. Watch the live board. You will see beat events arriving every few seconds as the review workspace makes progress: files read, test probes run, findings staged. If the workspace goes quiet past the stall threshold, a supervisor decision entry appears on the board—labeled, timestamped, with the action taken and the reason. If a nudge fires and the workspace recovers, the beats resume. If the supervisor aborts cleanly, the board shows that too. Nothing is ambiguous.

If the whole run goes cold—no beats for long enough that recovery is moot—the platform watchdog fires independently. The run status updates to failed with an explicit message. Another board entry logs the decision. A fallback review starts automatically, routed through the GitHub API path instead of the sandbox, so the PR still gets a verdict before the merge window closes.

Three watchers, three time horizons

Each layer handles a different kind of failure. Together they close the gaps.

LayerTime horizonSuccess signal
Review workspace (lead)Turn-scale secondsMerge-grade findings, evidence pack, structured JSON artifact.
Session supervisor — FlashStall interval (~60–180 s)Stall detected and recovered without restarting context, or cleanly aborted.
Platform watchdogRun-level (~30 min)Run marked failed with explicit reason; fallback path queued and publishes.

All three layers write structured board entries. Every outcome is auditable.

What silent reviews cost
  • One ghost review trains the team to treat the bot as decorative
  • Security fixes wait on merges that never finalize
  • On-call time leaks into babysitting automation that should be infrastructure
  • Leadership loses confidence in agentic workflows broadly
What guaranteed continuity buys
  • Every stall becomes a labeled, auditable event instead of a mystery
  • Credit burn stops accumulating on infinite hangs
  • Teams trust automation because failure modes are explicit and bounded
  • PRs get verdicts even when the sandbox has a bad day

COMMON QUESTIONS

No. It monitors execution health and decides: wait, nudge, or abort. Review comments, findings, and merge-grade verdicts are the job of the lead model you configure for depth—not the supervisor.
Through Tuesday, May 19, 2026 at UTC (UTC cutoff) Critique halves the catalog credit floors for both DeepSeek V4 Flash (0.5 credits versus the usual single credit) and V4 Pro (1.5 credits versus the usual triple-credit flagship lane). Catalog billing applies it automatically; no secret workspace flag.
It replaces ambiguous silence and removes the question of whether the bot is working. Merge decisions stay with your team. Critique makes the automation behave like infrastructure—predictable outcomes, explicit failures—rather than roulette.
SWE-Bench tests single-file patch generation. A PR review is closer to multi-step agentic work: long horizons, tool failures, recovery moments. GDPval-AA is a serious proxy for that reality. We cite it directionally, with the caveats it deserves.
Primary sources

Ship reviews that finish

Give Critique a PR that used to scare your bots. The board stays noisy in the right ways—and silent never.

Run Critique on your next PR

Ask about this essay

Nemotron-3-Super
Ask about the argument, the evidence, the structure, or how the post connects to Critique.
Not editorial advice · The essay above is the source of truth · Not saved to your account · OpenRouter privacy