April 26, 202612 min readRepath Khan

Critique PR Review v5: Diff-First Runs & Multi-Turn OpenCode

Review runs read like a review again: diff and findings up front, OpenCode as a resumable session, and a path to a cheap thinker that nudges weak or stuck agent work before you accept an artifact.

Diff first
Review runs lead with the PR change set and where findings landed.
Multi-turn
The same OpenCode session can take targeted follow-ups before the artifact is accepted.
Clean stream
Live activity defaults to readable progress; raw events stay one layer down.
Thinker-ready
Architecture points at a cheap controller model for shallow or stuck runs.

Why v5 matters

When the default screen is a firehose of tool starts, permission prompts, and JSON blobs, the product trains people to debug the agent instead of judging the pull request. That is the wrong default for merge decisions. v5 assumes the reader’s job is review: understand the change, scan findings, and only then dig into execution detail if something looks off.

Nothing here removes depth. Telemetry, Remedy export, the full OpenCode stream, and sandbox diagnostics are still there. They simply stop competing with the diff for attention on first paint.

What changed on the review-run page

The surface is reorganized around three layers of evidence: the change itself (diff and touched files), what Critique concluded (structured findings and severity), and what the agent did in plain language (OpenCode highlights rather than raw step noise). Operators who need the deep trace can still open it; everyone else gets a page that reads top-to-bottom like a review memo.

Before v5
Raw sandbox event stream dominated the first screen→Recoverable tool errors read like hard failures→Findings, diff context, and telemetry fought for the same fold
v5
Original PR diff and model findings lead→OpenCode work is summarized in human language first→Full diagnostics remain one deliberate click away

OpenCode as a session, not a single prompt

One-shot run
Send one prompt, wait for one artifact, hope the JSON is valid and the evidence is complete. If anything is thin, you discard or rerun from zero — losing context and burning time.
Session run
A headless server-backed session gets the initial review prompt, then the runner can inspect the artifact and send follow-ups in the same conversation before accepting the result — closer to how a senior engineer actually works.

Real review is iterative. If the artifact is malformed, skips command evidence, ignores available specialist roles, or hand-waves tests, Critique can ask OpenCode to continue in-place instead of treating the run as binary success or failure.

Next: a thinker outside the sandbox

Deterministic checks catch structural failures, but they do not read tone, repetition, or “looks busy but shallow” behavior in the stream. The next layer is a small, cheap model that watches the live OpenCode output, classifies whether the agent is stuck or under-evidenced, and drafts the next best follow-up — or signals abort when the run is going nowhere.

DeepSeek V4 Flash is a practical default for that role: already in the Critique catalog, inexpensive enough to run often, and strong enough for meta-review — missing evidence, repeated claims without new commands, tests claimed but not shown, and stalls where the last meaningful beat was too long ago.

What the thinker should inspect
Latest OpenCode messages, tool calls, errors, and command timeline
Current review artifact, schema validation, and finding depth
Elapsed time since the last meaningful activity beat
Whether specialists were invoked when policy made them available
What the thinker should decide
Accept the artifact as good enough for the merge gate
Send a targeted follow-up into the same OpenCode session
Narrow from exploration to final synthesis
Abort or degrade gracefully when stuck or over budget

What we are optimizing for

The durable goal is not more model calls for their own sake. It is fewer silent failures, fewer weak reviews, and explicit state: what the agent saw, what it ran, what it found, what it skipped, and why the controller accepted the artifact. v5 keeps OpenCode on repository work, Critique on control plane, and leaves room for a cheap thinker that keeps the session honest.

Quick sanity check after upgrading
1Does the first screen answer “what changed?” without scrolling past logs?
You should see the PR diff and finding highlights immediately; expand diagnostics only when debugging.
2When a follow-up fires, does it stay in one OpenCode session?
Check the run detail: continuation should reuse the session rather than starting a blank run, until follow-up budget is exhausted.
3Is the live feed readable by default?
Progress should scan quickly; switch to the raw stream when you need verbatim tool payloads.

It reorganizes the review-run experience so the PR diff and structured findings lead, OpenCode activity is summarized before raw logs, and the runner can send bounded follow-ups in the same headless session before accepting an artifact.

Reruns typically discard conversational context. v5 keeps a single server-backed session so targeted follow-ups can fix a thin artifact, schema issue, or missing evidence without throwing away everything the agent already did.

The environment variable CRITIQUE_OPENCODE_MAX_FOLLOWUPS bounds follow-up turns (default two). That prevents runaway cost while still allowing a short repair loop inside one session.

It is a catalog-ready, low-cost model that is strong enough to read an agent stream and judge whether evidence and depth are adequate. The architecture is model-flexible; Flash is the pragmatic starting point for a frequent meta-review pass.

No. Raw diagnostics, telemetry, and the full stream remain available. They are no longer the default first impression so merge reviewers see review evidence before operator detail.

Use the advanced or raw views on the review-run page for the same information you had before; pin that workflow in your runbook if operators need it on every review.

Primary sources
PR Review v4.1 — execution depth and live observability
Context on the prior sandbox and dashboard iteration.
Dashboard — review runs
Where live runs surface after the GitHub App is connected.
DeepSeek V4 on Critique
Pricing and catalog notes for Flash-class models.

Try the v5 review run experience
Connect a repository, open a PR review run, and see diff-first layout plus summarized OpenCode activity. Sign up takes under a minute.
Get started →

← All essays Privacy & Terms

Ask about this essay

Nemotron-3-Super

Ask about the argument, the evidence, the structure, or how the post connects to Critique.

Not editorial advice · The essay above is the source of truth · Not saved to your account · OpenRouter privacy