Skip to content
Product16 min readRepath Khan

Critique v5.1: The Review System Starts Reviewing Itself

v5.1 is about the part of AI code review that sounds least glamorous and matters most: measuring review quality, learning from real feedback, and making the OpenCode runtime harder to break.

Critique

Critique

Release essay

v5.1

The review system starts reviewing itself

critique.sh

Agent harness · BYOA · Composer 2.5

50-200
Historical PRs in the new review evaluation gate
0
Target finding count. Publish only what clears the bar.
0
Independent critic pass outside the OpenCode session
Daily
OpenCode/E2B smoke covering server, session, JSON, usage, shutdown

Most AI review products get better by adding more model. v5.1 moves in the other direction. We are making the review system more measurable, more memory-aware, and more willing to say nothing when the evidence is thin. That sounds smaller than a new model launch. It is not.

The uncomfortable truth is that code review automation fails in two ways at once. It misses the defect a senior engineer would have caught, and it invents a comment that makes the author stop trusting the reviewer. Shipping faster only helps if both errors are visible. Otherwise the product feels alive right up until the team quietly stops reading it.

v5.0 made Critique feel like a platform: marketplace skills, merge policy in plain English, signed passport exports, Cursor handoff, persistent Coding Agent API sessions, repo-first dashboard, and Insights. The obvious follow-up would have been another surface. Another dashboard tab. Another model lane. Instead, the product pushed us toward a harder question: how do we know the reviewer itself is improving?

You cannot answer that by reading a single beautiful review. You answer it by replaying old PRs, scoring accepted comments against false positives, counting missed human issues, measuring cost and time-to-first-comment, and tracking whether a comment later disappeared into the “ignored” pile. That is the work v5.1 starts.

What changed in the mental model

v5.1 moves Critique from “agent generated a review” toward “review quality is a release gate.”

Old pressurev5.1 pressure
QuantityAim for a healthy number of findings.Publish up to N high-confidence findings. There is no target count.
ValidationThe same OpenCode session judges whether it is done.An independent critic only checks actionability, evidence, and line pinning.
LearningRepository memory exists, but mostly after the run.Feedback-derived rules are rendered into the next review prompt with scope and decay.
FailureA stuck sandbox becomes a generic stale path.Failures classify as clone/auth, checkout, OpenCode missing, provider timeout, OOM, invalid artifact, missing output, and more.

The new evaluation harness is deliberately boring. It samples historical review runs, turns findings and feedback into a replay set, then reports precision, false positives, accepted comments, stale failures, missed human comments where we have them, time-to-first-comment, and cost. It can fail a release gate before a prompt, model, or runtime change reaches production.

Release gate shape

The CLI reads recent completed/failed review runs and exits non-zero when configured thresholds fail.

CRITIQUE_REVIEW_EVAL_SAMPLE_SIZE=100 \
CRITIQUE_REVIEW_EVAL_MIN_PRECISION=0.72 \
CRITIQUE_REVIEW_EVAL_MAX_STALE_FAILURES=2 \
pnpm review:evaluate

This changes how we ship. A prompt that feels more “senior” in one demo can still lose if it raises false positives across 100 old PRs. A runtime tweak that makes one run faster can still lose if it increases stale failures. v5.1 makes those tradeoffs harder to hide, including from us.

Critique already had Finding Memory. The missing piece was using it before the next review, not only after feedback was recorded. v5.1 derives scoped learnings from accepted findings, false positives, suppressions, incident links, and repository conventions. The rule is intentionally modest: decayed repository priors, not hard truth.

Prefer

Accepted findings become attention

If maintainers repeatedly accept a class of issue, the next run is nudged to inspect that path or concern more carefully.

Avoid

False positives become restraint

If a pattern is repeatedly dismissed or suppressed, the next run sees that context before publishing another weak version of the same claim.

Scoped

Scope matters

Rules render as repository, path, or concern-scoped priors. A frontend false positive does not poison backend security review.

Decaying

Decay matters

Old feedback loses weight. A codebase can change its conventions without being haunted forever by a stale rule.

That is the right level of ambition for a feedback system. We are not pretending the product has discovered universal laws of software. We are saying the reviewer should remember what this repository’s maintainers taught it, while staying humble enough to let those lessons age out.

A count target sounds useful until it quietly trains the agent to fill the quota. v5.1 removes that pressure from the OpenCode prompt. The new instruction is blunt: publish up to the quality budget, defaulting to twelve, and publish zero if the evidence does not support a concrete issue.

A finding has to earn the comment
  1. 1
    Is it anchored?
    It needs a changed file and right-side line. If GitHub cannot pin it, it is not an inline finding.
  2. 2
    Can it happen?
    It needs a realistic trigger: input, state, order, caller, permission, environment, or command path.
  3. 3
    Does it matter?
    It needs concrete impact on correctness, security, data integrity, performance, or operations.
  4. 4
    Can the author act?
    It needs the smallest plausible fix direction, not a broad refactor wish.
  5. 5
    Did the reviewer see evidence?
    It needs command output, static caller evidence, or line-level contract evidence.

We added an independent critic pass, but not to re-review the PR. That would turn quality control into another model debate. The critic only sees the artifact and asks a narrower question: are these findings actionable, line-pinned, evidence-backed, and inside budget?

This matters because weak review comments often sound plausible in the same session that produced them. The independent pass creates a second boundary. It can request one targeted follow-up: fix the weak findings or drop them. That is a small mechanism, but it attacks a common failure mode directly: the agent validating its own vibes.

OpenCode inside E2B is powerful because it can read, run, write scratch tests, and inspect the repo like a real reviewer. It is also a real runtime, which means it can drift. Tools can disappear. Node can change. The server can fail health. A model provider can stall. A command can hang. Pretending those are prompt problems is lazy engineering.

This is the part of shipping AI infrastructure that rarely gets a launch thread. But it is exactly where trust is won. If a review fails, the operator should know whether the repository token failed, the checkout failed, OpenCode was missing, the server never became healthy, the provider timed out, the sandbox hit OOM, a command hung, the artifact was invalid, or the output never appeared.

v5.1 also tightens the Coding Agent side of the product. The public API page now matches the real mental model: Builder is the UI; the Coding Agent API is the HTTP entry point for the same job system. Runs expose preview, idempotency, cursor pagination, status polling, cancel, model discovery, safety caps, webhook signatures, and draft PR handoff into review/passport surfaces.

That sounds like endpoint work, but the product meaning is bigger. Automation teams do not need a cute demo. They need budget preview before a sandbox starts, retries that do not duplicate work, streams that resume, controls that stop work, and a PR handoff that returns to the same evidence system as every other Critique review.

There is a temptation after a large platform release to pause, rename the roadmap, and turn the next month into ceremony. v5.1 is our refusal to do that. v5.0 shipped the visible platform. v5.1 ships the pressure system underneath it: evaluate the reviewer, teach it from feedback, constrain its comments, critic-check its artifact, smoke-test its runtime, and classify its failures.

That is how Critique will keep moving. Some releases will look like new surfaces. Some will look like reliability. Some will look like a harsh instruction that tells an agent to publish nothing unless it can prove the issue. All of them count. The product is not the marketing page. The product is whether teams can trust what happens at the merge boundary on a bad Tuesday.

Yes. The headline is review reliability: evaluation harness, feedback learnings, quality budget, critic pass, runtime smoke, and failure classification. Coding Agent API and dashboard handoff improvements are part of the same operator story.
Often, yes. That is intentional. The product should post the findings that clear the evidence bar, not fill a target count.
No. v5.1 renders scoped, editable, decaying repository learnings into the review prompt. It uses feedback as local review context, not a claim about global model training.
No. v5.1 hardens the current E2B/OpenCode path first: version recording, smoke tests, and failure classes. Modal remains a serious runtime direction to evaluate, but the immediate product need is making today’s path measurable and stable.

Read the v5.1 ship log

The changelog has the operator-facing details for PR review, Coding Agent API, marketplace attribution, dashboard navigation, and platform connections.

Open changelog