Skip to content
20 min readRepath Khan

Critique PR Review v4.1: Execution Depth, Live Observability, and Operator Secrets

The sandbox review no longer apologises for skipping tests. GitHub gets many small inline comments. The dashboard shows what the agent actually did. And you can finally hand it DATABASE_URL without handing it to the world.

PR Review v4.1Execution + observability + control plane

critique.sh

Deeper than a short bot comment, quieter than a manifesto

v4.1 is a tight release. It chases a single product truth: a sandbox review should feel as specific as a great human review on GitHub, and as legible in the control plane as any other production workload.

Exploratory sandbox

The OpenCode agent runs with edit permission inside an ephemeral E2B workspace. It is instructed to git diff, search, run existing tests, author throwaway tests under controlled paths, and record a full command timeline so the report is grounded in execution, not vibes.

GitHub-shaped review density

Structured output supports a large findings array and longer bodies. The GitHub integration raises the inline comment budget and prioritises by severity so the final review reads like a human left many small, file-pinned notes instead of one vague summary.

Live activity and a canonical feed

While the session runs, we poll OpenCode session messages and surface tool calls, reasoning chunks, and errors on the review run. Open the dedicated live page to watch the model think and tools execute in order.

Operator secrets for real tests

Per-repository secrets are stored AES-256-GCM at rest, validated for safe names, and merged into the sandbox process environment so DATABASE_URL and similar values exist only when you choose to provide them.

40
Max inline review comments to GitHub (prioritised by severity; INFO can ship)
Live
OpenCode session message polling with tool + reasoning + error stream
AES-GCM
At-rest encryption for per-repository secrets before sandbox injection
Edit on
Sandbox write permission for scratch tests and throwaway files (still no PR commit)

Why v4.1 exists

The failure mode of “smart but shallow”

An automated review that sounds intelligent in aggregate but thin in the diff is worse than a noisy linter. It trains teams to ignore the bot. The successful pattern on GitHub has always been the same: many small, well-placed comments that map cleanly to specific lines, each carrying enough local context to be actionable. v4.1 is our push to make the sandbox path behave like that pattern even when the underlying model is doing heavy synthesis.

The other half is honesty about execution. If the runtime can clone the PR and run code, a review that only narrates the diff is leaving signal on the table. v4.1 rewrites the agent contract so “I did not run tests” is not an acceptable end state. The sandbox is temporary, but while it lives it is a real environment.

Exploratory review in the sandbox

From read-only opinion to verifiable work

We keep the same safety spine: the agent must not exfiltrate operator secrets in review text, and it must not mutate the remote in ways that are out of policy. Inside those rails, the lead model and the specialist subagents are now expected to behave like people who were handed a laptop with the repo already checked out at the right merge base.

Concretely, that means reading the change set with git and ripgrep, running existing tests or targeted subsets when the project supports it, and when necessary, creating scratch tests or one-off scripts in designated locations. The OpenCode config grants edit rights because an exploratory review without the ability to write a small failing test is a review that is guessing about edge cases.

Before v4.1 (stylised pain)
Single long summary on the PRA few high-level findings with thin line anchorsOptional disclaimer that tests were not run
v4.1 (target state)
Larger JSON report with more findings and longer evidenceMany small GitHub inline comments aligned to hunks and severityCommand timeline and optional live tool feed while the run is in flight

Structured output that can carry the load

Schema limits are a product feature

If you cap the body at a few kilobytes and the findings list at a handful of entries, the model is forced to compress away nuance. We raised the limits on the sandbox JSON schema on purpose: longer per-finding text, more suggested tests, a wider command timeline, and more subagent detail. The lead model is still constrained by the skill definition, but the container is no longer the bottleneck at ten pages of “please shorten.”

The schema is not an academic exercise. It is the join between a messy runtime and a clean GitHub surface. A bigger container lets the model keep intermediate clarity without shoving everything into a single paragraph that humans will not read on the PR page.

GitHub: match the way humans expect reviews to look

Many inline comments, ordered by impact

We changed the GitHub integration so a single “wall of text” review is not the only shape. The inline builder now has a higher budget for comments and it ranks findings with FAIL and WARNING first, with INFO still eligible to land as a line note when the diff supports it. The intent is the screenshot you have seen from strong automated audits: a summary at the top, then a set of small threads anchored to the exact lines that matter.

Live activity: the operator can see the machine at work

Session messages, not a mystery spinner

Long-running model work suffers from a classic UX problem: the system is busy, and the human only knows that something is “still running.” v4.1 adds a poller on the OpenCode server side of the house that fetches new session messages while the main request is in flight, normalises them into a typed activity object, and ships them through the same progress channel that already writes agent board entries.

On the review run page you will see a compact preview of the feed, and a link that opens a canonical live view with filters, auto-scroll, and better formatting for long tool inputs. This is the difference between “the bot is thinking” and “here is the last command the bot ran, here is the tool output, here is a reasoning chunk, here is the failure, if any.”

What the feed is good for
  • Debugging why a test command failed in the sandbox
  • Showing compliance-minded teams that work happened in a real environment
  • Separating a model hiccup from a repository issue faster
What the feed is not
  • A second source of truth for the final review verdict. The durable artifact remains the validated JSON and the published GitHub review.
  • A public stream. It is visible to the operator through the app like other run diagnostics.

Repository secrets: test like production when you need to

The missing piece in most AI sandboxes

Until now, a honest integration test in a feature branch often needed credentials that you would never post into a public PR thread. The correct answer is the same as in CI: the platform holds secrets, the job receives them as environment variables, and the logs are redacted by policy. v4.1 brings that model to the review sandbox with a per-repository secret store, encrypted in the database, editable only by the user who installed the app for that repo, and merged into the sandbox process environment as plain env vars at run time.

The encryption key never lives in the database. We use AES-256-GCM with a v1 envelope string so rotation remains possible without changing table shapes. Names are forced to a conservative UPPER_SNAKE_CASE pattern and a blocklist rejects a handful of foot-gun variables like dynamic linker hooks and key material for our own runtimes. There is a per-repository count cap. None of that is a substitute for a security review of your own threat model, but it is a serious baseline for “not plaintext in Postgres.”

How to think about secrets in Critique

Plain language, no magic

LayerWhat happensWhat does not happen
In the dashboardYou set name and value on a card under Automation. The value is never echoed back in list APIs beyond a short hint.Secrets are not visible to other products outside this app surface unless you have built something custom.
In transit to the jobThe server composes a resolved env object for the sandbox. Progress logs may name which keys were injected, not the values.The review text on GitHub should still avoid echoing environment contents; that remains a model instruction.
At rest in PostgresCiphertext plus optional hint. Decryption is only attempted when building the sandbox environment.We do not log decrypted secret bodies to application logs in the happy path; treat any debug logging in your deployment as a policy decision.

If you rotate CRITIQUE_SECRETS_ENCRYPTION_KEY without re-encrypting older rows, those rows will not decrypt. Re-save from the UI after rotation.

How this lines up with v4.0’s story

The beta did not get quieter; it got sharper

The v4 beta note was about scale and the shape of the runtime: nearly four thousand pull requests, OpenCode server semantics, public ecosystem signals. v4.1 is not a rebrand. It is the set of product mechanics that make that runtime feel fair to a senior engineer: visible execution, a GitHub surface that respects how teams read feedback, and operator-controlled environment parity so “run the test suite” is a meaningful instruction.

If you are evaluating Critique in a real team

A practical checklist
  1. 1
    Does the review thread look like a human left many notes?
    Open a run that touched more than a handful of files. You should see a summary plus multiple inline comments with distinct titles and, where the diff allows, line anchoring. If the change is too small, count comments against expectations.
  2. 2
    Can the agent prove it ran the thing?
    On the run page, follow the live link and look for the command list and tool cards. The JSON report’s command timeline should not be empty on non-trivial repositories.
  3. 3
    Do your tests have what they need?
    Under Automation, add only the variable names the repo expects. Re-run. If a secret is wrong, you will see failures in the tool feed, not a silent pass.
  4. 4
    Are you comfortable with key rotation?
    Document how CRITIQUE_SECRETS_ENCRYPTION_KEY is generated and store it in your secrets manager. Plan a re-entry step for already saved secrets if you ever rotate the key material.
Primary sources

Try v4.1 on a real pull request

Point Critique at a branch you care about, add only the secrets you are willing to entrust the platform to hold encrypted, and judge the result on the three axes: GitHub comment quality, live activity, and test-backed findings.

Get started

Ask about this essay

Nemotron-3-Super
Ask about the argument, the evidence, the structure, or how the post connects to Critique.
Not editorial advice · The essay above is the source of truth · Not saved to your account · OpenRouter privacy