Critique PR Review v4.1: Execution Depth, Live Observability, and Operator Secrets
The sandbox review no longer apologises for skipping tests. GitHub gets many small inline comments. The dashboard shows what the agent actually did. And you can finally hand it DATABASE_URL without handing it to the world.
critique.sh
Deeper than a short bot comment, quieter than a manifesto
v4.1 is a tight release. It chases a single product truth: a sandbox review should feel as specific as a great human review on GitHub, and as legible in the control plane as any other production workload.
Exploratory sandbox
The OpenCode agent runs with edit permission inside an ephemeral E2B workspace. It is instructed to git diff, search, run existing tests, author throwaway tests under controlled paths, and record a full command timeline so the report is grounded in execution, not vibes.
GitHub-shaped review density
Structured output supports a large findings array and longer bodies. The GitHub integration raises the inline comment budget and prioritises by severity so the final review reads like a human left many small, file-pinned notes instead of one vague summary.
Live activity and a canonical feed
While the session runs, we poll OpenCode session messages and surface tool calls, reasoning chunks, and errors on the review run. Open the dedicated live page to watch the model think and tools execute in order.
Operator secrets for real tests
Per-repository secrets are stored AES-256-GCM at rest, validated for safe names, and merged into the sandbox process environment so DATABASE_URL and similar values exist only when you choose to provide them.
Why v4.1 exists
The failure mode of “smart but shallow”
An automated review that sounds intelligent in aggregate but thin in the diff is worse than a noisy linter. It trains teams to ignore the bot. The successful pattern on GitHub has always been the same: many small, well-placed comments that map cleanly to specific lines, each carrying enough local context to be actionable. v4.1 is our push to make the sandbox path behave like that pattern even when the underlying model is doing heavy synthesis.
The other half is honesty about execution. If the runtime can clone the PR and run code, a review that only narrates the diff is leaving signal on the table. v4.1 rewrites the agent contract so “I did not run tests” is not an acceptable end state. The sandbox is temporary, but while it lives it is a real environment.
Exploratory review in the sandbox
From read-only opinion to verifiable work
We keep the same safety spine: the agent must not exfiltrate operator secrets in review text, and it must not mutate the remote in ways that are out of policy. Inside those rails, the lead model and the specialist subagents are now expected to behave like people who were handed a laptop with the repo already checked out at the right merge base.
Concretely, that means reading the change set with git and ripgrep, running existing tests or targeted subsets when the project supports it, and when necessary, creating scratch tests or one-off scripts in designated locations. The OpenCode config grants edit rights because an exploratory review without the ability to write a small failing test is a review that is guessing about edge cases.
Structured output that can carry the load
Schema limits are a product feature
If you cap the body at a few kilobytes and the findings list at a handful of entries, the model is forced to compress away nuance. We raised the limits on the sandbox JSON schema on purpose: longer per-finding text, more suggested tests, a wider command timeline, and more subagent detail. The lead model is still constrained by the skill definition, but the container is no longer the bottleneck at ten pages of “please shorten.”
The schema is not an academic exercise. It is the join between a messy runtime and a clean GitHub surface. A bigger container lets the model keep intermediate clarity without shoving everything into a single paragraph that humans will not read on the PR page.
GitHub: match the way humans expect reviews to look
Many inline comments, ordered by impact
We changed the GitHub integration so a single “wall of text” review is not the only shape. The inline builder now has a higher budget for comments and it ranks findings with FAIL and WARNING first, with INFO still eligible to land as a line note when the diff supports it. The intent is the screenshot you have seen from strong automated audits: a summary at the top, then a set of small threads anchored to the exact lines that matter.
Live activity: the operator can see the machine at work
Session messages, not a mystery spinner
Long-running model work suffers from a classic UX problem: the system is busy, and the human only knows that something is “still running.” v4.1 adds a poller on the OpenCode server side of the house that fetches new session messages while the main request is in flight, normalises them into a typed activity object, and ships them through the same progress channel that already writes agent board entries.
On the review run page you will see a compact preview of the feed, and a link that opens a canonical live view with filters, auto-scroll, and better formatting for long tool inputs. This is the difference between “the bot is thinking” and “here is the last command the bot ran, here is the tool output, here is a reasoning chunk, here is the failure, if any.”
- Debugging why a test command failed in the sandbox
- Showing compliance-minded teams that work happened in a real environment
- Separating a model hiccup from a repository issue faster
- A second source of truth for the final review verdict. The durable artifact remains the validated JSON and the published GitHub review.
- A public stream. It is visible to the operator through the app like other run diagnostics.
Repository secrets: test like production when you need to
The missing piece in most AI sandboxes
Until now, a honest integration test in a feature branch often needed credentials that you would never post into a public PR thread. The correct answer is the same as in CI: the platform holds secrets, the job receives them as environment variables, and the logs are redacted by policy. v4.1 brings that model to the review sandbox with a per-repository secret store, encrypted in the database, editable only by the user who installed the app for that repo, and merged into the sandbox process environment as plain env vars at run time.
The encryption key never lives in the database. We use AES-256-GCM with a v1 envelope string so rotation remains possible without changing table shapes. Names are forced to a conservative UPPER_SNAKE_CASE pattern and a blocklist rejects a handful of foot-gun variables like dynamic linker hooks and key material for our own runtimes. There is a per-repository count cap. None of that is a substitute for a security review of your own threat model, but it is a serious baseline for “not plaintext in Postgres.”
Plain language, no magic
| Layer | What happens | What does not happen |
|---|---|---|
| In the dashboard | You set name and value on a card under Automation. The value is never echoed back in list APIs beyond a short hint. | Secrets are not visible to other products outside this app surface unless you have built something custom. |
| In transit to the job | The server composes a resolved env object for the sandbox. Progress logs may name which keys were injected, not the values. | The review text on GitHub should still avoid echoing environment contents; that remains a model instruction. |
| At rest in Postgres | Ciphertext plus optional hint. Decryption is only attempted when building the sandbox environment. | We do not log decrypted secret bodies to application logs in the happy path; treat any debug logging in your deployment as a policy decision. |
If you rotate CRITIQUE_SECRETS_ENCRYPTION_KEY without re-encrypting older rows, those rows will not decrypt. Re-save from the UI after rotation.
How this lines up with v4.0’s story
The beta did not get quieter; it got sharper
The v4 beta note was about scale and the shape of the runtime: nearly four thousand pull requests, OpenCode server semantics, public ecosystem signals. v4.1 is not a rebrand. It is the set of product mechanics that make that runtime feel fair to a senior engineer: visible execution, a GitHub surface that respects how teams read feedback, and operator-controlled environment parity so “run the test suite” is a meaningful instruction.
If you are evaluating Critique in a real team
- 1Does the review thread look like a human left many notes?Open a run that touched more than a handful of files. You should see a summary plus multiple inline comments with distinct titles and, where the diff allows, line anchoring. If the change is too small, count comments against expectations.
- 2Can the agent prove it ran the thing?On the run page, follow the live link and look for the command list and tool cards. The JSON report’s command timeline should not be empty on non-trivial repositories.
- 3Do your tests have what they need?Under Automation, add only the variable names the repo expects. Re-run. If a secret is wrong, you will see failures in the tool feed, not a silent pass.
- 4Are you comfortable with key rotation?Document how CRITIQUE_SECRETS_ENCRYPTION_KEY is generated and store it in your secrets manager. Plan a re-entry step for already saved secrets if you ever rotate the key material.
Try v4.1 on a real pull request
Point Critique at a branch you care about, add only the secrets you are willing to entrust the platform to hold encrypted, and judge the result on the three axes: GitHub comment quality, live activity, and test-backed findings.
Get started