One month. A billion tokens. What it actually took.
The arc from GitHub review to always-on engineering partner — what shipped, what is still wiring together, and why the scale we are reaching makes the boring work as important as the headlines.

fuck around and find out.
May 2026 · beta · thank you
critique.sh
Thank you. That is the honest first sentence. People are using Critique while the paint is still wet: filing bugs at midnight, burning real review load on PRs that matter, and writing blunt emails about the places where the product disappoints. That feedback is not a courtesy. It is how the roadmap stays honest and how the team avoids building in an echo chamber.
A small thanks on top of that: through end of day May 19, 2026 (UTC), DeepSeek V4 Flash and Pro bill at half the usual credit floors — 0.5 and 1.5 credits instead of 1 and 3 — anywhere we meter those models. The homepage and model directory show a live countdown; when the clock hits zero, the shelf pricing in the catalog takes back over automatically.
The numbers
These are not marketing figures. They are the raw counts from internal dashboards for the period this essay covers. They matter because at this scale, every point of inaccuracy — in metering, in model routing, in credit accounting — becomes visible within days.
A billion tokens a month is not a flex. It is a contract. When someone runs a long review session and the credit ledger is wrong on day three, they notice immediately. When a sandbox run stalls for forty seconds with no signal, they assume the product is broken — because it might be. Scale that was theoretical in January is now the daily operating condition, and it has forced a different kind of engineering discipline: the "boring" reliability merges at the end of the range matter as much as any headline feature.
How we got here
The story is not a straight line. Most interesting product arcs are not. We started in one narrow place and followed the logic wherever it led — which turned out to be somewhere much larger than the original premise.
- OriginThe GitHub App: review that lives where merges happen
Everything started with one belief: the right place to enforce code quality is the pull request, not a side tool you have to open separately. The first version of Critique was a GitHub App that posted structured review comments — findings with evidence, severity, and line-level attribution — directly on the PR thread, and could gate a merge on the result. That constraint, staying on the PR rather than building a parallel workspace, shaped every architectural decision that followed.
- ExpansionSandboxes: review that can see the whole checkout
A diff is three percent of what a good reviewer actually needs to read. The other ninety-seven percent is the files the PR did not change but the change still touches: callers, tests, adjacent modules, dependency boundaries. Sandboxes gave review a real checkout to reason against, not just a patch in a vacuum. The infrastructure required to make ephemeral environments reliable at review latency turned out to be its own significant project.
- Agent layerOpenCode inside the sandbox: from reviewer to executor
Once sandboxes existed, the natural question was whether an agent could work the way a human reviewer works — pulling context, tracing call graphs, running tests, not just reading the diff. OpenCode gave us that: a terminal-native coding agent that could operate inside our sandbox environment with full tool access. The Critique agent identity sits on top of OpenCode, tuned for the specific demands of long-horizon review runs.
- RepairRemedy: when a finding is narrow enough to fix, fix it
The honest follow-through after review is: if we know what is wrong and the fix is bounded, why stop at a comment? Remedy runs isolated patch attempts inside ephemeral VMs, reruns lint and the test suite, and pushes the verified commit back to the branch. The hard design constraint was the loop ceiling — Remedy stops after two autonomous attempts and hands back to the engineer, because AI that knows when to stop is more trustworthy than AI that never does.
- BuildBuilder: execution fabric for "make this change in the repo"
Teams wanted the same execution fabric Remedy used for something broader: natural language to repository-scoped change, without starting from a PR thread. Builder exposed the OpenCode sandbox layer as a first-class surface, with the same model routing and the same credit accounting as review.
- ContextChat: repo-grounded answers between PRs
Chat arrived early and kept maturing in parallel — better models, richer repo context, speech input, and progressively tighter bridges into review when a conversation should become action. The positioning is intentional: Chat is not a search box. It carries knowledge of your codebase across sessions and can launch a review run or a Builder job from the thread.
- North starWorkspace: one surface, one ledger, one always-on engineer
The observation that finally became a conviction: Chat, Review, Remedy, and Builder rhyme. Similar model ladder. Similar need for repo context. Similar accounting. Workspace is the bet that they should not feel like four different products forever. The goal is one place that behaves like an always-on software engineer — review, explain, patch, build — with one ledger and one clear notion of trust.
The surfaces, described plainly
Each surface is independent enough to be useful on its own. The value compounds when they share context — when a Chat thread can launch a review, or when a Remedy patch knows what the review found. That integration is the work still in progress.
Multi-agent PR analysis with a merge-gate verdict.
Scout maps the blast radius first. Specialists inspect in parallel — security, tests, architecture, performance. A frontier model synthesises findings into a PASS, REQUEST CHANGES, or BLOCK verdict. All evidence stays on the record.
Repo-aware answers without burning review credits.
Grounded in your codebase across sessions. Answers questions about architecture, traces call graphs, explains the unfamiliar file — and can kick off a review run or a Builder job from the conversation.
Bounded autonomous patches on verified findings.
When a finding is narrow enough to fix, Remedy boots an ephemeral VM, writes the patch, reruns tests, and pushes back to the branch. Two-loop ceiling by design — it hands back to the engineer before overreaching.
Natural language to repository-scoped change.
Same sandbox fabric as Remedy, without the PR thread constraint. Describe a change in plain language; Builder runs it in an isolated environment with full tool access and the same model routing as review.
One place that behaves like an always-on engineer.
Still wiring together. The north star: Chat, Review, Remedy, and Builder share one ledger, one context model, and one interface — so switching between them feels like talking to the same engineer, not opening a different app.
A billion tokens is a trust contract
This section would be easy to write as a flex. We will not do that. The scale we are reaching creates obligations that are harder than the engineering that got us here.
When review traffic is light, a miscount in the credit ledger is an embarrassment. At a billion tokens a month, it is a breach. When a model call stalls during a low-volume period, it looks like a timeout. At the load we are carrying now, a stall pattern that recurs twice a week becomes a support ticket, a trust problem, and eventually a cancellation. The reliability work that looks "boring" in the PR log — stall detection, liveness probes, accounting correctness under nested agent calls, failure visibility — is the work that determines whether the product keeps the promise it is making.
The billing clarity problem is worth naming explicitly. Multi-step agentic runs — where a review spawns a Remedy attempt that spawns a Builder sub-task — are hard to meter honestly. We have invested significant work in making sure the credit dashboard reflects what actually ran, not what we estimated would run. If you see a discrepancy, report it. We treat every billing bug as a trust bug.
What is still being built
Workspace unification is the largest open thread. The four surfaces exist and work independently. Making them feel like one product — shared context model, shared credit ledger, interface continuity as you move from Chat to Review to a Remedy run and back — is architectural work that takes time to do correctly. Exciting to describe; not finished as a shipped experience.
Enterprise policy is the second large thread. Teams want to define which specialists run on which file paths, what model quality tier they buy for security-critical merges versus routine updates, and what strictness threshold triggers a block versus a comment. The configuration surface exists in early form. Making it robust enough for a four-hundred-person organisation with branch protection enforced on every repository is a different bar.
The hand-off prompt layer — the ability for Remedy to emit a structured context packet that you can paste into whichever coding agent owns your workstation, whether that is Cursor or Claude desktop or Windsurf — is working but needs polish. The goal is that review output travels as a high-quality prompt, not a proprietary chat thread. That portability matters to enterprise customers who have already standardised on a different execution stack.
Frequently asked
Where to read more
The About page has the three-beat product description. Remedy and Models carry execution and catalog detail. The Version log is the authoritative dated ledger for what changed and when. Older essays go deeper on specific threads: the trust gap argument, the PR review v5 architecture, the Builder story, the case for Chat alongside review.
The beta is real. Use it like it is.
Free chat for signed-in users. Paid credits for real review runs on real PRs. Stress the flows that matter to your team. If something breaks or disappoints, tell us — that is the value of being here before everyone else.
Create an account