Skip to content
Product14 min readCritique

One month. A billion tokens. What it actually took.

The arc from GitHub review to always-on engineering partner — what shipped, what is still wiring together, and why the scale we are reaching makes the boring work as important as the headlines.

fuck around and find out.

May 2026 · beta · thank you

critique.sh

Thank you. That is the honest first sentence. People are using Critique while the paint is still wet: filing bugs at midnight, burning real review load on PRs that matter, and writing blunt emails about the places where the product disappoints. That feedback is not a courtesy. It is how the roadmap stays honest and how the team avoids building in an echo chamber.

A small thanks on top of that: through end of day May 19, 2026 (UTC), DeepSeek V4 Flash and Pro bill at half the usual credit floors — 0.5 and 1.5 credits instead of 1 and 3 — anywhere we meter those models. The homepage and model directory show a live countdown; when the clock hits zero, the shelf pricing in the catalog takes back over automatically.

These are not marketing figures. They are the raw counts from internal dashboards for the period this essay covers. They matter because at this scale, every point of inaccuracy — in metering, in model routing, in credit accounting — becomes visible within days.

~0B
Tokens processed per month across all surfaces
~0
Pull requests merged in the build window this essay covers
0k+
Lines of code written internally, agents included
0cr
Free beta credits for every new account

A billion tokens a month is not a flex. It is a contract. When someone runs a long review session and the credit ledger is wrong on day three, they notice immediately. When a sandbox run stalls for forty seconds with no signal, they assume the product is broken — because it might be. Scale that was theoretical in January is now the daily operating condition, and it has forced a different kind of engineering discipline: the "boring" reliability merges at the end of the range matter as much as any headline feature.

The story is not a straight line. Most interesting product arcs are not. We started in one narrow place and followed the logic wherever it led — which turned out to be somewhere much larger than the original premise.

  1. Origin
    The GitHub App: review that lives where merges happen

    Everything started with one belief: the right place to enforce code quality is the pull request, not a side tool you have to open separately. The first version of Critique was a GitHub App that posted structured review comments — findings with evidence, severity, and line-level attribution — directly on the PR thread, and could gate a merge on the result. That constraint, staying on the PR rather than building a parallel workspace, shaped every architectural decision that followed.

  2. Expansion
    Sandboxes: review that can see the whole checkout

    A diff is three percent of what a good reviewer actually needs to read. The other ninety-seven percent is the files the PR did not change but the change still touches: callers, tests, adjacent modules, dependency boundaries. Sandboxes gave review a real checkout to reason against, not just a patch in a vacuum. The infrastructure required to make ephemeral environments reliable at review latency turned out to be its own significant project.

  3. Agent layer
    OpenCode inside the sandbox: from reviewer to executor

    Once sandboxes existed, the natural question was whether an agent could work the way a human reviewer works — pulling context, tracing call graphs, running tests, not just reading the diff. OpenCode gave us that: a terminal-native coding agent that could operate inside our sandbox environment with full tool access. The Critique agent identity sits on top of OpenCode, tuned for the specific demands of long-horizon review runs.

  4. Repair
    Remedy: when a finding is narrow enough to fix, fix it

    The honest follow-through after review is: if we know what is wrong and the fix is bounded, why stop at a comment? Remedy runs isolated patch attempts inside ephemeral VMs, reruns lint and the test suite, and pushes the verified commit back to the branch. The hard design constraint was the loop ceiling — Remedy stops after two autonomous attempts and hands back to the engineer, because AI that knows when to stop is more trustworthy than AI that never does.

  5. Build
    Builder: execution fabric for "make this change in the repo"

    Teams wanted the same execution fabric Remedy used for something broader: natural language to repository-scoped change, without starting from a PR thread. Builder exposed the OpenCode sandbox layer as a first-class surface, with the same model routing and the same credit accounting as review.

  6. Context
    Chat: repo-grounded answers between PRs

    Chat arrived early and kept maturing in parallel — better models, richer repo context, speech input, and progressively tighter bridges into review when a conversation should become action. The positioning is intentional: Chat is not a search box. It carries knowledge of your codebase across sessions and can launch a review run or a Builder job from the thread.

  7. North star
    Workspace: one surface, one ledger, one always-on engineer

    The observation that finally became a conviction: Chat, Review, Remedy, and Builder rhyme. Similar model ladder. Similar need for repo context. Similar accounting. Workspace is the bet that they should not feel like four different products forever. The goal is one place that behaves like an always-on software engineer — review, explain, patch, build — with one ledger and one clear notion of trust.

Each surface is independent enough to be useful on its own. The value compounds when they share context — when a Chat thread can launch a review, or when a Remedy patch knows what the review found. That integration is the work still in progress.

This section would be easy to write as a flex. We will not do that. The scale we are reaching creates obligations that are harder than the engineering that got us here.

When review traffic is light, a miscount in the credit ledger is an embarrassment. At a billion tokens a month, it is a breach. When a model call stalls during a low-volume period, it looks like a timeout. At the load we are carrying now, a stall pattern that recurs twice a week becomes a support ticket, a trust problem, and eventually a cancellation. The reliability work that looks "boring" in the PR log — stall detection, liveness probes, accounting correctness under nested agent calls, failure visibility — is the work that determines whether the product keeps the promise it is making.

The billing clarity problem is worth naming explicitly. Multi-step agentic runs — where a review spawns a Remedy attempt that spawns a Builder sub-task — are hard to meter honestly. We have invested significant work in making sure the credit dashboard reflects what actually ran, not what we estimated would run. If you see a discrepancy, report it. We treat every billing bug as a trust bug.

Workspace unification is the largest open thread. The four surfaces exist and work independently. Making them feel like one product — shared context model, shared credit ledger, interface continuity as you move from Chat to Review to a Remedy run and back — is architectural work that takes time to do correctly. Exciting to describe; not finished as a shipped experience.

Enterprise policy is the second large thread. Teams want to define which specialists run on which file paths, what model quality tier they buy for security-critical merges versus routine updates, and what strictness threshold triggers a block versus a comment. The configuration surface exists in early form. Making it robust enough for a four-hundred-person organisation with branch protection enforced on every repository is a different bar.

The hand-off prompt layer — the ability for Remedy to emit a structured context packet that you can paste into whichever coding agent owns your workstation, whether that is Cursor or Claude desktop or Windsurf — is working but needs polish. The goal is that review output travels as a high-quality prompt, not a proprietary chat thread. That portability matters to enterprise customers who have already standardised on a different execution stack.

One credit is a normalised review slice — roughly 1M input and 150k output tokens at lead-model quality. A small PR (under 200 lines) costs 3–8 credits. A large PR with a security-critical path and Remedy follow-through might cost 30–50. Chat does not spend review credits; review and execution services require paid credits or eligible billing. If you want a team pilot, email ray@critique.sh.
The default stack routes specialists on cost-effective models (recent Qwen and GLM tiers) and reserves frontier models — Claude, GPT, Kimi — for the lead synthesis pass where multi-step reasoning matters most. You can override the model at any tier from the repository settings, including pointing the whole stack at your own inference endpoints via BYOA. The Models page has the current roster and per-model credit weights.
Yes, through Remedy — when Critique has GitHub App write access to the PR head branch and the finding is classified as bounded and resolvable. Remedy runs in an ephemeral VM, writes the patch, reruns your local test suite, and pushes the commit back. If tests fail, it tries once more and then stops. It never retries indefinitely. If you prefer to keep execution on your own stack, Remedy can emit a structured hand-off prompt instead of pushing directly.
Claude Code and Cursor are generation tools: they help you write code faster. Critique is a review layer: it sits at the merge gate and decides — with evidence — whether what was written should ship. The architecture is designed for that specific job: Scout maps the blast radius before any specialist starts; specialists run in parallel with structured, evidenced findings; a frontier model synthesises the full pack into a verdict. That verdict can drive a required GitHub check. A coding agent running in your editor does not have that structure, that integration, or that governance posture.
The GitHub App is required for automated PR review and Remedy pushes. Chat and Builder work without it — you can connect a repository through the dashboard for indexing and use those surfaces immediately. Installing the GitHub App unlocks the full pipeline: review checks, Remedy, and the merge-gate verdict that shows up on your PR thread.
Bring Your Own Agent means pointing the Critique specialist graph at your own inference infrastructure — a hosted OpenAI-compatible endpoint, Azure OpenAI, Amazon Bedrock, a private GPU fleet — instead of using the default OpenRouter routes. GitHub still receives identical verdict artefacts; only where the tokens are processed changes. Enterprise customers who need data residency in a specific region, or who have already certified particular inference providers for compliance, use BYOA. For most teams, the default routing is faster to set up and cheaper to operate.
Email ray@critique.sh. Describe what you are trying to prove — which surfaces, how many repositories, what scale. We will set up the pool and stay in the thread to help with onboarding. We are not running a formal enterprise sales motion yet; the pilot process is deliberately lightweight because we are still learning what matters most to different team sizes.
Workspace is the unification of Chat, Review, Remedy, and Builder into one coherent surface — one ledger, one context model, one interface that flows between explanation and action without mode-switching. It shipped in v0.3.0 (May 2026) at /workspace. Read the launch essay at /blog/introducing-critique-workspace for the full story.

The About page has the three-beat product description. Remedy and Models carry execution and catalog detail. The Version log is the authoritative dated ledger for what changed and when. Older essays go deeper on specific threads: the trust gap argument, the PR review v5 architecture, the Builder story, the case for Chat alongside review.

The beta is real. Use it like it is.

Free chat for signed-in users. Paid credits for real review runs on real PRs. Stress the flows that matter to your team. If something breaks or disappoints, tell us — that is the value of being here before everyone else.

Create an account