EssayJune 8, 20268 min readCritique

Code and Pray Is Not a Workflow: Why Vibe Coders Need Sandbox Verification

Q: How do I verify code runs on GitHub without merging?

Add a PR check that checks out the head SHA, runs `npm ci`, `npm run build`, and tests in Docker or an ephemeral sandbox. Block merge on failure.

Asking AI to read your code is an expensive spellchecker. Running it in a sandbox is how you learn whether it compiles.

Vibe coding feels productive until merge day. The diff looks right. The model said tests pass. Then CI fails on a TypeScript error three files away from what you edited, or Vercel reports a build failure nobody reproduced locally. Reading code with AI is an expensive spellchecker — useful, but it cannot tell you the program runs.

Static review vs runtime proof

What each layer catches

Mature teams use both. Vibe coding often stops at the first row.

Method	Catches	Misses
Human skim	Obvious mistakes, wrong file touched.	Type errors, missing env vars, flaky import order.
AI diff review	Logic risks, security patterns, missing tests.	Whether `next build` succeeds on this branch.
Sandbox build + tests	Compile errors, install failures, broken scripts.	Product judgment and design intent.

Runtime proof is the only layer that answers "does this branch compile in a clean environment?"

Static review scales explanation. Runtime proof scales truth. When most of the code is model-generated, the failure mode is plausible wrongness — code that reads fine and still breaks the build. That is why searches like "verify code runs on github" and "run github pr in docker sandbox" are rising alongside vibe coding tooling.

Plan-Work-Verify

Plan

Scope the task, name verification commands, list files that must not break.

Work

Let the coding agent implement; keep changes in a PR, not direct to main.

Verify

Run build and tests in a sandbox tied to the PR head SHA.

Merge

Human approves only after verify passes and review evidence is attached.

Verify is not "ask the model if this looks good." Verify is executing the same commands CI will execute, on the same commit, in an environment without your laptop's cached node_modules and without the dev server masking import cycles.

E2B, Docker, and the fork PR security model

The sandbox pattern is consistent: clone the PR ref, install deps, run build and tests, report exit codes to the PR check. Docker on self-hosted runners suits teams with existing CI. E2B and similar APIs give per-PR isolation without runner fleets — each run gets a fresh machine, which matters when PRs can change install scripts.

Sandbox checklist for external contributions
1Are you running untrusted fork PRs?
Never run post-checkout scripts with secrets on fork workflows. Use label-gated workflows or restricted environments until a maintainer approves.
2Does the sandbox match production?
Align Node version, package manager, and build command with Vercel or your deploy pipeline — not just npm test.
3What is the failure artifact?
Attach compile output and test logs to the PR check so the author does not re-run locally to see the same error.
4How long should verification take?
Cache dependencies by lockfile hash; run only affected apps in monorepos; fail fast on build before e2e.

Minimal GitHub Actions sandbox shape

Illustrative pattern — adapt secrets and approval gates to your threat model.

jobs:
  verify:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ github.event.pull_request.head.sha }}
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: npm
      - run: npm ci
      - run: npm run build
      - run: npm test -- --ci

Critique as productized sandbox review

Most teams do not want to operate sandbox infrastructure before they have paid the price of a few painful merges. Critique productizes the verify step: open a PR, Critique checks out the branch in an isolated environment, runs build and test commands configured for the repo, and posts review findings with compile evidence — not just opinions on the diff.

That closes the gap vibe coding opens. The coding agent optimizes for "done." Sandbox review optimizes for "merged without surprise." When Vercel build fails on a pull request, the failure mode is usually something runtime proof would have caught earlier — see our guide on catching those errors before push.

A week in the life

Monday: ship with Cursor, open PR, sandbox fails on a missing export — fix in ten minutes, not after deploy. Tuesday: dependabot bump, sandbox catches a type break AI review missed. Wednesday: fork PR, maintainer labels safe-to-test, gated sandbox run. Thursday: monorepo change, only the affected app builds. Friday: merge queue stays green because verify ran on the merge SHA.

That rhythm is boring. Boring is the goal. Vibe coding without verify is adrenaline. Plan-Work-Verify is engineering.

Primary sources

Vercel build failed on pull request

Catch TypeScript and Next.js build errors before merge.

E2B documentation

Ephemeral sandboxes for code execution and CI workflows.

GitHub Actions: fork PR security

Official guidance on running workflows safely for pull requests from forks.

Critique introduction

How GitHub PR review, sandbox verification, and remediation fit together.

FAQ

No. AI review inspects the diff statically. Sandbox verification proves the branch installs, compiles, and passes tests in a clean environment.

Add a PR check that checks out the head SHA, runs npm ci, npm run build, and tests in Docker or an ephemeral sandbox. Block merge on failure.

Spin up a short-lived VM per workflow run, execute install and build commands there, capture logs, and destroy the environment. Useful when you want isolation without managing persistent runners.

Critique combines multi-model PR review with sandbox build verification and posts unified findings on the pull request — less plumbing, same runtime proof.

Stop merging on vibes

Install Critique on GitHub and run sandbox verification on your next AI-assisted PR. See compile and test evidence before merge, not after Vercel fails.

Start free

Compare Critique

Compare the main AI code review options.

If this article is part of a buying process, these pages compare Critique with the tools most teams evaluate for GitHub PR review.

Best AI code review tools AI code review pricing

← All essays Privacy & Terms

Ask about this essay

Nemotron-3-Super

Ask about the argument, the evidence, the structure, or how the post connects to Critique.

Not editorial advice · The essay above is the source of truth · Not saved to your account · OpenRouter privacy