Skip to content
Essay8 min readCritique

Code and Pray Is Not a Workflow: Why Vibe Coders Need Sandbox Verification

Asking AI to read your code is an expensive spellchecker. Running it in a sandbox is how you learn whether it compiles.

Vibe coding feels productive until merge day. The diff looks right. The model said tests pass. Then CI fails on a TypeScript error three files away from what you edited, or Vercel reports a build failure nobody reproduced locally. Reading code with AI is an expensive spellchecker — useful, but it cannot tell you the program runs.

What each layer catches

Mature teams use both. Vibe coding often stops at the first row.

MethodCatchesMisses
Human skimObvious mistakes, wrong file touched.Type errors, missing env vars, flaky import order.
AI diff reviewLogic risks, security patterns, missing tests.Whether `next build` succeeds on this branch.
Sandbox build + testsCompile errors, install failures, broken scripts.Product judgment and design intent.

Runtime proof is the only layer that answers "does this branch compile in a clean environment?"

Static review scales explanation. Runtime proof scales truth. When most of the code is model-generated, the failure mode is plausible wrongness — code that reads fine and still breaks the build. That is why searches like "verify code runs on github" and "run github pr in docker sandbox" are rising alongside vibe coding tooling.

Plan
Scope the task, name verification commands, list files that must not break.
Work
Let the coding agent implement; keep changes in a PR, not direct to main.
Verify
Run build and tests in a sandbox tied to the PR head SHA.
Merge
Human approves only after verify passes and review evidence is attached.

Verify is not "ask the model if this looks good." Verify is executing the same commands CI will execute, on the same commit, in an environment without your laptop's cached `node_modules` and without the dev server masking import cycles.

The sandbox pattern is consistent: clone the PR ref, install deps, run build and tests, report exit codes to the PR check. Docker on self-hosted runners suits teams with existing CI. E2B and similar APIs give per-PR isolation without runner fleets — each run gets a fresh machine, which matters when PRs can change install scripts.

Sandbox checklist for external contributions
  1. 1
    Are you running untrusted fork PRs?
    Never run post-checkout scripts with secrets on fork workflows. Use label-gated workflows or restricted environments until a maintainer approves.
  2. 2
    Does the sandbox match production?
    Align Node version, package manager, and build command with Vercel or your deploy pipeline — not just `npm test`.
  3. 3
    What is the failure artifact?
    Attach compile output and test logs to the PR check so the author does not re-run locally to see the same error.
  4. 4
    How long should verification take?
    Cache dependencies by lockfile hash; run only affected apps in monorepos; fail fast on build before e2e.

Minimal GitHub Actions sandbox shape

Illustrative pattern — adapt secrets and approval gates to your threat model.

jobs:
  verify:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          ref: ${{ github.event.pull_request.head.sha }}
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: npm
      - run: npm ci
      - run: npm run build
      - run: npm test -- --ci

Most teams do not want to operate sandbox infrastructure before they have paid the price of a few painful merges. Critique productizes the verify step: open a PR, Critique checks out the branch in an isolated environment, runs build and test commands configured for the repo, and posts review findings with compile evidence — not just opinions on the diff.

That closes the gap vibe coding opens. The coding agent optimizes for "done." Sandbox review optimizes for "merged without surprise." When Vercel build fails on a pull request, the failure mode is usually something runtime proof would have caught earlier — see our guide on catching those errors before push.

Monday: ship with Cursor, open PR, sandbox fails on a missing export — fix in ten minutes, not after deploy. Tuesday: dependabot bump, sandbox catches a type break AI review missed. Wednesday: fork PR, maintainer labels `safe-to-test`, gated sandbox run. Thursday: monorepo change, only the affected app builds. Friday: merge queue stays green because verify ran on the merge SHA.

That rhythm is boring. Boring is the goal. Vibe coding without verify is adrenaline. Plan-Work-Verify is engineering.

No. AI review inspects the diff statically. Sandbox verification proves the branch installs, compiles, and passes tests in a clean environment.
Add a PR check that checks out the head SHA, runs `npm ci`, `npm run build`, and tests in Docker or an ephemeral sandbox. Block merge on failure.
Spin up a short-lived VM per workflow run, execute install and build commands there, capture logs, and destroy the environment. Useful when you want isolation without managing persistent runners.
Critique combines multi-model PR review with sandbox build verification and posts unified findings on the pull request — less plumbing, same runtime proof.

Stop merging on vibes

Install Critique on GitHub and run sandbox verification on your next AI-assisted PR. See compile and test evidence before merge, not after Vercel fails.

Start free