18 min readRepath Khan

Welcome Gemma-4-31B and MiMo-V2-Flash: The Permanent 0.5cr Review Tier

Blazing fast, super smart, and permanently priced at 0.5 credits. Why our new entry-level models are outperforming GPT-5 and Claude 4.5 Sonnet in high-volume code review.

The new floor.

0.5 credits · Gemma & MiMo · permanent

critique.sh

GemmaXiaomiMiMo
Permanent 0.5crGemma-4-31BMiMo-V2-Flash

Two engines for the agentic stack — now on critique.sh

The biggest barrier to shipping AI review on every PR has always been compute cost. Today we remove it. Gemma-4-31B and MiMo-V2-Flash are permanent at 0.5 credits — not a promo, not a trial, not a degraded tier. These are production-grade frontier-adjacent models with benchmarks that embarrass models that cost 20× more.

GemmaGemma-4-31B

Google DeepMind's open-weight flagship, distilled from the same research lineage as Gemini 3. Built from the ground up for instruction-following, structured output, and agentic tool use — the exact profile critique.sh routes need.

  • Apache 2.0 open weights — audit, fine-tune, or self-host without vendor lock-in. Your model strategy isn't a hostage negotiation.
  • 256K token context with native multimodal input (text + images) — send full PR diffs, stack traces, and design screenshots in a single pass without chunking.
  • Native function calling and system-prompt support baked into training — structured JSON critique output, tool-use chaining, and repo-aware routing without prompt-engineering workarounds.
  • LiveCodeBench v6: 80.0% — stronger than most closed models at 5–10× the credit cost. Scored 2150 Codeforces ELO on competitive programming. This is not a cheap fallback; this is a capable engine priced deliberately low.

Benchmarks

LiveCodeBench v6
80.0%
AIME 2026
89.2%
GPQA Diamond
84.3%
Codeforces ELO
2150competitive programming
XiaomiMiMoMiMo-V2-Flash

Xiaomi's reasoning-optimized MoE, trained specifically for software engineering and agentic tasks. The architecture is engineered for throughput-heavy loops: big context, fast decoding, low active-param overhead. It hit #1 on SWE-bench Verified among open models on release.

  • 309B total / ~15B active parameters via sparse MoE routing — you get frontier-scale capacity with the inference cost of a mid-size dense model. That gap is why it runs at 0.5cr.
  • 5:1 hybrid attention (sliding window + global) cuts KV-cache memory by ~6× versus full attention, enabling 256K context on large codebases without the infrastructure overhead.
  • Multi-token prediction (MTP) enables self-speculative decoding — Xiaomi reports 2–2.6× decode speedups in production, which is why high-volume PR queues drain fast. Trained on 27T tokens with FP8 mixed precision.
  • Post-trained with multi-teacher on-policy distillation and large-scale agentic RL — the model is shaped for real software workflows, not just benchmark overfitting. SWE-bench Multilingual #1 at release.

Benchmarks

SWE-bench Verified
73.4%#1 open on release
SWE-bench Multilingual
71.7%#1 open on release
AIME 2025
94.1%
GPQA Diamond
84.3%

Why this changes your PR routing

The classic triage problem: teams skip AI review on small PRs, hotfix branches, or draft PRs because the cost adds up. At 0.5cr per run, that calculation disappears. You can set Gemma-4-31B or MiMo-Flash as your always-on lead on every repository — every commit, every branch, every author — and reserve your heavier models for the architectural PRs that actually need GLM-5.1 or Opus-class depth. This is what 100% PR coverage actually looks like.

Z.ai
Frontier · Z.AIGLM-5.1

Z.AI's flagship model, built specifically for long-horizon agentic engineering. Not "agentic" as a marketing adjective — it is architecturally designed to sustain productive execution across hundreds of iterations, thousands of tool calls, and sessions that run for hours without losing coherence or drifting off-strategy. It scored #1 globally on SWE-Bench Pro, beating GPT-5.4 (57.7) and Claude Opus 4.6 (57.3) outright.

  • SWE-Bench Pro: 58.4 — the hardest real-world software engineering benchmark. GLM-5.1 leads every model that existed at launch, open or closed. Multi-file tracing, cross-module reasoning, and architecture critique are where it separates.
  • Terminal-Bench 2.0: 63.5 and NL2Repo: 42.7 — measures how well a model executes real multi-step terminal tasks and navigates unfamiliar repositories. Exactly what critique.sh routes need when depth matters more than speed.
  • Long-horizon stamina — demonstrated 8-hour autonomous execution without strategy drift. On critique.sh this translates to deep architectural reviews that don't miss the third-order implication buried in file 17.
  • At 4cr on critique.sh, it's the upgrade path when the merge gate matters — before burning Opus or GPT-5.4 Pro credits on every PR in the queue. Frontier judgment, not frontier pricing on everything.

Benchmarks

SWE-Bench Pro
58.4#1 globally
Terminal-Bench 2.0
63.5
NL2Repo
42.7
CyberGym
68.7
AIME 2026
95.3%
GPQA Diamond
86.2%
Minimax
1cr · MiniMaxM2.5

Re-introduced at 1cr — half the price of M2.7 (2cr). MiniMax-M2.5 scored 80.2% SWE-bench Verified. That is Claude Sonnet-class benchmark performance at 1cr — a 22× saving versus running Sonnet on every PR.

SWE-bench Verified
80.2%
SWE-Bench Pro
55.4%
BrowseComp
76.3%
OpenAI
18cr · OpenAIGPT-5.3 Codex

Re-introduced to fill the gap between Codex Max (14cr) and GPT-5.4 (20cr). For teams who want Codex-family algorithmic depth beyond what Max provides, without paying full 5.4 prices on every run. The in-between that most routing strategies were missing.

SWE-Bench Pro
56.8%
Bracket vs 5.4
−2crvs GPT-5.4 at 20cr
vs Codex Max
+4crvs Codex Max at 14cr
Gemini
16cr · GoogleGemini 3.1 Pro

Frontier-class at 16cr — cheaper than GPT-5.3 Codex (18cr), GPT-5.4 (20cr), and Sonnet 4.6 (22cr). A native 1M-token context window, natively multimodal (text, image, video, audio, PDF), with a configurable thinking budget. 80.6% SWE-bench Verified and 94.3% GPQA Diamond — the highest GPQA score in the catalog.

SWE-bench Verified
80.6%
GPQA Diamond
94.3%highest in catalog
Terminal-Bench 2.0
68.5%
Context window
1Mtokens

Also in catalog: MiMo-V2-Pro and the rest of the LobeHub lineup — select any lead or specialist in your installation or repo policy.

All models route through critique.sh in the same control plane

Z.aiZ.AI
GeminiGoogle
MinimaxMiniMax
ClaudeClaude
OpenAIOpenAI
QwenQwen

0.5cr for volume, 1cr for Sonnet-class SWE-V, 4cr for frontier depth, 16cr for 1M-context Gemini — one policy file controls the whole stack.

Here is the psychological trap most teams fall into without realising it: loss aversion meets mental accounting. When every review costs 14–22 credits, your brain quietly starts skipping "small" PRs, draft branches, and dependency bumps — even though those are exactly where regressions slip through. Flagship pricing anchors you to selective review. We are resetting the anchor. At 0.5cr, the default flips from "which PRs deserve AI?" to "why would we ever turn review off?"

SWE-bench Verified: the single ruler we use for every model

SWE-bench Verified is 500 human-validated real GitHub issues. It is the closest thing to a shared ruler across OpenAI, Anthropic, Google, Xiaomi, and the open-weight stack. We use it for every model in the comparison — with two honest notes. First, Google does not publish a SWE-bench Verified score for Gemma 4 31B IT; the closest official coding signal is LiveCodeBench v6 at 80.0%, which we include as a labelled proxy. Second, OpenAI only publishes SWE-Bench Pro for GPT-5.4 mini — a different suite. The independent Vals.ai lab ran GPT-5.4 mini on the standard 500-task Verified set using mini-SWE-agent and published 73.0% ± 1.99; we use that. Everything else is each vendor's own stated SWE-bench Verified figure.

SWE-bench Verified × critique.sh credit floor

Primary benchmark: SWE-bench Verified (500-task, human-validated). Gemma 4 uses LiveCodeBench v6 (no official SWE-V published). GPT-5.4 mini uses Vals.ai's standard 500-task Verified run (OpenAI only publishes SWE-Bench Pro for this model). Haiku 4.5 has no critique.sh catalog row yet.

ModelSWE-bench VerifiedBenchmark sourcecritique.sh cr
MiMo-V2-Flash73.4%Xiaomi tech report (arXiv 2601.02780)0.5 cr
Gemma-4-31B IT80.0% ★★ LiveCodeBench v6 — SWE-V not published by Google0.5 cr
MiniMax-M2.580.2%MiniMax Hugging Face model card1 cr
Qwen3.5-27B72.4%Qwen Hugging Face model card2 cr
Gemini 3 Flash78.0%Google DeepMind — single attempt, avg 5 runs4 cr
Claude Haiku 4.573.3%Anthropic (50 trials, 128K thinking budget)— (not in catalog)
GPT-5.4 Mini73.0%Vals.ai (mini-SWE-agent, 500-task Verified set) †6 cr
GPT-5.1 Codex Max77.9%OpenAI system card (xhigh reasoning effort)14 cr
Claude Sonnet 4.679.6%Anthropic system card (avg 10 trials)22 cr

★ Gemma 4 31B IT: On Google's Gemma 4 DeepMind table and Hugging Face card: LiveCodeBench v6 (80.0%), Codeforces ELO (2150), MMMU Pro (76.9% multimodal), τ2-bench Retail (86.4% agentic tool use) — not SWE-bench Verified. LCB v6 is competitive programming; it is not the same task distribution as SWE-V; use the LCB number directionally. † GPT-5.4 mini: OpenAI only reports "SWE-Bench Pro (public)" at 54.4% for this model. The 73.0% figure is Vals.ai's independent run on the same 500-task Verified subset using mini-SWE-agent (vals.ai/models/openai_gpt-5.4-mini-2026-03-17). Claude Sonnet 4.5 scores 77.2% SWE-V; we list 4.6 to match the catalog SKU. Haiku 4.5 is quoted from Anthropic's announcement; it is not yet available on critique.sh.

Price-to-performance: benchmark per credit

The efficiency ratio below is simple: SWE-bench Verified % divided by the critique.sh credit floor. Higher means more benchmark per dollar of compute you spend on review. The chart makes one thing immediately obvious — the gap between the 0.5cr models and the next tier is not incremental. It is structural.

SWE-bench Verified ÷ critique.sh credits — price-to-performance ratio

Higher = more benchmark signal per credit. Gemma uses LiveCodeBench v6 (80.0%) — see table note ★. GPT-5.4 mini uses Vals.ai Verified run (73.0%) — see note †. Haiku 4.5 excluded (no catalog price).

Formula: SWE-bench Verified % ÷ critique.sh credit floor. MiMo: 73.4 ÷ 0.5 = 146.8. Gemma: 80.0 (LCB) ÷ 0.5 = 160. MiniMax: 80.2 ÷ 1 = 80.2. Qwen: 72.4 ÷ 2 = 36.2. Gemini 3 Flash: 78.0 ÷ 4 = 19.5. GPT-5.4 mini: 73.0 ÷ 6 = 12.2. Codex Max: 77.9 ÷ 14 = 5.6. Sonnet 4.6: 79.6 ÷ 22 = 3.6.

Look at what the chart is actually saying: Sonnet 4.6 at 79.6% SWE-V is a better model than MiMo at 73.4% in absolute terms. But you get 40× the benchmark-per-credit from MiMo. The question is not "which model is smarter?" It is "what is the right model for this PR, given how often you are running it?" For 90% of the PR queue, volume-tier review is not a compromise — it is the strategically correct choice.

What the absolute scores tell you — and what they do not

The raw SWE-bench Verified column is tighter than the chart makes it look. MiMo (73.4%), Haiku (73.3%), Qwen (72.4%), and GPT-5.4 mini (73.0% on the Vals harness) are all within ~1 percentage point of each other on the same benchmark. Sonnet 4.6 (79.6%), Codex Max (77.9%), Gemini 3 Flash (78.0%), and MiniMax (80.2%) cluster in the upper 70s-80s. The "expensive = smarter" framing holds — but only weakly and only in absolute terms. The absolute gap between MiMo and Sonnet 4.6 is about 6 points on 500 tasks. The credit gap is 44×. That ratio does not survive scrutiny as a justification for selective review.

Where GLM-5.1 fits in the story

GLM-5.1 is not in the SWE-bench Verified table above — Z.AI reports on SWE-Bench Pro (58.4, #1 globally on that suite), Terminal-Bench 2.0 (63.5), and NL2Repo (42.7). Those benchmarks are designed for long-horizon agentic sessions: multi-step terminal tasks, repository navigation, and hours-long autonomous execution. That is the right profile for merge-gate reviews on security-critical or architecture-changing PRs, which is exactly where critique.sh routes it at 4cr.

0.5cr
Permanent floor for Gemma-4 & MiMo-Flash
~73%
SWE-bench Verified for MiMo at that price
40×
Price-to-performance vs Claude Sonnet 4.6
58.4
GLM-5.1 SWE-Bench Pro — #1 globally

Why this changes your entire routing strategy

The classic triage habit is a form of status-quo bias: the safe default feels like the expensive default, because that is where teams start. But status-quo bias runs both ways. Once you establish 0.5cr review as the default — on every PR, every branch, every author — the psychological load of "is this PR worth reviewing?" disappears. You are left with a simpler, better question: "does this specific merge gate need an upgrade to GLM-5.1 or Sonnet?" That escalation is opt-in, deliberate, and justified. The baseline is already on.

Volume & default review (0.5cr)
  • MiMo-V2-Flash: 73.4% SWE-bench Verified, 309B/15B MoE, 256K context — structured review at scale
  • Gemma-4-31B: 80.0% LiveCodeBench v6, Apache 2.0, native multimodal (images + text in one pass)
  • Run on every PR, every branch, every author — no triage story, no compute tax
  • Together they cover the full throughput surface; pick per-repo based on your policy
Frontier-grade review (4cr – 22cr)
  • GLM-5.1: #1 SWE-Bench Pro (58.4), Terminal-Bench 2.0 (63.5) — long-horizon, multi-file depth
  • GPT-5.1 Codex Max: 77.9% SWE-V — when you need deep Codex-family algorithmic critique
  • Claude Sonnet 4.6: 79.6% SWE-V — when vendor alignment or long context chat matters
  • Keep flagships for merges that justify the 40× multiple over the baseline
Google does not publish a SWE-bench Verified score for Gemma 4 31B IT — not on the Hugging Face model card, not on the DeepMind Gemma 4 page, and not on the official SWE-bench leaderboard viewer (no Gemma 4 entry appears there at all). For coding, Google publishes LiveCodeBench v6 (80.0%) and Codeforces ELO (2150) on the model card; the DeepMind table also lists MMMU Pro (76.9%, multimodal) and τ2-bench Retail (86.4%, agentic tool use). LCB v6 measures competitive programming; it is not the same distribution as SWE-V. Use the LCB proxy directionally.
Vals.ai runs GPT-5.4 mini on the same 500-task Verified subset using mini-SWE-agent, the standard bash-only scaffold that the SWE-bench leaderboard uses for direct LM comparison. OpenAI publishes SWE-Bench Pro (54.4%) for mini — a different suite with different tasks. The Vals.ai 73.0% is methodologically comparable to the other Verified numbers in this table; OpenAI's own Pro figure is not. We prefer a same-suite third-party run over a first-party different-suite number.
Haiku 4.5 scores 73.3% SWE-bench Verified (Anthropic, 50 trials, 128K thinking budget) — essentially identical to MiMo on this benchmark. It is not yet in the critique.sh catalog, so there is no credit floor to compare. When we add it, this table will update.
MiniMax prices their API very aggressively. At 1cr on critique.sh it still gives an 80.2 price-to-performance index — 2× better than Qwen at 2cr, and 4× better than Gemini 3 Flash at 4cr. MiniMax is a strong value pick if you want to stay under 2cr and want a first-rate SWE-V score. The 0.5cr models still lead on the efficiency chart because the denominator is half.
SWE-bench Verified is 500 human-validated instances from the original SWE-bench dataset — the standard leaderboard format since 2024. SWE-Bench Pro is a newer, harder OpenAI-defined suite. They are not comparable; a model can score 54.4% on Pro and 73%+ on Verified because the task distributions differ. This is why we use Verified as our single ruler — it has the most cross-vendor coverage.
Primary sources

Update your repository policy

Gemma-4-31B and MiMo-V2-Flash are live now. Head to your dashboard to set them as your default lead or specialist models — or use the Installation Policy to make them the global default across all your repos.

Go to Dashboard →

Ask about this essay

Nemotron-3-Super
Ask about the argument, the evidence, the structure, or how the post connects to Critique.
Not editorial advice · The essay above is the source of truth · Not saved to your account · OpenRouter privacy