Welcome Gemma-4-31B and MiMo-V2-Flash: The Permanent 0.5cr Review Tier
Blazing fast, super smart, and permanently priced at 0.5 credits. Why our new entry-level models are outperforming GPT-5 and Claude 4.5 Sonnet in high-volume code review.

The new floor.
0.5 credits · Gemma & MiMo · permanent
critique.sh
Two engines for the agentic stack — now on critique.sh
The biggest barrier to shipping AI review on every PR has always been compute cost. Today we remove it. Gemma-4-31B and MiMo-V2-Flash are permanent at 0.5 credits — not a promo, not a trial, not a degraded tier. These are production-grade frontier-adjacent models with benchmarks that embarrass models that cost 20× more.
Google DeepMind's open-weight flagship, distilled from the same research lineage as Gemini 3. Built from the ground up for instruction-following, structured output, and agentic tool use — the exact profile critique.sh routes need.
- Apache 2.0 open weights — audit, fine-tune, or self-host without vendor lock-in. Your model strategy isn't a hostage negotiation.
- 256K token context with native multimodal input (text + images) — send full PR diffs, stack traces, and design screenshots in a single pass without chunking.
- Native function calling and system-prompt support baked into training — structured JSON critique output, tool-use chaining, and repo-aware routing without prompt-engineering workarounds.
- LiveCodeBench v6: 80.0% — stronger than most closed models at 5–10× the credit cost. Scored 2150 Codeforces ELO on competitive programming. This is not a cheap fallback; this is a capable engine priced deliberately low.
Benchmarks
Xiaomi's reasoning-optimized MoE, trained specifically for software engineering and agentic tasks. The architecture is engineered for throughput-heavy loops: big context, fast decoding, low active-param overhead. It hit #1 on SWE-bench Verified among open models on release.
- 309B total / ~15B active parameters via sparse MoE routing — you get frontier-scale capacity with the inference cost of a mid-size dense model. That gap is why it runs at 0.5cr.
- 5:1 hybrid attention (sliding window + global) cuts KV-cache memory by ~6× versus full attention, enabling 256K context on large codebases without the infrastructure overhead.
- Multi-token prediction (MTP) enables self-speculative decoding — Xiaomi reports 2–2.6× decode speedups in production, which is why high-volume PR queues drain fast. Trained on 27T tokens with FP8 mixed precision.
- Post-trained with multi-teacher on-policy distillation and large-scale agentic RL — the model is shaped for real software workflows, not just benchmark overfitting. SWE-bench Multilingual #1 at release.
Benchmarks
Why this changes your PR routing
The classic triage problem: teams skip AI review on small PRs, hotfix branches, or draft PRs because the cost adds up. At 0.5cr per run, that calculation disappears. You can set Gemma-4-31B or MiMo-Flash as your always-on lead on every repository — every commit, every branch, every author — and reserve your heavier models for the architectural PRs that actually need GLM-5.1 or Opus-class depth. This is what 100% PR coverage actually looks like.
Z.AI's flagship model, built specifically for long-horizon agentic engineering. Not "agentic" as a marketing adjective — it is architecturally designed to sustain productive execution across hundreds of iterations, thousands of tool calls, and sessions that run for hours without losing coherence or drifting off-strategy. It scored #1 globally on SWE-Bench Pro, beating GPT-5.4 (57.7) and Claude Opus 4.6 (57.3) outright.
- SWE-Bench Pro: 58.4 — the hardest real-world software engineering benchmark. GLM-5.1 leads every model that existed at launch, open or closed. Multi-file tracing, cross-module reasoning, and architecture critique are where it separates.
- Terminal-Bench 2.0: 63.5 and NL2Repo: 42.7 — measures how well a model executes real multi-step terminal tasks and navigates unfamiliar repositories. Exactly what critique.sh routes need when depth matters more than speed.
- Long-horizon stamina — demonstrated 8-hour autonomous execution without strategy drift. On critique.sh this translates to deep architectural reviews that don't miss the third-order implication buried in file 17.
- At 4cr on critique.sh, it's the upgrade path when the merge gate matters — before burning Opus or GPT-5.4 Pro credits on every PR in the queue. Frontier judgment, not frontier pricing on everything.
Benchmarks
Re-introduced at 1cr — half the price of M2.7 (2cr). MiniMax-M2.5 scored 80.2% SWE-bench Verified. That is Claude Sonnet-class benchmark performance at 1cr — a 22× saving versus running Sonnet on every PR.
Re-introduced to fill the gap between Codex Max (14cr) and GPT-5.4 (20cr). For teams who want Codex-family algorithmic depth beyond what Max provides, without paying full 5.4 prices on every run. The in-between that most routing strategies were missing.
Frontier-class at 16cr — cheaper than GPT-5.3 Codex (18cr), GPT-5.4 (20cr), and Sonnet 4.6 (22cr). A native 1M-token context window, natively multimodal (text, image, video, audio, PDF), with a configurable thinking budget. 80.6% SWE-bench Verified and 94.3% GPQA Diamond — the highest GPQA score in the catalog.
Also in catalog: MiMo-V2-Pro and the rest of the LobeHub lineup — select any lead or specialist in your installation or repo policy.
All models route through critique.sh in the same control plane
0.5cr for volume, 1cr for Sonnet-class SWE-V, 4cr for frontier depth, 16cr for 1M-context Gemini — one policy file controls the whole stack.
Here is the psychological trap most teams fall into without realising it: loss aversion meets mental accounting. When every review costs 14–22 credits, your brain quietly starts skipping "small" PRs, draft branches, and dependency bumps — even though those are exactly where regressions slip through. Flagship pricing anchors you to selective review. We are resetting the anchor. At 0.5cr, the default flips from "which PRs deserve AI?" to "why would we ever turn review off?"
SWE-bench Verified: the single ruler we use for every model
SWE-bench Verified is 500 human-validated real GitHub issues. It is the closest thing to a shared ruler across OpenAI, Anthropic, Google, Xiaomi, and the open-weight stack. We use it for every model in the comparison — with two honest notes. First, Google does not publish a SWE-bench Verified score for Gemma 4 31B IT; the closest official coding signal is LiveCodeBench v6 at 80.0%, which we include as a labelled proxy. Second, OpenAI only publishes SWE-Bench Pro for GPT-5.4 mini — a different suite. The independent Vals.ai lab ran GPT-5.4 mini on the standard 500-task Verified set using mini-SWE-agent and published 73.0% ± 1.99; we use that. Everything else is each vendor's own stated SWE-bench Verified figure.
Primary benchmark: SWE-bench Verified (500-task, human-validated). Gemma 4 uses LiveCodeBench v6 (no official SWE-V published). GPT-5.4 mini uses Vals.ai's standard 500-task Verified run (OpenAI only publishes SWE-Bench Pro for this model). Haiku 4.5 has no critique.sh catalog row yet.
| Model | SWE-bench Verified | Benchmark source | critique.sh cr |
|---|---|---|---|
| MiMo-V2-Flash | 73.4% | Xiaomi tech report (arXiv 2601.02780) | 0.5 cr |
| Gemma-4-31B IT | 80.0% ★ | ★ LiveCodeBench v6 — SWE-V not published by Google | 0.5 cr |
| MiniMax-M2.5 | 80.2% | MiniMax Hugging Face model card | 1 cr |
| Qwen3.5-27B | 72.4% | Qwen Hugging Face model card | 2 cr |
| Gemini 3 Flash | 78.0% | Google DeepMind — single attempt, avg 5 runs | 4 cr |
| Claude Haiku 4.5 | 73.3% | Anthropic (50 trials, 128K thinking budget) | — (not in catalog) |
| GPT-5.4 Mini | 73.0% | Vals.ai (mini-SWE-agent, 500-task Verified set) † | 6 cr |
| GPT-5.1 Codex Max | 77.9% | OpenAI system card (xhigh reasoning effort) | 14 cr |
| Claude Sonnet 4.6 | 79.6% | Anthropic system card (avg 10 trials) | 22 cr |
★ Gemma 4 31B IT: On Google's Gemma 4 DeepMind table and Hugging Face card: LiveCodeBench v6 (80.0%), Codeforces ELO (2150), MMMU Pro (76.9% multimodal), τ2-bench Retail (86.4% agentic tool use) — not SWE-bench Verified. LCB v6 is competitive programming; it is not the same task distribution as SWE-V; use the LCB number directionally. † GPT-5.4 mini: OpenAI only reports "SWE-Bench Pro (public)" at 54.4% for this model. The 73.0% figure is Vals.ai's independent run on the same 500-task Verified subset using mini-SWE-agent (vals.ai/models/openai_gpt-5.4-mini-2026-03-17). Claude Sonnet 4.5 scores 77.2% SWE-V; we list 4.6 to match the catalog SKU. Haiku 4.5 is quoted from Anthropic's announcement; it is not yet available on critique.sh.
Price-to-performance: benchmark per credit
The efficiency ratio below is simple: SWE-bench Verified % divided by the critique.sh credit floor. Higher means more benchmark per dollar of compute you spend on review. The chart makes one thing immediately obvious — the gap between the 0.5cr models and the next tier is not incremental. It is structural.
Higher = more benchmark signal per credit. Gemma uses LiveCodeBench v6 (80.0%) — see table note ★. GPT-5.4 mini uses Vals.ai Verified run (73.0%) — see note †. Haiku 4.5 excluded (no catalog price).
Formula: SWE-bench Verified % ÷ critique.sh credit floor. MiMo: 73.4 ÷ 0.5 = 146.8. Gemma: 80.0 (LCB) ÷ 0.5 = 160. MiniMax: 80.2 ÷ 1 = 80.2. Qwen: 72.4 ÷ 2 = 36.2. Gemini 3 Flash: 78.0 ÷ 4 = 19.5. GPT-5.4 mini: 73.0 ÷ 6 = 12.2. Codex Max: 77.9 ÷ 14 = 5.6. Sonnet 4.6: 79.6 ÷ 22 = 3.6.
Look at what the chart is actually saying: Sonnet 4.6 at 79.6% SWE-V is a better model than MiMo at 73.4% in absolute terms. But you get 40× the benchmark-per-credit from MiMo. The question is not "which model is smarter?" It is "what is the right model for this PR, given how often you are running it?" For 90% of the PR queue, volume-tier review is not a compromise — it is the strategically correct choice.
What the absolute scores tell you — and what they do not
The raw SWE-bench Verified column is tighter than the chart makes it look. MiMo (73.4%), Haiku (73.3%), Qwen (72.4%), and GPT-5.4 mini (73.0% on the Vals harness) are all within ~1 percentage point of each other on the same benchmark. Sonnet 4.6 (79.6%), Codex Max (77.9%), Gemini 3 Flash (78.0%), and MiniMax (80.2%) cluster in the upper 70s-80s. The "expensive = smarter" framing holds — but only weakly and only in absolute terms. The absolute gap between MiMo and Sonnet 4.6 is about 6 points on 500 tasks. The credit gap is 44×. That ratio does not survive scrutiny as a justification for selective review.
Where GLM-5.1 fits in the story
GLM-5.1 is not in the SWE-bench Verified table above — Z.AI reports on SWE-Bench Pro (58.4, #1 globally on that suite), Terminal-Bench 2.0 (63.5), and NL2Repo (42.7). Those benchmarks are designed for long-horizon agentic sessions: multi-step terminal tasks, repository navigation, and hours-long autonomous execution. That is the right profile for merge-gate reviews on security-critical or architecture-changing PRs, which is exactly where critique.sh routes it at 4cr.
Why this changes your entire routing strategy
The classic triage habit is a form of status-quo bias: the safe default feels like the expensive default, because that is where teams start. But status-quo bias runs both ways. Once you establish 0.5cr review as the default — on every PR, every branch, every author — the psychological load of "is this PR worth reviewing?" disappears. You are left with a simpler, better question: "does this specific merge gate need an upgrade to GLM-5.1 or Sonnet?" That escalation is opt-in, deliberate, and justified. The baseline is already on.
- MiMo-V2-Flash: 73.4% SWE-bench Verified, 309B/15B MoE, 256K context — structured review at scale
- Gemma-4-31B: 80.0% LiveCodeBench v6, Apache 2.0, native multimodal (images + text in one pass)
- Run on every PR, every branch, every author — no triage story, no compute tax
- Together they cover the full throughput surface; pick per-repo based on your policy
- GLM-5.1: #1 SWE-Bench Pro (58.4), Terminal-Bench 2.0 (63.5) — long-horizon, multi-file depth
- GPT-5.1 Codex Max: 77.9% SWE-V — when you need deep Codex-family algorithmic critique
- Claude Sonnet 4.6: 79.6% SWE-V — when vendor alignment or long context chat matters
- Keep flagships for merges that justify the 40× multiple over the baseline
Update your repository policy
Gemma-4-31B and MiMo-V2-Flash are live now. Head to your dashboard to set them as your default lead or specialist models — or use the Installation Policy to make them the global default across all your repos.
Go to Dashboard →