Xiaomi MiMo-V2-Flash & Pro: What a Phone Company’s LLMs Mean for AI Code Review
Deep dive on architecture, coding benchmarks, aggressive API pricing, and why MiMo-V2-Flash and MiMo-V2-Pro are rolling out across every critique.sh tier — including how a 1M-token window changes repo-aware critique and Remedy-style fixes.
Two engines for the agentic stack — now on critique.sh
The biggest barrier to shipping AI review on every PR has always been compute cost. Today we remove it. Gemma-4-31B and MiMo-V2-Flash are permanent at 0.5 credits — not a promo, not a trial, not a degraded tier. These are production-grade frontier-adjacent models with benchmarks that embarrass models that cost 20× more.
Google DeepMind's open-weight flagship, distilled from the same research lineage as Gemini 3. Built from the ground up for instruction-following, structured output, and agentic tool use — the exact profile critique.sh routes need.
- Apache 2.0 open weights — audit, fine-tune, or self-host without vendor lock-in. Your model strategy isn't a hostage negotiation.
- 256K token context with native multimodal input (text + images) — send full PR diffs, stack traces, and design screenshots in a single pass without chunking.
- Native function calling and system-prompt support baked into training — structured JSON critique output, tool-use chaining, and repo-aware routing without prompt-engineering workarounds.
- LiveCodeBench v6: 80.0% — stronger than most closed models at 5–10× the credit cost. Scored 2150 Codeforces ELO on competitive programming. This is not a cheap fallback; this is a capable engine priced deliberately low.
Benchmarks
Xiaomi's reasoning-optimized MoE, trained specifically for software engineering and agentic tasks. The architecture is engineered for throughput-heavy loops: big context, fast decoding, low active-param overhead. It hit #1 on SWE-bench Verified among open models on release.
- 309B total / ~15B active parameters via sparse MoE routing — you get frontier-scale capacity with the inference cost of a mid-size dense model. That gap is why it runs at 0.5cr.
- 5:1 hybrid attention (sliding window + global) cuts KV-cache memory by ~6× versus full attention, enabling 256K context on large codebases without the infrastructure overhead.
- Multi-token prediction (MTP) enables self-speculative decoding — Xiaomi reports 2–2.6× decode speedups in production, which is why high-volume PR queues drain fast. Trained on 27T tokens with FP8 mixed precision.
- Post-trained with multi-teacher on-policy distillation and large-scale agentic RL — the model is shaped for real software workflows, not just benchmark overfitting. SWE-bench Multilingual #1 at release.
Benchmarks
Why this changes your PR routing
The classic triage problem: teams skip AI review on small PRs, hotfix branches, or draft PRs because the cost adds up. At 0.5cr per run, that calculation disappears. You can set Gemma-4-31B or MiMo-Flash as your always-on lead on every repository — every commit, every branch, every author — and reserve your heavier models for the architectural PRs that actually need GLM-5.1 or Opus-class depth. This is what 100% PR coverage actually looks like.
Z.AI's flagship model, built specifically for long-horizon agentic engineering. Not "agentic" as a marketing adjective — it is architecturally designed to sustain productive execution across hundreds of iterations, thousands of tool calls, and sessions that run for hours without losing coherence or drifting off-strategy. It scored #1 globally on SWE-Bench Pro, beating GPT-5.4 (57.7) and Claude Opus 4.6 (57.3) outright.
- SWE-Bench Pro: 58.4 — the hardest real-world software engineering benchmark. GLM-5.1 leads every model that existed at launch, open or closed. Multi-file tracing, cross-module reasoning, and architecture critique are where it separates.
- Terminal-Bench 2.0: 63.5 and NL2Repo: 42.7 — measures how well a model executes real multi-step terminal tasks and navigates unfamiliar repositories. Exactly what critique.sh routes need when depth matters more than speed.
- Long-horizon stamina — demonstrated 8-hour autonomous execution without strategy drift. On critique.sh this translates to deep architectural reviews that don't miss the third-order implication buried in file 17.
- At 4cr on critique.sh, it's the upgrade path when the merge gate matters — before burning Opus or GPT-5.4 Pro credits on every PR in the queue. Frontier judgment, not frontier pricing on everything.
Benchmarks
Re-introduced at 1cr — half the price of M2.7 (2cr). MiniMax-M2.5 scored 80.2% SWE-bench Verified. That is Claude Sonnet-class benchmark performance at 1cr — a 22× saving versus running Sonnet on every PR.
Re-introduced to fill the gap between Codex Max (14cr) and GPT-5.4 (20cr). For teams who want Codex-family algorithmic depth beyond what Max provides, without paying full 5.4 prices on every run. The in-between that most routing strategies were missing.
Frontier-class at 16cr — cheaper than GPT-5.3 Codex (18cr), GPT-5.4 (20cr), and Sonnet 4.6 (22cr). A native 1M-token context window, natively multimodal (text, image, video, audio, PDF), with a configurable thinking budget. 80.6% SWE-bench Verified and 94.3% GPQA Diamond — the highest GPQA score in the catalog.
Also in catalog: MiMo-V2-Pro and the rest of the LobeHub lineup — select any lead or specialist in your installation or repo policy.
All models route through critique.sh in the same control plane
0.5cr for volume, 1cr for Sonnet-class SWE-V, 4cr for frontier depth, 16cr for 1M-context Gemini — one policy file controls the whole stack.
If you still think of Xiaomi as “the phone company with aggressive hardware margins,” it is time to update the mental model. Over the last eighteen months they have moved hard into foundation models, agentic APIs, and developer-facing pricing that deliberately undercuts Western defaults. The MiMo-V2 line — especially Flash and Pro — is not a curiosity from a consumer brand. It is a credible part of the global model stack, and it intersects directly with what we build at critique.sh: AI that critiques AI-written code, explains failures, and (through Remedy) can patch verified issues in isolation.
This essay is long because the stack deserves it. We will cover why Xiaomi showing up in the LLM race is surprising only on the surface, how Flash and Pro differ architecturally, what public benchmarks say about coding ability, how API pricing compares to Claude-class models, and — most importantly — what any of that means when your organisation is drowning in pull requests and needs a merge gate that actually scales.
PART ONE
Why Xiaomi in the AI Race Is a Plot Twist — and Why It Isn’t
Xiaomi built its reputation on supply-chain execution: flagship specs at mid-market prices, ecosystem cross-sell, and relentless iteration. That playbook maps surprisingly well onto today’s inference market. Training frontier-class models is capital intensive, but the long-run moat is just as much about cost per token, latency, and distribution. Xiaomi already operates global cloud relationships, consumer distribution in the hundreds of millions, and deep ties into Chinese enterprise software — WPS Office and adjacent productivity surfaces are natural places to ship model-backed features.
The “surprise” is cultural more than technical. Western narratives often slot Chinese labs into a short list of names; Xiaomi was not, until recently, on the developer mindshare leaderboard the way a Qwen or DeepSeek might be. MiMo-V2 changes that: competitive coding scores, million-token context on Pro, and API pricing that reads like a deliberate shot across the bow of premium Western APIs. Whether you trust the marketing numbers or not, the strategic intent is clear — Xiaomi intends to be a model provider, not only a handset OEM.
PART TWO
Timeline: From MiMo-7B to MiMo-V2
Early public footprint: Xiaomi signals serious intent in open-weight, developer-credible releases rather than chat-only demos.
Flash orientation: speed, cost, and practical coding benchmarks — the kind of profile that maps to high-volume critique loops.
Pro pushes context and agentic stability; Omni and TTS widen multimodal and voice surfaces (this piece focuses on Flash + Pro for code).
Vendor timelines are always partly narrative. The useful part for engineering leaders is directional: Xiaomi is shipping on cadence, not standing still. For a review platform, that matters because model drift and capability jumps show up in customer expectations before they show up in procurement spreadsheets.
PART THREE
Architecture: Hybrid Attention, MTP, and What Actually Changes at Inference
Both Flash and Pro advertise hybrid attention — a sliding window paired with full attention regions and a learnable “attention sink” bias so distant tokens are not silently dropped. Flash uses a 5:1 ratio (window versus full); Pro tightens toward 7:1 to favour even longer coherent runs. The practical effect is KV-cache pressure: Xiaomi claims roughly ~6× KV reduction versus naive full attention at comparable quality targets, which is the sort of engineering that turns “1M context” from a slide deck bullet into something economically runnable.
Multi-token prediction (MTP) is the other headline. A lightweight block (~0.33B parameters per block in public materials) predicts multiple future tokens in parallel using dense FFNs, which can materially raise tokens-per-second during decoding and shorten reinforcement-learning rollouts. For critique.sh, throughput is not vanity — it is how many PRs per hour you can run through Scout mapping, specialist passes, and lead synthesis without queuing.
PART FOUR
Coding Capability: Benchmarks With All the Usual Caveats
Benchmarks are gamed, leaked, and fine-printed. They are still the only public lingua franca when comparing models you have not yet run on your own monorepo. The charts below use vendor-reported or widely cited figures; where a number is approximate or contested, we say so in the footnotes.
Verified / comparable software-engineering repair scores where public numbers exist. Higher is better.
Pro score is an approximate placement based on vendor claims of beating Sonnet-class models where full public leaderboard entries may still be pending. Treat it as directional, not contractual.
Single-model spotlight: MiMo-V2-Flash has been quoted around mid-90s percentage on AIME 2025 — useful as a signal of chain-of-thought reliability even though math ≠ shipping production code.
The lower rows are illustrative tiers, not specific competitor calls-outs. The point is separation: Flash is not only “cheap,” it is sharp on structured reasoning tasks.
For agentic coding, Xiaomi also publicises ClawEval-style scores for Pro in the mid-70s — territory that overlaps top closed models on tool-heavy tasks. That is the axis most relevant to critique.sh: not “can it write a LeetCode solution,” but “can it keep a tool contract stable across a long episode when repository state is messy?”
- Throughput-bound workloads: many small reviews, fast reruns, tight CI coupling
- Cost-proportional coverage: running specialists on every PR without exploding budget
- Self-host or air-gapped paths thanks to MIT weights
- Mathematically structured failures (numerical bugs, algorithmic regressions)
- Million-token windows for whole-service incident postmortems plus code
- Long-horizon tool use where context rot usually kills cheaper models
- Lead-oracle style synthesis over large evidence packs from Scout
- Cross-module refactors where the diff is tiny but the blast radius is not
PART FIVE
Speed: Tokens per Second as a First-Class Metric
Approximate vendor / market quotes for high-throughput decoding. Actual rates depend on batching, region, and precision.
Pro figure is an estimate where public disclosures emphasise quality and stability over peak tok/s. For batch review, Flash is the hammer; Pro is the magnifying glass.
PART SIX
Context Windows: Why 1M Tokens Matters for Critique — Not Chat
Most “code assistants” live in a few thousand tokens of editor context. Real review is different. Scout-style exploration pulls in surrounding modules, historical incidents, configuration matrices, and sometimes entire service graphs. When the lead model can hold a million tokens in attention budget, you stop pretending that “just the diff” is sufficient for high-stakes merges.
Approximate advertised maxima; effective usable context is always lower once system prompts, tools, and safety overhead reserve space.
1024K = ~1,048,576 tokens for Pro in Xiaomi’s public materials. Your orchestration should still segment work: context is necessary, not sufficient, for correctness.
PART SEVEN
Pricing: How MiMo Undercuts Premium Western APIs
List prices move weekly. The shapes matter more than the last decimal. Flash is priced like a commodity inference product: sub-dollar per million tokens in both directions. Pro introduces tiered context bands — once you leave the first quarter-million tokens, input and output rates step up, reflecting the real cost of KV memory and attention compute.
Normalised to MiMo-V2-Flash input list pricing (~$0.09 / MTok). Higher means more expensive per million input tokens.
Indexed from illustrative list prices: Flash ~$0.09, Pro ~$1.00, Sonnet ~$3.00, Opus ~$5.00 per million input tokens. Confirm live rates on your provider before budgeting.
Same indexing against Flash output (~$0.29 / MTok). Output pricing dominates long-form critique and remediation plans.
Pro banded pricing (~$3 output ≤256K, higher beyond) is simplified here to the entry band. Ultra-long contexts are intentionally premium everywhere — Xiaomi is no exception.
PART EIGHT
How This Maps to critique.sh: Critique, Fix, and the Merge Gate
We did not write this essay to celebrate a handset manufacturer. We wrote it because the economics of automated review flipped when models like Flash appeared: you can afford to run repo-aware pipelines on every pull request, not only on “critical” paths. MiMo-V2-Flash is the natural home for high-volume specialist passes — fast JSON findings, quick rechecks after Remedy pushes a fix, and tight loops in CI. MiMo-V2-Pro is the right escalation target when Scout returns an enormous evidence pack or when the failure mode is cross-module and subtle.
Remedy — our isolated, two-loop-limited repair agent — benefits twice. First, cheaper critique means more issues caught early. Second, when a fix requires reconciling logs, stack traces, and multiple packages, Pro-sized context reduces the failure mode where the model “forgets” the upstream constraint half an hour into the episode.
PART NINE
Honest Limitations
Flash is not magic. Teams report instruction-following drift on baroque, multi-step creative prompts — irrelevant for our domain, but a reminder that model choice should track task shape. Tool calling on Flash is good enough for many workflows but still behind premium closed models on adversarial tool schemas; that is exactly why we keep a portfolio and let credits route risk-proportionally.
Pro is API-first. If your compliance story requires weights on your own metal, Flash (MIT) is the lever; Pro is for when cloud inference and contractual DPAs are acceptable. And like every vendor chart, take Xiaomi’s public numbers with salt — then run your own evals on representative services.
CLOSING
The Merge Gate Is a Portfolio Problem
MiMo-V2 does not replace the need for human judgment. It raises the floor on how much context-aware critique you can afford before a human ever opens GitHub. Xiaomi entering this race with credible coding scores, aggressive pricing, and a serious Pro tier is good for buyers — and good for teams like ours who believe the future of software is generated quickly but merged carefully.
If you are already running critique.sh, select MiMo-V2-Flash or MiMo-V2-Pro in your model picker or let the control plane route you. If you are not, consider this essay a postcard from the merge gate: the models keep improving, the PR volume keeps climbing, and the organisations that win will treat review infrastructure as seriously as they treat CI.
See MiMo-V2 inside critique.sh
Sign up to run repo-aware review, parallel specialists, and Remedy-backed fixes with MiMo-V2-Flash, MiMo-V2-Pro, and the rest of our model fabric.
Get started →