MiniMax M3 and Qwen3.7 Plus on Critique: Coding Benchmarks and a Two-Week M3 Welcome Price
Vendor-reported SWE-Bench Pro, terminal, and multimodal scores for MiniMax M3 vs M2.7 — plus Qwen3.7 Plus vs Qwen3.6 Plus — with cross-reads against GLM-5.1, Kimi K2.6, Composer 2.5, and Claude Opus 4.8, and how Critique prices the new review lanes.
MiniMax M3 welcome pricing on PR review.
M3 joins the same OpenRouter-shaped review stack as every other runtime model. The welcome window matches M2.7’s credit floor so you can trial the upgrade without re-budgeting mid-sprint. Qwen3.7 Plus is the parallel story on the Alibaba lane — stronger terminal and UI benchmarks vs Qwen3.6 Plus, same 1.5-credit review shelf.
June 2026 is another “two launches, one economics story” week. MiniMax shipped M3 as an open-weight multimodal foundation model with a million-token context window and sparse attention (MSA). Alibaba positioned Qwen3.7 Plus as a multimodal agent model — not merely a vision upgrade, but a single loop that can move between terminal output, UI screenshots, and tool calls. Critique’s job is not to repeat vendor keynote charts. It is to wire the models on the surfaces that spend review credits — lead synthesis, specialist passes, Remedy execution — and price them so the median PR can afford frontier-class coding without defaulting to Opus on every file. Repo chat stays on its own free roster.
PART ONE — MINIMAX M3: WHAT CHANGED VS M2.7
MiniMax M2.7 was the self-evolving agent lane Critique teams already knew: strong SWE-Bench Verified numbers on the Hugging Face card (80.2% in our catalog snapshot), 56.22% on SWE-Bench Pro in MiniMax’s M2.7 launch materials, and 57.0% on Terminal Bench 2 in the same generation. M3 is a architectural step, not a point release. MiniMax describes M3 as natively multimodal from step zero, trained with interleaved text-image (and video-capable) data, and served with up to 1M tokens via MSA — sparse attention that cuts per-token compute at long context (they quote ~1/20 the cost of the prior generation at 1M tokens, with large prefill/decode speedups in their launch post).
SWE-Bench Pro — MiniMax generation
M3 (Jun 2026 launch) vs M2.7 launch materials vs M2.5 (MiniMax official M2.5 README — same agent-scaffold family as M2.7).
- MiniMax M359%
- MiniMax M2.756.22%
- MiniMax M2.555.4%
SWE-Bench Pro — Qwen Plus generation
Qwen3.7 Plus (Alibaba Cloud) vs Qwen3.6 Plus (Z.AI HF table) vs Qwen3.5 Plus / 27B (Qwen HF agent-scaffold table).
- Qwen3.7 Plus56.6%
- Qwen3.6 Plus56.2%
- Qwen3.5 Plus51.2%
SWE-Bench Pro — cross-vendor
OpenAI, Anthropic, Alibaba, Moonshot, Z.AI, DeepSeek, and MiniMax on SWE-Bench Pro.
- Claude Opus 4.869.2%
- MiniMax M359%
- GPT-5.558.6%
- Kimi K2.658.6%
- GLM-5.158.4%
- GPT-5.457.7%
- Claude Opus 4.657.3%
- Qwen3.7 Plus56.6%
- MiniMax M2.756.22%
- Qwen3.6 Plus56.2%
- DeepSeek V4 Pro Max55.4%
- GLM-555.1%
MiniMax M2.5: official README benchmark table. Qwen3.6 SWE-Pro: Z.AI GLM-5.1 HF table; Qwen3.5 Plus: Qwen HF agent-scaffold table (51.2% SWE-Pro, 41.6% TB 2.0). Critique routes `qwen/qwen3.6-plus` and legacy `qwen/qwen3.5-27b` to `qwen/qwen3.7-plus`.
Terminal-Bench — MiniMax generation
M3: TB 2.1 (launch blog). M2.7: Terminal Bench 2 (M2.7 materials). M2.5: Terminal Bench 2 at 51.7% (M2.5 README).
- MiniMax M366%
- MiniMax M2.757%
- MiniMax M2.551.7%
Terminal-Bench — Qwen Plus generation
Qwen3.7: TB 2.0-Terminus (Jun 2026 press). Qwen3.6 Plus: TB 2.0 61.6% (Alibaba). Qwen3.5 Plus: TB 2.0 41.6% (Qwen HF table).
- Qwen3.7 Plus70.3%
- Qwen3.6 Plus61.6%
- Qwen3.5 Plus41.6%
Terminal-Bench — cross-vendor
Vendor-published terminal-suite scores (harness names differ by vendor).
- GPT-5.582.7%
- GPT-5.475.1%
- Gemini 3.1 Pro70.3%
- Qwen3.7 Plus70.3%
- Composer 2.569.3%
- DeepSeek V4 Pro Max67.9%
- Kimi K2.666.7%
- MiniMax M366%
- GLM-5.163.5%
- Qwen3.6 Plus61.6%
- Claude Sonnet 4.659.1%
- MiniMax M2.757%
- GLM-556.2%
Cursor Composer 2.5 (May 2026). Sonnet 4.6: Anthropic system card. GLM-5 vs 5.1: Z.AI Hugging Face readme.
- Gemini 3.5 Flash83.6%
- Claude Opus 4.882.2%
- Claude Opus 4.779.1%
- Gemini 3.1 Pro78.2%
- Qwen3.7 Max76.4%
- GPT-5.575.3%
- MiniMax M374.2%
- GLM-5.171.8%
- GPT-5.470.6%
- Claude Sonnet 4.669.5%
Vendor May–Jun 2026 tables (Scale MCP Atlas public set where cited). MiniMax M3: launch blog + official MCP Atlas codebase.
MiniMax’s own positioning is aggressive: on SWE-Bench Pro they report beating GPT-5.5 and Gemini 3.1 Pro and approaching Claude Opus 4.7; on BrowseComp they report 83.5 vs 79.3 for Opus 4.7. Anthropic has since shipped Opus 4.8, so the correct buyer question is not “does M3 beat last month’s Opus row?” but “does M3 clear my quality bar at 1/10th the credit burn?” For many repos, the answer is now plausibly yes on coding-and-tools workloads — with the usual caveat that long-horizon autonomy tests in vendor blogs (12-hour paper reproduction, 24-hour CUDA kernel search) are demonstrations, not guarantees on your monorepo.
PART TWO — QWEN3.7 PLUS: MULTIMODAL AGENT VS QWEN3.6 PLUS
Qwen3.7 Plus is already the Alibaba workhorse in Critique’s runtime catalog: specialist passes, cheap-volume stacks, and the default free Remedy chat model id (`qwen/qwen3.7-plus`). The June 2026 launch framing from Alibaba and press coverage is different from “slightly better Qwen3.6.” Qwen3.7 Plus is pitched as a multimodal agent foundation — text, image, and video inputs — with stronger computer-use and terminal scores than the prior Plus generation.
- Claude Opus 4.887.9%
- Qwen3.7 Plus79%
- Gemini 3.1 Pro72.7%
- GPT-5.4 xhigh67.4%
- Claude Opus 4.649.5%
Qwen3.7 & GPT-5.4 xhigh: VentureBeat Jun 2026. Gemini 3.1 Pro: Google DeepMind. Opus 4.8 & 4.6: Anthropic / BenchLM snapshots.
What Qwen3.7 Plus improves over Qwen3.6 Plus (how to think about it)
Qwen3.6 Plus was already a major agentic upgrade over Qwen3.5 Plus — better tool-call stability, less “think forever, write little” ratio, and stronger front-end component generation in Alibaba’s materials. Qwen3.7 Plus keeps that lane and adds multimodal perception so the same agent can reason over screenshots, recordings, and terminal transcripts without bolting on a separate vision model.
Practical guidance: keep Qwen3.7 Plus when the PR carries visual signal — Figma exports, failing UI screenshots, chart-heavy docs, short screen recordings. Escalate to Qwen3.7-Max (6 credits on Critique) when the job is terminal-only, maximum reasoning depth, and you can afford the step up. For the lowest-burn text lanes on review, MiMo v2.5 and Ling still anchor the sub-1cr chart from our May catalog essay.
PART THREE — CRITIQUE SURFACES AND PRICING
Signed-in users still pick Ling 2.6 Flash or DeepSeek V4 Flash for conversational repo search. Chat does not debit the monthly PR review credit pool and does not include M3 or Qwen3.7 Plus.
lib/ai/chat-models.tsMiniMax M3 joins the full review/remedy runtime catalog at 1.5 credits through June 17, 2026, then 3 credits. Qwen3.7 Plus remains 1.5 credits on review. Usage metering follows the runtime catalog, not Remedy UI “free” labels.
lib/models/catalog.tsWhere to try them first — SWE-Bench Pro
Repo-scale repair lanes on Critique shelves. All rows are SWE-Bench Pro — not Verified, multilingual, or Terminal-Bench.
Claude Opus 4.8
SWE-Bench Pro
69.2%
MiniMax M3
SWE-Bench Pro
59.0%
Kimi K2.6
SWE-Bench Pro
58.6%
GPT-5.5
SWE-Bench Pro
58.6%
GLM-5.1
SWE-Bench Pro
58.4%
GPT-5.4
SWE-Bench Pro
57.7%
DeepSeek V4 Pro
SWE-Bench Pro
55.4%
SWE-bench scores reflect best observed performance on the toughest real-world coding tasks.
All scores are relative.
Where to try them first — Terminal-Bench
Agentic terminal / CLI lanes. TB 2.0 rows use vendor TB 2.0 tables; M3 cites TB 2.1 from the MiniMax launch — same family, different harness revision.
GPT-5.5
Terminal-Bench 2.0
82.7%
GPT-5.4
Terminal-Bench 2.0
75.1%
Qwen3.7 Plus
Terminus
Terminal-Bench 2.0
70.3%
Kimi K2.6
Terminal-Bench 2.0
66.7%
MiniMax M3
Terminal-Bench 2.1
66.0%
GLM-5.1
Terminal-Bench 2.0
63.5%
SWE-bench scores reflect best observed performance on the toughest real-world coding tasks.
All scores are relative.
PART FOUR — M3 VS OPUS 4.8 AND COMPOSER 2.5 (BUYING FRAME)
Three questions land in Discord every launch week. The charts above hold the scores — here is the buying frame without mixing benchmarks.
- 1Is M3 actually near Opus?On SWE-Bench Pro, M3 sits in the same band as GPT-5.5 and below Opus 4.8 on vendor tables. On Critique, Opus 4.8 is a 37-credit shelf; M3 welcome is 1.5 credits. Choose Opus when policy requires Anthropic frontier depth. Choose M3 when the repo needs strong coding signal without paying frontier rent every run.
- 2How does M3 compare to Composer 2.5?Compare suites, not one leaderboard. M3 is scored on SWE-Bench Pro. Composer’s public story is SWE Multilingual and Hard-AA on Artificial Analysis — different contests. M3’s pitch is open weights, million-token context, multimodal inputs, and OpenRouter routing via `minimax/minimax-m3`.
- 3Should I drop Kimi or GLM?Kimi K2.6 and GLM-5.1 are in the same SWE-Pro neighborhood as M3 on vendor cards. Kimi’s much higher Verified number is a different benchmark — do not plot it on a Pro chart. M3 is the MiniMax lane upgrade when you want better terminal scores than M2.7 and sparse-attention context economics at the welcome credit floor.
MiniMax M3 welcome pricing vs Opus 4.8 (Critique shelves)
Illustrative single-pass review cost — actual bills multiply by specialist count and depth tier.
PART FIVE — SOURCES AND REPRODUCIBILITY
Set M3 on your next review stack
Open the model catalog, point lead or specialist policy at MiniMax M3 while welcome pricing runs, and keep Qwen3.7 Plus on multimodal PRs. Critique Chat is still free on Ling and DeepSeek V4 Flash.
Browse models