Skip to content
28 min readCritique

MiniMax M3 and Qwen3.7 Plus on Critique: Coding Benchmarks and a Two-Week M3 Welcome Price

Vendor-reported SWE-Bench Pro, terminal, and multimodal scores for MiniMax M3 vs M2.7 — plus Qwen3.7 Plus vs Qwen3.6 Plus — with cross-reads against GLM-5.1, Kimi K2.6, Composer 2.5, and Claude Opus 4.8, and how Critique prices the new review lanes.

June 2026 — review catalog

MiniMax M3 welcome pricing on PR review.

M3 joins the same OpenRouter-shaped review stack as every other runtime model. The welcome window matches M2.7’s credit floor so you can trial the upgrade without re-budgeting mid-sprint. Qwen3.7 Plus is the parallel story on the Alibaba lane — stronger terminal and UI benchmarks vs Qwen3.6 Plus, same 1.5-credit review shelf.

50% welcome
MiniMax M3
1.5 credits / run3 credits shelf
Ends Through June 17, 2026 (UTC cutoff)
Review lane
Qwen3.7 Plus
1.5 credits / runQwen3.6 Plus aliased forward
Ends Paid review + Remedy — not in Chat
0%
MiniMax M3 — SWE-Bench Pro (vendor)
0%
Qwen3.7 Plus — Terminal-Bench 2.0-Terminus (Alibaba)
0 cr
M3 welcome review floor — matches M2.7
0M
M3 context class on OpenRouter

June 2026 is another “two launches, one economics story” week. MiniMax shipped M3 as an open-weight multimodal foundation model with a million-token context window and sparse attention (MSA). Alibaba positioned Qwen3.7 Plus as a multimodal agent model — not merely a vision upgrade, but a single loop that can move between terminal output, UI screenshots, and tool calls. Critique’s job is not to repeat vendor keynote charts. It is to wire the models on the surfaces that spend review credits — lead synthesis, specialist passes, Remedy execution — and price them so the median PR can afford frontier-class coding without defaulting to Opus on every file. Repo chat stays on its own free roster.

MiniMax M2.7 was the self-evolving agent lane Critique teams already knew: strong SWE-Bench Verified numbers on the Hugging Face card (80.2% in our catalog snapshot), 56.22% on SWE-Bench Pro in MiniMax’s M2.7 launch materials, and 57.0% on Terminal Bench 2 in the same generation. M3 is a architectural step, not a point release. MiniMax describes M3 as natively multimodal from step zero, trained with interleaved text-image (and video-capable) data, and served with up to 1M tokens via MSA — sparse attention that cuts per-token compute at long context (they quote ~1/20 the cost of the prior generation at 1M tokens, with large prefill/decode speedups in their launch post).

SWE-Bench Pro — MiniMax M3 vs peers
SWE-Bench Pro only. Higher is better. Vendor-published repo-scale scores — not SWE-Bench Verified, multilingual, or Terminal-Bench rows.

SWE-Bench Pro — MiniMax generation

M3 (Jun 2026 launch) vs M2.7 launch materials vs M2.5 (MiniMax official M2.5 README — same agent-scaffold family as M2.7).

  • MiniMax M359%
  • MiniMax M2.756.22%
  • MiniMax M2.555.4%

SWE-Bench Pro — Qwen Plus generation

Qwen3.7 Plus (Alibaba Cloud) vs Qwen3.6 Plus (Z.AI HF table) vs Qwen3.5 Plus / 27B (Qwen HF agent-scaffold table).

  • Qwen3.7 Plus56.6%
  • Qwen3.6 Plus56.2%
  • Qwen3.5 Plus51.2%

SWE-Bench Pro — cross-vendor

OpenAI, Anthropic, Alibaba, Moonshot, Z.AI, DeepSeek, and MiniMax on SWE-Bench Pro.

  • Claude Opus 4.869.2%
  • MiniMax M359%
  • GPT-5.558.6%
  • Kimi K2.658.6%
  • GLM-5.158.4%
  • GPT-5.457.7%
  • Claude Opus 4.657.3%
  • Qwen3.7 Plus56.6%
  • MiniMax M2.756.22%
  • Qwen3.6 Plus56.2%
  • DeepSeek V4 Pro Max55.4%
  • GLM-555.1%

MiniMax M2.5: official README benchmark table. Qwen3.6 SWE-Pro: Z.AI GLM-5.1 HF table; Qwen3.5 Plus: Qwen HF agent-scaffold table (51.2% SWE-Pro, 41.6% TB 2.0). Critique routes `qwen/qwen3.6-plus` and legacy `qwen/qwen3.5-27b` to `qwen/qwen3.7-plus`.

Terminal-Bench — MiniMax M3 vs peers
Terminal-Bench only (not MCP Atlas, not SWE-Bench Pro). Vendor harness names differ (TB 2.0, 2.1, Terminus) — compare directionally.

Terminal-Bench — MiniMax generation

M3: TB 2.1 (launch blog). M2.7: Terminal Bench 2 (M2.7 materials). M2.5: Terminal Bench 2 at 51.7% (M2.5 README).

  • MiniMax M366%
  • MiniMax M2.757%
  • MiniMax M2.551.7%

Terminal-Bench — Qwen Plus generation

Qwen3.7: TB 2.0-Terminus (Jun 2026 press). Qwen3.6 Plus: TB 2.0 61.6% (Alibaba). Qwen3.5 Plus: TB 2.0 41.6% (Qwen HF table).

  • Qwen3.7 Plus70.3%
  • Qwen3.6 Plus61.6%
  • Qwen3.5 Plus41.6%

Terminal-Bench — cross-vendor

Vendor-published terminal-suite scores (harness names differ by vendor).

  • GPT-5.582.7%
  • GPT-5.475.1%
  • Gemini 3.1 Pro70.3%
  • Qwen3.7 Plus70.3%
  • Composer 2.569.3%
  • DeepSeek V4 Pro Max67.9%
  • Kimi K2.666.7%
  • MiniMax M366%
  • GLM-5.163.5%
  • Qwen3.6 Plus61.6%
  • Claude Sonnet 4.659.1%
  • MiniMax M2.757%
  • GLM-556.2%

Cursor Composer 2.5 (May 2026). Sonnet 4.6: Anthropic system card. GLM-5 vs 5.1: Z.AI Hugging Face readme.

MCP Atlas — MiniMax M3 vs peers
MCP Atlas (Scale public set) only — multi-server MCP tool orchestration. Not Terminal-Bench, SWE-Bench Pro, or Toolathlon.
  • Gemini 3.5 Flash83.6%
  • Claude Opus 4.882.2%
  • Claude Opus 4.779.1%
  • Gemini 3.1 Pro78.2%
  • Qwen3.7 Max76.4%
  • GPT-5.575.3%
  • MiniMax M374.2%
  • GLM-5.171.8%
  • GPT-5.470.6%
  • Claude Sonnet 4.669.5%

Vendor May–Jun 2026 tables (Scale MCP Atlas public set where cited). MiniMax M3: launch blog + official MCP Atlas codebase.

MiniMax’s own positioning is aggressive: on SWE-Bench Pro they report beating GPT-5.5 and Gemini 3.1 Pro and approaching Claude Opus 4.7; on BrowseComp they report 83.5 vs 79.3 for Opus 4.7. Anthropic has since shipped Opus 4.8, so the correct buyer question is not “does M3 beat last month’s Opus row?” but “does M3 clear my quality bar at 1/10th the credit burn?” For many repos, the answer is now plausibly yes on coding-and-tools workloads — with the usual caveat that long-horizon autonomy tests in vendor blogs (12-hour paper reproduction, 24-hour CUDA kernel search) are demonstrations, not guarantees on your monorepo.

Qwen3.7 Plus is already the Alibaba workhorse in Critique’s runtime catalog: specialist passes, cheap-volume stacks, and the default free Remedy chat model id (`qwen/qwen3.7-plus`). The June 2026 launch framing from Alibaba and press coverage is different from “slightly better Qwen3.6.” Qwen3.7 Plus is pitched as a multimodal agent foundation — text, image, and video inputs — with stronger computer-use and terminal scores than the prior Plus generation.

ScreenSpot Pro — Qwen3.7 Plus vs peers
ScreenSpot Pro only — localized UI understanding from the Jun 2026 press table. Not Terminal-Bench or SWE-Bench Pro.
  • Claude Opus 4.887.9%
  • Qwen3.7 Plus79%
  • Gemini 3.1 Pro72.7%
  • GPT-5.4 xhigh67.4%
  • Claude Opus 4.649.5%

Qwen3.7 & GPT-5.4 xhigh: VentureBeat Jun 2026. Gemini 3.1 Pro: Google DeepMind. Opus 4.8 & 4.6: Anthropic / BenchLM snapshots.

Generation delta

What Qwen3.7 Plus improves over Qwen3.6 Plus (how to think about it)

Qwen3.6 Plus was already a major agentic upgrade over Qwen3.5 Plus — better tool-call stability, less “think forever, write little” ratio, and stronger front-end component generation in Alibaba’s materials. Qwen3.7 Plus keeps that lane and adds multimodal perception so the same agent can reason over screenshots, recordings, and terminal transcripts without bolting on a separate vision model.

Metric
Qwen3.6 Plus (prior Plus lane)
Qwen3.7 Plus (current Plus lane)
Primary identity
Text-first agentic coding
Multimodal agent (vision + code + tools)
Terminal-Bench class
Different harness names — compare directionally.
61.6% TB 2.0 (Alibaba, in Critique catalog)
70.3% TB 2.0-Terminus (Jun 2026 press)
UI / computer use
Stronger than 3.5 on agentic rows
79.0% ScreenSpot Pro (press table)
Critique credit floor
1.5 cr (aliased forward)
1.5 cr — unchanged shelf

Practical guidance: keep Qwen3.7 Plus when the PR carries visual signal — Figma exports, failing UI screenshots, chart-heavy docs, short screen recordings. Escalate to Qwen3.7-Max (6 credits on Critique) when the job is terminal-only, maximum reasoning depth, and you can afford the step up. For the lowest-burn text lanes on review, MiMo v2.5 and Ling still anchor the sub-1cr chart from our May catalog essay.

Critique Chat (free)

Signed-in users still pick Ling 2.6 Flash or DeepSeek V4 Flash for conversational repo search. Chat does not debit the monthly PR review credit pool and does not include M3 or Qwen3.7 Plus.

lib/ai/chat-models.ts
PR review + Remedy (credits)

MiniMax M3 joins the full review/remedy runtime catalog at 1.5 credits through June 17, 2026, then 3 credits. Qwen3.7 Plus remains 1.5 credits on review. Usage metering follows the runtime catalog, not Remedy UI “free” labels.

lib/models/catalog.ts
Critique.shLive · Updated just now

Where to try them first — SWE-Bench Pro

Repo-scale repair lanes on Critique shelves. All rows are SWE-Bench Pro — not Verified, multilingual, or Terminal-Bench.

7 models69.2% top SWE-Pro score1 cr lowest floor

Claude Opus 4.8

SWE-Bench Pro

69.2%

37 cr

MiniMax M3

SWE-Bench Pro

59.0%

1.5 cr

Kimi K2.6

SWE-Bench Pro

58.6%

4 cr

GPT-5.5

SWE-Bench Pro

58.6%

40 cr

GLM-5.1

SWE-Bench Pro

58.4%

3 cr

GPT-5.4

SWE-Bench Pro

57.7%

20 cr

DeepSeek V4 Pro

SWE-Bench Pro

55.4%

1 cr

SWE-bench scores reflect best observed performance on the toughest real-world coding tasks.

All scores are relative.

Critique.shLive · Updated just now

Where to try them first — Terminal-Bench

Agentic terminal / CLI lanes. TB 2.0 rows use vendor TB 2.0 tables; M3 cites TB 2.1 from the MiniMax launch — same family, different harness revision.

6 models82.7% top Terminal-Bench score1.5 cr lowest floor

GPT-5.5

Terminal-Bench 2.0

82.7%

40 cr

GPT-5.4

Terminal-Bench 2.0

75.1%

20 cr

Qwen3.7 Plus

Terminus

Terminal-Bench 2.0

70.3%

1.5 cr

Kimi K2.6

Terminal-Bench 2.0

66.7%

4 cr

MiniMax M3

Terminal-Bench 2.1

66.0%

1.5 cr

GLM-5.1

Terminal-Bench 2.0

63.5%

3 cr

SWE-bench scores reflect best observed performance on the toughest real-world coding tasks.

All scores are relative.

Three questions land in Discord every launch week. The charts above hold the scores — here is the buying frame without mixing benchmarks.

How to read M3 against Opus, Composer, Kimi, and GLM
  1. 1
    Is M3 actually near Opus?
    On SWE-Bench Pro, M3 sits in the same band as GPT-5.5 and below Opus 4.8 on vendor tables. On Critique, Opus 4.8 is a 37-credit shelf; M3 welcome is 1.5 credits. Choose Opus when policy requires Anthropic frontier depth. Choose M3 when the repo needs strong coding signal without paying frontier rent every run.
  2. 2
    How does M3 compare to Composer 2.5?
    Compare suites, not one leaderboard. M3 is scored on SWE-Bench Pro. Composer’s public story is SWE Multilingual and Hard-AA on Artificial Analysis — different contests. M3’s pitch is open weights, million-token context, multimodal inputs, and OpenRouter routing via `minimax/minimax-m3`.
  3. 3
    Should I drop Kimi or GLM?
    Kimi K2.6 and GLM-5.1 are in the same SWE-Pro neighborhood as M3 on vendor cards. Kimi’s much higher Verified number is a different benchmark — do not plot it on a Pro chart. M3 is the MiniMax lane upgrade when you want better terminal scores than M2.7 and sparse-attention context economics at the welcome credit floor.
Credit efficiency

MiniMax M3 welcome pricing vs Opus 4.8 (Critique shelves)

Illustrative single-pass review cost — actual bills multiply by specialist count and depth tier.

Metric
MiniMax M3 (welcome)
Claude Opus 4.8
Credit floor / run
1.5 cr
37 cr
Shelf after Jun 17, 2026
3 cr
37 cr
SWE-Bench Pro (vendor)
59.0%
See Anthropic / Scale tables
Open weights + 1M ctx
Yes (MiniMax)
Proprietary API
No. M3 is on the paid PR review and Remedy catalog only. Critique Chat remains Ling 2.6 Flash and DeepSeek V4 Flash at no extra chat fee. Legacy minimax model ids in saved chat preferences normalize to DeepSeek V4 Flash.
Through June 17, 2026 (UTC), M3 review runs bill at 1.5 credits — the same as M2.7 today. After that date the shelf returns to 3 credits automatically; no coupon code required.
The Qwen3.7 Plus credit floor remains 1.5 credits on Critique review and Remedy. This essay focuses on benchmark context vs Qwen3.6 Plus and where to use the Plus lane — not a new discount.
Vendor tables place M3 near Opus 4.7 on SWE-Bench Pro while trailing some GPT-5.5 terminal scores. Opus 4.8 remains the policy-mandated Anthropic lane on Critique at 37 credits. Use M3 when open-weight economics and 1M context matter; use Opus when your org requires Anthropic or maximum frontier depth regardless of cost.
Keep Plus for multimodal and mid-cost specialist work. Move to Max when you need the higher-end Alibaba reasoning lane (6 credits on Critique) and text-only depth is worth the step up.

Set M3 on your next review stack

Open the model catalog, point lead or specialist policy at MiniMax M3 while welcome pricing runs, and keep Qwen3.7 Plus on multimodal PRs. Critique Chat is still free on Ling and DeepSeek V4 Flash.

Browse models