Skip to content
Model update22 min readCritique

GLM-5.2 Lands in Critique: 1M Context, Design Arena #1, Same 3-Credit GLM-5.1 Shelf

Z.ai’s June 2026 flagship upgrades the GLM lane to `z-ai/glm-5.2` — MIT open weights, a usable 1M-token window, and vendor-reported gains on SWE-bench Pro, Terminal-Bench 2.1, FrontierSWE, and MCP-Atlas — while Critique keeps the GLM-5.1 credit floor and inference API slot at 3 credits.

0M
Usable context — IndexShare DSA + KVShare MTP
0
Design Arena Elo — #1 crowdsourced design tasks
+17.5
Terminal-Bench 2.1 pts vs GLM-5.1 (81.0 vs 63.5)
0 cr
Review + Remedy floor — unchanged from GLM-5.1

June 16, 2026 is the long-horizon inflection for Z.ai. GLM-5.2 is not a token-window press release — it is a 753B-parameter MoE flagship with MIT open weights, effort-level control (High / Max), and training aimed at coding-agent trajectories that run for hours, not minutes. The architecture story matters for reviewers: IndexShare reuses sparse-attention indexers across layers (2.9× fewer FLOPs at 1M context), and an improved MTP speculative-decoding stack raises acceptance length by ~20%. Critique’s job is unchanged: wire the model through the OpenRouter adapter, keep credit economics honest, and chart Z.ai’s published numbers against the same frontier rows they evaluated — not a mash-up of unrelated leaderboards.

Critique credits — Z.AI lane

GLM-5.2 ships at the GLM-5.1 price

Same 3-credit floor for PR review and Remedy. Token burn still scales with PR size, specialist fan-out, and Max effort — but the capability jump is free at the credit layer.

Capability tier
GLM-5.1 (GLM-5.2 upstream)

Long-horizon lead or specialist: 1M context, effort levels, MIT weights on Hugging Face.

Critique floor3 cr / run
Z.AI API — input / 1M$1.40
Z.AI API — output / 1M$4.40
Throughput tier
Inference API slot

Public z-ai/glm-5.1 ID routes to GLM-5.2 at Critique list ($0.95 / $3.35 per M) with training opt-in discount.

Critique floormetered cr / run
Z.AI API — input / 1M$0.95
Z.AI API — output / 1M$3.35

OpenRouter list for GLM-5.2 is higher than Critique inference list — same pattern as other discounted inference lanes. Review credits are per agent step, not per million tokens.

Z.ai positions GLM-5.2 for “project-level engineering context” — requirements through multi-platform deployment in one agent thread. That only works if 1M context stays coherent under messy agent traces (repeated tool errors, compaction, sub-agent handoffs). Their training expansion for coding-agent scenarios — large implementation, automated research, perf optimization, complex debugging — is the substantive claim behind the window size.

753B
Total MoE parameters (Z.ai / VentureBeat)
1M
Context class — up from ~200K on GLM-5.1
MIT
Open license — weights on Hugging Face & ModelScope
Jun 2026
Launch date on z.ai/blog/glm-5.2

On the coding rows Z.ai publishes together, GLM-5.2 is the strongest open-weight model and closes much of the gap to closed frontier. The standout delta vs its predecessor is Terminal-Bench 2.1 (Terminus-2): 81.0 vs 63.5 on GLM-5.1 — shell iteration, error recovery, and finish rates. SWE-bench Pro moves to 62.1 (+3.7 vs GLM-5.1), edging GPT-5.5 (58.6) and landing under Opus 4.8 (69.2). NL2Repo (greenfield repo synthesis) hits 48.9, ahead of GPT-5.5’s 50.7 on raw score but behind Opus’s 69.7 — the frontier gap on net-new repo construction is still real.

Coding

Coding benchmarks — Z.ai harness
SWE-bench Pro, Terminal-Bench 2.1 (Terminus-2), and NL2Repo from the GLM-5.2 launch table. Higher is better.
GLM-5.2
GLM-5.1
GPT-5.5
Opus 4.8
Terminal-Bench 2.1: Terminus-2, temp 1.0, 256K ctx, 4 CPU / 8 GB RAM per task. SWE-bench Pro: OpenHands, 400K ctx. NL2Repo: 400K ctx, anti-hack filters.

This is where GLM-5.2’s positioning is clearest. FrontierSWE (open-ended technical projects at hour-to-tens-of-hours scale) puts GLM-5.2 at 74.4%1 point behind Opus 4.8 (75.1%), +1.8 vs GPT-5.5 (72.6%), and +43.9 vs GLM-5.1 (30.5%). PostTrainBench (H100 post-training improvement) scores 34.3%, beating GPT-5.5 (28.4%) and trailing only Opus 4.8 (37.2%). SWE-Marathon (compilers, kernels, production services) is still Opus territory (26.0% vs GLM-5.2’s 13.0%), but GLM-5.2 ranks second overall — the marathon lane is where every non-Opus model still has headroom.

Long horizon

FrontierSWE · PostTrainBench · SWE-Marathon
Z.ai’s three long-horizon suites — 1M context, max effort, 128K max output where applicable. Dominance scores as of 2026-06-16.
GLM-5.2
GLM-5.1
GPT-5.5
Opus 4.8
FrontierSWE: Proximal eval. PostTrainBench: PostTrainBench service. SWE-Marathon: Abundant AI. GLM-5.1 marathon at 1.0% reflects prior-gen long-horizon limits, not a typo.

For PR review orchestration, tool fidelity matters as much as raw coding score. On MCP-Atlas (500-task public subset, think mode, 10-minute timeout), GLM-5.2 reaches 76.8% vs GLM-5.1’s 71.8% — ahead of GPT-5.5 (75.3%) and under Opus 4.8 (77.8%). Tool-Decathlon lands at 48.2%, between GLM-5.1 (40.7%) and GPT-5.5 (55.6%). That pattern — near-frontier MCP scores, mid-pack on broader tool decathlon — is what we expect for a strong open-weight lead that still benefits from specialist sub-agents on security and tests.

Agentic

MCP-Atlas & Tool-Decathlon
Same Z.ai table — agentic rows comparable across the four-model slice.
GLM-5.2
GLM-5.1
GPT-5.5
Opus 4.8
MCP-Atlas judge: Gemini 3.0 Pro. Tool-Decathlon: official evaluation service, 128K max tokens.

Coding benchmarks do not capture visual product craft. Z.ai and third-party coverage both highlight Design Arena: GLM-5.2 took #1 with Elo 1360, ahead of prior leaders including Claude Fable 5 (since delisted). That is not a substitute for SWE-bench — but it matters when your PRs include UI screenshots, marketing pages, or design-system regressions. Pair GLM-5.2 as lead with a vision specialist when the diff is image-backed; the Design Arena signal suggests the model has real layout and aesthetic judgment, not just codegen.

Design Arena Elo — selected frontier models
Crowdsourced head-to-head design tasks. GLM-5.2 at 1360 per Z.ai launch and VentureBeat coverage; other Elos illustrative from public leaderboard snapshots.
  • GLM-5.21360Elo
  • Claude Fable 5 (retired)1342Elo
  • GPT-5.51288Elo
  • Gemini 3.1 Pro1275Elo
  • GLM-5.11210Elo

Non-GLM-5.2 rows are approximate peer snapshots for context — Design Arena Elo shifts as models enter and exit. Treat as directional, not a reproduced single harness run.

Z.ai full table excerpt (selected rows)

Percent scores from z.ai/blog/glm-5.2 — June 16, 2026.

BenchmarkGLM-5.1GLM-5.2GPT-5.5Opus 4.8
SWE-bench Pro58.462.158.669.2
Terminal-Bench 2.1 (Terminus-2)63.581.084.085.0
FrontierSWE30.574.472.675.1
PostTrainBench20.134.328.437.2
SWE-Marathon1.013.012.026.0
MCP-Atlas71.876.875.377.8
Design Arena Elo1360 (#1)

Dash cells: not published for that model on Z.ai’s launch table.

Retiring GLM-5V-Turbo consolidates the Z.AI lane: one 3-credit model instead of split multimodal (5V) and text (5.1) SKUs. For PR review, GLM-5.2 is the right default when (1) the PR touches many files or deep call chains, (2) Remedy needs multi-hour autonomous fix attempts, or (3) you want open-weight MIT deployment without giving up frontier-adjacent Terminal-Bench and FrontierSWE scores. Keep DeepSeek V4 Flash or Trinity on specialists for cost; escalate to Opus or GPT-5.5 Pro only when the risk profile demands it.

Routing checklist
  1. 1
    Use GLM-5.2 as lead when…
    Large diffs, long-horizon Remedy, repo-wide refactors, MCP-heavy agent plans.
  2. 2
    Keep Flash specialists when…
    DeepSeek V4 Flash, Ling 2.6, Gemma — fan out security/tests cheaply.
  3. 3
    Attach vision separately when…
    GLM-5.2 is text-first; use MiMo, Gemini, or Qwen3.7 Plus when screenshots drive the review.
  4. 4
    Inference API clients
    Keep calling z-ai/glm-5.1 — Critique routes to GLM-5.2 automatically.
Both retire in favor of a single z-ai/glm-5.2 entry displayed as GLM-5.1 at 3 credits. Legacy IDs alias forward so saved policies keep working.
Credit shelf and inference API compatibility. You get GLM-5.2 capability without repricing the lane teams already budgeted at 3 credits.
Different strengths: GLM-5.2 leads on long-horizon FrontierSWE and 1M context; Kimi K2.7 Code optimizes coding-specialist efficiency; MiniMax M3 competes on SWE-bench Pro at a similar price band. Route by task shape, not brand loyalty.
No — it measures crowdsourced design preference, not merge safety. Use it as a signal that GLM-5.2 has layout taste; still run your normal review + test gates.

Try GLM-5.2 on your next PR.

Select GLM-5.1 (GLM-5.2 upstream) in the review model picker — 3 credits, 1M context, same economics as before.

Open dashboard →