GLM-5.2 Lands in Critique: 1M Context, Design Arena #1, Same 3-Credit GLM-5.1 Shelf
Z.ai’s June 2026 flagship upgrades the GLM lane to `z-ai/glm-5.2` — MIT open weights, a usable 1M-token window, and vendor-reported gains on SWE-bench Pro, Terminal-Bench 2.1, FrontierSWE, and MCP-Atlas — while Critique keeps the GLM-5.1 credit floor and inference API slot at 3 credits.
June 16, 2026 is the long-horizon inflection for Z.ai. GLM-5.2 is not a token-window press release — it is a 753B-parameter MoE flagship with MIT open weights, effort-level control (High / Max), and training aimed at coding-agent trajectories that run for hours, not minutes. The architecture story matters for reviewers: IndexShare reuses sparse-attention indexers across layers (2.9× fewer FLOPs at 1M context), and an improved MTP speculative-decoding stack raises acceptance length by ~20%. Critique’s job is unchanged: wire the model through the OpenRouter adapter, keep credit economics honest, and chart Z.ai’s published numbers against the same frontier rows they evaluated — not a mash-up of unrelated leaderboards.
GLM-5.2 ships at the GLM-5.1 price
Same 3-credit floor for PR review and Remedy. Token burn still scales with PR size, specialist fan-out, and Max effort — but the capability jump is free at the credit layer.
Long-horizon lead or specialist: 1M context, effort levels, MIT weights on Hugging Face.
Public z-ai/glm-5.1 ID routes to GLM-5.2 at Critique list ($0.95 / $3.35 per M) with training opt-in discount.
OpenRouter list for GLM-5.2 is higher than Critique inference list — same pattern as other discounted inference lanes. Review credits are per agent step, not per million tokens.
PART ONE — WHAT GLM-5.2 IS
Z.ai positions GLM-5.2 for “project-level engineering context” — requirements through multi-platform deployment in one agent thread. That only works if 1M context stays coherent under messy agent traces (repeated tool errors, compaction, sub-agent handoffs). Their training expansion for coding-agent scenarios — large implementation, automated research, perf optimization, complex debugging — is the substantive claim behind the window size.
PART TWO — STANDARD CODING BENCHMARKS
On the coding rows Z.ai publishes together, GLM-5.2 is the strongest open-weight model and closes much of the gap to closed frontier. The standout delta vs its predecessor is Terminal-Bench 2.1 (Terminus-2): 81.0 vs 63.5 on GLM-5.1 — shell iteration, error recovery, and finish rates. SWE-bench Pro moves to 62.1 (+3.7 vs GLM-5.1), edging GPT-5.5 (58.6) and landing under Opus 4.8 (69.2). NL2Repo (greenfield repo synthesis) hits 48.9, ahead of GPT-5.5’s 50.7 on raw score but behind Opus’s 69.7 — the frontier gap on net-new repo construction is still real.
Coding
PART THREE — LONG-HORIZON BENCHMARKS
This is where GLM-5.2’s positioning is clearest. FrontierSWE (open-ended technical projects at hour-to-tens-of-hours scale) puts GLM-5.2 at 74.4% — 1 point behind Opus 4.8 (75.1%), +1.8 vs GPT-5.5 (72.6%), and +43.9 vs GLM-5.1 (30.5%). PostTrainBench (H100 post-training improvement) scores 34.3%, beating GPT-5.5 (28.4%) and trailing only Opus 4.8 (37.2%). SWE-Marathon (compilers, kernels, production services) is still Opus territory (26.0% vs GLM-5.2’s 13.0%), but GLM-5.2 ranks second overall — the marathon lane is where every non-Opus model still has headroom.
Long horizon
PART FOUR — AGENTIC TOOL USE
For PR review orchestration, tool fidelity matters as much as raw coding score. On MCP-Atlas (500-task public subset, think mode, 10-minute timeout), GLM-5.2 reaches 76.8% vs GLM-5.1’s 71.8% — ahead of GPT-5.5 (75.3%) and under Opus 4.8 (77.8%). Tool-Decathlon lands at 48.2%, between GLM-5.1 (40.7%) and GPT-5.5 (55.6%). That pattern — near-frontier MCP scores, mid-pack on broader tool decathlon — is what we expect for a strong open-weight lead that still benefits from specialist sub-agents on security and tests.
Agentic
PART FIVE — DESIGN ARENA ELO
Coding benchmarks do not capture visual product craft. Z.ai and third-party coverage both highlight Design Arena: GLM-5.2 took #1 with Elo 1360, ahead of prior leaders including Claude Fable 5 (since delisted). That is not a substitute for SWE-bench — but it matters when your PRs include UI screenshots, marketing pages, or design-system regressions. Pair GLM-5.2 as lead with a vision specialist when the diff is image-backed; the Design Arena signal suggests the model has real layout and aesthetic judgment, not just codegen.
- GLM-5.21360Elo
- Claude Fable 5 (retired)1342Elo
- GPT-5.51288Elo
- Gemini 3.1 Pro1275Elo
- GLM-5.11210Elo
Non-GLM-5.2 rows are approximate peer snapshots for context — Design Arena Elo shifts as models enter and exit. Treat as directional, not a reproduced single harness run.
Percent scores from z.ai/blog/glm-5.2 — June 16, 2026.
| Benchmark | GLM-5.1 | GLM-5.2 | GPT-5.5 | Opus 4.8 |
|---|---|---|---|---|
| SWE-bench Pro | 58.4 | 62.1 | 58.6 | 69.2 |
| Terminal-Bench 2.1 (Terminus-2) | 63.5 | 81.0 | 84.0 | 85.0 |
| FrontierSWE | 30.5 | 74.4 | 72.6 | 75.1 |
| PostTrainBench | 20.1 | 34.3 | 28.4 | 37.2 |
| SWE-Marathon | 1.0 | 13.0 | 12.0 | 26.0 |
| MCP-Atlas | 71.8 | 76.8 | 75.3 | 77.8 |
| Design Arena Elo | — | 1360 (#1) | — | — |
Dash cells: not published for that model on Z.ai’s launch table.
PART SIX — WHERE GLM-5.2 EARNS A SLOT IN CRITIQUE
Retiring GLM-5V-Turbo consolidates the Z.AI lane: one 3-credit model instead of split multimodal (5V) and text (5.1) SKUs. For PR review, GLM-5.2 is the right default when (1) the PR touches many files or deep call chains, (2) Remedy needs multi-hour autonomous fix attempts, or (3) you want open-weight MIT deployment without giving up frontier-adjacent Terminal-Bench and FrontierSWE scores. Keep DeepSeek V4 Flash or Trinity on specialists for cost; escalate to Opus or GPT-5.5 Pro only when the risk profile demands it.
- 1Use GLM-5.2 as lead when…Large diffs, long-horizon Remedy, repo-wide refactors, MCP-heavy agent plans.
- 2Keep Flash specialists when…DeepSeek V4 Flash, Ling 2.6, Gemma — fan out security/tests cheaply.
- 3Attach vision separately when…GLM-5.2 is text-first; use MiMo, Gemini, or Qwen3.7 Plus when screenshots drive the review.
- 4Inference API clientsKeep calling
z-ai/glm-5.1— Critique routes to GLM-5.2 automatically.
Try GLM-5.2 on your next PR.
Select GLM-5.1 (GLM-5.2 upstream) in the review model picker — 3 credits, 1M context, same economics as before.
Open dashboard →