Model updateJune 17, 202622 min readCritique

GLM-5.2 Lands in Critique: 1M Context, Design Arena #1, Same 3-Credit GLM-5.1 Shelf

Q: What happened to GLM-5V-Turbo and GLM-5.1 in the catalog?

Both retire in favor of a single `z-ai/glm-5.2` entry displayed as GLM-5.1 at 3 credits. Legacy IDs alias forward so saved policies keep working.

Z.ai’s June 2026 flagship upgrades the GLM lane to `z-ai/glm-5.2` — MIT open weights, a usable 1M-token window, and vendor-reported gains on SWE-bench Pro, Terminal-Bench 2.1, FrontierSWE, and MCP-Atlas — while Critique keeps the GLM-5.1 credit floor and inference API slot at 3 credits.

Usable context — IndexShare DSA + KVShare MTP

Design Arena Elo — #1 crowdsourced design tasks

+17.5

Terminal-Bench 2.1 pts vs GLM-5.1 (81.0 vs 63.5)

0 cr

Review + Remedy floor — unchanged from GLM-5.1

June 16, 2026 is the long-horizon inflection for Z.ai. GLM-5.2 is not a token-window press release — it is a 753B-parameter MoE flagship with MIT open weights, effort-level control (High / Max), and training aimed at coding-agent trajectories that run for hours, not minutes. The architecture story matters for reviewers: IndexShare reuses sparse-attention indexers across layers (2.9× fewer FLOPs at 1M context), and an improved MTP speculative-decoding stack raises acceptance length by ~20%. Critique’s job is unchanged: wire the model through the OpenRouter adapter, keep credit economics honest, and chart Z.ai’s published numbers against the same frontier rows they evaluated — not a mash-up of unrelated leaderboards.

Critique credits — Z.AI lane
GLM-5.2 ships at the GLM-5.1 priceSame 3-credit floor for PR review and Remedy. Token burn still scales with PR size, specialist fan-out, and Max effort — but the capability jump is free at the credit layer.
Capability tier
GLM-5.1 (GLM-5.2 upstream)
Long-horizon lead or specialist: 1M context, effort levels, MIT weights on Hugging Face.
Critique floor3 cr / run
Z.AI API — input / 1M$1.40
Z.AI API — output / 1M$4.40
Throughput tier
Inference API slot
Public z-ai/glm-5.1 ID routes to GLM-5.2 at Critique list ($0.95 / $3.35 per M) with training opt-in discount.
Critique floormetered cr / run
Z.AI API — input / 1M$0.95
Z.AI API — output / 1M$3.35
OpenRouter list for GLM-5.2 is higher than Critique inference list — same pattern as other discounted inference lanes. Review credits are per agent step, not per million tokens.
Full credit ladder and plans

PART ONE — WHAT GLM-5.2 IS

Z.ai positions GLM-5.2 for “project-level engineering context” — requirements through multi-platform deployment in one agent thread. That only works if 1M context stays coherent under messy agent traces (repeated tool errors, compaction, sub-agent handoffs). Their training expansion for coding-agent scenarios — large implementation, automated research, perf optimization, complex debugging — is the substantive claim behind the window size.

753B

Total MoE parameters (Z.ai / VentureBeat)

Context class — up from ~200K on GLM-5.1

MIT

Open license — weights on Hugging Face & ModelScope

Jun 2026

Launch date on z.ai/blog/glm-5.2

PART TWO — STANDARD CODING BENCHMARKS

On the coding rows Z.ai publishes together, GLM-5.2 is the strongest open-weight model and closes much of the gap to closed frontier. The standout delta vs its predecessor is Terminal-Bench 2.1 (Terminus-2): 81.0 vs 63.5 on GLM-5.1 — shell iteration, error recovery, and finish rates. SWE-bench Pro moves to 62.1 (+3.7 vs GLM-5.1), edging GPT-5.5 (58.6) and landing under Opus 4.8 (69.2). NL2Repo (greenfield repo synthesis) hits 48.9, ahead of GPT-5.5’s 50.7 on raw score but behind Opus’s 69.7 — the frontier gap on net-new repo construction is still real.

Coding

Coding benchmarks — Z.ai harness

SWE-bench Pro, Terminal-Bench 2.1 (Terminus-2), and NL2Repo from the GLM-5.2 launch table. Higher is better.

GLM-5.2

GLM-5.1

GPT-5.5

Opus 4.8

Terminal-Bench 2.1: Terminus-2, temp 1.0, 256K ctx, 4 CPU / 8 GB RAM per task. SWE-bench Pro: OpenHands, 400K ctx. NL2Repo: 400K ctx, anti-hack filters.

PART THREE — LONG-HORIZON BENCHMARKS

This is where GLM-5.2’s positioning is clearest. FrontierSWE (open-ended technical projects at hour-to-tens-of-hours scale) puts GLM-5.2 at 74.4% — 1 point behind Opus 4.8 (75.1%), +1.8 vs GPT-5.5 (72.6%), and +43.9 vs GLM-5.1 (30.5%). PostTrainBench (H100 post-training improvement) scores 34.3%, beating GPT-5.5 (28.4%) and trailing only Opus 4.8 (37.2%). SWE-Marathon (compilers, kernels, production services) is still Opus territory (26.0% vs GLM-5.2’s 13.0%), but GLM-5.2 ranks second overall — the marathon lane is where every non-Opus model still has headroom.

Long horizon

FrontierSWE · PostTrainBench · SWE-Marathon

Z.ai’s three long-horizon suites — 1M context, max effort, 128K max output where applicable. Dominance scores as of 2026-06-16.

GLM-5.2

GLM-5.1

GPT-5.5

Opus 4.8

FrontierSWE: Proximal eval. PostTrainBench: PostTrainBench service. SWE-Marathon: Abundant AI. GLM-5.1 marathon at 1.0% reflects prior-gen long-horizon limits, not a typo.

PART FOUR — AGENTIC TOOL USE

For PR review orchestration, tool fidelity matters as much as raw coding score. On MCP-Atlas (500-task public subset, think mode, 10-minute timeout), GLM-5.2 reaches 76.8% vs GLM-5.1’s 71.8% — ahead of GPT-5.5 (75.3%) and under Opus 4.8 (77.8%). Tool-Decathlon lands at 48.2%, between GLM-5.1 (40.7%) and GPT-5.5 (55.6%). That pattern — near-frontier MCP scores, mid-pack on broader tool decathlon — is what we expect for a strong open-weight lead that still benefits from specialist sub-agents on security and tests.

Agentic

MCP-Atlas & Tool-Decathlon

Same Z.ai table — agentic rows comparable across the four-model slice.

GLM-5.2

GLM-5.1

GPT-5.5

Opus 4.8

MCP-Atlas judge: Gemini 3.0 Pro. Tool-Decathlon: official evaluation service, 128K max tokens.

PART FIVE — DESIGN ARENA ELO

Coding benchmarks do not capture visual product craft. Z.ai and third-party coverage both highlight Design Arena: GLM-5.2 took #1 with Elo 1360, ahead of prior leaders including Claude Fable 5 (since delisted). That is not a substitute for SWE-bench — but it matters when your PRs include UI screenshots, marketing pages, or design-system regressions. Pair GLM-5.2 as lead with a vision specialist when the diff is image-backed; the Design Arena signal suggests the model has real layout and aesthetic judgment, not just codegen.

Design Arena Elo — selected frontier models

Crowdsourced head-to-head design tasks. GLM-5.2 at 1360 per Z.ai launch and VentureBeat coverage; other Elos illustrative from public leaderboard snapshots.

GLM-5.21360Elo
Claude Fable 5 (retired)1342Elo
GPT-5.51288Elo
Gemini 3.1 Pro1275Elo
GLM-5.11210Elo

Non-GLM-5.2 rows are approximate peer snapshots for context — Design Arena Elo shifts as models enter and exit. Treat as directional, not a reproduced single harness run.

Z.ai full table excerpt (selected rows)

Percent scores from z.ai/blog/glm-5.2 — June 16, 2026.

Benchmark	GLM-5.1	GLM-5.2	GPT-5.5	Opus 4.8
SWE-bench Pro	58.4	62.1	58.6	69.2
Terminal-Bench 2.1 (Terminus-2)	63.5	81.0	84.0	85.0
FrontierSWE	30.5	74.4	72.6	75.1
PostTrainBench	20.1	34.3	28.4	37.2
SWE-Marathon	1.0	13.0	12.0	26.0
MCP-Atlas	71.8	76.8	75.3	77.8
Design Arena Elo	—	1360 (#1)	—	—

Dash cells: not published for that model on Z.ai’s launch table.

PART SIX — WHERE GLM-5.2 EARNS A SLOT IN CRITIQUE

Retiring GLM-5V-Turbo consolidates the Z.AI lane: one 3-credit model instead of split multimodal (5V) and text (5.1) SKUs. For PR review, GLM-5.2 is the right default when (1) the PR touches many files or deep call chains, (2) Remedy needs multi-hour autonomous fix attempts, or (3) you want open-weight MIT deployment without giving up frontier-adjacent Terminal-Bench and FrontierSWE scores. Keep DeepSeek V4 Flash or Trinity on specialists for cost; escalate to Opus or GPT-5.5 Pro only when the risk profile demands it.

Routing checklist
1Use GLM-5.2 as lead when…
Large diffs, long-horizon Remedy, repo-wide refactors, MCP-heavy agent plans.
2Keep Flash specialists when…
DeepSeek V4 Flash, Ling 2.6, Gemma — fan out security/tests cheaply.
3Attach vision separately when…
GLM-5.2 is text-first; use MiMo, Gemini, or Qwen3.7 Plus when screenshots drive the review.
4Inference API clients
Keep calling z-ai/glm-5.1 — Critique routes to GLM-5.2 automatically.

Both retire in favor of a single z-ai/glm-5.2 entry displayed as GLM-5.1 at 3 credits. Legacy IDs alias forward so saved policies keep working.

Credit shelf and inference API compatibility. You get GLM-5.2 capability without repricing the lane teams already budgeted at 3 credits.

Different strengths: GLM-5.2 leads on long-horizon FrontierSWE and 1M context; Kimi K2.7 Code optimizes coding-specialist efficiency; MiniMax M3 competes on SWE-bench Pro at a similar price band. Route by task shape, not brand loyalty.

No — it measures crowdsourced design preference, not merge safety. Use it as a signal that GLM-5.2 has layout taste; still run your normal review + test gates.

Primary sources

GLM-5.2 launch blog (Z.ai)

GLM-5.2 developer docs

OpenRouter — z-ai/glm-5.2

VentureBeat coverage

Critique pricing

Try GLM-5.2 on your next PR.

Select GLM-5.1 (GLM-5.2 upstream) in the review model picker — 3 credits, 1M context, same economics as before.

Open dashboard →

Compare Critique

Compare verification approaches.

If you are evaluating independent finish checks against built-in agent review, these pages map the landscape.

Best AI code review tools AI code review pricing

← All essays Privacy & Terms

Ask about this essay

Nemotron-3-Super

Ask about the argument, the evidence, the structure, or how the post connects to Critique.

Not editorial advice · The essay above is the source of truth · Not saved to your account · OpenRouter privacy