Skip to content
Model update16 min readCritique

Qwen 3.6 and Grok 4.3 in Critique: cheaper routing, stronger coding signal, and one absurd xAI price cut

We replaced Grok 4.2 with Grok 4.3 at 3 credits, retired Qwen3.5-27B in favor of Qwen3.6-35B-A3B at 1 credit, added Ling-2.6-Flash at 1 credit, added Qwen3.6-Max-Preview at 8 credits, discounted GLM-5.1 by 1 credit, and moved MiMo v2.5 plus KAT Coder Pro V2 to their new floors.

This update is less about adding names to a dropdown and more about cleaning up our routing ladder. The old Qwen3.5-27B slot had become hard to justify once Qwen shipped a 35B/3B-active successor that is better on repository reasoning, better on frontend tasks, and still cheap enough to use as a routine specialist. On the xAI side, Grok 4.3 gives us a simpler story: reasoning is always on, multimodal input stays available, the context window remains large, and the credit floor falls sharply enough that it stops being a novelty lane and starts being a real option.

3 cr
New Grok 4.3 floor, down from 8 cr for the old Grok 4.2 slot
1 cr
Qwen3.6-35B-A3B floor, 0.5 credits cheaper than retired Qwen3.5-27B
1 cr
Ling-2.6-Flash joins the catalog as a fast InclusionAI agent lane
8 cr
Qwen3.6-Max-Preview floor in Critique and Remedy
3 cr
GLM-5.1 now costs 1 credit less than before
2 cr
KAT Coder Pro V2 returns to its normal shelf price after a month-long discount
1.5 cr
New MiMo v2.5 floor after a +1 credit adjustment
Critique.shLive · Updated just now

New routing snapshot

Catalog floors after the May 2, 2026 refresh.

7 models73.4% top benchmark score1 cr lowest floor

Qwen3.6-35B-A3B

Benchmark

73.4%

1 cr

Ling-2.6-Flash

Benchmark

104B total · 7.4B active · fast agent lane

1 cr

MiMo v2.5

Benchmark

Credit floor raised by 1 cr

1.5 cr

KAT Coder Pro V2

Benchmark

Back to normal shelf price

2 cr

Grok 4.3

Benchmark

1M ctx · reasoning-only · text+image in

3 cr

GLM-5.1

Benchmark

Discounted by 1 cr

3 cr

Qwen3.6-Max-Preview

Benchmark

Top score on 6 coding benchmarks (vendor summary)

8 cr

SWE-bench scores reflect best observed performance on the toughest real-world coding tasks.

All scores are relative.

xAI’s current developer docs surface grok-4.3 as the active flagship in the overview flow, with a 1M-token context class and text-plus-image input support. Their model docs also keep the important behavior constraint from the Grok 4 family: reasoning is built in, and there is no separate reasoning-effort dial for the standard Grok 4 line. That makes Grok 4.3 a clean fit for Critique. We do not need to explain hidden mode switches to users, and we do not need one price for “thinking” and another for “non-thinking.”

xAI slot refresh

How the Grok lane changed

We replaced the old Grok 4.2 entry with Grok 4.3 and repriced the lane hard downward.

Throughput tier
Grok 4.3

Reasoning-only xAI flagship with 1M context and multimodal inputs, now viable for regular lead and specialist use.

Critique floor3 cr / run
xAI API — input / 1M$1.25
xAI API — output / 1M$2.50

xAI public docs show Grok 4.3 in the live developer overview as of May 2, 2026. The 1.25 / 2.50 vendor token pricing is the current xAI pricing input we used for this catalog refresh.

This is the cleaner benchmark story in the release. Qwen’s official Hugging Face card for Qwen3.6-35B-A3B lists a 35B total / 3B active MoE architecture, 262,144 native context, and extension to roughly 1.01M tokens. The published coding-agent table is mixed on the SWE rows, but it shows clear gains on the repo-scale and front-end-shaped tasks we care about most: Terminal-Bench 2.0, Claw-Eval average, SkillsBench, QwenClawBench, NL2Repo, and QwenWebBench.

Qwen3.6-35B-A3B on the rows that matter most to Critique
Official Qwen model-card scores on repo, terminal, and browser-shaped coding tasks. Higher is better.
  • Terminal-Bench 2.0 - Qwen3.6-35B-A3B51.5%
  • Terminal-Bench 2.0 - Qwen3.5-27B41.6%
  • Claw-Eval Avg - Qwen3.6-35B-A3B68.7%
  • Claw-Eval Avg - Qwen3.5-27B64.3%
  • SkillsBench Avg5 - Qwen3.6-35B-A3B28.7%
  • SkillsBench Avg5 - Qwen3.5-27B27.2%

Qwen’s table mixes percentages with benchmark-specific scales such as QwenWebBench. Qwen3.5-27B still leads on some SWE rows, but Qwen3.6-35B-A3B leads most of the broader agentic, terminal, browser, and repo-style workflow rows while also costing less in our catalog.

Why we still replaced Qwen3.5-27B

The replacement call is about the overall workflow mix, not one benchmark in isolation.

Qwen3.5-27BQwen3.6-35B-A3BPractical read
Context256K class262K native / ~1.01M extendedNew model has more headroom
Active paramsdense-style 27B slot3B active MoECheaper inference profile
Terminal-Bench 2.041.651.5Meaningful jump for agent loops
NL2Repo27.329.4Better repo-scale reasoning
QwenWebBench10681397Better browser/front-end shaped tasks
Critique floor1.5 cr (retired)1 crCheaper and broader

One important correction to the simplistic launch line: the official Qwen table does not show a clean sweep on every single coding number. Qwen3.5-27B still posts higher values on some SWE rows. What it does show is that Qwen3.6-35B-A3B is stronger on the repo-scale, terminal, browser, and agentic workflow benchmarks that matter more to Critique’s review and Remedy loops, while also costing less in our catalog. That is enough to retire the older slot.

InclusionAI positions Ling-2.6-Flash as a fast instruct model for real-world agents: 104B total parameters, 7.4B active parameters, and a focus on token-efficient execution rather than theatrical benchmark chasing. The official Hugging Face card emphasizes lower token usage across coding, document processing, and lightweight workflows. That makes it a natural cheap specialist lane in Critique, especially for teams that want something faster and broader than a tiny extraction model but still do not want to pay mid-tier prices.

Alibaba positions Qwen3.6-Max-Preview as the higher-end proprietary Qwen lane. The official Model Studio docs list a 256K context window with thinking mode, function calling, and structured output support. Alibaba’s launch writeup says the preview model improves on Qwen3.7-Plus by +9.9 on SkillsBench, +6.3 on SciCode, +5.0 on NL2Repo, and +3.8 on Terminal-Bench 2.0, then summarizes the release by saying it leads six major coding benchmarks in their internal comparison set.

Qwen3.6-Max-Preview vs Qwen3.7-Plus
Vendor-reported deltas from Alibaba’s launch note. Higher is better.
  • SkillsBench9.9pts
  • SciCode6.3pts
  • NL2Repo5pts
  • Terminal-Bench 2.03.8pts
  • SuperGPQA2.3pts
  • ToolcallFormatIFBench2.8pts

Alibaba published the improvement margins and the six-benchmark summary in the launch article, not a full plain-text score table in the page body.

Three more credit moves round out the refresh. z-ai/glm-5.1 drops from 4 credits to 3, making it a more attractive mid-tier generalist. kwaipilot/kat-coder-pro-v2 rises from 1 credit back to its normal 2-credit shelf price; the lower number was a month-long partnership discount with the Kwaipilot team, not the permanent list price. And xiaomi/mimo-v2.5 rises from 0.5 to 1.5 credits, ending the unusually cheap launch positioning it held earlier.