Model updateJune 13, 202624 min readCritique

Kimi K2.7 Code Lands in Critique: Open-Source Coding at 4.5 Credits, With Moonshot’s Full Benchmark Table

Moonshot’s coding-specialized K2.7 release improves Kimi Code Bench v2, Program Bench, and MLS Bench Lite over K2.6 while cutting reasoning-token usage — plus how Critique routes `moonshotai/kimi-k2.7-code` alongside Kimi K2.6 at 4 and 4.5 credits.

+21.8%

Kimi Code Bench v2 vs K2.6 (Moonshot)

−30%

Reasoning-token usage vs K2.6 (Moonshot claim)

0 cr

K2.7 Code review floor on Critique

Context class — same deployment family as K2.6

June 2026’s Kimi story is narrower than April’s K2.6 launch — and that is the point. Moonshot did not ship another generalist refresh. Kimi K2.7 Code is a coding-first fork of the K2 MoE stack: ~1T total parameters, 32B activated per token, native multimodal inputs (text, image, video on the official API), and forced thinking with preserve_thinking so agent threads keep full reasoning content across turns. Weights and code are on Hugging Face under a modified MIT license; OpenRouter lists moonshotai/kimi-k2.7-code at $0.95/M input and $4.00/M output. Critique’s job is the same as always: wire the model through the existing OpenRouter-shaped adapter, price it honestly on the credit ladder, and give you charts that compare apples to apples — not a grab bag of unrelated leaderboard rows.

Critique credits — Moonshot lane
K2.7 Code costs half a credit more than K2.6Both models share the same review and Remedy surfaces. Pick K2.7 when long-horizon coding depth and tool efficiency matter; keep K2.6 when you want the general multimodal flagship at the lower floor.
Capability tier
Kimi K2.7 Code
Coding-specialized lane: forced thinking, preserve_thinking, long-horizon agent tasks.
Critique floor4.5 cr / run
MoonshotAI API — input / 1M$0.95
MoonshotAI API — output / 1M$4.00
Throughput tier
Kimi K2.6
General multimodal K2 flagship — still the default Moonshot slot for balanced review.
Critique floor4 cr / run
MoonshotAI API — input / 1M$0.75
MoonshotAI API — output / 1M$3.75
Credit floors are per agent step on paid review and Remedy. Token burn still scales with PR size, specialist fan-out, and reasoning length — K2.7’s efficiency claims target that variable cost.
Full credit ladder and plans

PART ONE — WHAT K2.7 CODE IS

Moonshot describes K2.7 Code as built on Kimi K2.6’s backbone with deployment reuse: same vLLM / SGLang / KTransformers paths, same transformers pin (>=4.57.1, <5.0.0), and the same 256K / 262,144-token evaluation context. What changes is optimization for software engineering workflows — instruction following on long tasks, end-to-end completion rates, and less “overthinking” in the reasoning trace. The model always runs in thinking mode; instant mode is not supported. Video input remains experimental outside Moonshot’s official API even though the weights are multimodal.

1T / 32B

Total / activated MoE parameters

256K

Quoted context (262,144 in eval runs)

MIT*

Modified MIT license on HF weights

Jun 2026

Open-source release on Hugging Face

PART TWO — MOONSHOT’S SIX-BENCHMARK TABLE (SAME HARNESS)

The grouped charts below reproduce Moonshot’s published table. K2.7 Code improves every row versus K2.6: +11.1 points on Kimi Code Bench v2 (50.9 → 62.0), +5.3 on Program Bench, +8.4 on MLS Bench Lite, +4.0 on Kimi Claw 24/7, +6.6 on MCP Atlas, and +8.3 on MCP Mark Verified. Against closed frontier models on the same page, GPT-5.5 still leads Program Bench and MCP Mark Verified; Opus 4.8 leads MLS Bench Lite and MCP Atlas. K2.7 nearly ties GPT-5.5 on MLS Bench Lite (35.1 vs 35.5) — the multi-language systems-programming lane Moonshot highlights for Rust, Go, and Python.

Coding

Coding benchmarks — Moonshot harness

Kimi Code Bench v2* and Kimi Claw 24/7* are Moonshot in-house suites. Higher is better. Source: moonshotai/Kimi-K2.7-Code model card.

Kimi K2.7 Code

Kimi K2.6

GPT-5.5 (xhigh)

Opus 4.8 (xhigh)

K2.6/K2.7: Kimi Code CLI, thinking on, temp 1.0, top-p 0.95. GPT-5.5: Codex xhigh. Opus 4.8: Claude Code xhigh.

Agents

Agentic benchmarks — Moonshot harness

MCP Atlas uses the official 100 tool-call budget configuration. MCP Mark Verified is Moonshot’s human-verified MCPMark edition.

Kimi K2.7 Code

Kimi K2.6

GPT-5.5 (xhigh)

Opus 4.8 (xhigh)

Kimi Claw 24/7 spans 17 professional scenarios across 610 evaluation points via OpenClaw. MCP Mark Verified covers Notion, GitHub, Filesystem, Postgres, and Playwright servers.

Full Moonshot comparison table (percent scores)

Exactly as printed on the Kimi K2.7 Code Hugging Face card — June 2026.

Benchmark	Kimi K2.6	Kimi K2.7 Code	GPT-5.5	Opus 4.8
Kimi Code Bench v2*	50.9	62.0	69.0	67.4
Program Bench	48.3	53.6	69.1	63.8
MLS Bench Lite	26.7	35.1	35.5	42.8
Kimi Claw 24/7 Bench*	42.9	46.9	52.8	50.4
MCP Atlas	69.4	76.0	79.4	81.3
MCP Mark Verified	72.8	81.1	92.9	76.4

* Moonshot in-house benchmark.

PART THREE — REASONING EFFICIENCY VS K2.6

Moonshot’s second headline is cost-shaped: roughly 30% lower reasoning-token usage versus K2.6 on equivalent tasks, with less overthinking in the trace. The scatter chart plots Moonshot’s published efficiency story on the three coding benchmarks — performance (vertical) against reasoning tokens in thousands (horizontal). Up and left is better: K2.7 moves up on score while shifting left on tokens for Kimi Code Bench v2 and Program Bench. That pattern matters for agentic PR review where reasoning tokens are billed output and compound across specialist passes.

Performance vs reasoning tokens — K2.7 Code vs K2.6

Moonshot launch materials: coding benchmarks with thinking mode enabled. Token counts are average reasoning tokens per task (thousands).

Kimi K2.6Kimi K2.7 CodeArrows in Moonshot materials: up-left = higher score, fewer reasoning tokens.

Token coordinates are from Moonshot’s K2.7 Code efficiency chart in launch materials. Treat as vendor-reported; validate on your own PR distributions before changing production routing.

PART FOUR — MCP ATLAS: ONE BENCHMARK, MANY VENDORS

Teams often ask how K2.7 compares to MiniMax M3, Qwen3.7 Max, Gemini 3.5 Flash, or the Opus 4.7 generation on tool use. Moonshot’s six-benchmark table only includes GPT-5.5 and Opus 4.8. MCP Atlas is the cleanest extension point: multiple vendors published Scale MCP Atlas public-set scores in May–June 2026 using the same benchmark name. We do not mix in SWE-Bench Pro or Terminal-Bench rows here — those are different contests with different harnesses.

MCP Atlas — cross-vendor (same benchmark only)

Scale MCP Atlas public set. Vendor tables from Google DeepMind (Gemini 3.5 Flash), MiniMax M3 launch, and Critique’s June 2026 model roundup. Kimi rows from Moonshot K2.7 card.

Gemini 3.5 Flash83.6%
Claude Opus 4.882.2%
Claude Opus 4.779.1%
Gemini 3.1 Pro78.2%
Qwen3.7 Max76.4%
Kimi K2.7 Code76%
GPT-5.575.3%
MiniMax M374.2%
GLM-5.171.8%
Kimi K2.669.4%

GPT-5.4 does not appear in the MCP Atlas tables we tracked for this essay. Opus 4.7 and 4.8 scores are from vendor pages cited in our June model refresh — not re-run by Critique.

PART FIVE — WHERE K2.7 EARNS A SLOT IN CRITIQUE

Critique’s review pipeline does not change shape when a new Moonshot SKU arrives. Webhooks still become runs; the runtime adapter still resolves OpenRouter headers; the lead model still reconciles specialist outputs. What changes is which Moonshot ID you select in policy: moonshotai/kimi-k2.7-code for coding-heavy, tool-shaped, long-horizon diffs where reasoning efficiency and MCP-style tool orchestration dominate; moonshotai/kimi-k2.6 when you want the broader multimodal flagship at 4 credits. Remedy chat lists both at their respective floors so fix loops can stay on the same family without a separate integration.

Practically, reach for K2.7 Code when your pain looks like the benchmarks it wins on: multi-file backend work, infrastructure and performance engineering, multilingual systems code, and agentic tool chains that resemble MCP Atlas or MCP Mark Verified more than a single-turn lint pass. Stay on K2.6 — or a cheaper specialist — when the PR is small, latency-sensitive, or purely presentational. Public benchmarks orient routing; your monorepo still wins the argument.

Quick routing checklist
1Is the PR dominated by long-horizon coding or tool orchestration?
Try K2.7 Code at 4.5 credits — its gains concentrate on Moonshot’s coding and MCP suites.
2Do you need video or broad multimodal synthesis at the lowest Moonshot credit floor?
Keep Kimi K2.6 at 4 credits — it remains the generalist K2 slot.
3Are you optimizing reasoning-token spend on agentic review?
Benchmark K2.7 against K2.6 on your repo; Moonshot claims ~30% fewer reasoning tokens on the same coding tasks.
4Need frontier closed-model depth on hard repo-wide refactors?
Opus 4.8 and GPT-5.5 still lead several Moonshot-table rows — use them when credits allow, not by default.

Primary sources

Moonshot — Kimi K2.7 Code on Hugging Face

Architecture, six-benchmark table, deployment notes

OpenRouter — moonshotai/kimi-k2.7-code

Aggregator pricing and API routing

Kimi API — K2.7 Code quickstart

Official API, thinking mode, preserve_thinking

Google DeepMind — Gemini 3.5 Flash benchmarks

MCP Atlas 83.6% and cross-model agentic table

Critique — Kimi K2.6 launch essay

Earlier Moonshot lane context and K2.6 integration

Compare Critique

Compare the main AI code review options.

If this article is part of a buying process, these pages compare Critique with the tools most teams evaluate for GitHub PR review.

Best AI code review tools AI code review pricing

← All essays Privacy & Terms

Get started

Ask about this essay

Nemotron-3-Super

Ask about the argument, the evidence, the structure, or how the post connects to Critique.

Not editorial advice · The essay above is the source of truth · Not saved to your account · OpenRouter privacy