Kimi K2.7 Code Lands in Critique: Open-Source Coding at 4.5 Credits, With Moonshot’s Full Benchmark Table
Moonshot’s coding-specialized K2.7 release improves Kimi Code Bench v2, Program Bench, and MLS Bench Lite over K2.6 while cutting reasoning-token usage — plus how Critique routes `moonshotai/kimi-k2.7-code` alongside Kimi K2.6 at 4 and 4.5 credits.
June 2026’s Kimi story is narrower than April’s K2.6 launch — and that is the point. Moonshot did not ship another generalist refresh. Kimi K2.7 Code is a coding-first fork of the K2 MoE stack: ~1T total parameters, 32B activated per token, native multimodal inputs (text, image, video on the official API), and forced thinking with `preserve_thinking` so agent threads keep full reasoning content across turns. Weights and code are on Hugging Face under a modified MIT license; OpenRouter lists `moonshotai/kimi-k2.7-code` at $0.95/M input and $4.00/M output. Critique’s job is the same as always: wire the model through the existing OpenRouter-shaped adapter, price it honestly on the credit ladder, and give you charts that compare apples to apples — not a grab bag of unrelated leaderboard rows.
K2.7 Code costs half a credit more than K2.6
Both models share the same review and Remedy surfaces. Pick K2.7 when long-horizon coding depth and tool efficiency matter; keep K2.6 when you want the general multimodal flagship at the lower floor.
Coding-specialized lane: forced thinking, preserve_thinking, long-horizon agent tasks.
General multimodal K2 flagship — still the default Moonshot slot for balanced review.
Credit floors are per agent step on paid review and Remedy. Token burn still scales with PR size, specialist fan-out, and reasoning length — K2.7’s efficiency claims target that variable cost.
PART ONE — WHAT K2.7 CODE IS
Moonshot describes K2.7 Code as built on Kimi K2.6’s backbone with deployment reuse: same vLLM / SGLang / KTransformers paths, same `transformers` pin (`>=4.57.1, <5.0.0`), and the same 256K / 262,144-token evaluation context. What changes is optimization for software engineering workflows — instruction following on long tasks, end-to-end completion rates, and less “overthinking” in the reasoning trace. The model always runs in thinking mode; instant mode is not supported. Video input remains experimental outside Moonshot’s official API even though the weights are multimodal.
PART TWO — MOONSHOT’S SIX-BENCHMARK TABLE (SAME HARNESS)
The grouped charts below reproduce Moonshot’s published table. K2.7 Code improves every row versus K2.6: +11.1 points on Kimi Code Bench v2 (50.9 → 62.0), +5.3 on Program Bench, +8.4 on MLS Bench Lite, +4.0 on Kimi Claw 24/7, +6.6 on MCP Atlas, and +8.3 on MCP Mark Verified. Against closed frontier models on the same page, GPT-5.5 still leads Program Bench and MCP Mark Verified; Opus 4.8 leads MLS Bench Lite and MCP Atlas. K2.7 nearly ties GPT-5.5 on MLS Bench Lite (35.1 vs 35.5) — the multi-language systems-programming lane Moonshot highlights for Rust, Go, and Python.
Coding
Agents
Exactly as printed on the Kimi K2.7 Code Hugging Face card — June 2026.
| Benchmark | Kimi K2.6 | Kimi K2.7 Code | GPT-5.5 | Opus 4.8 |
|---|---|---|---|---|
| Kimi Code Bench v2* | 50.9 | 62.0 | 69.0 | 67.4 |
| Program Bench | 48.3 | 53.6 | 69.1 | 63.8 |
| MLS Bench Lite | 26.7 | 35.1 | 35.5 | 42.8 |
| Kimi Claw 24/7 Bench* | 42.9 | 46.9 | 52.8 | 50.4 |
| MCP Atlas | 69.4 | 76.0 | 79.4 | 81.3 |
| MCP Mark Verified | 72.8 | 81.1 | 92.9 | 76.4 |
* Moonshot in-house benchmark.
PART THREE — REASONING EFFICIENCY VS K2.6
Moonshot’s second headline is cost-shaped: roughly 30% lower reasoning-token usage versus K2.6 on equivalent tasks, with less overthinking in the trace. The scatter chart plots Moonshot’s published efficiency story on the three coding benchmarks — performance (vertical) against reasoning tokens in thousands (horizontal). Up and left is better: K2.7 moves up on score while shifting left on tokens for Kimi Code Bench v2 and Program Bench. That pattern matters for agentic PR review where reasoning tokens are billed output and compound across specialist passes.
PART FOUR — MCP ATLAS: ONE BENCHMARK, MANY VENDORS
Teams often ask how K2.7 compares to MiniMax M3, Qwen3.7 Max, Gemini 3.5 Flash, or the Opus 4.7 generation on tool use. Moonshot’s six-benchmark table only includes GPT-5.5 and Opus 4.8. MCP Atlas is the cleanest extension point: multiple vendors published Scale MCP Atlas public-set scores in May–June 2026 using the same benchmark name. We do not mix in SWE-Bench Pro or Terminal-Bench rows here — those are different contests with different harnesses.
- Gemini 3.5 Flash83.6%
- Claude Opus 4.882.2%
- Claude Opus 4.779.1%
- Gemini 3.1 Pro78.2%
- Qwen3.7 Max76.4%
- Kimi K2.7 Code76%
- GPT-5.575.3%
- MiniMax M374.2%
- GLM-5.171.8%
- Kimi K2.669.4%
GPT-5.4 does not appear in the MCP Atlas tables we tracked for this essay. Opus 4.7 and 4.8 scores are from vendor pages cited in our June model refresh — not re-run by Critique.
PART FIVE — WHERE K2.7 EARNS A SLOT IN CRITIQUE
Critique’s review pipeline does not change shape when a new Moonshot SKU arrives. Webhooks still become runs; the runtime adapter still resolves OpenRouter headers; the lead model still reconciles specialist outputs. What changes is which Moonshot ID you select in policy: `moonshotai/kimi-k2.7-code` for coding-heavy, tool-shaped, long-horizon diffs where reasoning efficiency and MCP-style tool orchestration dominate; `moonshotai/kimi-k2.6` when you want the broader multimodal flagship at 4 credits. Remedy chat lists both at their respective floors so fix loops can stay on the same family without a separate integration.
Practically, reach for K2.7 Code when your pain looks like the benchmarks it wins on: multi-file backend work, infrastructure and performance engineering, multilingual systems code, and agentic tool chains that resemble MCP Atlas or MCP Mark Verified more than a single-turn lint pass. Stay on K2.6 — or a cheaper specialist — when the PR is small, latency-sensitive, or purely presentational. Public benchmarks orient routing; your monorepo still wins the argument.
- 1Is the PR dominated by long-horizon coding or tool orchestration?Try K2.7 Code at 4.5 credits — its gains concentrate on Moonshot’s coding and MCP suites.
- 2Do you need video or broad multimodal synthesis at the lowest Moonshot credit floor?Keep Kimi K2.6 at 4 credits — it remains the generalist K2 slot.
- 3Are you optimizing reasoning-token spend on agentic review?Benchmark K2.7 against K2.6 on your repo; Moonshot claims ~30% fewer reasoning tokens on the same coding tasks.
- 4Need frontier closed-model depth on hard repo-wide refactors?Opus 4.8 and GPT-5.5 still lead several Moonshot-table rows — use them when credits allow, not by default.