Skip to content
Model update24 min readCritique

Kimi K2.7 Code Lands in Critique: Open-Source Coding at 4.5 Credits, With Moonshot’s Full Benchmark Table

Moonshot’s coding-specialized K2.7 release improves Kimi Code Bench v2, Program Bench, and MLS Bench Lite over K2.6 while cutting reasoning-token usage — plus how Critique routes `moonshotai/kimi-k2.7-code` alongside Kimi K2.6 at 4 and 4.5 credits.

+21.8%
Kimi Code Bench v2 vs K2.6 (Moonshot)
−30%
Reasoning-token usage vs K2.6 (Moonshot claim)
0 cr
K2.7 Code review floor on Critique
0K
Context class — same deployment family as K2.6

June 2026’s Kimi story is narrower than April’s K2.6 launch — and that is the point. Moonshot did not ship another generalist refresh. Kimi K2.7 Code is a coding-first fork of the K2 MoE stack: ~1T total parameters, 32B activated per token, native multimodal inputs (text, image, video on the official API), and forced thinking with `preserve_thinking` so agent threads keep full reasoning content across turns. Weights and code are on Hugging Face under a modified MIT license; OpenRouter lists `moonshotai/kimi-k2.7-code` at $0.95/M input and $4.00/M output. Critique’s job is the same as always: wire the model through the existing OpenRouter-shaped adapter, price it honestly on the credit ladder, and give you charts that compare apples to apples — not a grab bag of unrelated leaderboard rows.

Critique credits — Moonshot lane

K2.7 Code costs half a credit more than K2.6

Both models share the same review and Remedy surfaces. Pick K2.7 when long-horizon coding depth and tool efficiency matter; keep K2.6 when you want the general multimodal flagship at the lower floor.

Capability tier
Kimi K2.7 Code

Coding-specialized lane: forced thinking, preserve_thinking, long-horizon agent tasks.

Critique floor4.5 cr / run
MoonshotAI API — input / 1M$0.95
MoonshotAI API — output / 1M$4.00
Throughput tier
Kimi K2.6

General multimodal K2 flagship — still the default Moonshot slot for balanced review.

Critique floor4 cr / run
MoonshotAI API — input / 1M$0.75
MoonshotAI API — output / 1M$3.75

Credit floors are per agent step on paid review and Remedy. Token burn still scales with PR size, specialist fan-out, and reasoning length — K2.7’s efficiency claims target that variable cost.

Moonshot describes K2.7 Code as built on Kimi K2.6’s backbone with deployment reuse: same vLLM / SGLang / KTransformers paths, same `transformers` pin (`>=4.57.1, <5.0.0`), and the same 256K / 262,144-token evaluation context. What changes is optimization for software engineering workflows — instruction following on long tasks, end-to-end completion rates, and less “overthinking” in the reasoning trace. The model always runs in thinking mode; instant mode is not supported. Video input remains experimental outside Moonshot’s official API even though the weights are multimodal.

1T / 32B
Total / activated MoE parameters
256K
Quoted context (262,144 in eval runs)
MIT*
Modified MIT license on HF weights
Jun 2026
Open-source release on Hugging Face

The grouped charts below reproduce Moonshot’s published table. K2.7 Code improves every row versus K2.6: +11.1 points on Kimi Code Bench v2 (50.9 → 62.0), +5.3 on Program Bench, +8.4 on MLS Bench Lite, +4.0 on Kimi Claw 24/7, +6.6 on MCP Atlas, and +8.3 on MCP Mark Verified. Against closed frontier models on the same page, GPT-5.5 still leads Program Bench and MCP Mark Verified; Opus 4.8 leads MLS Bench Lite and MCP Atlas. K2.7 nearly ties GPT-5.5 on MLS Bench Lite (35.1 vs 35.5) — the multi-language systems-programming lane Moonshot highlights for Rust, Go, and Python.

Coding

Coding benchmarks — Moonshot harness
Kimi Code Bench v2* and Kimi Claw 24/7* are Moonshot in-house suites. Higher is better. Source: moonshotai/Kimi-K2.7-Code model card.
Kimi K2.7 Code
Kimi K2.6
GPT-5.5 (xhigh)
Opus 4.8 (xhigh)
K2.6/K2.7: Kimi Code CLI, thinking on, temp 1.0, top-p 0.95. GPT-5.5: Codex xhigh. Opus 4.8: Claude Code xhigh.

Agents

Agentic benchmarks — Moonshot harness
MCP Atlas uses the official 100 tool-call budget configuration. MCP Mark Verified is Moonshot’s human-verified MCPMark edition.
Kimi K2.7 Code
Kimi K2.6
GPT-5.5 (xhigh)
Opus 4.8 (xhigh)
Kimi Claw 24/7 spans 17 professional scenarios across 610 evaluation points via OpenClaw. MCP Mark Verified covers Notion, GitHub, Filesystem, Postgres, and Playwright servers.
Full Moonshot comparison table (percent scores)

Exactly as printed on the Kimi K2.7 Code Hugging Face card — June 2026.

BenchmarkKimi K2.6Kimi K2.7 CodeGPT-5.5Opus 4.8
Kimi Code Bench v2*50.962.069.067.4
Program Bench48.353.669.163.8
MLS Bench Lite26.735.135.542.8
Kimi Claw 24/7 Bench*42.946.952.850.4
MCP Atlas69.476.079.481.3
MCP Mark Verified72.881.192.976.4

* Moonshot in-house benchmark.

Moonshot’s second headline is cost-shaped: roughly 30% lower reasoning-token usage versus K2.6 on equivalent tasks, with less overthinking in the trace. The scatter chart plots Moonshot’s published efficiency story on the three coding benchmarks — performance (vertical) against reasoning tokens in thousands (horizontal). Up and left is better: K2.7 moves up on score while shifting left on tokens for Kimi Code Bench v2 and Program Bench. That pattern matters for agentic PR review where reasoning tokens are billed output and compound across specialist passes.

Performance vs reasoning tokens — K2.7 Code vs K2.6
Moonshot launch materials: coding benchmarks with thinking mode enabled. Token counts are average reasoning tokens per task (thousands).
Kimi K2.6Kimi K2.7 CodeArrows in Moonshot materials: up-left = higher score, fewer reasoning tokens.
Token coordinates are from Moonshot’s K2.7 Code efficiency chart in launch materials. Treat as vendor-reported; validate on your own PR distributions before changing production routing.

Teams often ask how K2.7 compares to MiniMax M3, Qwen3.7 Max, Gemini 3.5 Flash, or the Opus 4.7 generation on tool use. Moonshot’s six-benchmark table only includes GPT-5.5 and Opus 4.8. MCP Atlas is the cleanest extension point: multiple vendors published Scale MCP Atlas public-set scores in May–June 2026 using the same benchmark name. We do not mix in SWE-Bench Pro or Terminal-Bench rows here — those are different contests with different harnesses.

MCP Atlas — cross-vendor (same benchmark only)
Scale MCP Atlas public set. Vendor tables from Google DeepMind (Gemini 3.5 Flash), MiniMax M3 launch, and Critique’s June 2026 model roundup. Kimi rows from Moonshot K2.7 card.
  • Gemini 3.5 Flash83.6%
  • Claude Opus 4.882.2%
  • Claude Opus 4.779.1%
  • Gemini 3.1 Pro78.2%
  • Qwen3.7 Max76.4%
  • Kimi K2.7 Code76%
  • GPT-5.575.3%
  • MiniMax M374.2%
  • GLM-5.171.8%
  • Kimi K2.669.4%

GPT-5.4 does not appear in the MCP Atlas tables we tracked for this essay. Opus 4.7 and 4.8 scores are from vendor pages cited in our June model refresh — not re-run by Critique.

Critique’s review pipeline does not change shape when a new Moonshot SKU arrives. Webhooks still become runs; the runtime adapter still resolves OpenRouter headers; the lead model still reconciles specialist outputs. What changes is which Moonshot ID you select in policy: `moonshotai/kimi-k2.7-code` for coding-heavy, tool-shaped, long-horizon diffs where reasoning efficiency and MCP-style tool orchestration dominate; `moonshotai/kimi-k2.6` when you want the broader multimodal flagship at 4 credits. Remedy chat lists both at their respective floors so fix loops can stay on the same family without a separate integration.

Practically, reach for K2.7 Code when your pain looks like the benchmarks it wins on: multi-file backend work, infrastructure and performance engineering, multilingual systems code, and agentic tool chains that resemble MCP Atlas or MCP Mark Verified more than a single-turn lint pass. Stay on K2.6 — or a cheaper specialist — when the PR is small, latency-sensitive, or purely presentational. Public benchmarks orient routing; your monorepo still wins the argument.

Quick routing checklist
  1. 1
    Is the PR dominated by long-horizon coding or tool orchestration?
    Try K2.7 Code at 4.5 credits — its gains concentrate on Moonshot’s coding and MCP suites.
  2. 2
    Do you need video or broad multimodal synthesis at the lowest Moonshot credit floor?
    Keep Kimi K2.6 at 4 credits — it remains the generalist K2 slot.
  3. 3
    Are you optimizing reasoning-token spend on agentic review?
    Benchmark K2.7 against K2.6 on your repo; Moonshot claims ~30% fewer reasoning tokens on the same coding tasks.
  4. 4
    Need frontier closed-model depth on hard repo-wide refactors?
    Opus 4.8 and GPT-5.5 still lead several Moonshot-table rows — use them when credits allow, not by default.