April 24, 202629 min readCritique

DeepSeek V4 Flash and V4 Pro in Critique: 1M context, EU inference, and open-weights leadership on GDPval-AA

Why DeepSeek’s first new architecture since V3 matters for PR review and Remedy, how V4 Pro tops independent agentic work benchmarks among open weights, and what changes when specialist fallbacks gain a million-token window at 1 and 3 credits.

Most model launches are incremental: a few points on a public leaderboard, a pricing tweak, a longer context line in a table. V4 is different in structure. DeepSeek describes a clean break from the V3 MoE envelope — V4 Pro at roughly 1.6T total parameters with 49B active per forward step, and V4 Flash at 284B total with 13B active — paired with hybrid attention for long sequences and configurable reasoning modes so teams can trade latency against depth without swapping vendors. For a review platform, that combination is the interesting part: you can run a cheap, fast specialist lane on Flash, escalate synthesis to Pro when the PR is messy, and still keep an entire dependency graph or generated trace in window if the repository policy asks for it.

1M
Token context class (8× V3.2’s 128K class)
1.6T / 49B
V4 Pro total / active parameters (MoE)
284B / 13B
V4 Flash total / active parameters (MoE)
MIT
License on released weights (verify compliance)

DeepSeek V4 Flash
Efficiency-first MoE: 284B total, 13B active, hybrid attention, and reasoning modes tuned for throughput. Best when you want open-weights signal across many PRs per day, specialist fan-out, or Remedy loops where latency and burn rate dominate.
openrouter id: deepseek/deepseek-v4-flash
DeepSeek V4 Pro
Capability-first MoE: 1.6T total, 49B active, same architectural family scaled for hard reasoning and long-horizon agents. Best when a single mistake is expensive — security-sensitive modules, billing, concurrency, or multi-file coherence that needs flagship open-weight depth.
openrouter id: deepseek/deepseek-v4-pro

PART ONE — WHAT SHIPPED, IN PLAIN TERMS

Both models stay text-in / text-out like V3.2, which keeps Critique’s adapters, policy fields, and sandbox contracts stable. What changes is the capacity envelope: an 8× context expansion versus the old 128K class, a new MoE scale, and hybrid attention aimed at long prompts without pretending that “long” is free — it still shows up in latency, dollars, and output-token volume when reasoning modes dig in.

DeepSeek and third-party analysts also emphasize deployment packaging: Pro is often discussed in FP4-weight form with a large on-disk footprint (on the order of hundreds of GB), while FP8-only paths exist for hardware that does not want the narrowest quantization story. GLM-5.1 and other open flagships have different native precisions; the practical lesson is to compare end-to-end latency and $/useful-output, not a single parameter count in isolation.

Critique credits × vendor list pricing
How V4 lands on your invoiceCritique floors bundle orchestration, failover, specialist fan-out, and depth multipliers. DeepSeek’s first-party API numbers anchor what “cheap” and “deep” mean before our abstraction — cache-hit input rates matter for long-context review where repeated system prompts dominate.
DeepSeek
Throughput tier
DeepSeek V4 Flash
deepseek/deepseek-v4-flash
Default specialist and volume lane: fastest path inside the V4 family on Critique’s credit ladder.
Critique floor1 cr / run
DeepSeek API — input / 1M$0.14
DeepSeek API — output / 1M$0.28
Cache-hit input / 1M$0.028
DeepSeek
Capability tier
DeepSeek V4 Pro
deepseek/deepseek-v4-pro
Open-weights flagship slot: GDPval-AA leader among open models in Artificial Analysis’s April 2026 snapshot.
Critique floor3 cr / run
DeepSeek API — input / 1M$1.74
DeepSeek API — output / 1M$3.48
Cache-hit input / 1M$0.145
DeepSeek API prices cited from Artificial Analysis’s April 2026 launch notes; OpenRouter and regional hosts may differ. Critique credits are not 1:1 with vendor tokens — they meter full review runs (lead + specialists + depth).
Full credit ladder and plans

PART TWO — GDPval-AA AND TOKEN ECONOMICS

GDPval-AA, published by Artificial Analysis, measures Elo-style strength on agentic evaluations designed to resemble real knowledge work — multi-step tasks with shell access and browsing via their Stirrup harness. That is closer to the messy reality of PR review than a multiple-choice knowledge exam: long horizons, tool-shaped failures, and recoveries that burn output tokens even when the final answer looks short.

GDPval-AA — open-weights reasoning snapshot (Elo)

Higher is better. Figures from Artificial Analysis’s April 2026 DeepSeek V4 commentary; confidence intervals apply.

High vs Max effort can swap slightly inside the error band; treat ordering as directional. Artificial Analysis also highlights that Flash (High) beats Flash (Max) on GDPval-AA while using fewer output tokens — a reminder that “more reasoning” is not always strictly better.

Generation-over-generation
V4 Pro (Reasoning, Max) vs V3.2 (Reasoning)Rounded architecture and GDPval-AA figures from public summaries; use them to orient policy, not to skip your own PR replay tests.
Metric
DeepSeek V3.2
DeepSeek V4 Pro
GDPval-AA Elo (reasoning, Max)
Artificial Analysis April 2026 snapshot
1203
1554
Approx. Elo uplift
Agentic real-work harness, not code-only SWE-Bench
—
~351
Total parameters (MoE)
685B class
~1.6T
Active parameters
37B
49B
Context window class
128K
1M tokens
Critique credit floor
retired (was 2 cr)
3 cr

The Flash story is equally important for operators: Artificial Analysis reports V4 Flash (Reasoning, High) at 1414 Elo against V3.2 (Reasoning) at 1203 — on the order of a 210 Elo lift — while occupying a smaller activated footprint designed for throughput. That is the profile you want when Critique is fanning out six specialist lanes and you still need merge-grade diligence without paying flagship rent on every file.

GDPval-AA — output tokens (reasoning runs, millions)

Artificial Analysis reports total output-token volume on the benchmark; higher bars are not automatically better scores.

Same source notes Flash (High) scores higher than Flash (Max) while using half the output tokens — an efficiency story, not a simple “more tokens equals more intelligence” curve.

PART THREE — WHERE V4 SITS IN THE FULL GDPval-AA FIELD

Open-weights leadership is the headline for teams that need inspectable weights, on-prem options, or vendor diversification. But it helps to see the adjacent proprietary frontier — the same harness, different models — so you know what you are trading away when you optimize for cost or residency.

GDPval-AA — frontier plus open weights (selected)

Includes top proprietary entries and leading open-weights rows from Artificial Analysis leaderboard imagery (April 2026).

Proprietary models are shown for orientation only; Critique still routes them where policy allows. Open-weights rows are the ones many regulated teams can adopt without locking into a single US API vendor.

V3 family vs V4 — operational snapshot

Verify against your provider’s live model card before freezing procurement assumptions.

Line	Total params	Active params	Context
V3 family (V3 / V3.2 / R1 line)	685B MoE	37B	128K class
V4 Flash	284B MoE	13B	1M
V4 Pro	1.6T MoE	49B	1M

Deployment footprint (indicative)

Third-party comparisons are approximate; exact GB depends on checkpoint packing and KV cache policy.

Model	Precision story	Notes
V4 Pro	Often discussed in FP4 weights	~865GB class disk story vs other trillion-scale opens
Kimi K2.6	INT4 ~500GB class	1T total / 32B active (public card)
GLM-5.1	BF16 native ~1.49TB	Typically served FP8 / FP4 in production

PART FOUR — HOW CRITIQUE WIRES V4 IN PRODUCTION

We retired deepseek/deepseek-v3.2-speciale from the active catalog and mapped legacy policy IDs to deepseek/deepseek-v4-flash so existing installations keep working without a manual migration. deepseek/deepseek-v4-pro is available everywhere V3.2 was — lead reviewer, specialist lanes, and Remedy execution — at a 3-credit floor that reflects its position against other mid-tier open-weights flagships.

Default specialist fallback chains now step through Flash before Pro before the larger Qwen MoE, which preserves a cheap-first failure mode while still giving the orchestrator a depth lever if an upstream provider errors out. Marketing examples that used to say DeepSeek V3.2 now name V4 Flash when we mean throughput-first open weights, and V4 Pro when we mean maximum open-weights depth on a hard lane.

What improves immediately
Million-token context for huge traces, generated trees, and wide file packs without hand-truncation hacks
Two credit floors (1 and 3) instead of one mid-tier slot — clearer cost–risk mapping in policy
Stronger GDPval-AA agentic scores than V3.2 across both Flash and Pro, per Artificial Analysis
EU-region routing path for teams that enable European-hosted inference in workspace settings
What still needs discipline
Reasoning “Max” can spend more tokens without always improving score — meter depth like you meter model tier
Text-only modality: attach vision lanes (GLM-5V, Gemini, MiMo) when the PR is image-backed
Provider variance in quantization and cache hits — watch your own latency histograms, not just marketing tables

When to reach for which tier
1You are optimizing for PR volume and Remedy cost per merged fix.
Start with V4 Flash at 1 credit — especially for specialists, re-review loops, and high-frequency repositories.
2The change touches auth, billing, concurrency, or cross-cutting architecture and you want open weights.
Promote the lead or key specialists to V4 Pro at 3 credits so the model budget matches the downside risk.
3You still have policies pinned to the old V3.2 ID.
No action required for continuity: aliases resolve to V4 Flash, but we recommend explicitly selecting Flash or Pro in the dashboard so intent is obvious to your team.

V3 ERA
685B-class MoE, 37B active, 128K context
V3, V3.1, V3.2, and the R1 variants iterated inside the same broad footprint — enormous impact per dollar, but the context and agentic ceiling was known.
APR 2026
V4 Flash + V4 Pro ship
New total/active scale, hybrid attention, 1M context class, and GDPval-AA numbers that reposition DeepSeek in the open-weights race.
IN CRITIQUE
1 cr / 3 cr floors, EU routing, legacy alias
Runtime catalog, Remedy picker, specialist fallbacks, and docs updated; v3.2-speciale maps to v4-flash for continuity.

PART FIVE — EU HOSTING, LATENCY, AND THE 1M WINDOW

Teams ask two questions immediately after a launch: where does inference run, and can we trust the context number in production? For Critique customers who enable EU-region routing, DeepSeek V4 is served from European-hosted inference partners so residency expectations map to a concrete control. Latency remains a function of prompt size, tool traffic, and reasoning mode; Flash is the honest choice when you need snappy specialist turns, while Pro is the honest choice when a single under-baked review is unacceptable.

Hybrid pattern — where most teams land
V4 Flash strengths
High-throughput specialists
Cheap Remedy iterations
Huge prompts without flagship $
Fast feedback on noisy repos
Shared
1M context class
Text-only contract
Configurable reasoning depth
MIT weights story
V4 Pro strengths
Frontier open-weights reasoning
Hard multi-file synthesis
GDPval-AA leadership (open)
When downside risk dominates

Primary sources
Artificial Analysis — GDPval-AA leaderboard
Live Elo table and methodology for the agentic GDPval harness.
Artificial Analysis — DeepSeek V4 Pro on GDPval-AA (notes)
April 2026 commentary: parameters, precision, token usage, peer comparisons.
OpenRouter — deepseek/deepseek-v4-flash
Routing ID as exposed to Critique.
OpenRouter — deepseek/deepseek-v4-pro
Routing ID for the Pro tier.
DeepSeek — @deepseek_ai
Release announcements and API notes.

It is no longer a selectable catalog ID. Legacy deepseek/deepseek-v3.2-speciale entries normalize to V4 Flash so policies keep working; we recommend updating selections to Flash or Pro explicitly.

V4 is text-in / text-out like V3.2. Use GLM-5V, Gemini, or MiMo lanes when the review must consume images or mixed media.

Replay your hardest historical PRs with frozen prompts, compare false-positive rates, and track Remedy success — public Elo is orientation, not a substitute for your repository’s reality.

The open-weights slice answers “which open model should get my default policy slot?” The wider chart answers “what am I giving up versus the proprietary frontier on the same harness?”

← All essays Privacy & Terms

Get started

Ask about this essay

Nemotron-3-Super

Ask about the argument, the evidence, the structure, or how the post connects to Critique.

Not editorial advice · The essay above is the source of truth · Not saved to your account · OpenRouter privacy