DeepSeek V4 Flash and V4 Pro in Critique: 1M context, EU inference, and open-weights leadership on GDPval-AA
Why DeepSeek’s first new architecture since V3 matters for PR review and Remedy, how V4 Pro tops independent agentic work benchmarks among open weights, and what changes when specialist fallbacks gain a million-token window at 1 and 3 credits.
Most model launches are incremental: a few points on a public leaderboard, a pricing tweak, a longer context line in a table. V4 is different in structure. DeepSeek describes a clean break from the V3 MoE envelope — V4 Pro at roughly 1.6T total parameters with 49B active per forward step, and V4 Flash at 284B total with 13B active — paired with hybrid attention for long sequences and configurable reasoning modes so teams can trade latency against depth without swapping vendors. For a review platform, that combination is the interesting part: you can run a cheap, fast specialist lane on Flash, escalate synthesis to Pro when the PR is messy, and still keep an entire dependency graph or generated trace in window if the repository policy asks for it.
Efficiency-first MoE: 284B total, 13B active, hybrid attention, and reasoning modes tuned for throughput. Best when you want open-weights signal across many PRs per day, specialist fan-out, or Remedy loops where latency and burn rate dominate.
openrouter id: deepseek/deepseek-v4-flashCapability-first MoE: 1.6T total, 49B active, same architectural family scaled for hard reasoning and long-horizon agents. Best when a single mistake is expensive — security-sensitive modules, billing, concurrency, or multi-file coherence that needs flagship open-weight depth.
openrouter id: deepseek/deepseek-v4-proPART ONE — WHAT SHIPPED, IN PLAIN TERMS
Both models stay text-in / text-out like V3.2, which keeps Critique’s adapters, policy fields, and sandbox contracts stable. What changes is the capacity envelope: an 8× context expansion versus the old 128K class, a new MoE scale, and hybrid attention aimed at long prompts without pretending that “long” is free — it still shows up in latency, dollars, and output-token volume when reasoning modes dig in.
DeepSeek and third-party analysts also emphasize deployment packaging: Pro is often discussed in FP4-weight form with a large on-disk footprint (on the order of hundreds of GB), while FP8-only paths exist for hardware that does not want the narrowest quantization story. GLM-5.1 and other open flagships have different native precisions; the practical lesson is to compare end-to-end latency and $/useful-output, not a single parameter count in isolation.
How V4 lands on your invoice
Critique floors bundle orchestration, failover, specialist fan-out, and depth multipliers. DeepSeek’s first-party API numbers anchor what “cheap” and “deep” mean before our abstraction — cache-hit input rates matter for long-context review where repeated system prompts dominate.
deepseek/deepseek-v4-flashDefault specialist and volume lane: fastest path inside the V4 family on Critique’s credit ladder.
deepseek/deepseek-v4-proOpen-weights flagship slot: GDPval-AA leader among open models in Artificial Analysis’s April 2026 snapshot.
DeepSeek API prices cited from Artificial Analysis’s April 2026 launch notes; OpenRouter and regional hosts may differ. Critique credits are not 1:1 with vendor tokens — they meter full review runs (lead + specialists + depth).
PART TWO — GDPval-AA AND TOKEN ECONOMICS
GDPval-AA, published by Artificial Analysis, measures Elo-style strength on agentic evaluations designed to resemble real knowledge work — multi-step tasks with shell access and browsing via their Stirrup harness. That is closer to the messy reality of PR review than a multiple-choice knowledge exam: long horizons, tool-shaped failures, and recoveries that burn output tokens even when the final answer looks short.
Higher is better. Figures from Artificial Analysis’s April 2026 DeepSeek V4 commentary; confidence intervals apply.
High vs Max effort can swap slightly inside the error band; treat ordering as directional. Artificial Analysis also highlights that Flash (High) beats Flash (Max) on GDPval-AA while using fewer output tokens — a reminder that “more reasoning” is not always strictly better.
V4 Pro (Reasoning, Max) vs V3.2 (Reasoning)
Rounded architecture and GDPval-AA figures from public summaries; use them to orient policy, not to skip your own PR replay tests.
The Flash story is equally important for operators: Artificial Analysis reports V4 Flash (Reasoning, High) at 1414 Elo against V3.2 (Reasoning) at 1203 — on the order of a 210 Elo lift — while occupying a smaller activated footprint designed for throughput. That is the profile you want when Critique is fanning out six specialist lanes and you still need merge-grade diligence without paying flagship rent on every file.
Artificial Analysis reports total output-token volume on the benchmark; higher bars are not automatically better scores.
Same source notes Flash (High) scores higher than Flash (Max) while using half the output tokens — an efficiency story, not a simple “more tokens equals more intelligence” curve.
PART THREE — WHERE V4 SITS IN THE FULL GDPval-AA FIELD
Open-weights leadership is the headline for teams that need inspectable weights, on-prem options, or vendor diversification. But it helps to see the adjacent proprietary frontier — the same harness, different models — so you know what you are trading away when you optimize for cost or residency.
Includes top proprietary entries and leading open-weights rows from Artificial Analysis leaderboard imagery (April 2026).
Proprietary models are shown for orientation only; Critique still routes them where policy allows. Open-weights rows are the ones many regulated teams can adopt without locking into a single US API vendor.
Verify against your provider’s live model card before freezing procurement assumptions.
| Line | Total params | Active params | Context |
|---|---|---|---|
| V3 family (V3 / V3.2 / R1 line) | 685B MoE | 37B | 128K class |
| V4 Flash | 284B MoE | 13B | 1M |
| V4 Pro | 1.6T MoE | 49B | 1M |
Third-party comparisons are approximate; exact GB depends on checkpoint packing and KV cache policy.
| Model | Precision story | Notes |
|---|---|---|
| V4 Pro | Often discussed in FP4 weights | ~865GB class disk story vs other trillion-scale opens |
| Kimi K2.6 | INT4 ~500GB class | 1T total / 32B active (public card) |
| GLM-5.1 | BF16 native ~1.49TB | Typically served FP8 / FP4 in production |
PART FOUR — HOW CRITIQUE WIRES V4 IN PRODUCTION
We retired deepseek/deepseek-v3.2-speciale from the active catalog and mapped legacy policy IDs to deepseek/deepseek-v4-flash so existing installations keep working without a manual migration. deepseek/deepseek-v4-pro is available everywhere V3.2 was — lead reviewer, specialist lanes, and Remedy execution — at a 3-credit floor that reflects its position against other mid-tier open-weights flagships.
Default specialist fallback chains now step through Flash before Pro before the larger Qwen MoE, which preserves a cheap-first failure mode while still giving the orchestrator a depth lever if an upstream provider errors out. Marketing examples that used to say DeepSeek V3.2 now name V4 Flash when we mean throughput-first open weights, and V4 Pro when we mean maximum open-weights depth on a hard lane.
- Million-token context for huge traces, generated trees, and wide file packs without hand-truncation hacks
- Two credit floors (1 and 3) instead of one mid-tier slot — clearer cost–risk mapping in policy
- Stronger GDPval-AA agentic scores than V3.2 across both Flash and Pro, per Artificial Analysis
- EU-region routing path for teams that enable European-hosted inference in workspace settings
- Reasoning “Max” can spend more tokens without always improving score — meter depth like you meter model tier
- Text-only modality: attach vision lanes (GLM-5V, Gemini, MiMo) when the PR is image-backed
- Provider variance in quantization and cache hits — watch your own latency histograms, not just marketing tables
- 1You are optimizing for PR volume and Remedy cost per merged fix.Start with V4 Flash at 1 credit — especially for specialists, re-review loops, and high-frequency repositories.
- 2The change touches auth, billing, concurrency, or cross-cutting architecture and you want open weights.Promote the lead or key specialists to V4 Pro at 3 credits so the model budget matches the downside risk.
- 3You still have policies pinned to the old V3.2 ID.No action required for continuity: aliases resolve to V4 Flash, but we recommend explicitly selecting Flash or Pro in the dashboard so intent is obvious to your team.
V3, V3.1, V3.2, and the R1 variants iterated inside the same broad footprint — enormous impact per dollar, but the context and agentic ceiling was known.
New total/active scale, hybrid attention, 1M context class, and GDPval-AA numbers that reposition DeepSeek in the open-weights race.
Runtime catalog, Remedy picker, specialist fallbacks, and docs updated; v3.2-speciale maps to v4-flash for continuity.
PART FIVE — EU HOSTING, LATENCY, AND THE 1M WINDOW
Teams ask two questions immediately after a launch: where does inference run, and can we trust the context number in production? For Critique customers who enable EU-region routing, DeepSeek V4 is served from European-hosted inference partners so residency expectations map to a concrete control. Latency remains a function of prompt size, tool traffic, and reasoning mode; Flash is the honest choice when you need snappy specialist turns, while Pro is the honest choice when a single under-baked review is unacceptable.
- High-throughput specialists
- Cheap Remedy iterations
- Huge prompts without flagship $
- Fast feedback on noisy repos
- 1M context class
- Text-only contract
- Configurable reasoning depth
- MIT weights story
- Frontier open-weights reasoning
- Hard multi-file synthesis
- GDPval-AA leadership (open)
- When downside risk dominates