GPT-5.5 and GPT-5.5 Pro in Critique: Benchmarks, Pricing, and When to Spend the Credits
OpenAI calls GPT-5.5 a new class of intelligence for real work. Here is what the April 2026 launch data, OpenRouter pricing, and GDPval-AA results mean for PR review, specialist agents, and Remedy.
GPT-5.5 is not just another name in the model dropdown. OpenAI describes it as a fully retrained model aimed at agentic coding, computer use, knowledge work, and scientific research. That matters for Critique because pull request review is exactly that kind of workload: the model must read messy context, use tools, compare claims against code, decide what is actually risky, and keep going until the review artifact is useful.
The important product question is not "is GPT-5.5 smart?" The public answer is yes. The practical question is where to route it. Critique already has cheap specialist lanes, long-context open models, and mid-premium OpenAI options like GPT-5.4. GPT-5.5 earns the lead slot when the PR is ambiguous, terminal-heavy, security-sensitive, or likely to need a multi-file mental model. GPT-5.5 Pro earns the escalation slot when accuracy beats cost: auth, payments, infrastructure, incident response, regulatory logic, or a Remedy plan that must be right before it touches code.
How GPT-5.5 lands in the model catalog
Critique credits meter the whole review run: lead synthesis, specialists, depth multipliers, retries, and Remedy handoff. Vendor token prices are still useful because they explain why GPT-5.5 Pro belongs behind an Ultra gate.
openai/gpt-5.5Standard-plan frontier OpenAI lead for hard PRs, agentic review, and Remedy runs where GPT-5.4 is not enough.
openai/gpt-5.5-proUltra-only deep reasoning lane for high-stakes review, legal/business/data work, and long-horizon accuracy.
OpenRouter lists both models with a 1.05M context class. OpenAI lists API pricing at $5/$30 per 1M tokens for GPT-5.5 and $30/$180 for GPT-5.5 Pro. Critique floors are catalog values and may scale with PR depth.
PART ONE - WHAT OPENAI SHIPPED
OpenAI released GPT-5.5 on April 23, 2026, then updated the launch on April 24 to note API availability and an updated system card. The model is positioned around "real work" rather than chat-only intelligence: coding, debugging, online research, data analysis, documents, spreadsheets, software operation, and tool use across a task until completion.
The most relevant release detail for engineering teams is efficiency. OpenAI says GPT-5.5 matches GPT-5.4 per-token latency in real-world serving while reaching higher intelligence and using significantly fewer tokens on the same Codex tasks. That is the right axis for automated review: long-running agentic work gets expensive not only because the model is pricey, but because failed paths and repeated reasoning burn output tokens.
PART TWO - CODING AND AGENTIC BENCHMARKS
For PR review, the first benchmark cluster to care about is not MMLU-style knowledge. It is coding and tool use: Terminal-Bench 2.0 for command-line workflows, SWE-Bench Pro for real GitHub issue repair, Expert-SWE for long-horizon internal engineering tasks, and MCP Atlas or Toolathlon for structured tool coordination.
Percent scores from OpenAI GPT-5.5 launch tables. Higher is better.
OpenAI notes labs have raised memorization concerns for SWE-Bench Pro; treat SWE numbers as directional and replay your own repository tasks before making routing policy permanent.
Selected OpenAI launch-table rows that map closely to Critique review and Remedy workflows.
| GPT-5.5 | GPT-5.4 | Delta | |
|---|---|---|---|
| Terminal-Bench 2.0 | 82.7% | 75.1% | +7.6 pts |
| Expert-SWE (internal) | 73.1% | 68.5% | +4.6 pts |
| OSWorld-Verified | 78.7% | 75.0% | +3.7 pts |
| MCP Atlas | 75.3% | 70.6% | +4.7 pts |
| Tau2-bench Telecom | 98.0% | 92.8% | +5.2 pts |
| CyberGym | 81.8% | 79.0% | +2.8 pts |
Tau2-bench Telecom is not a coding benchmark, but it is useful for judging long tool workflows with state, policy, and customer-service style constraints.
This is why GPT-5.5 is now prominent in Critique rather than hidden as a novelty model. A review lead has to do the unglamorous middle work: inspect command output, reason about blast radius, decide if a specialist finding is real, and write a verdict that a human maintainer can act on. Terminal-Bench and MCP-style evals are imperfect, but they are closer to that job than a static answer benchmark.
PART THREE - GPT-5.5 PRO IS NOT JUST "MORE GPT-5.5"
GPT-5.5 Pro is the expensive lane. OpenAI prices it at six times GPT-5.5 on input and output tokens, and Critique reflects that with an Ultra-only 237-credit floor. The reason to use it is not ordinary code review volume. The reason is high-stakes correctness, especially when the task looks more like a research partner, legal/business analyst, or deep Remedy planner than a fast review pass.
OpenAI launch-table rows where Pro is listed against GPT-5.5 or GPT-5.4 Pro.
Pro does not appear on every coding row in the launch table. Its clearest public advantages are browsing, harder math tiers, GeneBench, investment-banking modeling, and high-accuracy professional work.
GPT-5.5 vs GPT-5.5 Pro inside Critique
The default should not be "always Pro." The default should be risk-proportional routing.
PART FOUR - LONG CONTEXT IS THE QUIET STORY
Both OpenAI and OpenRouter describe GPT-5.5 as a 1M-context-class model, while OpenRouter lists the deployed shape as roughly 922K input tokens plus 128K output tokens. Long context does not remove retrieval. It changes what you can safely put in front of the lead after retrieval has narrowed the field: policy, diff, impacted files, test output, specialist reports, prior review memory, and the dependency trail that explains why a small hunk is dangerous.
F1 scores from OpenAI long-context rows. Higher is better.
These rows are not PR-review benchmarks, but they are relevant to repo-aware review because they stress retrieval across very long prompt ranges.
PART FIVE - THIRD-PARTY VIEW: GDPVAL-AA
OpenAI reports GDPval as "wins or ties"; Artificial Analysis publishes GDPval-AA as an independent agentic harness with shell and web access via Stirrup, scored with Elo from blind pairwise comparisons. The two are not the same display, but together they are useful: OpenAI shows GPT-5.5 at 84.9% GDPval, while Artificial Analysis ranks GPT-5.5 (xhigh) first on GDPval-AA at 1782 Elo.
Artificial Analysis Elo scores for selected top models, April 2026. Higher is better.
Artificial Analysis publishes confidence intervals; treat close rankings as directional rather than absolute. The important fact is that GPT-5.5 occupies the top cluster on an external agentic work benchmark.
PART SIX - HOW TO USE GPT-5.5 IN CRITIQUE
- 1Is the PR small, local, and low-risk?Use a cheaper lead or specialist stack. GPT-5.5 is usually overkill for copy edits, one-file UI tweaks, and routine dependency bumps.
- 2Does the PR require terminal work or reproduction?Prefer GPT-5.5 as lead. Terminal-Bench, OSWorld, MCP Atlas, and Tau2-bench gains point to stronger tool-loop behavior.
- 3Could a wrong review miss auth, payments, security, or data-loss risk?Escalate to GPT-5.5, and consider GPT-5.5 Pro on Ultra when the cost of a false negative dwarfs the credit cost.
- 4Is Remedy going to make code changes from the result?Use GPT-5.5 for most hard fixes. Reserve GPT-5.5 Pro for critical plans that need extra research, browsing, or high-accuracy reasoning before execution.
A practical routing matrix for teams updating repository policy.
| Best model | Why | Risk if overused | |
|---|---|---|---|
| Everyday small PR | GPT-5.4 Mini, Kimi K2.6, DeepSeek V4 Flash | Fast enough signal at lower credit burn | Wasting frontier budget |
| Ambiguous feature PR | GPT-5.5 | Best balance of OpenAI frontier depth and usable cost | Slower than cheap lanes |
| Security-sensitive PR | GPT-5.5 lead plus security specialist | CyberGym and tool-loop gains map to defensive review | False confidence if tests are skipped |
| Critical platform or payment change | GPT-5.5 Pro | Ultra escalation for highest-cost mistakes | 237-credit floor can be excessive for routine work |
| Huge monorepo context | GPT-5.5 or GPT-5.5 Pro after retrieval | 1M context class helps final synthesis | Dumping raw context still creates noise |