Model updateApril 25, 202624 min readCritique

GPT-5.5 and GPT-5.5 Pro in Critique: Benchmarks, Pricing, and When to Spend the Credits

OpenAI calls GPT-5.5 a new class of intelligence for real work. Here is what the April 2026 launch data, OpenRouter pricing, and GDPval-AA results mean for PR review, specialist agents, and Remedy.

GPT-5.5 is not just another name in the model dropdown. OpenAI describes it as a fully retrained model aimed at agentic coding, computer use, knowledge work, and scientific research. That matters for Critique because pull request review is exactly that kind of workload: the model must read messy context, use tools, compare claims against code, decide what is actually risky, and keep going until the review artifact is useful.

The important product question is not "is GPT-5.5 smart?" The public answer is yes. The practical question is where to route it. Critique already has cheap specialist lanes, long-context open models, and mid-premium OpenAI options like GPT-5.4. GPT-5.5 earns the lead slot when the PR is ambiguous, terminal-heavy, security-sensitive, or likely to need a multi-file mental model. GPT-5.5 Pro earns the escalation slot when accuracy beats cost: auth, payments, infrastructure, incident response, regulatory logic, or a Remedy plan that must be right before it touches code.

82.7%

Terminal-Bench 2.0 for GPT-5.5, up from 75.1% for GPT-5.4 in OpenAI evals

58.6%

SWE-Bench Pro for GPT-5.5 on public real-world GitHub issue repair

90.1%

BrowseComp for GPT-5.5 Pro, the clearest Pro win in OpenAI launch tables

1.05M

OpenRouter context class for GPT-5.5 and GPT-5.5 Pro: 922K input + 128K output

OpenRouter IDs and Critique floors
How GPT-5.5 lands in the model catalogCritique credits meter the whole review run: lead synthesis, specialists, depth multipliers, retries, and Remedy handoff. Vendor token prices are still useful because they explain why GPT-5.5 Pro belongs behind an Ultra gate.
OpenAI
Throughput tier
GPT-5.5
Standard-plan frontier OpenAI lead for hard PRs, agentic review, and Remedy runs where GPT-5.4 is not enough.
Critique floor40 cr / run
OpenAI API — input / 1M$5
OpenAI API — output / 1M$30
OpenAI
Capability tier
GPT-5.5 Pro
Ultra-only deep reasoning lane for high-stakes review, legal/business/data work, and long-horizon accuracy.
Critique floor237 cr / run
OpenAI API — input / 1M$30
OpenAI API — output / 1M$180
OpenRouter lists both models with a 1.05M context class. OpenAI lists API pricing at $5/$30 per 1M tokens for GPT-5.5 and $30/$180 for GPT-5.5 Pro. Critique floors are catalog values and may scale with PR depth.
Full credit ladder and plans

PART ONE - WHAT OPENAI SHIPPED

OpenAI released GPT-5.5 on April 23, 2026, then updated the launch on April 24 to note API availability and an updated system card. The model is positioned around "real work" rather than chat-only intelligence: coding, debugging, online research, data analysis, documents, spreadsheets, software operation, and tool use across a task until completion.

The most relevant release detail for engineering teams is efficiency. OpenAI says GPT-5.5 matches GPT-5.4 per-token latency in real-world serving while reaching higher intelligence and using significantly fewer tokens on the same Codex tasks. That is the right axis for automated review: long-running agentic work gets expensive not only because the model is pricey, but because failed paths and repeated reasoning burn output tokens.

PART TWO - CODING AND AGENTIC BENCHMARKS

For PR review, the first benchmark cluster to care about is not MMLU-style knowledge. It is coding and tool use: Terminal-Bench 2.0 for command-line workflows, SWE-Bench Pro for real GitHub issue repair, Expert-SWE for long-horizon internal engineering tasks, and MCP Atlas or Toolathlon for structured tool coordination.

OpenAI coding and terminal evals

Percent scores from OpenAI GPT-5.5 launch tables. Higher is better.

GPT-5.5 - Terminal-Bench 2.082.7%
GPT-5.4 - Terminal-Bench 2.075.1%
Claude Opus 4.7 - Terminal-Bench 2.069.4%
Gemini 3.1 Pro - Terminal-Bench 2.068.5%
GPT-5.5 - SWE-Bench Pro58.6%
GPT-5.4 - SWE-Bench Pro57.7%

OpenAI notes labs have raised memorization concerns for SWE-Bench Pro; treat SWE numbers as directional and replay your own repository tasks before making routing policy permanent.

GPT-5.5 vs GPT-5.4 on review-shaped tasks

Selected OpenAI launch-table rows that map closely to Critique review and Remedy workflows.

GPT-5.5	GPT-5.4	Delta
Terminal-Bench 2.0	82.7%	75.1%	+7.6 pts
Expert-SWE (internal)	73.1%	68.5%	+4.6 pts
OSWorld-Verified	78.7%	75.0%	+3.7 pts
MCP Atlas	75.3%	70.6%	+4.7 pts
Tau2-bench Telecom	98.0%	92.8%	+5.2 pts
CyberGym	81.8%	79.0%	+2.8 pts

Tau2-bench Telecom is not a coding benchmark, but it is useful for judging long tool workflows with state, policy, and customer-service style constraints.

This is why GPT-5.5 is now prominent in Critique rather than hidden as a novelty model. A review lead has to do the unglamorous middle work: inspect command output, reason about blast radius, decide if a specialist finding is real, and write a verdict that a human maintainer can act on. Terminal-Bench and MCP-style evals are imperfect, but they are closer to that job than a static answer benchmark.

PART THREE - GPT-5.5 PRO IS NOT JUST "MORE GPT-5.5"

GPT-5.5 Pro is the expensive lane. OpenAI prices it at six times GPT-5.5 on input and output tokens, and Critique reflects that with an Ultra-only 237-credit floor. The reason to use it is not ordinary code review volume. The reason is high-stakes correctness, especially when the task looks more like a research partner, legal/business analyst, or deep Remedy planner than a fast review pass.

Where GPT-5.5 Pro separates

OpenAI launch-table rows where Pro is listed against GPT-5.5 or GPT-5.4 Pro.

GPT-5.5 Pro - BrowseComp90.1%
GPT-5.5 - BrowseComp84.4%
GPT-5.4 Pro - BrowseComp89.3%
GPT-5.5 Pro - FrontierMath T1-352.4%
GPT-5.5 - FrontierMath T1-351.7%
GPT-5.5 Pro - GeneBench33.2%
GPT-5.5 - GeneBench25%

Pro does not appear on every coding row in the launch table. Its clearest public advantages are browsing, harder math tiers, GeneBench, investment-banking modeling, and high-accuracy professional work.

Routing decision
GPT-5.5 vs GPT-5.5 Pro inside CritiqueThe default should not be "always Pro." The default should be risk-proportional routing.
Metric
GPT-5.5
GPT-5.5 Pro
Critique plan
Standard and above
Ultra only
Credit floor
40 cr
237 cr
OpenRouter ID
openai/gpt-5.5
openai/gpt-5.5-pro
Vendor API price
$5 / $30 per 1M in/out
$30 / $180 per 1M in/out
Best lead use
Hard PRs, terminal-heavy review, large context synthesis
High-stakes architecture, auth, payments, compliance, critical Remedy plans
Avoid when
A cheap specialist lane already answers the question
Latency or cost matters more than marginal accuracy

PART FOUR - LONG CONTEXT IS THE QUIET STORY

Both OpenAI and OpenRouter describe GPT-5.5 as a 1M-context-class model, while OpenRouter lists the deployed shape as roughly 922K input tokens plus 128K output tokens. Long context does not remove retrieval. It changes what you can safely put in front of the lead after retrieval has narrowed the field: policy, diff, impacted files, test output, specialist reports, prior review memory, and the dependency trail that explains why a small hunk is dangerous.

OpenAI long-context evals: GPT-5.5 vs GPT-5.4

F1 scores from OpenAI long-context rows. Higher is better.

MRCR 512K-1M - GPT-5.574%
MRCR 512K-1M - GPT-5.436.6%
Graphwalks BFS 1M - GPT-5.545.4%
Graphwalks BFS 1M - GPT-5.49.4%
Graphwalks parents 1M - GPT-5.558.5%
Graphwalks parents 1M - GPT-5.444.4%

These rows are not PR-review benchmarks, but they are relevant to repo-aware review because they stress retrieval across very long prompt ranges.

PART FIVE - THIRD-PARTY VIEW: GDPVAL-AA

OpenAI reports GDPval as "wins or ties"; Artificial Analysis publishes GDPval-AA as an independent agentic harness with shell and web access via Stirrup, scored with Elo from blind pairwise comparisons. The two are not the same display, but together they are useful: OpenAI shows GPT-5.5 at 84.9% GDPval, while Artificial Analysis ranks GPT-5.5 (xhigh) first on GDPval-AA at 1782 Elo.

GDPval-AA leaderboard snapshot

Artificial Analysis Elo scores for selected top models, April 2026. Higher is better.

GPT-5.5 (xhigh)1782
GPT-5.5 (high)1758
Claude Opus 4.7 (max)1753
Claude Opus 4.7 (high)1697
Claude Sonnet 4.6 (max)1676
GPT-5.4 (xhigh)1674
GPT-5.5 (medium)1654

Artificial Analysis publishes confidence intervals; treat close rankings as directional rather than absolute. The important fact is that GPT-5.5 occupies the top cluster on an external agentic work benchmark.

PART SIX - HOW TO USE GPT-5.5 IN CRITIQUE

Routing checklist
1Is the PR small, local, and low-risk?
Use a cheaper lead or specialist stack. GPT-5.5 is usually overkill for copy edits, one-file UI tweaks, and routine dependency bumps.
2Does the PR require terminal work or reproduction?
Prefer GPT-5.5 as lead. Terminal-Bench, OSWorld, MCP Atlas, and Tau2-bench gains point to stronger tool-loop behavior.
3Could a wrong review miss auth, payments, security, or data-loss risk?
Escalate to GPT-5.5, and consider GPT-5.5 Pro on Ultra when the cost of a false negative dwarfs the credit cost.
4Is Remedy going to make code changes from the result?
Use GPT-5.5 for most hard fixes. Reserve GPT-5.5 Pro for critical plans that need extra research, browsing, or high-accuracy reasoning before execution.

Recommended Critique placement

A practical routing matrix for teams updating repository policy.

Best model	Why	Risk if overused
Everyday small PR	GPT-5.4 Mini, Kimi K2.6, DeepSeek V4 Flash	Fast enough signal at lower credit burn	Wasting frontier budget
Ambiguous feature PR	GPT-5.5	Best balance of OpenAI frontier depth and usable cost	Slower than cheap lanes
Security-sensitive PR	GPT-5.5 lead plus security specialist	CyberGym and tool-loop gains map to defensive review	False confidence if tests are skipped
Critical platform or payment change	GPT-5.5 Pro	Ultra escalation for highest-cost mistakes	237-credit floor can be excessive for routine work
Huge monorepo context	GPT-5.5 or GPT-5.5 Pro after retrieval	1M context class helps final synthesis	Dumping raw context still creates noise

Yes. GPT-5.5 is in the runtime catalog as openai/gpt-5.5 with a 40-credit floor for lead, specialist, and Remedy use.

Yes. GPT-5.5 Pro is in the runtime catalog as openai/gpt-5.5-pro with a 237-credit Ultra floor for lead, specialist, and Remedy use.

No. GPT-5.5 should replace GPT-5.4 only where the review is hard enough to justify the floor: ambiguous failures, tool-heavy reproduction, large context, or high-risk merge paths.

No single benchmark is enough. Terminal-Bench 2.0, SWE-Bench Pro, MCP Atlas, OSWorld, CyberGym, and GDPval-AA together give a better picture of agentic code review than one coding leaderboard.

Primary sources

OpenAI - Introducing GPT-5.5

Primary launch announcement, benchmark tables, pricing, availability, and long-context rows.

OpenRouter - OpenAI model catalog

OpenRouter IDs, context windows, token pricing, and routing metadata for GPT-5.5 and GPT-5.5 Pro.

Artificial Analysis - GDPval-AA leaderboard

Independent agentic real-work benchmark showing GPT-5.5 at the top of the April 2026 leaderboard snapshot.

OpenAI - BrowseComp

Benchmark background for browser-agent tasks, useful context for GPT-5.5 Pro BrowseComp gains.

Compare Critique

Compare the main AI code review options.

If this article is part of a buying process, these pages compare Critique with the tools most teams evaluate for GitHub PR review.

Best AI code review tools AI code review pricing

← All essays Privacy & Terms

Get started

Ask about this essay

Nemotron-3-Super

Ask about the argument, the evidence, the structure, or how the post connects to Critique.

Not editorial advice · The essay above is the source of truth · Not saved to your account · OpenRouter privacy