Skip to content
24 min readCritique

GPT-5.5 and GPT-5.5 Pro in Critique: Benchmarks, Pricing, and When to Spend the Credits

OpenAI calls GPT-5.5 a new class of intelligence for real work. Here is what the April 2026 launch data, OpenRouter pricing, and GDPval-AA results mean for PR review, specialist agents, and Remedy.

GPT-5.5 is not just another name in the model dropdown. OpenAI describes it as a fully retrained model aimed at agentic coding, computer use, knowledge work, and scientific research. That matters for Critique because pull request review is exactly that kind of workload: the model must read messy context, use tools, compare claims against code, decide what is actually risky, and keep going until the review artifact is useful.

The important product question is not "is GPT-5.5 smart?" The public answer is yes. The practical question is where to route it. Critique already has cheap specialist lanes, long-context open models, and mid-premium OpenAI options like GPT-5.4. GPT-5.5 earns the lead slot when the PR is ambiguous, terminal-heavy, security-sensitive, or likely to need a multi-file mental model. GPT-5.5 Pro earns the escalation slot when accuracy beats cost: auth, payments, infrastructure, incident response, regulatory logic, or a Remedy plan that must be right before it touches code.

82.7%
Terminal-Bench 2.0 for GPT-5.5, up from 75.1% for GPT-5.4 in OpenAI evals
58.6%
SWE-Bench Pro for GPT-5.5 on public real-world GitHub issue repair
90.1%
BrowseComp for GPT-5.5 Pro, the clearest Pro win in OpenAI launch tables
1.05M
OpenRouter context class for GPT-5.5 and GPT-5.5 Pro: 922K input + 128K output
OpenRouter IDs and Critique floors

How GPT-5.5 lands in the model catalog

Critique credits meter the whole review run: lead synthesis, specialists, depth multipliers, retries, and Remedy handoff. Vendor token prices are still useful because they explain why GPT-5.5 Pro belongs behind an Ultra gate.

OpenAI
Throughput tier
GPT-5.5
openai/gpt-5.5

Standard-plan frontier OpenAI lead for hard PRs, agentic review, and Remedy runs where GPT-5.4 is not enough.

Critique floor40 cr / run
OpenAI API — input / 1M$5
OpenAI API — output / 1M$30
OpenAI
Capability tier
GPT-5.5 Pro
openai/gpt-5.5-pro

Ultra-only deep reasoning lane for high-stakes review, legal/business/data work, and long-horizon accuracy.

Critique floor237 cr / run
OpenAI API — input / 1M$30
OpenAI API — output / 1M$180

OpenRouter lists both models with a 1.05M context class. OpenAI lists API pricing at $5/$30 per 1M tokens for GPT-5.5 and $30/$180 for GPT-5.5 Pro. Critique floors are catalog values and may scale with PR depth.

PART ONE - WHAT OPENAI SHIPPED

OpenAI released GPT-5.5 on April 23, 2026, then updated the launch on April 24 to note API availability and an updated system card. The model is positioned around "real work" rather than chat-only intelligence: coding, debugging, online research, data analysis, documents, spreadsheets, software operation, and tool use across a task until completion.

The most relevant release detail for engineering teams is efficiency. OpenAI says GPT-5.5 matches GPT-5.4 per-token latency in real-world serving while reaching higher intelligence and using significantly fewer tokens on the same Codex tasks. That is the right axis for automated review: long-running agentic work gets expensive not only because the model is pricey, but because failed paths and repeated reasoning burn output tokens.

PART TWO - CODING AND AGENTIC BENCHMARKS

For PR review, the first benchmark cluster to care about is not MMLU-style knowledge. It is coding and tool use: Terminal-Bench 2.0 for command-line workflows, SWE-Bench Pro for real GitHub issue repair, Expert-SWE for long-horizon internal engineering tasks, and MCP Atlas or Toolathlon for structured tool coordination.

OpenAI coding and terminal evals

Percent scores from OpenAI GPT-5.5 launch tables. Higher is better.

OpenAI notes labs have raised memorization concerns for SWE-Bench Pro; treat SWE numbers as directional and replay your own repository tasks before making routing policy permanent.

GPT-5.5 vs GPT-5.4 on review-shaped tasks

Selected OpenAI launch-table rows that map closely to Critique review and Remedy workflows.

GPT-5.5GPT-5.4Delta
Terminal-Bench 2.082.7%75.1%+7.6 pts
Expert-SWE (internal)73.1%68.5%+4.6 pts
OSWorld-Verified78.7%75.0%+3.7 pts
MCP Atlas75.3%70.6%+4.7 pts
Tau2-bench Telecom98.0%92.8%+5.2 pts
CyberGym81.8%79.0%+2.8 pts

Tau2-bench Telecom is not a coding benchmark, but it is useful for judging long tool workflows with state, policy, and customer-service style constraints.

This is why GPT-5.5 is now prominent in Critique rather than hidden as a novelty model. A review lead has to do the unglamorous middle work: inspect command output, reason about blast radius, decide if a specialist finding is real, and write a verdict that a human maintainer can act on. Terminal-Bench and MCP-style evals are imperfect, but they are closer to that job than a static answer benchmark.

PART THREE - GPT-5.5 PRO IS NOT JUST "MORE GPT-5.5"

GPT-5.5 Pro is the expensive lane. OpenAI prices it at six times GPT-5.5 on input and output tokens, and Critique reflects that with an Ultra-only 237-credit floor. The reason to use it is not ordinary code review volume. The reason is high-stakes correctness, especially when the task looks more like a research partner, legal/business analyst, or deep Remedy planner than a fast review pass.

Where GPT-5.5 Pro separates

OpenAI launch-table rows where Pro is listed against GPT-5.5 or GPT-5.4 Pro.

Pro does not appear on every coding row in the launch table. Its clearest public advantages are browsing, harder math tiers, GeneBench, investment-banking modeling, and high-accuracy professional work.

Routing decision

GPT-5.5 vs GPT-5.5 Pro inside Critique

The default should not be "always Pro." The default should be risk-proportional routing.

Metric
GPT-5.5
GPT-5.5 Pro
Critique plan
Standard and above
Ultra only
Credit floor
40 cr
237 cr
OpenRouter ID
openai/gpt-5.5
openai/gpt-5.5-pro
Vendor API price
$5 / $30 per 1M in/out
$30 / $180 per 1M in/out
Best lead use
Hard PRs, terminal-heavy review, large context synthesis
High-stakes architecture, auth, payments, compliance, critical Remedy plans
Avoid when
A cheap specialist lane already answers the question
Latency or cost matters more than marginal accuracy

PART FOUR - LONG CONTEXT IS THE QUIET STORY

Both OpenAI and OpenRouter describe GPT-5.5 as a 1M-context-class model, while OpenRouter lists the deployed shape as roughly 922K input tokens plus 128K output tokens. Long context does not remove retrieval. It changes what you can safely put in front of the lead after retrieval has narrowed the field: policy, diff, impacted files, test output, specialist reports, prior review memory, and the dependency trail that explains why a small hunk is dangerous.

OpenAI long-context evals: GPT-5.5 vs GPT-5.4

F1 scores from OpenAI long-context rows. Higher is better.

These rows are not PR-review benchmarks, but they are relevant to repo-aware review because they stress retrieval across very long prompt ranges.

PART FIVE - THIRD-PARTY VIEW: GDPVAL-AA

OpenAI reports GDPval as "wins or ties"; Artificial Analysis publishes GDPval-AA as an independent agentic harness with shell and web access via Stirrup, scored with Elo from blind pairwise comparisons. The two are not the same display, but together they are useful: OpenAI shows GPT-5.5 at 84.9% GDPval, while Artificial Analysis ranks GPT-5.5 (xhigh) first on GDPval-AA at 1782 Elo.

GDPval-AA leaderboard snapshot

Artificial Analysis Elo scores for selected top models, April 2026. Higher is better.

Artificial Analysis publishes confidence intervals; treat close rankings as directional rather than absolute. The important fact is that GPT-5.5 occupies the top cluster on an external agentic work benchmark.

PART SIX - HOW TO USE GPT-5.5 IN CRITIQUE

Routing checklist
  1. 1
    Is the PR small, local, and low-risk?
    Use a cheaper lead or specialist stack. GPT-5.5 is usually overkill for copy edits, one-file UI tweaks, and routine dependency bumps.
  2. 2
    Does the PR require terminal work or reproduction?
    Prefer GPT-5.5 as lead. Terminal-Bench, OSWorld, MCP Atlas, and Tau2-bench gains point to stronger tool-loop behavior.
  3. 3
    Could a wrong review miss auth, payments, security, or data-loss risk?
    Escalate to GPT-5.5, and consider GPT-5.5 Pro on Ultra when the cost of a false negative dwarfs the credit cost.
  4. 4
    Is Remedy going to make code changes from the result?
    Use GPT-5.5 for most hard fixes. Reserve GPT-5.5 Pro for critical plans that need extra research, browsing, or high-accuracy reasoning before execution.
Recommended Critique placement

A practical routing matrix for teams updating repository policy.

Best modelWhyRisk if overused
Everyday small PRGPT-5.4 Mini, Kimi K2.6, DeepSeek V4 FlashFast enough signal at lower credit burnWasting frontier budget
Ambiguous feature PRGPT-5.5Best balance of OpenAI frontier depth and usable costSlower than cheap lanes
Security-sensitive PRGPT-5.5 lead plus security specialistCyberGym and tool-loop gains map to defensive reviewFalse confidence if tests are skipped
Critical platform or payment changeGPT-5.5 ProUltra escalation for highest-cost mistakes237-credit floor can be excessive for routine work
Huge monorepo contextGPT-5.5 or GPT-5.5 Pro after retrieval1M context class helps final synthesisDumping raw context still creates noise
Yes. GPT-5.5 is in the runtime catalog as openai/gpt-5.5 with a 40-credit floor for lead, specialist, and Remedy use.
Yes. GPT-5.5 Pro is in the runtime catalog as openai/gpt-5.5-pro with a 237-credit Ultra floor for lead, specialist, and Remedy use.
No. GPT-5.5 should replace GPT-5.4 only where the review is hard enough to justify the floor: ambiguous failures, tool-heavy reproduction, large context, or high-risk merge paths.
No single benchmark is enough. Terminal-Bench 2.0, SWE-Bench Pro, MCP Atlas, OSWorld, CyberGym, and GDPval-AA together give a better picture of agentic code review than one coding leaderboard.
Primary sources

Ask about this essay

Nemotron-3-Super
Ask about the argument, the evidence, the structure, or how the post connects to Critique.
Not editorial advice · The essay above is the source of truth · Not saved to your account · OpenRouter privacy