Skip to content
22 min readCritique

Cursor as a Top-Tier Agent Harness: Composer 2.5, Cloud BYOA, and How It Compares to the Models on Critique

Deep read on Cursor’s agent runtime and Composer 2.5 — vendor benchmarks vs Opus 4.7, GPT-5.5, Kimi K2.6, MiniMax M3, and Qwen3.7 Plus — plus Critique’s SDK-backed cloud handoffs from review runs.

Cursor

Cursor

Bring your own agent

Cursor harness × Composer 2.5

Review on Critique. Fix in the cloud.

critique.sh

Agent harness · BYOA · Composer 2.5

Cursor
Default BYOA modelcomposer-2.5Cursor Agent SDK · cloud runtime

Harness + model · June 2026

Top-tier agent harness. Frontier coding model.

Critique now queues PR fix handoffs through the Cursor Agent SDK — same cloud agent loop as the IDE, running on Composer 2.5 against your repo and PR. Save your Cursor API key in Settings once; execution bills your Cursor plan, not Critique credits.

Harness
Tool-native agent loop: edit, terminal, search, MCP inside Cursor cloud VMs.
Composer 2.5
Cursor’s in-house model — RL on long coding trajectories, Kimi K2.5 lineage.
Cloud execution
Agents run on Cursor-hosted VMs with prUrl + workOnCurrentBranch — like our Claude BYOA path.
Colossus horizon
Cursor × SpaceX AI Colossus 2 training is the next compute chapter — not where BYOA runs today.
Critique blueprint
Findings, allowed paths, validation commands — then one queue from the review run.
Bring your own agent

One key in Settings. Cloud agents on every PR you choose.

No extra env vars or sidecar scripts for operators. The flow matches Claude Managed Agents and OpenAI Codex BYOA: encrypted key, scoped blueprint, QStash worker, status on the review run. Cursor is the harness; Composer 2.5 is the default model id we pass to the SDK.

Execution
Cursor BYOA
Your Cursor planNot Critique credits
Ends Queue from review runs
Model
Composer 2.5
composer-2.5Composer 2 Fast tier optional in Cursor
Ends SDK cloud runtime
Decision layer
Critique review
Findings + blueprintRemedy optional
Ends Same PR context
0%
Composer 2.5 — SWE-Bench Multilingual (Cursor / DataCamp, May 2026)
0%
Composer 2.5 — Terminal-Bench 2.0 (same sources)
0%
Composer 2.5 — CursorBench v3.1
SDK
Critique queue path — cloud prUrl + workOnCurrentBranch

Teams argue about model leaderboards. Staff engineers argue about agent harnesses: Does the loop survive 40 minutes? Does it respect the PR branch? Does it recover from a failed test without rewriting half the repo? Cursor’s moat is the second conversation. The IDE, CLI, Cloud Agents, and the TypeScript SDK all expose the same conceptual object — an Agent with durable state, Runs per prompt, streaming tool events, and cloud VMs that clone your repository.

Critique does not try to replicate that harness inside our sandboxes for Cursor BYOA. We already have Remedy when you want Critique-managed OpenCode on E2B. Cursor BYOA is for orgs that standardized on Cursor execution: same billing relationship, same agent UX in cursor.com/agents, and Composer tuned for the tool schema Cursor actually ships.

Harness vs model on a pull request

Why Critique queues Cursor for execution but uses OpenRouter-shaped models for review.

LayerCursor (BYOA)Critique review catalog
QuestionHow do we patch the PR?What should change before merge?
RuntimeCursor cloud VM + Agent SDKSandbox review graph + specialists
Default modelcomposer-2.5 (Cursor)Plan-dependent (Opus, Sonnet, M3, Qwen, …)
BillingCursor API key / planCritique credits or BYOK OpenRouter
OutputCommits on PR branchFindings, verdict, blueprint JSON

Composer 2.5 shipped May 18, 2026 as Cursor’s in-house agentic coding model. Public materials describe it as building on Composer 2, with more reinforcement learning on long-horizon coding tasks, better effort calibration (when to keep going vs stop), and stronger tool selection and intent understanding inside Cursor’s agent loop.

The base checkpoint is widely reported as Moonshot’s open Kimi K2.5 lineage — the same architectural family as Kimi K2.6 on Critique’s catalog. Cursor’s differentiation is post-training: Cursor states Composer 2.5 trained on roughly 25× more synthetic tasks than Composer 2, with harder synthetic problems generated dynamically as the model improved (so “easy” tasks did not dominate RL). Third-party summaries also cite a large fraction of total training compute going to Cursor’s own RL stack on top of the open checkpoint.

Composer 2.5 Standard

API list price about $0.50 / M input and $2.50 / M output tokens (Cursor docs, May 2026). Positioned for cost-sensitive batch runs.

Composer 2.5 Fast (default)

About $3 / M input and $15 / M output — same intelligence tier in Cursor’s framing, tuned for interactive agent sessions. Often cited as cheaper than other fast frontier tiers at similar latency.

Composer 2.5 is text-first and tool-native: file edits, terminal, search, MCP when configured in Cursor. It is not on Critique’s OpenRouter review roster — it is exclusive to Cursor surfaces (IDE, CLI, Cloud Agents, SDK). That exclusivity is exactly why BYOA exists: your review can stay multi-vendor while fixes run on the stack you already bought.

SWE-Bench Multilingual — Composer 2.5 vs frontier rows
Multilingual repair suite. Scores from Cursor launch materials and DataCamp’s May 2026 comparison table — not SWE-Bench Verified or Pro.

SWE-Bench Multilingual

Composer 2.5 vs peers on the same published rows.

  • Claude Opus 4.780.5%
  • Composer 2.579.8%
  • GPT-5.577.8%
  • Composer 273.7%

Opus 4.8 may supersede 4.7 on some vendor tables; compare using the exact row your procurement packet cites.

Terminal-Bench 2.0 — agentic terminal coding
Higher is better. GPT-5.5 leads this suite in public comparisons; Composer 2.5 ties Opus 4.7 band.

Terminal-Bench 2.0

  • GPT-5.582.7%
  • Claude Opus 4.769.4%
  • Composer 2.569.3%
  • Composer 261.7%
  • Kimi K2.666.7%
  • MiniMax M366%
  • Qwen3.7 Plus70.3%

Qwen3.7 Plus terminal score from Alibaba Jun 2026 materials (Critique catalog). M3 from MiniMax launch blog. Harnesses differ — do not treat as interchangeable with Critique’s internal review scores.

CursorBench v3.1 — Cursor’s agent-trajectory benchmark
Designed to reflect real Cursor agent runs. Composer 2.5 at 63.2% in Cursor/DataCamp tables; Composer 2 at 52.2%.

CursorBench v3.1

  • Claude Opus 4.7 (max)64.8%
  • GPT-5.5 (xhigh)64.3%
  • Composer 2.563.2%
  • Claude Opus 4.7 (default)61.6%
  • GPT-5.5 (default)59.2%
  • Composer 252.2%

Artificial Analysis Coding Agent Index (May 2026) reports Composer 2.5 at **62** overall with strong cost-per-task — a different blend than CursorBench but the same narrative: near-frontier scores at lower dollars per task.

Cursor’s launch narrative for 2026 also points forward: training collaboration with SpaceX AI on Colossus 2 — public commentary describes an order-of-magnitude step up in training compute versus prior generations. That is foundation-model factory infrastructure, not the runtime path for your Tuesday afternoon PR fix.

Critique Cursor BYOA runs in Cursor cloud agents today: we call the Agent SDK with `cloud.repos[]`, your `prUrl`, `workOnCurrentBranch: true`, and `composer-2.5`. The worker runs on Critique’s backend (QStash → pipeline), but the agent loop executes in Cursor-hosted VMs — the same class of surface as “Queue Cursor agent” in the dashboard. Claude BYOA similarly mounts your repo in Anthropic managed cloud; Codex BYOA uses OpenAI Responses until a fuller Codex agent API is available.

Critique → Cursor cloud (BYOA)
Review completes — findings + policy + allowed write pathsbuildRemedyBlueprint (backend: byoa) + cursor handoff JSONQueue — POST /api/review-runs/{id}/cursor-agentCursor Agent SDK — cloud repos + prUrl + composer-2.5Open in Cursor — commits on PR head branch
Operator checklist
  1. 1
    Where do I put the API key?
    Cursor Dashboard → Integrations, then Critique Settings → Cursor agent (BYOA). Same pattern as Anthropic and OpenAI BYOA panels.
  2. 2
    When can I queue?
    After the review run completes on the PR you want fixed. Optional operator instructions narrow scope or tests.
  3. 3
    Where does the agent run?
    In Cursor cloud VMs via the Agent SDK — not on Critique Remedy sandboxes and not on Colossus training clusters.
  4. 4
    What model executes?
    Composer 2.5 (`composer-2.5`) unless Cursor changes the SDK default for your account tier.

Export remains available: `GET /api/review-runs/{reviewRunId}/byoa/cursor` returns the `critique.cursor_agent_handoff` JSON for your own CI or scripts. Most teams use the queue button.

Cursor BYOA vs Remedy vs review-only
PathWho executesWhen
Review onlyNobody (human or external)You want findings without auto-fix
RemedyCritique E2B / OpenCodeYou want one invoice and managed sandbox
Cursor BYOACursor cloud + Composer 2.5You already pay for Cursor agents
Claude / Codex BYOAAnthropic / OpenAISame pattern, different vendor key
No. PR review uses Critique’s multi-model graph on OpenRouter-shaped ids (Opus, Sonnet, M3, Qwen, Kimi, etc.). Composer 2.5 is only used when you queue **Cursor BYOA** execution after review.
No. Critique runs the SDK server-side with your encrypted API key. You only save the key in Settings.
They share reported K2.5-lineage DNA, but Composer 2.5 is Cursor’s post-trained agentic product with Cursor-only tool tuning. Kimi K2.6 on Critique is a separate OpenRouter runtime for review passes.
Critique falls back to the Cloud Agents REST API with the same PR attachment and model id. You still use one key in Settings.
That essay is the partnership framing. This one is the deep harness + Composer 2.5 benchmark read and the SDK queue path shipping now.