What authentication do I need?

Authorization: Bearer crt_… with read:inference or write:inference (included on new keys alongside builder scopes). Create a key at critique.sh/inference-api when signed in.

How does billing work?

Token usage converts to Critique credits at Solo-plan economics (Solo $19/mo, Pro $49/mo, Team $149/mo). Review runs use per-run catalog floors; Inference API uses per-token math on the active model rate card.

How is Nemotron 3 Ultra priced?

Nemotron 3 Ultra is 3 credits per PR review run. Inference API private list rates are $1.00 / $5.00 per million input/output tokens, with optional log retention at 75% off ($0.25 / $1.25 per million).

Which IDEs and clients work?

Any OpenAI-compatible client with a custom base URL: Cursor, Windsurf, Zed, VS Code, and JetBrains, LangChain, Vercel AI SDK, Continue, Cline, and your own agent harness. Set baseURL to https://critique.sh/api/v1 and apiKey to your crt_ secret. Default model: deepseek/deepseek-v4-flash. Also: tencent/hy3, nvidia/nemotron-3-ultra-550b-a55b, moonshotai/kimi-k2.6, z-ai/glm-5.2, and arcee-ai/trinity-large-thinking.

How is Tencent Hy3 priced?

Hy3 is 10% below market on the Inference API ($0.126 / $0.522 per M input/output vs $0.14 / $0.58 market) and 1 credit per PR review run. 295B MoE (21B active), 262K context, Western-hosted. New Inference API users get $50 / €50 worth of bonus credits when they create their first crt_ key. Private tier has no log retention; log retention tier is 75% off list token rates.

Critique Inference

OpenAI-compatible inference. Ship from the IDE you already use.

Use Critique as raw token inference for Cursor, Windsurf, Zed, VS Code, and JetBrains, your agent harness, eval loops, or any OpenAI client. List rates are built for building fast. Opt into log retention on eligible models and cut your token bill by 75% if you're OK helping improve models and the platform.

Cheap raw inference

OpenAI-compatible chat completions on frontier coding models — Nemotron 3 Ultra, Kimi K2.6, GLM-5.2, DeepSeek V4 Flash, and more. Pay per token from Critique credits. Built for agents that burn tokens all day.

75% off with log retention

Need the bill lower? Opt in to short-term log retention on eligible models and pay 25% of list price. Private-by-default stays at standard rates — no retention, no discount.

Logs improve the stack

Retained prompts and completions help us tune models, routing, and the Critique platform. You ship faster on cheaper tokens; we get signal to make the system better for everyone.

OpenAI-compatible
Bearer crt_ keys
Western-hosted
Pay per token

Cursor
Windsurf
Zed
VS Code
JetBrains

Rates

Private list rates for fast iteration, or log-retention pricing at 75% off. Same Critique credit pool.

Builder pricing

World-class model rates for Nemotron, Kimi K2.6, GLM-5.2 & more

OpenAI-compatible endpoints for Cursor, Windsurf, Zed, VS Code, and JetBrains, LangChain, Vercel AI SDK, or your own harness. List rates are tuned for fast iteration. Opt into log retention on eligible models and cut the bill by 75% — those logs help us improve models and the platform.

NVIDIA Nemotron 3 Ultra

World-class MoE

Private / M: $1.00 in; $5.00 out
Log retention / M: $0.250 in; $1.25 out

Kimi K2.6

Multimodal agent

Private / M: $0.750 in; $3.75 out
Log retention / M: $0.188 in; $0.938 out

GLM-5.2

Long-horizon coding

Private / M: $0.950 in; $3.35 out
Log retention / M: $0.237 in; $0.838 out

Trinity Large Thinking

Open reasoning

Private / M: $0.250 in; $1.00 out
Log retention / M: $0.0625 in; $0.250 out

Tencent Hy3

10% below market

Private / M: $0.126 in; $0.522 out
Log retention / M: $0.0315 in; $0.131 out

DeepSeek V4 Flash

Default · fast lane

Private / M: $0.150 in; $0.300 out
Log retention / M: $0.0375 in; $0.0750 out

Per-million-token list prices on Critique Inference API. Your bill converts to Critique credits at Solo-plan economics. Log retention pricing requires opt-in — see conditions below.

Log retention deal — 75% off eligible models

Standard pricing keeps prompts private with no retention. If you're building in Cursor, Windsurf, Zed, VS Code, and JetBrains, your harness, or any OpenAI client and want the lowest bill, opt in: we retain prompts and completions for a limited period to improve models and the Critique platform — and you pay 75% less per token on Nemotron 3 Ultra, Kimi K2.6, GLM-5.2, Trinity Large Thinking, DeepSeek V4 Flash, and Hy3.

Log retention applies to Inference API traffic on eligible models only — not PR review runs or Builder.
Retained prompts and completions are used to improve models, routing, and the Critique platform.
Private tier stays available at list rates with no log retention.
Revoke anytime in Settings → Connections or per request with X-Critique-DeepSeek-Training-Opt-In: false.

I'm OK with log retention for the 75% discount and understand I can turn it off anytime.

Checking this box updates your account billing setting when signed in. Inference API docs

Log retention deal · 75% off

DeepSeek V4 Flash, Tencent Hy3, Nemotron 3 Ultra, Kimi K2.6, GLM-5.2, and Trinity Large Thinking. Opt into log retention on eligible models and pay 75% less per token. Private tier stays at list rates.

Default

DeepSeek V4 Flash

DeepSeek V4 Flash is an efficiency-optimized Mixture-of-Experts model from DeepSeek with 284B total parameters and 13B activated parameters, supporting a 1M-token context window. It is designed for fast inference and high-throughput workloads, while maintaining strong reasoning and coding performance.

Context: 1M
Architecture: 284B MoE (13B active)
Review catalog: 0.5 credits / PR review

Best for

Coding agents
1M context
High-throughput inference

Input: $0.150/ M
Output: $0.300/ M

Private by default · Western-hosted

deepseek/deepseek-v4-flash

10% below market

Tencent Hy3

Hy3 is a 295B-parameter Mixture-of-Experts model from Tencent (21B active) built for reasoning, agentic workflows, and production use. Configurable reasoning effort — direct no-think by default, plus low and high chain-of-thought for complex math, coding, and multi-step problems. Strong tool-calling across agent scaffoldings with grounded, anti-hallucination behavior.

Context: 262K
Architecture: 295B MoE (21B active)
Review catalog: 1 credit / PR review

Best for

Agentic workflows
Configurable reasoning
Production code

Input: $0.126/ M
Output: $0.522/ M

Western-hosted · no logs retained

tencent/hy3

World-class MoE

NVIDIA Nemotron 3 Ultra

NVIDIA Nemotron 3 Ultra is an open frontier-reasoning and orchestration model from NVIDIA, with 55B active parameters out of 550B total (MoE). Built on a hybrid Transformer-Mamba mixture-of-experts architecture, it supports text input and output with a context window of up to 1M tokens. It is suited for long-running agentic workflows, including agent orchestration, coding agents, deep research, and complex enterprise tasks. It is particularly strong at multi-step reasoning and planning, with high-throughput inference designed for high-volume agent pipelines. It is part of the NVIDIA Nemotron family of open models for agentic AI.

Context: 1M
Architecture: 550B MoE (55B active)
Review catalog: 3 credits / PR review

Best for

Agent orchestration
Deep research
Multi-step planning

Input: $1.00/ M
Output: $5.00/ M

Western-hosted frontier MoE · private by default

nvidia/nemotron-3-ultra-550b-a55b

Multimodal agent

Kimi K2.6

Kimi K2.6 is Moonshot AI's next-generation multimodal model, designed for long-horizon coding, coding-driven UI/UX generation, and multi-agent orchestration. It handles complex end-to-end coding tasks across Python, Rust, and Go, and can convert prompts and visual inputs into production-ready interfaces. Its agent swarm architecture scales to hundreds of parallel sub-agents for autonomous task decomposition — delivering documents, websites, and spreadsheets in a single run without human oversight.

Context: 262K
Architecture: 1T MoE (32B active)
Review catalog: 4 credits / PR review

Best for

Long-horizon coding
Multimodal UI generation
Agent swarms

Input: $0.750/ M
Output: $3.75/ M

Western-hosted · multimodal coding and orchestration

moonshotai/kimi-k2.6

Long-horizon coding

GLM-5.2

GLM-5.2 is Z.ai’s flagship for long-horizon engineering: a solid 1M-token context, effort-level control, and MIT open weights. Critique exposes z-ai/glm-5.2 directly while keeping older GLM-5.1 client IDs aliased forward at the same list price.

Context: 1M
Architecture: 753B MoE · 1M context
Review catalog: 3 credits / PR review

Best for

8+ hour autonomous coding
Long-horizon tasks
Engineering-grade output

Input: $0.950/ M
Output: $3.35/ M

Western-hosted · extended autonomous coding loops

z-ai/glm-5.2

Open reasoning

Trinity Large Thinking

Trinity Large Thinking is a powerful open source reasoning model from the team at Arcee AI. It shows strong performance in PinchBench, agentic workloads, and reasoning tasks.

Context: 262K
Architecture: 400B MoE (13B active)
Review catalog: 1 credit / PR review

Best for

Open-weight reasoning
Agentic workloads
PinchBench-scale agents

Input: $0.250/ M
Output: $1.00/ M

Apache 2.0 weights · strong agentic and reasoning signal

arcee-ai/trinity-large-thinking

Model	Architecture	Context	Input / M	Output / M	Review run
DeepSeek V4 FlashDefault deepseek/deepseek-v4-flash	284B MoE (13B active)	1M	$0.150/ M	$0.300/ M	0.5 credits / PR review
Tencent Hy310% below market tencent/hy3	295B MoE (21B active)	262K	$0.126/ M	$0.522/ M	1 credit / PR review
NVIDIA Nemotron 3 UltraWorld-class MoE nvidia/nemotron-3-ultra-550b-a55b	550B MoE (55B active)	1M	$1.00/ M	$5.00/ M	3 credits / PR review
Kimi K2.6Multimodal agent moonshotai/kimi-k2.6	1T MoE (32B active)	262K	$0.750/ M	$3.75/ M	4 credits / PR review
GLM-5.2Long-horizon coding z-ai/glm-5.2	753B MoE · 1M context	1M	$0.950/ M	$3.35/ M	3 credits / PR review
Trinity Large ThinkingOpen reasoning arcee-ai/trinity-large-thinking	400B MoE (13B active)	262K	$0.250/ M	$1.00/ M	1 credit / PR review

Per-token USD equivalents · billed from Critique credits · responses include X-Critique-Credits-Charged

Private by default — log retention is optional for the discountFull pricing docs →

DeepSeek V4 Flash
deepseek/deepseek-v4-flash
Tencent Hy3
tencent/hy3
NVIDIA Nemotron 3 Ultra
nvidia/nemotron-3-ultra-550b-a55b
Kimi K2.6
moonshotai/kimi-k2.6
GLM-5.2
z-ai/glm-5.2
Trinity Large Thinking
arcee-ai/trinity-large-thinking
Western-hosted
No training on your prompts
DeepSeek V4 Flash
deepseek/deepseek-v4-flash
Tencent Hy3
tencent/hy3
NVIDIA Nemotron 3 Ultra
nvidia/nemotron-3-ultra-550b-a55b
Kimi K2.6
moonshotai/kimi-k2.6
GLM-5.2
z-ai/glm-5.2
Trinity Large Thinking
arcee-ai/trinity-large-thinking
Western-hosted
No training on your prompts

Billing

Token usage converts to USD at the model rate card, then to Critique credits from your shared balance.

How we bill your account

Inference API calls draw from the same Critique credit balance as PR review and Builder. We meter actual prompt and completion tokens from the upstream response, convert that spend to USD using the active model rate card, then round up to whole credits.

1
Count tokens
prompt_tokens are billed at the model's input rate.completion_tokens (including reasoning tokens when reported) are billed at the output rate.
2
Convert to USD
inputUsd = (prompt_tokens ÷ 1,000,000) × inputRate
outputUsd = (completion_tokens ÷ 1,000,000) × outputRate
totalUsd = inputUsd + outputUsd
3
Convert to Critique credits
We use Solo-plan economics: $19 ÷ 750 credits = $0.0253 per credit.
creditsCharged = max(1, ⌈totalUsd ÷ $0.0253⌉)
Non-zero usage always bills at least 1 credit. PR review runs use separate per-run catalog floors — not this token math.

Worked example · Kimi K2.6 standard rates

Prompt tokens: 100,000
Completion tokens: 50,000
Input USD: 0.100 M × $0.75/M = $0.0750
Output USD: 0.050 M × $3.75/M = $0.1875
Total USD: $0.2625
Credits charged: ⌈$0.2625 ÷ $0.0253⌉ = 11 credits

Non-streaming responses include X-Critique-Credits-Charged and X-Critique-Estimated-Usd. Log retention tier applies 75% off token rates on eligible models when enabled in Settings or via X-Critique-DeepSeek-Training-Opt-In: true.

Quickstart

Point any OpenAI-compatible client at https://critique.sh/api/v1 with a crt_ key. Works in Cursor, Windsurf, Zed, VS Code, and JetBrains, LangChain, Vercel AI SDK, CI evals, or a sidecar next to review.

GET: /api/v1/models
POST: /api/v1/chat/completions

curl https://critique.sh/api/v1/chat/completions \
  -H "Authorization: Bearer crt_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek/deepseek-v4-flash",
    "messages": [
      { "role": "user", "content": "Summarize this webhook retry design in three bullets." }
    ]
  }'

OpenAI SDK

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.CRITIQUE_API_KEY,
  baseURL: "https://critique.sh/api/v1",
});

const response = await client.chat.completions.create({
  model: "deepseek/deepseek-v4-flash",
  messages: [{ role: "user", content: "Draft a TypeScript interface for idempotent job enqueue." }],
});

console.log(response.choices[0]?.message?.content);

OpenAI-compatible client

// OpenAI-compatible — same shape in any client with a custom base URL
import OpenAI from "openai";

export const inference = new OpenAI({
  apiKey: process.env.CRITIQUE_API_KEY,
  baseURL: "https://critique.sh/api/v1",
});

// Agent loop, sidecar, CI eval, or IDE extension
const res = await inference.chat.completions.create({
  model: "moonshotai/kimi-k2.6",
  messages: [{ role: "user", content: "Refactor this handler for idempotency." }],
});

API keys

Scopes: read:inference / write:inference

Create your Inference API key

Keys use the crt_ prefix with read:inference and write:inference scopes. Sign in to generate a key here — the full secret is shown once.

FAQ

Why use Critique as my inference layer?

Same crt_ keys and credit pool as review and Builder, OpenAI-compatible endpoints, Western-region routing, and list rates tuned for agents that burn tokens all day. Point Cursor, Windsurf, Zed, VS Code, and JetBrains, your harness, or any OpenAI client at https://critique.sh/api/v1 — no separate provider account on managed billing.

Do you store or train on my prompts?

Private tier (default): no log retention for model training — requests are metered for billing only. Log retention tier (optional, 75% off on eligible models): you opt in explicitly; retained prompts and completions help improve models, routing, and the Critique platform. Toggle in Settings → Connections or per request with X-Critique-DeepSeek-Training-Opt-In.

What is log retention pricing (75% off)?

On DeepSeek V4 Flash, Hy3, Nemotron 3 Ultra, Kimi K2.6, GLM-5.2, and Trinity Large Thinking you can opt into short-term log retention and pay 25% of list price (75% off). Accept the conditions, enable account-wide in Settings, or send X-Critique-DeepSeek-Training-Opt-In: true per request.

How is this different from the Coding Agent API?

The Coding Agent API runs a full sandboxed repo agent (clone, edit, optional draft PR). The Inference API is token-in/token-out chat completions for your own apps, agents, and tools — same crt_ keys and credit pool, lighter contract.

Where do requests run?

Critique routes Inference API traffic through Western-region infrastructure. Private and log-retention tiers both use Western-hosted routing.

Read the full API reference →

Need a full sandbox agent? Coding Agent API · Default model: deepseek/deepseek-v4-flash