OpenAI-compatible inference. Ship from the IDE you already use.
Use Critique as raw token inference for Cursor, Windsurf, Zed, VS Code, and JetBrains, your agent harness, eval loops, or any OpenAI client. List rates are built for building fast. Opt into log retention on eligible models and cut your token bill by 75% if you're OK helping improve models and the platform.
Cheap raw inference
OpenAI-compatible chat completions on frontier coding models — Kimi K2.6, GLM-5.1, DeepSeek V4 Flash, and more. Pay per token from Critique credits. Built for agents that burn tokens all day.
75% off with log retention
Need the bill lower? Opt in to short-term log retention on eligible models and pay 25% of list price. Private-by-default stays at standard rates — no retention, no discount.
Logs improve the stack
Retained prompts and completions help us tune models, routing, and the Critique platform. You ship faster on cheaper tokens; we get signal to make the system better for everyone.
- OpenAI-compatible
- Bearer crt_ keys
- Western-hosted
- Pay per token
- Cursor
- Windsurf
- Zed
- VS Code
- JetBrains
Rates
Private list rates for fast iteration, or log-retention pricing at 75% off. Same Critique credit pool.
Builder pricing
Among the lowest rates we publish for Kimi K2.6, GLM-5.1 & more
OpenAI-compatible endpoints for Cursor, Windsurf, Zed, VS Code, and JetBrains, LangChain, Vercel AI SDK, or your own harness. List rates are tuned for fast iteration. Opt into log retention on eligible models and cut the bill by 75% — those logs help us improve models and the platform.
NVIDIA Nemotron 3 Ultra
Frontier MoE intro
- Private / M
- $0.250 in
- $1.25 out
- Output / M
- $1.25
Kimi K2.6
Multimodal agent
- Private / M
- $0.750 in
- $3.75 out
- Log retention / M
- $0.188 in
- $0.938 out
GLM-5.1
Long-horizon coding
- Private / M
- $0.950 in
- $3.35 out
- Log retention / M
- $0.237 in
- $0.838 out
Trinity Large Thinking
Open reasoning
- Private / M
- $0.250 in
- $1.00 out
- Log retention / M
- $0.0625 in
- $0.250 out
Tencent Hy3 Preview
10% below market
- Private / M
- $0.0567 in
- $0.189 out
- Log retention / M
- $0.0142 in
- $0.0473 out
DeepSeek V4 Flash
Default · fast lane
- Private / M
- $0.150 in
- $0.300 out
- Log retention / M
- $0.0375 in
- $0.0750 out
Per-million-token list prices on Critique Inference API. Your bill converts to Critique credits at Solo-plan economics. Log retention pricing requires opt-in — see conditions below.
Log retention deal — 75% off eligible models
Standard pricing keeps prompts private with no retention. If you're building in Cursor, Windsurf, Zed, VS Code, and JetBrains, your harness, or any OpenAI client and want the lowest bill, opt in: we retain prompts and completions for a limited period to improve models and the Critique platform — and you pay 75% less per token on Kimi K2.6, GLM-5.1, Trinity Large Thinking, DeepSeek V4 Flash, and Hy3 Preview.
- Log retention applies to Inference API traffic on eligible models only — not PR review runs or Builder.
- Retained prompts and completions are used to improve models, routing, and the Critique platform.
- Private tier stays available at list rates with no log retention.
- Revoke anytime in Settings → Connections or per request with X-Critique-DeepSeek-Training-Opt-In: false.
Checking this box updates your account billing setting when signed in. Inference API docs
Log retention deal · 75% off
DeepSeek V4 Flash, Tencent Hy3 Preview, Kimi K2.6, GLM-5.1, and Trinity Large Thinking. Opt into log retention on eligible models and pay 75% less per token. Private tier stays at list rates. Nemotron unchanged.
DeepSeek V4 Flash
DeepSeek V4 Flash is an efficiency-optimized Mixture-of-Experts model from DeepSeek with 284B total parameters and 13B activated parameters, supporting a 1M-token context window. It is designed for fast inference and high-throughput workloads, while maintaining strong reasoning and coding performance.
- Context
- 1M
- Architecture
- 284B MoE (13B active)
- Review catalog
- 0.5 credits / PR review
Best for
- Coding agents
- 1M context
- High-throughput inference
- Input
- $0.150/ M
- Output
- $0.300/ M
Private by default · Western-hosted
deepseek/deepseek-v4-flash
Tencent Hy3 Preview
Hy3 Preview is a high-efficiency Mixture-of-Experts model from Tencent designed for agentic workflows and production use. It supports configurable reasoning levels across disabled, low, and high modes, allowing it to balance speed and depth depending on the task, while delivering strong code generation and reliable performance across multi-step, real-world workflows.
- Context
- 262K
- Architecture
- 205B MoE
- Review catalog
- 0.5 credits / PR review
Best for
- Agentic workflows
- Configurable reasoning
- Production code
- Input
- $0.0567/ M
- Output
- $0.189/ M
Western-hosted · no logs retained
tencent/hy3-preview
NVIDIA Nemotron 3 Ultra
NVIDIA Nemotron 3 Ultra is an open frontier-reasoning and orchestration model from NVIDIA, with 55B active parameters out of 550B total (MoE). Built on a hybrid Transformer-Mamba mixture-of-experts architecture, it supports text input and output with a context window of up to 1M tokens. It is suited for long-running agentic workflows, including agent orchestration, coding agents, deep research, and complex enterprise tasks. It is particularly strong at multi-step reasoning and planning, with high-throughput inference designed for high-volume agent pipelines. It is part of the NVIDIA Nemotron family of open models for agentic AI.
- Context
- 1M
- Architecture
- 550B MoE (55B active)
- Review catalog
- 2 cr intro · 3 cr shelf
Best for
- Agent orchestration
- Deep research
- Multi-step planning
- Input
- $0.250/ M
- Output
- $1.25/ M
Intro API rates through June 19, 2026 (UTC)
nvidia/nemotron-3-ultra-550b-a55b
Kimi K2.6
Kimi K2.6 is Moonshot AI's next-generation multimodal model, designed for long-horizon coding, coding-driven UI/UX generation, and multi-agent orchestration. It handles complex end-to-end coding tasks across Python, Rust, and Go, and can convert prompts and visual inputs into production-ready interfaces. Its agent swarm architecture scales to hundreds of parallel sub-agents for autonomous task decomposition — delivering documents, websites, and spreadsheets in a single run without human oversight.
- Context
- 262K
- Architecture
- 1T MoE (32B active)
- Review catalog
- 4 credits / PR review
Best for
- Long-horizon coding
- Multimodal UI generation
- Agent swarms
- Input
- $0.750/ M
- Output
- $3.75/ M
Western-hosted · multimodal coding and orchestration
moonshotai/kimi-k2.6
GLM-5.1
GLM-5.1 delivers a major leap in coding capability, with particularly significant gains in handling long-horizon tasks. Unlike previous models built around minute-level interactions, GLM-5.1 can work independently and continuously on a single task for more than 8 hours, autonomously planning, executing, and improving itself throughout the process, ultimately delivering complete, engineering-grade results.
- Context
- 203K
- Architecture
- Frontier coding MoE
- Review catalog
- 3 credits / PR review
Best for
- 8+ hour autonomous coding
- Long-horizon tasks
- Engineering-grade output
- Input
- $0.950/ M
- Output
- $3.35/ M
Western-hosted · extended autonomous coding loops
z-ai/glm-5.1
Trinity Large Thinking
Trinity Large Thinking is a powerful open source reasoning model from the team at Arcee AI. It shows strong performance in PinchBench, agentic workloads, and reasoning tasks.
- Context
- 262K
- Architecture
- 400B MoE (13B active)
- Review catalog
- 1 credit / PR review
Best for
- Open-weight reasoning
- Agentic workloads
- PinchBench-scale agents
- Input
- $0.250/ M
- Output
- $1.00/ M
Apache 2.0 weights · strong agentic and reasoning signal
arcee-ai/trinity-large-thinking
| Model | Architecture | Context | Input / M | Output / M | Review run |
|---|---|---|---|---|---|
DeepSeek V4 FlashDefault deepseek/deepseek-v4-flash | 284B MoE (13B active) | 1M | $0.150/ M | $0.300/ M | 0.5 credits / PR review |
Tencent Hy3 Preview10% below market tencent/hy3-preview | 205B MoE | 262K | $0.0567/ M | $0.189/ M | 0.5 credits / PR review |
NVIDIA Nemotron 3 UltraIntro pricing nvidia/nemotron-3-ultra-550b-a55b | 550B MoE (55B active) | 1M | $0.250/ M | $1.25/ M | 2 cr intro · 3 cr shelf |
Kimi K2.6Multimodal agent moonshotai/kimi-k2.6 | 1T MoE (32B active) | 262K | $0.750/ M | $3.75/ M | 4 credits / PR review |
GLM-5.1Long-horizon coding z-ai/glm-5.1 | Frontier coding MoE | 203K | $0.950/ M | $3.35/ M | 3 credits / PR review |
Trinity Large ThinkingOpen reasoning arcee-ai/trinity-large-thinking | 400B MoE (13B active) | 262K | $0.250/ M | $1.00/ M | 1 credit / PR review |
Per-token USD equivalents · billed from Critique credits · responses include X-Critique-Credits-Charged
Private by default — log retention is optional for the discountFull pricing docs →
- DeepSeek V4 Flashdeepseek/deepseek-v4-flash
- Tencent Hy3 Previewtencent/hy3-preview
- NVIDIA Nemotron 3 Ultranvidia/nemotron-3-ultra-550b-a55b
- Kimi K2.6moonshotai/kimi-k2.6
- GLM-5.1z-ai/glm-5.1
- Trinity Large Thinkingarcee-ai/trinity-large-thinking
- Western-hostedNo training on your prompts
- DeepSeek V4 Flashdeepseek/deepseek-v4-flash
- Tencent Hy3 Previewtencent/hy3-preview
- NVIDIA Nemotron 3 Ultranvidia/nemotron-3-ultra-550b-a55b
- Kimi K2.6moonshotai/kimi-k2.6
- GLM-5.1z-ai/glm-5.1
- Trinity Large Thinkingarcee-ai/trinity-large-thinking
- Western-hostedNo training on your prompts
Billing
Token usage converts to USD at the model rate card, then to Critique credits from your shared balance.
How we bill your account
Inference API calls draw from the same Critique credit balance as PR review and Builder. We meter actual prompt and completion tokens from the upstream response, convert that spend to USD using the active model rate card, then round up to whole credits.
- 1
Count tokens
prompt_tokensare billed at the model's input rate.completion_tokens(including reasoning tokens when reported) are billed at the output rate. - 2
Convert to USD
inputUsd = (prompt_tokens ÷ 1,000,000) × inputRate
outputUsd = (completion_tokens ÷ 1,000,000) × outputRate
totalUsd = inputUsd + outputUsd
- 3
Convert to Critique credits
We use Solo-plan economics: $19 ÷ 750 credits = $0.0253 per credit.
creditsCharged = max(1, ⌈totalUsd ÷ $0.0253⌉)
Non-zero usage always bills at least 1 credit. PR review runs use separate per-run catalog floors — not this token math.
Worked example · Kimi K2.6 standard rates
- Prompt tokens
- 100,000
- Completion tokens
- 50,000
- Input USD
- 0.100 M × $0.75/M = $0.0750
- Output USD
- 0.050 M × $3.75/M = $0.1875
- Total USD
- $0.2625
- Credits charged
- ⌈$0.2625 ÷ $0.0253⌉ = 11 credits
Non-streaming responses include X-Critique-Credits-Charged and X-Critique-Estimated-Usd. Log retention tier applies 75% off token rates on eligible models when enabled in Settings or via X-Critique-DeepSeek-Training-Opt-In: true.
Quickstart
Point any OpenAI-compatible client at https://critique.sh/api/v1 with a crt_ key. Works in Cursor, Windsurf, Zed, VS Code, and JetBrains, LangChain, Vercel AI SDK, CI evals, or a sidecar next to review.
- GET
/api/v1/models- POST
/api/v1/chat/completions
curl https://critique.sh/api/v1/chat/completions \
-H "Authorization: Bearer crt_..." \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek/deepseek-v4-flash",
"messages": [
{ "role": "user", "content": "Summarize this webhook retry design in three bullets." }
]
}'OpenAI SDK
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.CRITIQUE_API_KEY,
baseURL: "https://critique.sh/api/v1",
});
const response = await client.chat.completions.create({
model: "deepseek/deepseek-v4-flash",
messages: [{ role: "user", content: "Draft a TypeScript interface for idempotent job enqueue." }],
});
console.log(response.choices[0]?.message?.content);OpenAI-compatible client
// OpenAI-compatible — same shape in any client with a custom base URL
import OpenAI from "openai";
export const inference = new OpenAI({
apiKey: process.env.CRITIQUE_API_KEY,
baseURL: "https://critique.sh/api/v1",
});
// Agent loop, sidecar, CI eval, or IDE extension
const res = await inference.chat.completions.create({
model: "moonshotai/kimi-k2.6",
messages: [{ role: "user", content: "Refactor this handler for idempotency." }],
});API keys
Scopes: read:inference / write:inference
Create your Inference API key
Keys use the crt_ prefix with read:inference and write:inference scopes. Sign in to generate a key here — the full secret is shown once.
FAQ
Why use Critique as my inference layer?
Same crt_ keys and credit pool as review and Builder, OpenAI-compatible endpoints, Western-region routing, and list rates tuned for agents that burn tokens all day. Point Cursor, Windsurf, Zed, VS Code, and JetBrains, your harness, or any OpenAI client at https://critique.sh/api/v1 — no separate provider account on managed billing.
Do you store or train on my prompts?
Private tier (default): no log retention for model training — requests are metered for billing only. Log retention tier (optional, 75% off on eligible models): you opt in explicitly; retained prompts and completions help improve models, routing, and the Critique platform. Toggle in Settings → Connections or per request with X-Critique-DeepSeek-Training-Opt-In.
What is log retention pricing (75% off)?
On DeepSeek V4 Flash, Hy3 Preview, Kimi K2.6, GLM-5.1, and Trinity Large Thinking you can opt into short-term log retention and pay 25% of list price (75% off). Nemotron rates are unchanged. Accept the conditions, enable account-wide in Settings, or send X-Critique-DeepSeek-Training-Opt-In: true per request.
How is this different from the Coding Agent API?
The Coding Agent API runs a full sandboxed repo agent (clone, edit, optional draft PR). The Inference API is token-in/token-out chat completions for your own apps, agents, and tools — same crt_ keys and credit pool, lighter contract.
Where do requests run?
Critique routes Inference API traffic through Western-region infrastructure. Private and log-retention tiers both use Western-hosted routing.
Need a full sandbox agent? Coding Agent API · Default model: deepseek/deepseek-v4-flash