Skip to content
Critique Inference

OpenAI-compatible inference. Ship from the IDE you already use.

Use Critique as raw token inference for Cursor, Windsurf, Zed, VS Code, and JetBrains, your agent harness, eval loops, or any OpenAI client. List rates are built for building fast. Opt into log retention on eligible models and cut your token bill by 75% if you're OK helping improve models and the platform.

Cheap raw inference

OpenAI-compatible chat completions on frontier coding models — Kimi K2.6, GLM-5.1, DeepSeek V4 Flash, and more. Pay per token from Critique credits. Built for agents that burn tokens all day.

75% off with log retention

Need the bill lower? Opt in to short-term log retention on eligible models and pay 25% of list price. Private-by-default stays at standard rates — no retention, no discount.

Logs improve the stack

Retained prompts and completions help us tune models, routing, and the Critique platform. You ship faster on cheaper tokens; we get signal to make the system better for everyone.

  • OpenAI-compatible
  • Bearer crt_ keys
  • Western-hosted
  • Pay per token
  • Cursor
  • Windsurf
  • Zed
  • VS Code
  • JetBrains

Rates

Private list rates for fast iteration, or log-retention pricing at 75% off. Same Critique credit pool.

Builder pricing

Among the lowest rates we publish for Kimi K2.6, GLM-5.1 & more

OpenAI-compatible endpoints for Cursor, Windsurf, Zed, VS Code, and JetBrains, LangChain, Vercel AI SDK, or your own harness. List rates are tuned for fast iteration. Opt into log retention on eligible models and cut the bill by 75% — those logs help us improve models and the platform.

NVIDIA Nemotron 3 Ultra

Frontier MoE intro

Private / M
Output / M

Kimi K2.6

Multimodal agent

Private / M
Log retention / M
$0.188 in
$0.938 out

GLM-5.1

Long-horizon coding

Private / M
Log retention / M
$0.237 in
$0.838 out

Trinity Large Thinking

Open reasoning

Private / M
Log retention / M
$0.0625 in
$0.250 out

Tencent Hy3 Preview

10% below market

Private / M
Log retention / M
$0.0142 in
$0.0473 out

DeepSeek V4 Flash

Default · fast lane

Private / M
Log retention / M
$0.0375 in
$0.0750 out

Per-million-token list prices on Critique Inference API. Your bill converts to Critique credits at Solo-plan economics. Log retention pricing requires opt-in — see conditions below.

Log retention deal — 75% off eligible models

Standard pricing keeps prompts private with no retention. If you're building in Cursor, Windsurf, Zed, VS Code, and JetBrains, your harness, or any OpenAI client and want the lowest bill, opt in: we retain prompts and completions for a limited period to improve models and the Critique platform — and you pay 75% less per token on Kimi K2.6, GLM-5.1, Trinity Large Thinking, DeepSeek V4 Flash, and Hy3 Preview.

  • Log retention applies to Inference API traffic on eligible models only — not PR review runs or Builder.
  • Retained prompts and completions are used to improve models, routing, and the Critique platform.
  • Private tier stays available at list rates with no log retention.
  • Revoke anytime in Settings → Connections or per request with X-Critique-DeepSeek-Training-Opt-In: false.

Checking this box updates your account billing setting when signed in. Inference API docs

Log retention deal · 75% off

DeepSeek V4 Flash, Tencent Hy3 Preview, Kimi K2.6, GLM-5.1, and Trinity Large Thinking. Opt into log retention on eligible models and pay 75% less per token. Private tier stays at list rates. Nemotron unchanged.

Default

DeepSeek V4 Flash

DeepSeek V4 Flash is an efficiency-optimized Mixture-of-Experts model from DeepSeek with 284B total parameters and 13B activated parameters, supporting a 1M-token context window. It is designed for fast inference and high-throughput workloads, while maintaining strong reasoning and coding performance.

Context
1M
Architecture
284B MoE (13B active)
Review catalog
0.5 credits / PR review

Best for

  • Coding agents
  • 1M context
  • High-throughput inference
Input
Output

Private by default · Western-hosted

deepseek/deepseek-v4-flash

10% below market

Tencent Hy3 Preview

Hy3 Preview is a high-efficiency Mixture-of-Experts model from Tencent designed for agentic workflows and production use. It supports configurable reasoning levels across disabled, low, and high modes, allowing it to balance speed and depth depending on the task, while delivering strong code generation and reliable performance across multi-step, real-world workflows.

Context
262K
Architecture
205B MoE
Review catalog
0.5 credits / PR review

Best for

  • Agentic workflows
  • Configurable reasoning
  • Production code
Input
Output

Western-hosted · no logs retained

tencent/hy3-preview

Intro pricing

NVIDIA Nemotron 3 Ultra

NVIDIA Nemotron 3 Ultra is an open frontier-reasoning and orchestration model from NVIDIA, with 55B active parameters out of 550B total (MoE). Built on a hybrid Transformer-Mamba mixture-of-experts architecture, it supports text input and output with a context window of up to 1M tokens. It is suited for long-running agentic workflows, including agent orchestration, coding agents, deep research, and complex enterprise tasks. It is particularly strong at multi-step reasoning and planning, with high-throughput inference designed for high-volume agent pipelines. It is part of the NVIDIA Nemotron family of open models for agentic AI.

Context
1M
Architecture
550B MoE (55B active)
Review catalog
2 cr intro · 3 cr shelf

Best for

  • Agent orchestration
  • Deep research
  • Multi-step planning
Input
Output

Intro API rates through June 19, 2026 (UTC)

nvidia/nemotron-3-ultra-550b-a55b

Multimodal agent

Kimi K2.6

Kimi K2.6 is Moonshot AI's next-generation multimodal model, designed for long-horizon coding, coding-driven UI/UX generation, and multi-agent orchestration. It handles complex end-to-end coding tasks across Python, Rust, and Go, and can convert prompts and visual inputs into production-ready interfaces. Its agent swarm architecture scales to hundreds of parallel sub-agents for autonomous task decomposition — delivering documents, websites, and spreadsheets in a single run without human oversight.

Context
262K
Architecture
1T MoE (32B active)
Review catalog
4 credits / PR review

Best for

  • Long-horizon coding
  • Multimodal UI generation
  • Agent swarms
Input
Output

Western-hosted · multimodal coding and orchestration

moonshotai/kimi-k2.6

Long-horizon coding

GLM-5.1

GLM-5.1 delivers a major leap in coding capability, with particularly significant gains in handling long-horizon tasks. Unlike previous models built around minute-level interactions, GLM-5.1 can work independently and continuously on a single task for more than 8 hours, autonomously planning, executing, and improving itself throughout the process, ultimately delivering complete, engineering-grade results.

Context
203K
Architecture
Frontier coding MoE
Review catalog
3 credits / PR review

Best for

  • 8+ hour autonomous coding
  • Long-horizon tasks
  • Engineering-grade output
Input
Output

Western-hosted · extended autonomous coding loops

z-ai/glm-5.1

Open reasoning

Trinity Large Thinking

Trinity Large Thinking is a powerful open source reasoning model from the team at Arcee AI. It shows strong performance in PinchBench, agentic workloads, and reasoning tasks.

Context
262K
Architecture
400B MoE (13B active)
Review catalog
1 credit / PR review

Best for

  • Open-weight reasoning
  • Agentic workloads
  • PinchBench-scale agents
Input
Output

Apache 2.0 weights · strong agentic and reasoning signal

arcee-ai/trinity-large-thinking

ModelArchitectureContextInput / MOutput / M
DeepSeek V4 FlashDefault
deepseek/deepseek-v4-flash
284B MoE (13B active)1M
Tencent Hy3 Preview10% below market
tencent/hy3-preview
205B MoE262K
NVIDIA Nemotron 3 UltraIntro pricing
nvidia/nemotron-3-ultra-550b-a55b
550B MoE (55B active)1M
Kimi K2.6Multimodal agent
moonshotai/kimi-k2.6
1T MoE (32B active)262K
GLM-5.1Long-horizon coding
z-ai/glm-5.1
Frontier coding MoE203K
Trinity Large ThinkingOpen reasoning
arcee-ai/trinity-large-thinking
400B MoE (13B active)262K

Per-token USD equivalents · billed from Critique credits · responses include X-Critique-Credits-Charged

Private by default — log retention is optional for the discountFull pricing docs →

  • DeepSeekDeepSeek V4 Flash
    deepseek/deepseek-v4-flash
  • TencentTencent Hy3 Preview
    tencent/hy3-preview
  • NvidiaNVIDIA Nemotron 3 Ultra
    nvidia/nemotron-3-ultra-550b-a55b
  • KimiKimi K2.6
    moonshotai/kimi-k2.6
  • Z.aiGLM-5.1
    z-ai/glm-5.1
  • ArceeTrinity Large Thinking
    arcee-ai/trinity-large-thinking
  • AntGroupWestern-hosted
    No training on your prompts
  • DeepSeekDeepSeek V4 Flash
    deepseek/deepseek-v4-flash
  • TencentTencent Hy3 Preview
    tencent/hy3-preview
  • NvidiaNVIDIA Nemotron 3 Ultra
    nvidia/nemotron-3-ultra-550b-a55b
  • KimiKimi K2.6
    moonshotai/kimi-k2.6
  • Z.aiGLM-5.1
    z-ai/glm-5.1
  • ArceeTrinity Large Thinking
    arcee-ai/trinity-large-thinking
  • AntGroupWestern-hosted
    No training on your prompts

Billing

Token usage converts to USD at the model rate card, then to Critique credits from your shared balance.

How we bill your account

Inference API calls draw from the same Critique credit balance as PR review and Builder. We meter actual prompt and completion tokens from the upstream response, convert that spend to USD using the active model rate card, then round up to whole credits.

  1. 1

    Count tokens

    prompt_tokens are billed at the model's input rate.completion_tokens (including reasoning tokens when reported) are billed at the output rate.

  2. 2

    Convert to USD

  3. 3

    Convert to Critique credits

    We use Solo-plan economics: $19 ÷ 750 credits = $0.0253 per credit.

    Non-zero usage always bills at least 1 credit. PR review runs use separate per-run catalog floors — not this token math.

Worked example · Kimi K2.6 standard rates

Prompt tokens
100,000
Completion tokens
50,000
Input USD
Output USD
Total USD
$0.2625
Credits charged
$0.2625 ÷ $0.0253⌉ = 11 credits

Non-streaming responses include X-Critique-Credits-Charged and X-Critique-Estimated-Usd. Log retention tier applies 75% off token rates on eligible models when enabled in Settings or via X-Critique-DeepSeek-Training-Opt-In: true.

Quickstart

Point any OpenAI-compatible client at https://critique.sh/api/v1 with a crt_ key. Works in Cursor, Windsurf, Zed, VS Code, and JetBrains, LangChain, Vercel AI SDK, CI evals, or a sidecar next to review.

/api/v1/models
/api/v1/chat/completions
curl https://critique.sh/api/v1/chat/completions \
  -H "Authorization: Bearer crt_..." \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek/deepseek-v4-flash",
    "messages": [
      { "role": "user", "content": "Summarize this webhook retry design in three bullets." }
    ]
  }'

OpenAI SDK

import OpenAI from "openai";

const client = new OpenAI({
  apiKey: process.env.CRITIQUE_API_KEY,
  baseURL: "https://critique.sh/api/v1",
});

const response = await client.chat.completions.create({
  model: "deepseek/deepseek-v4-flash",
  messages: [{ role: "user", content: "Draft a TypeScript interface for idempotent job enqueue." }],
});

console.log(response.choices[0]?.message?.content);

OpenAI-compatible client

// OpenAI-compatible — same shape in any client with a custom base URL
import OpenAI from "openai";

export const inference = new OpenAI({
  apiKey: process.env.CRITIQUE_API_KEY,
  baseURL: "https://critique.sh/api/v1",
});

// Agent loop, sidecar, CI eval, or IDE extension
const res = await inference.chat.completions.create({
  model: "moonshotai/kimi-k2.6",
  messages: [{ role: "user", content: "Refactor this handler for idempotency." }],
});

API keys

Scopes: read:inference / write:inference

Create your Inference API key

Keys use the crt_ prefix with read:inference and write:inference scopes. Sign in to generate a key here — the full secret is shown once.

Sign in to generate a key

FAQ

Why use Critique as my inference layer?

Same crt_ keys and credit pool as review and Builder, OpenAI-compatible endpoints, Western-region routing, and list rates tuned for agents that burn tokens all day. Point Cursor, Windsurf, Zed, VS Code, and JetBrains, your harness, or any OpenAI client at https://critique.sh/api/v1 — no separate provider account on managed billing.

Do you store or train on my prompts?

Private tier (default): no log retention for model training — requests are metered for billing only. Log retention tier (optional, 75% off on eligible models): you opt in explicitly; retained prompts and completions help improve models, routing, and the Critique platform. Toggle in Settings → Connections or per request with X-Critique-DeepSeek-Training-Opt-In.

What is log retention pricing (75% off)?

On DeepSeek V4 Flash, Hy3 Preview, Kimi K2.6, GLM-5.1, and Trinity Large Thinking you can opt into short-term log retention and pay 25% of list price (75% off). Nemotron rates are unchanged. Accept the conditions, enable account-wide in Settings, or send X-Critique-DeepSeek-Training-Opt-In: true per request.

How is this different from the Coding Agent API?

The Coding Agent API runs a full sandboxed repo agent (clone, edit, optional draft PR). The Inference API is token-in/token-out chat completions for your own apps, agents, and tools — same crt_ keys and credit pool, lighter contract.

Where do requests run?

Critique routes Inference API traffic through Western-region infrastructure. Private and log-retention tiers both use Western-hosted routing.

Read the full API reference →

Need a full sandbox agent? Coding Agent API · Default model: deepseek/deepseek-v4-flash