April 21, 202619 min readCritique

Arcee Trinity-Large-Thinking Lands in Critique at 1 Credit

A serious open-weight agent model is only strategically interesting when it is both ownable and cheap enough to route in production. Trinity-Large-Thinking clears that bar.

New in Critiquearcee-ai/trinity-large-thinking · 1 cr

Trinity-Large-Thinking

Open-weight agentic reasoning,
priced like a default lane.

Arcee positions Trinity-Large-Thinking as a frontier open reasoning model for long-horizon agents, multi-turn tool use, and cleaner instruction following. The interesting product story is not just that it is open. It is that the OpenRouter SKU lands in Critique at 1 credit, making it an unusually cheap way to buy real agentic depth.

License

Apache 2.0 weights

Model shape

398B total · 13B active

Context in Critique

262K on OpenRouter

Why it matters

Lead, specialist, or Remedy at 1 cr

Arcee

Opus 4.6

GLM-5

MiniMax-M2.7

Kimi-K2.5

Vendor benchmark snapshot

Agentic signal at 1 credit

Docs + launch post

Tau2-Telecomhigher is better

94.7

Trinity

92.1

Opus 4.6

98.2

GLM-5

84.8

MiniMax

95.9

Kimi-K2.5

PinchBenchhigher is better

91.9

Trinity

93.3

Opus 4.6

86.4

GLM-5

89.8

MiniMax

84.8

Kimi-K2.5

AIME25higher is better

96.3

Trinity

99.8

Opus 4.6

93.3

GLM-5

MiniMax

96.3

Kimi-K2.5

SWE-bench Verifiedhigher is better

63.2

Trinity

75.6

Opus 4.6

72.8

GLM-5

75.4

MiniMax

70.8

Kimi-K2.5

When a new model arrives in Critique, the relevant question is never just whether another benchmark row went up by three points. The real question is whether the model changes routing decisions. Can it credibly sit in front of real pull requests, hold onto context across long tool loops, stay cheap enough for broad coverage, and still produce review signal that feels better than a toy? Trinity-Large-Thinking is interesting because the answer looks much closer to yes than most open releases.

PART ONE — WHAT TRINITY-LARGE-THINKING ACTUALLY IS

Arcee launched Trinity-Large-Thinking on April 1, 2026 as the reasoning-oriented release of its Trinity-Large family, with weights on Hugging Face under Apache 2.0 and hosted API access through Arcee and OpenRouter. The company frames it as a frontier open reasoning model for complex agents, multi-turn tool use, and long-horizon loops. That framing matters because the model is not being sold as a generic chat endpoint. It is explicitly being sold as agent infrastructure.

The official docs describe Trinity-Large-Thinking as a 398B-parameter sparse Mixture-of-Experts model with roughly 13B active parameters per token, 256 experts with 4 active, Grouped Query Attention, and 17 trillion training tokens. Arcee’s docs also emphasize reasoning traces as part of the contract: the model emits explicit thinking before the final answer, and those thinking tokens are supposed to remain in context during multi-turn conversations and agent loops. That is a stronger claim than “supports reasoning.” It is a statement about how the model expects to be used.

398B
Total parameters in the Trinity-Large-Thinking MoE.
13B
Approximate active parameters per token according to Arcee docs.
17T
Training-token figure cited in Arcee’s model summary.
1 cr
Critique floor for `arcee-ai/trinity-large-thinking`.

PART TWO — THE BENCHMARK PICTURE IS GOOD, NOT MAGIC

The official Arcee docs and launch post paint a coherent benchmark story: Trinity-Large-Thinking looks strongest on agent-shaped evaluations, especially Tau2-Telecom, PinchBench, and AIME25, while landing below the very top closed frontier models on some general reasoning and software engineering measures. That is the right way to read it. This is not “beats Opus everywhere.” It is “more agentic signal than you would normally expect at this price, with enough openness to change the buying decision.”

Arcee docs benchmark snapshot

Selected rows from Arcee’s Trinity-Large-Thinking docs. Percent scores shown as published there; treat cross-vendor comparisons as directional rather than perfectly apples-to-apples.

Model	PinchBench	Tau2-Telecom	AIME25	SWE-bench Verified
Trinity-Large-Thinking	91.9	94.7	96.3	63.2
Opus-4.6	93.3	92.1	99.8	75.6
GLM-5	86.4	98.2	93.3	72.8
MiniMax-M2.7	89.8	84.8	80.0	75.4
Kimi-K2.5	84.8	95.9	96.3	70.8

SWE-bench Verified footnote in the docs says all models were evaluated in `mini-swe-agent-v2`.

Trinity-Large-Thinking benchmark profile

Published by Arcee across general reasoning, agentic, and software-engineering evaluations.

Source: Arcee Trinity-Large-Thinking documentation. High scores on agentic tasks do not automatically imply best-in-class coding on every harness.

The honest read is important. Trinity-Large-Thinking is not the strongest model in this comparison on GPQA, IFBench, or SWE-bench Verified. Opus still looks stronger on several broad and coding-heavy axes, and GLM-5 or MiniMax can beat it on specific tasks. The interesting part is that Trinity is close enough on the agentic evaluations that matter for tool use and planning while remaining radically cheaper and openly licensed.

PART THREE — WHY THIS FITS CRITIQUE SPECIFICALLY

Critique does not need every model to be universally best. It needs distinct lanes that make sense. Trinity-Large-Thinking naturally fits the lane where you want more long-horizon reasoning and better tool discipline than typical low-cost models provide, but you do not want to spend Claude Sonnet, GPT-5.4, or Opus credits on every review. That makes it valuable in three roles: an open-weight lead when cost matters, a specialist when a workflow is tool-heavy, and a Remedy option when the fix loop benefits from stronger multi-step planning.

The second reason it fits is ownership. Arcee’s launch post is unusually direct about why they are releasing this under Apache 2.0: developers and enterprises need models they can inspect, distill, host, and own. That matters to us because Critique sits in the part of the stack where trust and governance become product questions. Open weights do not automatically make a model better, but they absolutely make a model more governable.

Quick product facts

What we think matters most when deciding whether to actually route this model.

Metric	Value	Why it matters
License	Apache 2.0	Open-weight availability changes the ownership story for enterprises and infra teams.
OpenRouter SKU	`arcee-ai/trinity-large-thinking`	This is the hosted route we exposed in Critique.
OpenRouter pricing	$0.22/M input · $0.85/M output	Keeps the model in the “default lane” conversation, not the “emergency escalation only” lane.
Context exposed in Critique	262,144 tokens	That is the current OpenRouter route context even though Arcee docs discuss a larger extended window in direct-deployment guidance.
Model shape	398B total · 13B active	Shows why the model can target frontier-style behavior without frontier dense-model inference cost.

Arcee docs separately mention a 512K extended context window in reasoning-trace guidance. Our catalog entry reflects the OpenRouter-hosted route currently exposed in product.

PART FOUR — WHY 1 CREDIT IS THE ACTUAL STORY

Many model launches are strategically irrelevant because the economics never let them escape “demo mode.” Trinity-Large-Thinking is more interesting than that. A 1-credit floor means you can plausibly use it for broad review coverage, not just for the once-a-day nightmare PR. It also means teams can experiment with an openly licensed agentic model without immediately dragging review economics into the same territory as the premium closed frontier stack.

That does not mean one credit buys the whole workflow. In Critique, the total cost of a review still depends on the lead model, specialist fan-out, PR depth, and whether Remedy runs or re-review loops are needed. But the floor still matters because it determines whether a model gets considered at all. Trinity-Large-Thinking now clears that first gate comfortably.

PART FIVE — CAVEATS, RESEARCH NOTES, AND WHAT WE WOULD WATCH

There are at least three caveats worth keeping in view. First, benchmark comparability across vendors is always messy. Second, Arcee’s own docs make context handling a product requirement because the model is designed around preserved reasoning traces; teams that aggressively trim assistant state may undercut the exact behavior they were trying to buy. Third, the software-engineering story is good, but not yet obviously superior to the strongest closed models or the very best open competitors on every coding harness.

What we would watch in practice is not whether Trinity wins one more benchmark chart on launch week. We would watch whether it stays coherent across long specialist loops, whether it tool-calls cleanly under constraint, whether its review comments feel stable across large pull requests, and whether its open-weight posture becomes a real buying advantage for teams that care about control. Those are the questions that decide whether a model becomes infrastructure.

Primary sources
Arcee launch post — Trinity-Large-Thinking: Scaling an Open Source Frontier Agent
Official launch narrative, release date, PinchBench claim, pricing framing, and Preview usage claims.
Arcee docs — Trinity-Large-Thinking
Model summary, architecture details, reasoning-trace guidance, and benchmark table.
Hugging Face — arcee-ai/Trinity-Large-Thinking
Weights and model card location for the Apache 2.0 release.
OpenRouter — arcee-ai/trinity-large-thinking
Hosted route, 262,144-token context listing, and public price sheet.

Arcee’s launch post and docs both say the model weights are available on Hugging Face under Apache 2.0. So the release is not only an API product; the open-weight story is part of the point.

No. The article’s framing is narrower: Trinity looks unusually strong on agentic benchmarks like Tau2-Telecom and PinchBench relative to its price, but it does not top Opus or the strongest peers on every benchmark.

Because Critique credits are a normalized routing unit, not a direct display of raw token prices. The 1-credit floor is our product abstraction for using the model inside review and Remedy workflows.

Use the OpenRouter route currently exposed in product as the practical answer: 262,144 tokens. Arcee’s docs also discuss a larger extended context regime in direct deployment guidance, but that is not the hosted route we are cataloging here.

← All essays Privacy & Terms

Get started

Ask about this essay

Nemotron-3-Super

Ask about the argument, the evidence, the structure, or how the post connects to Critique.

Not editorial advice · The essay above is the source of truth · Not saved to your account · OpenRouter privacy

Arcee Trinity-Large-Thinking Lands in Critique at 1 Credit

Open-weight agentic reasoning,priced like a default lane.

PART ONE — WHAT TRINITY-LARGE-THINKING ACTUALLY IS

PART TWO — THE BENCHMARK PICTURE IS GOOD, NOT MAGIC

PART THREE — WHY THIS FITS CRITIQUE SPECIFICALLY

PART FOUR — WHY 1 CREDIT IS THE ACTUAL STORY

PART FIVE — CAVEATS, RESEARCH NOTES, AND WHAT WE WOULD WATCH

Ask about this essay

Open-weight agentic reasoning,
priced like a default lane.