Arcee Trinity-Large-Thinking Lands in Critique at 1 Credit
A serious open-weight agent model is only strategically interesting when it is both ownable and cheap enough to route in production. Trinity-Large-Thinking clears that bar.
Trinity-Large-Thinking
Open-weight agentic reasoning,
priced like a default lane.
Arcee positions Trinity-Large-Thinking as a frontier open reasoning model for long-horizon agents, multi-turn tool use, and cleaner instruction following. The interesting product story is not just that it is open. It is that the OpenRouter SKU lands in Critique at 1 credit, making it an unusually cheap way to buy real agentic depth.
Vendor benchmark snapshot
Agentic signal at 1 credit
When a new model arrives in Critique, the relevant question is never just whether another benchmark row went up by three points. The real question is whether the model changes routing decisions. Can it credibly sit in front of real pull requests, hold onto context across long tool loops, stay cheap enough for broad coverage, and still produce review signal that feels better than a toy? Trinity-Large-Thinking is interesting because the answer looks much closer to yes than most open releases.
PART ONE — WHAT TRINITY-LARGE-THINKING ACTUALLY IS
Arcee launched Trinity-Large-Thinking on April 1, 2026 as the reasoning-oriented release of its Trinity-Large family, with weights on Hugging Face under Apache 2.0 and hosted API access through Arcee and OpenRouter. The company frames it as a frontier open reasoning model for complex agents, multi-turn tool use, and long-horizon loops. That framing matters because the model is not being sold as a generic chat endpoint. It is explicitly being sold as agent infrastructure.
The official docs describe Trinity-Large-Thinking as a 398B-parameter sparse Mixture-of-Experts model with roughly 13B active parameters per token, 256 experts with 4 active, Grouped Query Attention, and 17 trillion training tokens. Arcee’s docs also emphasize reasoning traces as part of the contract: the model emits explicit thinking before the final answer, and those thinking tokens are supposed to remain in context during multi-turn conversations and agent loops. That is a stronger claim than “supports reasoning.” It is a statement about how the model expects to be used.
PART TWO — THE BENCHMARK PICTURE IS GOOD, NOT MAGIC
The official Arcee docs and launch post paint a coherent benchmark story: Trinity-Large-Thinking looks strongest on agent-shaped evaluations, especially Tau2-Telecom, PinchBench, and AIME25, while landing below the very top closed frontier models on some general reasoning and software engineering measures. That is the right way to read it. This is not “beats Opus everywhere.” It is “more agentic signal than you would normally expect at this price, with enough openness to change the buying decision.”
Selected rows from Arcee’s Trinity-Large-Thinking docs. Percent scores shown as published there; treat cross-vendor comparisons as directional rather than perfectly apples-to-apples.
| Model | PinchBench | Tau2-Telecom | AIME25 | SWE-bench Verified |
|---|---|---|---|---|
| Trinity-Large-Thinking | 91.9 | 94.7 | 96.3 | 63.2 |
| Opus-4.6 | 93.3 | 92.1 | 99.8 | 75.6 |
| GLM-5 | 86.4 | 98.2 | 93.3 | 72.8 |
| MiniMax-M2.7 | 89.8 | 84.8 | 80.0 | 75.4 |
| Kimi-K2.5 | 84.8 | 95.9 | 96.3 | 70.8 |
SWE-bench Verified footnote in the docs says all models were evaluated in `mini-swe-agent-v2`.
Published by Arcee across general reasoning, agentic, and software-engineering evaluations.
Source: Arcee Trinity-Large-Thinking documentation. High scores on agentic tasks do not automatically imply best-in-class coding on every harness.
The honest read is important. Trinity-Large-Thinking is not the strongest model in this comparison on GPQA, IFBench, or SWE-bench Verified. Opus still looks stronger on several broad and coding-heavy axes, and GLM-5 or MiniMax can beat it on specific tasks. The interesting part is that Trinity is close enough on the agentic evaluations that matter for tool use and planning while remaining radically cheaper and openly licensed.
PART THREE — WHY THIS FITS CRITIQUE SPECIFICALLY
Critique does not need every model to be universally best. It needs distinct lanes that make sense. Trinity-Large-Thinking naturally fits the lane where you want more long-horizon reasoning and better tool discipline than typical low-cost models provide, but you do not want to spend Claude Sonnet, GPT-5.4, or Opus credits on every review. That makes it valuable in three roles: an open-weight lead when cost matters, a specialist when a workflow is tool-heavy, and a Remedy option when the fix loop benefits from stronger multi-step planning.
The second reason it fits is ownership. Arcee’s launch post is unusually direct about why they are releasing this under Apache 2.0: developers and enterprises need models they can inspect, distill, host, and own. That matters to us because Critique sits in the part of the stack where trust and governance become product questions. Open weights do not automatically make a model better, but they absolutely make a model more governable.
What we think matters most when deciding whether to actually route this model.
| Metric | Value | Why it matters |
|---|---|---|
| License | Apache 2.0 | Open-weight availability changes the ownership story for enterprises and infra teams. |
| OpenRouter SKU | `arcee-ai/trinity-large-thinking` | This is the hosted route we exposed in Critique. |
| OpenRouter pricing | $0.22/M input · $0.85/M output | Keeps the model in the “default lane” conversation, not the “emergency escalation only” lane. |
| Context exposed in Critique | 262,144 tokens | That is the current OpenRouter route context even though Arcee docs discuss a larger extended window in direct-deployment guidance. |
| Model shape | 398B total · 13B active | Shows why the model can target frontier-style behavior without frontier dense-model inference cost. |
Arcee docs separately mention a 512K extended context window in reasoning-trace guidance. Our catalog entry reflects the OpenRouter-hosted route currently exposed in product.
PART FOUR — WHY 1 CREDIT IS THE ACTUAL STORY
Many model launches are strategically irrelevant because the economics never let them escape “demo mode.” Trinity-Large-Thinking is more interesting than that. A 1-credit floor means you can plausibly use it for broad review coverage, not just for the once-a-day nightmare PR. It also means teams can experiment with an openly licensed agentic model without immediately dragging review economics into the same territory as the premium closed frontier stack.
That does not mean one credit buys the whole workflow. In Critique, the total cost of a review still depends on the lead model, specialist fan-out, PR depth, and whether Remedy runs or re-review loops are needed. But the floor still matters because it determines whether a model gets considered at all. Trinity-Large-Thinking now clears that first gate comfortably.
PART FIVE — CAVEATS, RESEARCH NOTES, AND WHAT WE WOULD WATCH
There are at least three caveats worth keeping in view. First, benchmark comparability across vendors is always messy. Second, Arcee’s own docs make context handling a product requirement because the model is designed around preserved reasoning traces; teams that aggressively trim assistant state may undercut the exact behavior they were trying to buy. Third, the software-engineering story is good, but not yet obviously superior to the strongest closed models or the very best open competitors on every coding harness.
What we would watch in practice is not whether Trinity wins one more benchmark chart on launch week. We would watch whether it stays coherent across long specialist loops, whether it tool-calls cleanly under constraint, whether its review comments feel stable across large pull requests, and whether its open-weight posture becomes a real buying advantage for teams that care about control. Those are the questions that decide whether a model becomes infrastructure.