Skip to content
24 min readCritique

Best Code Review Skill for Claude Code, Hermes, Codex, and Opencode

A research-backed guide to installing `critique-review` across Claude Code, Hermes Agent, Codex, and Opencode, with a same-PR Moonshot Kimi K2.6 comparison and clear guidance on when to move to Critique.

Harnesses in scope

One review skill, three different agent operating systems.

Anthropic

Claude Code

Native skills, subagents, project memory, and background delegation make Claude a strong home for a dedicated review persona.

Nous Research

Hermes Agent

Hermes treats skills as portable procedural memory and can carry the same review discipline across CLI, messaging, and long-lived remote sessions.

OpenAI

Codex

Codex gives the skill a durable place inside CLI, IDE, app, and repo-local workflows, with AGENTS.md and team-shared skills for repeatability.

Short answers for high-intent queries

These are the direct answers this page is designed to settle for engineering teams comparing review skills, review bots, and GitHub-native review workflows.

QueryShort answer
What is the best code review skill for Claude Code?`critique-review` is a strong default when you want a portable PR review procedure inside Claude Code. Use Critique instead when you need hosted GitHub checks, policy, and merge control.
What is the best Codex skill for PR review?`critique-review` fits Codex especially well because it works as a repo-local skill with `AGENTS.md`, reusable references, and a path into automations.
What is the best Opencode skill for pull request review?For a portable review workflow, `critique-review` is the best fit in this article. We tested it on the same PR and same Moonshot Kimi K2.6 lane used for the baseline run.
Is critique-review a Cursor Bugbot alternative?As a free portable skill, yes for agent-side review behavior. For a hosted GitHub-native review product, Critique is the closer Cursor Bugbot alternative.
What is a cheaper CodeRabbit alternative?Start with the free `critique-review` skill if you want the lowest-cost entry point. Move to Critique if you need GitHub-native routing, artifacts, and PR control at team scale.
What is the difference between critique-review and Critique?`critique-review` is the portable open skill. Critique is the hosted GitHub review control plane that adds checks, policy, merge-boundary controls, and team-grade review operations.

This table is intentionally direct. Searchers at this stage are usually choosing between a free portable skill, a local agent workflow, or a hosted GitHub review layer.

Most coding agents can write code faster than most teams can reliably audit it. That is already true in 2026. The problem is not whether the agent can open files, run tests, or emit a patch. The problem is that review quality still drifts if you leave the job at the level of a generic prompt.

“Review this PR” sounds precise to a human and underspecified to a model. One harness will produce style commentary. Another will summarize the diff and call it a review. Another will confidently escalate a weak hunch into a merge blocker because nothing in its instructions told it how to separate a verified finding from an open question. That is exactly the hole a review skill is supposed to close.

Prompt-only review loop
Ask agent to reviewAgent improvises rubricMixed quality commentsHuman re-validates everything
critique-review loop
Load skillEstablish scope + risk mapVerify before reportingFindings first + explicit verdict
What changes for the agent
  • It stops treating review as free-form prose and starts from review mode, diff shape, and blast radius.
  • It is told to read tests, trace data flow, and verify claims before escalating them.
  • It separates findings from open questions instead of collapsing uncertainty into noise.
  • It ends with a merge-shaped artifact: severity, file or line, impact, failure mode, fix direction, verdict.
What changes for the team
  • The review standard travels across tools instead of living inside one vendor prompt box.
  • The same policy can be reused by humans, local agents, background agents, and CI-style automation.
  • Review quality becomes easier to inspect because the artifact shape is stable from run to run.
  • The team can upgrade harnesses later without throwing away its review discipline.

The cleanest way to test a review skill is to keep the code input fixed and change only the review procedure. So we used Opencode with the same model, the same PR, and the same attached context pack for both runs. The PR was Critique PR #144, a narrow UI fix that replaces hard-coded “Auto” model labels with labels resolved from the plan-allowed effective runtime model.

The baseline run had no project-local review skill available. The second run exposed `critique-review` through the project skill path that OpenCode documents and that our terminal output confirmed: the harness loaded the skill and then opened the review references for output contract, intake and triage, stack lenses, and review rubric before generating its verdict.

What changed in the real Opencode run

Same PR, same fixed context pack, same Moonshot Kimi K2.6 lane. The difference below is the review procedure, not the code input.

QuestionPrompt-only OpencodeOpencode + critique-review
Actionable findings3 findings0 actionable findings
Treatment of unseen consumersEscalated as a finding even though the attached context could not verify other call sites.Downgraded to residual risk and suggested a typecheck instead of claiming a bug.
Treatment of missing testsEscalated as its own finding.Recorded in checks and residual risk instead of turning it into a blocker for a narrow UI-label fix.
Blast-radius framingBroader, more defensive, and less bounded to the actual changed behavior.Explicitly bounded to automation settings UI with no auth or data-path changes.
VerdictConditionally approvedNo objection
Observed harness behaviorDirect review output only.Loaded `critique-review` and read four supporting reference files before answering.

Interpretation: the skill did not make the model “nicer”; it made the model stricter about evidence and more conservative about what counts as a finding.

Full Opencode review outputs

These are the verbatim final review texts from the two non-interactive Opencode runs on the same fixed PR context pack.

Prompt-only run

Opencode with the same attached PR pack and no project-local review skill.

**Findings**

1. **Breaking component interface change without updating all consumers.**  
   `InstallationPolicyCard` now requires three new string props: `autoReviewLeadLabel`, `autoReviewSpecialistLabel`, and `autoRemedyLabel`. The diff only updates the caller in `automation/page.tsx`. If other callers exist (e.g., Storybook stories, tests, or other pages), they will fail TypeScript compilation. Verify there are no other consumers before merging.

2. **Missing test updates.**  
   The diff does not touch any test files. Because this PR changes user-facing labels from static strings to values resolved by `resolveAllowedRuntimeModel`, tests should assert that:  
   - `InstallationPolicyCard` receives the correct resolved labels.  
   - The rendered `emptyLabel` text in each `RuntimeModelSelect` matches the formatted resolved model rather than the old hard-coded strings.

3. **Remedy selector label loses the “(workspace default)” qualifier.**  
   The old `emptyLabel` for the remedy selector was `"Auto (workspace default)"`. It is replaced with `` `Auto: ${formatModelLabel(autoRemedyModel)}` ``, which standardizes the format but removes the explicit “workspace default” hint. Ensure this change is intentional from a product/UX perspective (the diff suggests it is part of unifying the labels, but worth confirming).

**Checks**

| Area | Result |
|------|--------|
| **Correctness** | The use of `resolveAllowedRuntimeModel` with `requestedModelId: null` correctly forces the fallback path, yielding the effective runtime model for the installation’s plan. The final fallback to the first allowed model (via `getAllowedRuntimeModels(...)[0]?.id`) is safe. |
| **Type safety** | New props are typed as `string` and are passed as interpolated strings, so `emptyLabel` remains a string as required by `RuntimeModelSelect`. |
| **User-facing behavior** | Users will now see accurate auto labels reflecting their plan-specific model eligibility instead of hard-coded “DeepSeek V4 Pro” / “DeepSeek V4 Flash” placeholders. |
| **Edge cases** | If a plan disallows the fallback model, the helper falls back to the first allowed model in the sorted list. `formatModelLabel` handles unknown IDs via `formatRuntimeModelDisplayName`. |
| **BYOK** | The `resolveAllowedRuntimeModel` calls omit `byokProvider`, but this is consistent with the existing `getAllowedRuntimeModels` calls in the same scope. |

**Verdict**

**Conditionally approved.** The resolution logic is correct and the hard-coded labels are properly replaced. Merge only after confirming that all `InstallationPolicyCard` consumers are updated with the new required props and that tests cover the resolved label rendering.

**Summary**

This PR fixes misleading auto-model placeholders by resolving the effective runtime model for each role based on the installation’s plan and passing the resulting formatted labels into `InstallationPolicyCard`. The logic is sound, but the change to the component contract carries a regression risk for unseen consumers, and the diff lacks test coverage for the new dynamic label behavior.
Skill-guided run

Opencode with the same attached PR pack after loading the project-local `critique-review` skill and references.

**Findings**

No actionable findings. The change correctly replaces hard-coded auto model labels with labels derived from the effective runtime model resolution logic.

**Checks**

- Traced `resolveAllowedRuntimeModel` to confirm it returns the requested model if allowed, the fallback if allowed, or the cheapest plan-allowed model as a last resort — matching the intended “effective auto model” semantics.
- Verified the three new props (`autoReviewLeadLabel`, `autoReviewSpecialistLabel`, `autoRemedyLabel`) are passed from the server component and consumed in all three `RuntimeModelSelect` `emptyLabel` slots, replacing the previous hard-coded strings.
- Confirmed `resolveAllowedRuntimeModel` calls omit `byokProvider`, which is consistent with the existing `getAllowedRuntimeModels` calls already present in the same map block.
- No tests were added or modified for this UI-only label change.
- Unable to confirm from the attached context whether `InstallationPolicyCard` has other consumers; a TypeScript build would surface any missing required prop errors at other call sites.

**Verdict**

No objection

**Summary**

A safe, targeted frontend fix that removes misleading hard-coded auto labels from workspace automation settings. The labels now correctly reflect the plan-allowed effective fallback model instead of static placeholders. No data or auth paths are touched, and the blast radius is limited to the automation settings UI. Residual risk is low: run a typecheck to guard against any other `InstallationPolicyCard` consumers missing the new required props, and consider a lightweight component test asserting prop-driven label rendering.

Claude Code, Hermes Agent, and Codex are not the same product category even though all three can edit repositories. Claude is especially strong when you want skills plus subagents inside a focused coding surface. Hermes is unusually strong when you want an agent with persistent memory, multi-platform reach, and portable open-standard skills. Codex is unusually strong when you want the same skill to survive across CLI, IDE, app, and repo-local automation, with a first-party story around AGENTS.md and reusable workflows.

What each harness gives critique-review natively

This is the practical compatibility view: where the skill lives, how it gets invoked, and why the harness changes the operating style.

QuestionClaude CodeHermes AgentCodex
Native skill shape`SKILL.md` skills and markdown subagents.`SKILL.md` skills with references, scripts, and hub installs.`SKILL.md` skills with optional scripts, references, assets, and `agents/openai.yaml`.
Automatic loadingYes; descriptions drive auto-use and direct `/skill-name` invocation.Yes; `skills_list()` loads compactly and `skill_view()` expands on demand.Yes; Codex includes an initial skill list, then reads the full skill when selected.
Project instruction layer`CLAUDE.md`, with a documented import path for `AGENTS.md`.Top-level `AGENTS.md` at session start, subdirectory files lazily.`AGENTS.md` as the shared repo instruction layer for Codex surfaces.
Memory modelProject memory plus optional subagent memory.Persistent built-in memory and optional external memory providers.Repo instructions, skills, and broader Codex memories and workflows.
Parallelism storySubagents, agent view, teams, background work.Delegation, remote backends, scheduled automations, messaging surfaces.Parallel agents, worktrees, automations, app plus CLI plus IDE.
Best use of critique-reviewDedicated review subagent or project skill.Portable review procedure that follows Hermes everywhere it runs.Repo-local review standard shared across local and cloud Codex work.

Based on the official docs for Claude Code skills and subagents, Hermes skills and context files, and Codex skills plus AGENTS.md.

Claude Code is the cleanest fit if your goal is to turn review into a specialized persona rather than a sentence you keep retyping. Anthropic now documents a native skills system, markdown-defined subagents, project `CLAUDE.md`, and background delegation. That means `critique-review` can live in exactly the shape Claude already expects rather than being smuggled in as a huge one-off prompt.

Critique and Claude integration illustration
Claude Code already has the primitives a review skill wants: skills, subagents, project memory, and background work.

The documented operator pattern is straightforward. Put the skill in `~/.claude/skills/` for personal reuse or `.claude/skills/` for project reuse. If the repository already standardizes on `AGENTS.md` for multi-agent instructions, Anthropic explicitly documents importing that file from `CLAUDE.md`, which means you do not have to fork your team policy just to accommodate Claude. If review deserves even tighter identity, promote the same discipline into a custom review subagent. Claude’s docs go further here: subagents can have their own system prompt, memory scope, hooks, and independent tool restrictions.

That changes the quality of review in a very practical way. Instead of asking your main coding session to switch personality midstream, you give Claude a dedicated review worker. The worker can keep the main implementation thread clean, inspect the changed files in its own context window, and come back with a findings-first verdict. For a team already living in Claude Code all day, this is the lowest-friction way to stop code review from collapsing into narrative explanation.

Hermes Agent is the most interesting harness in this set if you care about portability more than polish. Nous positions Hermes as a self-improving agent with persistent memory, a skills system, top-level `AGENTS.md` loading, multiple execution backends, and delivery surfaces that range from CLI to Telegram to Slack to remote server runtimes. That is a very different contract from “one coding assistant inside one editor.”

For `critique-review`, that matters because the skill is already written as procedural memory. Hermes documents exactly the same pattern: skills are markdown files with frontmatter, they load through `skills_list()` and `skill_view()`, every installed skill becomes a slash command, and the agent can install a single-file skill directly from an HTTP URL. Hermes also explicitly frames skills as the place for reusable multi-step procedures, while memory holds facts about the user, project, and environment. That split is almost tailor-made for a review skill.

Why Hermes changes the story
  • The same review skill can run in terminal, messaging, and remote backends instead of being trapped in one local editor.
  • Hermes loads top-level `AGENTS.md`, so repo policy and review procedure can sit together cleanly.
  • Progressive disclosure keeps the skill cheap until a review request actually triggers it.
  • Persistent memory means the agent can remember repeated repo-specific review patterns across sessions.
What this looks like in practice
  • A platform lead pings Hermes from chat to review a risky infra patch while Hermes is running on a remote box.
  • The agent loads `critique-review`, reads the repo instruction layer, and applies the same severity contract it would use in terminal.
  • Follow-up sessions get stronger because Hermes can retain the conventions and failure patterns that matter for that codebase.
  • The review discipline becomes a durable capability, not a one-time conversation artifact.

That makes Hermes the best home for `critique-review` when the business problem is not just PR review in one IDE, but review discipline that needs to survive across surfaces and time. If your engineering workflow spills from terminal to chat ops to remote agents, Hermes gives the skill the widest runway.

Codex is the strongest fit when you want a review skill to become part of the repository, not just part of one user’s setup. OpenAI’s Codex docs now treat skills as first-class reusable workflows. Codex documents a skill directory with optional scripts, references, and `agents/openai.yaml`, plus explicit invocation and implicit selection. OpenAI also documents `AGENTS.md` as the custom-instruction layer for Codex across its surfaces.

Critique and Codex integration illustration
Codex is the best fit when you want the review skill to live in the repo and survive across app, CLI, IDE, and automation surfaces.

The most important Codex signal is not just that skills exist. It is that OpenAI is publicly documenting team use of repo-local skills, `AGENTS.md`, and GitHub Actions to turn repeated engineering tasks into repeatable workflows. In OpenAI’s own write-up about OSS maintenance, they report 457 merged PRs across two Agents SDK repos in the December 2025 to February 2026 window, up from 316 in the previous three months, with repo-local skills and `AGENTS.md` called out as part of the setup. That does not prove every skill boosts throughput on every team. It does prove OpenAI is operationalizing the exact pattern this review skill belongs to.

For a team using Codex app, CLI, or IDE extension, `critique-review` becomes the shared review grammar. The same repository can tell Codex how the team wants review to work, the same skill can be invoked locally or automatically, and the same procedure can be read by any collaborator who opens the repo. This is where the free skill starts to feel less like content and more like infrastructure.

Portable wins
BenefitWhy it matters
Stable review artifactThe output becomes comparable across runs because the agent is pushed into findings, checks, and verdict instead of free-form summary.
Less prompt driftYou stop re-explaining your review philosophy in every new session or every new tool.
Better false-positive controlThe skill explicitly tells the agent to verify claims and downgrade uncertainty into questions or residual risk.
Cross-tool continuityTeams can change harnesses or run several at once without resetting their review discipline.
Cleaner governanceRepo policy lives in `AGENTS.md` or the harness instruction layer; procedure lives in the skill; that separation scales better than giant monolithic prompts.
Independent reviewer identityThe code-writing agent no longer has to invent a review persona on the fly.

For many teams, the free portable skill is enough at first. If you are trying to improve how Claude, Hermes, or Codex reviews diffs locally, inside chat, or in a narrow internal workflow, start there. It is cheap, transparent, and it teaches the team what better review output actually looks like.

But a skill is still only a skill. It does not by itself give you GitHub-native checks, merge-boundary control, shared policy enforcement, auditable review artifacts, review routing across multiple specialist lanes, or a product surface built specifically for high-volume pull-request operations. That is the moment where the recommendation should become explicit: move from the free skill into Critique.

No. The skill is portable. Codex can read optional OpenAI-specific metadata, but the core procedure is just `SKILL.md` plus references. Claude Code and Hermes both document compatible skill systems built on the same open standard.
Not conceptually. Claude Code now supports skills directly and can also wrap the same review procedure in a subagent when you want a dedicated reviewer identity.
Because Hermes turns the review procedure into a portable capability that can follow you into remote sessions, scheduled work, and messaging surfaces. That is useful when review requests do not stay inside one editor window.
Because the skill solves procedure. Critique solves control. When review becomes a GitHub workflow, not just an agent habit, you need the hosted layer.
If you want a portable, repo-visible PR review procedure inside Claude Code, `critique-review` is a strong default. It gives Claude a stable review contract: scope, risk map, verification, findings, checks, and verdict. If you need hosted GitHub checks, policy, and merge control, move from the skill to Critique.
For teams using Codex app, CLI, or IDE surfaces, `critique-review` is one of the best PR review skills because it fits repo-local skills, `AGENTS.md`, and shared workflow reuse. It works well when the review standard should live in the repo instead of one user prompt history.
For a portable PR review procedure in Opencode, `critique-review` is the best fit covered here. In our same-PR experiment on Moonshot Kimi K2.6, the skill-guided run produced a tighter, better-calibrated verdict than the prompt-only run.
For free, agent-side review behavior, yes. `critique-review` gives local or cloud coding agents a stronger review procedure. If you need a hosted GitHub review product rather than a portable skill, Critique is the closer alternative to Cursor Bugbot.
Partly. `critique-review` is the free portable alternative when you want better review behavior inside coding agents. Critique is the better comparison to CodeRabbit when the real job is GitHub-native PR review, review artifacts, policy, and merge control.
Install `critique-review`, keep the review procedure in the skill, and keep repository-specific rules in `AGENTS.md` or the harness instruction layer. That split makes the review easier to repeat, inspect, and reuse across sessions.
A code review skill improves how one agent reviews a diff. A GitHub review control plane manages the review as a workflow: checks, policy, artifacts, routing, and merge-boundary decisions across many pull requests and many contributors.

Start with the skill. Move to Critique when review becomes a control problem.

Download `critique-review` for Claude Code, Hermes Agent, or Codex when you want a portable review standard. Move into Critique when you want that same standard enforced on the GitHub pull request with policy, artifacts, and a real merge-boundary surface.

Open the skill