Best Code Review Skill for Claude Code, Hermes, Codex, and Opencode
A research-backed guide to installing `critique-review` across Claude Code, Hermes Agent, Codex, and Opencode, with a same-PR Moonshot Kimi K2.6 comparison and clear guidance on when to move to Critique.
One review skill, three different agent operating systems.
Claude Code
Native skills, subagents, project memory, and background delegation make Claude a strong home for a dedicated review persona.
Hermes Agent
Hermes treats skills as portable procedural memory and can carry the same review discipline across CLI, messaging, and long-lived remote sessions.
Codex
Codex gives the skill a durable place inside CLI, IDE, app, and repo-local workflows, with AGENTS.md and team-shared skills for repeatability.
These are the direct answers this page is designed to settle for engineering teams comparing review skills, review bots, and GitHub-native review workflows.
| Query | Short answer |
|---|---|
| What is the best code review skill for Claude Code? | `critique-review` is a strong default when you want a portable PR review procedure inside Claude Code. Use Critique instead when you need hosted GitHub checks, policy, and merge control. |
| What is the best Codex skill for PR review? | `critique-review` fits Codex especially well because it works as a repo-local skill with `AGENTS.md`, reusable references, and a path into automations. |
| What is the best Opencode skill for pull request review? | For a portable review workflow, `critique-review` is the best fit in this article. We tested it on the same PR and same Moonshot Kimi K2.6 lane used for the baseline run. |
| Is critique-review a Cursor Bugbot alternative? | As a free portable skill, yes for agent-side review behavior. For a hosted GitHub-native review product, Critique is the closer Cursor Bugbot alternative. |
| What is a cheaper CodeRabbit alternative? | Start with the free `critique-review` skill if you want the lowest-cost entry point. Move to Critique if you need GitHub-native routing, artifacts, and PR control at team scale. |
| What is the difference between critique-review and Critique? | `critique-review` is the portable open skill. Critique is the hosted GitHub review control plane that adds checks, policy, merge-boundary controls, and team-grade review operations. |
This table is intentionally direct. Searchers at this stage are usually choosing between a free portable skill, a local agent workflow, or a hosted GitHub review layer.
Most coding agents can write code faster than most teams can reliably audit it. That is already true in 2026. The problem is not whether the agent can open files, run tests, or emit a patch. The problem is that review quality still drifts if you leave the job at the level of a generic prompt.
“Review this PR” sounds precise to a human and underspecified to a model. One harness will produce style commentary. Another will summarize the diff and call it a review. Another will confidently escalate a weak hunch into a merge blocker because nothing in its instructions told it how to separate a verified finding from an open question. That is exactly the hole a review skill is supposed to close.
What the skill actually adds
- It stops treating review as free-form prose and starts from review mode, diff shape, and blast radius.
- It is told to read tests, trace data flow, and verify claims before escalating them.
- It separates findings from open questions instead of collapsing uncertainty into noise.
- It ends with a merge-shaped artifact: severity, file or line, impact, failure mode, fix direction, verdict.
- The review standard travels across tools instead of living inside one vendor prompt box.
- The same policy can be reused by humans, local agents, background agents, and CI-style automation.
- Review quality becomes easier to inspect because the artifact shape is stable from run to run.
- The team can upgrade harnesses later without throwing away its review discipline.
Real experiment: same PR, same model, only the skill changed
The cleanest way to test a review skill is to keep the code input fixed and change only the review procedure. So we used Opencode with the same model, the same PR, and the same attached context pack for both runs. The PR was Critique PR #144, a narrow UI fix that replaces hard-coded “Auto” model labels with labels resolved from the plan-allowed effective runtime model.
The baseline run had no project-local review skill available. The second run exposed `critique-review` through the project skill path that OpenCode documents and that our terminal output confirmed: the harness loaded the skill and then opened the review references for output contract, intake and triage, stack lenses, and review rubric before generating its verdict.
Same PR, same fixed context pack, same Moonshot Kimi K2.6 lane. The difference below is the review procedure, not the code input.
| Question | Prompt-only Opencode | Opencode + critique-review |
|---|---|---|
| Actionable findings | 3 findings | 0 actionable findings |
| Treatment of unseen consumers | Escalated as a finding even though the attached context could not verify other call sites. | Downgraded to residual risk and suggested a typecheck instead of claiming a bug. |
| Treatment of missing tests | Escalated as its own finding. | Recorded in checks and residual risk instead of turning it into a blocker for a narrow UI-label fix. |
| Blast-radius framing | Broader, more defensive, and less bounded to the actual changed behavior. | Explicitly bounded to automation settings UI with no auth or data-path changes. |
| Verdict | Conditionally approved | No objection |
| Observed harness behavior | Direct review output only. | Loaded `critique-review` and read four supporting reference files before answering. |
Interpretation: the skill did not make the model “nicer”; it made the model stricter about evidence and more conservative about what counts as a finding.
Full Opencode review outputs
These are the verbatim final review texts from the two non-interactive Opencode runs on the same fixed PR context pack.
Opencode with the same attached PR pack and no project-local review skill.
**Findings**
1. **Breaking component interface change without updating all consumers.**
`InstallationPolicyCard` now requires three new string props: `autoReviewLeadLabel`, `autoReviewSpecialistLabel`, and `autoRemedyLabel`. The diff only updates the caller in `automation/page.tsx`. If other callers exist (e.g., Storybook stories, tests, or other pages), they will fail TypeScript compilation. Verify there are no other consumers before merging.
2. **Missing test updates.**
The diff does not touch any test files. Because this PR changes user-facing labels from static strings to values resolved by `resolveAllowedRuntimeModel`, tests should assert that:
- `InstallationPolicyCard` receives the correct resolved labels.
- The rendered `emptyLabel` text in each `RuntimeModelSelect` matches the formatted resolved model rather than the old hard-coded strings.
3. **Remedy selector label loses the “(workspace default)” qualifier.**
The old `emptyLabel` for the remedy selector was `"Auto (workspace default)"`. It is replaced with `` `Auto: ${formatModelLabel(autoRemedyModel)}` ``, which standardizes the format but removes the explicit “workspace default” hint. Ensure this change is intentional from a product/UX perspective (the diff suggests it is part of unifying the labels, but worth confirming).
**Checks**
| Area | Result |
|------|--------|
| **Correctness** | The use of `resolveAllowedRuntimeModel` with `requestedModelId: null` correctly forces the fallback path, yielding the effective runtime model for the installation’s plan. The final fallback to the first allowed model (via `getAllowedRuntimeModels(...)[0]?.id`) is safe. |
| **Type safety** | New props are typed as `string` and are passed as interpolated strings, so `emptyLabel` remains a string as required by `RuntimeModelSelect`. |
| **User-facing behavior** | Users will now see accurate auto labels reflecting their plan-specific model eligibility instead of hard-coded “DeepSeek V4 Pro” / “DeepSeek V4 Flash” placeholders. |
| **Edge cases** | If a plan disallows the fallback model, the helper falls back to the first allowed model in the sorted list. `formatModelLabel` handles unknown IDs via `formatRuntimeModelDisplayName`. |
| **BYOK** | The `resolveAllowedRuntimeModel` calls omit `byokProvider`, but this is consistent with the existing `getAllowedRuntimeModels` calls in the same scope. |
**Verdict**
**Conditionally approved.** The resolution logic is correct and the hard-coded labels are properly replaced. Merge only after confirming that all `InstallationPolicyCard` consumers are updated with the new required props and that tests cover the resolved label rendering.
**Summary**
This PR fixes misleading auto-model placeholders by resolving the effective runtime model for each role based on the installation’s plan and passing the resulting formatted labels into `InstallationPolicyCard`. The logic is sound, but the change to the component contract carries a regression risk for unseen consumers, and the diff lacks test coverage for the new dynamic label behavior.Opencode with the same attached PR pack after loading the project-local `critique-review` skill and references.
**Findings** No actionable findings. The change correctly replaces hard-coded auto model labels with labels derived from the effective runtime model resolution logic. **Checks** - Traced `resolveAllowedRuntimeModel` to confirm it returns the requested model if allowed, the fallback if allowed, or the cheapest plan-allowed model as a last resort — matching the intended “effective auto model” semantics. - Verified the three new props (`autoReviewLeadLabel`, `autoReviewSpecialistLabel`, `autoRemedyLabel`) are passed from the server component and consumed in all three `RuntimeModelSelect` `emptyLabel` slots, replacing the previous hard-coded strings. - Confirmed `resolveAllowedRuntimeModel` calls omit `byokProvider`, which is consistent with the existing `getAllowedRuntimeModels` calls already present in the same map block. - No tests were added or modified for this UI-only label change. - Unable to confirm from the attached context whether `InstallationPolicyCard` has other consumers; a TypeScript build would surface any missing required prop errors at other call sites. **Verdict** No objection **Summary** A safe, targeted frontend fix that removes misleading hard-coded auto labels from workspace automation settings. The labels now correctly reflect the plan-allowed effective fallback model instead of static placeholders. No data or auth paths are touched, and the blast radius is limited to the automation settings UI. Residual risk is low: run a typecheck to guard against any other `InstallationPolicyCard` consumers missing the new required props, and consider a lightweight component test asserting prop-driven label rendering.
Why these three harnesses matter
Claude Code, Hermes Agent, and Codex are not the same product category even though all three can edit repositories. Claude is especially strong when you want skills plus subagents inside a focused coding surface. Hermes is unusually strong when you want an agent with persistent memory, multi-platform reach, and portable open-standard skills. Codex is unusually strong when you want the same skill to survive across CLI, IDE, app, and repo-local automation, with a first-party story around AGENTS.md and reusable workflows.
This is the practical compatibility view: where the skill lives, how it gets invoked, and why the harness changes the operating style.
| Question | Claude Code | Hermes Agent | Codex |
|---|---|---|---|
| Native skill shape | `SKILL.md` skills and markdown subagents. | `SKILL.md` skills with references, scripts, and hub installs. | `SKILL.md` skills with optional scripts, references, assets, and `agents/openai.yaml`. |
| Automatic loading | Yes; descriptions drive auto-use and direct `/skill-name` invocation. | Yes; `skills_list()` loads compactly and `skill_view()` expands on demand. | Yes; Codex includes an initial skill list, then reads the full skill when selected. |
| Project instruction layer | `CLAUDE.md`, with a documented import path for `AGENTS.md`. | Top-level `AGENTS.md` at session start, subdirectory files lazily. | `AGENTS.md` as the shared repo instruction layer for Codex surfaces. |
| Memory model | Project memory plus optional subagent memory. | Persistent built-in memory and optional external memory providers. | Repo instructions, skills, and broader Codex memories and workflows. |
| Parallelism story | Subagents, agent view, teams, background work. | Delegation, remote backends, scheduled automations, messaging surfaces. | Parallel agents, worktrees, automations, app plus CLI plus IDE. |
| Best use of critique-review | Dedicated review subagent or project skill. | Portable review procedure that follows Hermes everywhere it runs. | Repo-local review standard shared across local and cloud Codex work. |
Based on the official docs for Claude Code skills and subagents, Hermes skills and context files, and Codex skills plus AGENTS.md.
Case study pattern one: Claude Code as the senior reviewer inside the coding loop
Claude Code is the cleanest fit if your goal is to turn review into a specialized persona rather than a sentence you keep retyping. Anthropic now documents a native skills system, markdown-defined subagents, project `CLAUDE.md`, and background delegation. That means `critique-review` can live in exactly the shape Claude already expects rather than being smuggled in as a huge one-off prompt.

The documented operator pattern is straightforward. Put the skill in `~/.claude/skills/` for personal reuse or `.claude/skills/` for project reuse. If the repository already standardizes on `AGENTS.md` for multi-agent instructions, Anthropic explicitly documents importing that file from `CLAUDE.md`, which means you do not have to fork your team policy just to accommodate Claude. If review deserves even tighter identity, promote the same discipline into a custom review subagent. Claude’s docs go further here: subagents can have their own system prompt, memory scope, hooks, and independent tool restrictions.
That changes the quality of review in a very practical way. Instead of asking your main coding session to switch personality midstream, you give Claude a dedicated review worker. The worker can keep the main implementation thread clean, inspect the changed files in its own context window, and come back with a findings-first verdict. For a team already living in Claude Code all day, this is the lowest-friction way to stop code review from collapsing into narrative explanation.
Case study pattern two: Hermes Agent as the portable review brain
Hermes Agent is the most interesting harness in this set if you care about portability more than polish. Nous positions Hermes as a self-improving agent with persistent memory, a skills system, top-level `AGENTS.md` loading, multiple execution backends, and delivery surfaces that range from CLI to Telegram to Slack to remote server runtimes. That is a very different contract from “one coding assistant inside one editor.”
For `critique-review`, that matters because the skill is already written as procedural memory. Hermes documents exactly the same pattern: skills are markdown files with frontmatter, they load through `skills_list()` and `skill_view()`, every installed skill becomes a slash command, and the agent can install a single-file skill directly from an HTTP URL. Hermes also explicitly frames skills as the place for reusable multi-step procedures, while memory holds facts about the user, project, and environment. That split is almost tailor-made for a review skill.
- The same review skill can run in terminal, messaging, and remote backends instead of being trapped in one local editor.
- Hermes loads top-level `AGENTS.md`, so repo policy and review procedure can sit together cleanly.
- Progressive disclosure keeps the skill cheap until a review request actually triggers it.
- Persistent memory means the agent can remember repeated repo-specific review patterns across sessions.
- A platform lead pings Hermes from chat to review a risky infra patch while Hermes is running on a remote box.
- The agent loads `critique-review`, reads the repo instruction layer, and applies the same severity contract it would use in terminal.
- Follow-up sessions get stronger because Hermes can retain the conventions and failure patterns that matter for that codebase.
- The review discipline becomes a durable capability, not a one-time conversation artifact.
That makes Hermes the best home for `critique-review` when the business problem is not just PR review in one IDE, but review discipline that needs to survive across surfaces and time. If your engineering workflow spills from terminal to chat ops to remote agents, Hermes gives the skill the widest runway.
Case study pattern three: Codex as the team-shared review standard
Codex is the strongest fit when you want a review skill to become part of the repository, not just part of one user’s setup. OpenAI’s Codex docs now treat skills as first-class reusable workflows. Codex documents a skill directory with optional scripts, references, and `agents/openai.yaml`, plus explicit invocation and implicit selection. OpenAI also documents `AGENTS.md` as the custom-instruction layer for Codex across its surfaces.

The most important Codex signal is not just that skills exist. It is that OpenAI is publicly documenting team use of repo-local skills, `AGENTS.md`, and GitHub Actions to turn repeated engineering tasks into repeatable workflows. In OpenAI’s own write-up about OSS maintenance, they report 457 merged PRs across two Agents SDK repos in the December 2025 to February 2026 window, up from 316 in the previous three months, with repo-local skills and `AGENTS.md` called out as part of the setup. That does not prove every skill boosts throughput on every team. It does prove OpenAI is operationalizing the exact pattern this review skill belongs to.
For a team using Codex app, CLI, or IDE extension, `critique-review` becomes the shared review grammar. The same repository can tell Codex how the team wants review to work, the same skill can be invoked locally or automatically, and the same procedure can be read by any collaborator who opens the repo. This is where the free skill starts to feel less like content and more like infrastructure.
What benefits show up across all three harnesses
| Benefit | Why it matters |
|---|---|
| Stable review artifact | The output becomes comparable across runs because the agent is pushed into findings, checks, and verdict instead of free-form summary. |
| Less prompt drift | You stop re-explaining your review philosophy in every new session or every new tool. |
| Better false-positive control | The skill explicitly tells the agent to verify claims and downgrade uncertainty into questions or residual risk. |
| Cross-tool continuity | Teams can change harnesses or run several at once without resetting their review discipline. |
| Cleaner governance | Repo policy lives in `AGENTS.md` or the harness instruction layer; procedure lives in the skill; that separation scales better than giant monolithic prompts. |
| Independent reviewer identity | The code-writing agent no longer has to invent a review persona on the fly. |
When the free skill is enough, and when it should turn into Critique
For many teams, the free portable skill is enough at first. If you are trying to improve how Claude, Hermes, or Codex reviews diffs locally, inside chat, or in a narrow internal workflow, start there. It is cheap, transparent, and it teaches the team what better review output actually looks like.
But a skill is still only a skill. It does not by itself give you GitHub-native checks, merge-boundary control, shared policy enforcement, auditable review artifacts, review routing across multiple specialist lanes, or a product surface built specifically for high-volume pull-request operations. That is the moment where the recommendation should become explicit: move from the free skill into Critique.
FAQ
Start with the skill. Move to Critique when review becomes a control problem.
Download `critique-review` for Claude Code, Hermes Agent, or Codex when you want a portable review standard. Move into Critique when you want that same standard enforced on the GitHub pull request with policy, artifacts, and a real merge-boundary surface.
Open the skill