Code Review Orchestrator
READ-ONLY CONSTRAINT
THIS SKILL IS STRICTLY READ-ONLY. NEVER modify, write, edit, or delete any user files. NEVER run commands that have side effects (no , no , no file writes, no git commits, no database mutations). If a fix is needed, generate a copy-pasteable fix prompt that the user can run separately. Violations of this constraint are NEVER acceptable, regardless of user request.
1. Command Routing
Parse the user's input to extract a subcommand and target path. The input format is:
/codeprobe [subcommand] [path]
Routing Table
| Command | Behavior | Sub-skills Invoked |
|---|
| Full audit — visual health dashboard (category scores, codebase stats, hot spots) followed by detailed P0-P3 findings with fix prompts | All available sub-skills + |
| SOLID principles analysis only | |
/codeprobe security <path>
| Security audit only | |
| Code smells detection only | |
/codeprobe architecture <path>
| Architecture analysis only | |
/codeprobe patterns <path>
| Design patterns analysis only | |
/codeprobe performance <path>
| Performance audit only | |
| Error handling audit only | |
| Test quality audit only | |
/codeprobe framework <path>
| Framework best practices only | |
| Top 5 issues — run all sub-skills in scan mode, then generate full detail for top 5 | All available |
| PR-style review on changed files vs branch (default: ) | All relevant (Phase 3) |
| Generate report from last audit | scripts/generate_report.py
(Phase 3) |
Default Behaviors
- No subcommand given: Ask the user what they want. Present the available commands.
- No path given: Use the current working directory.
- Phase 3 stubs: If the user invokes or , respond: "This feature is coming in Phase 3. Available now: audit, solid, security, smells, architecture, patterns, performance, errors, tests, framework, quick."
2. Stack Auto-Detection
Before routing to any sub-skill, detect the technology stack at the target path. This informs which reference guides to load and pass to sub-skills.
Detection Procedure
- Use Glob to scan file extensions at the target path (recursive, reasonable depth).
- Apply the following detection rules — multiple stacks can match simultaneously:
| Signal | Stack Detected | Reference to Load |
|---|
| files | PHP / Laravel | references/php-laravel.md
|
| , , , files | JavaScript / TypeScript | references/javascript-typescript.md
|
| files | Python | |
| , files + present | React / Next.js | references/react-nextjs.md
|
| files or directory | SQL / Database | references/sql-database.md
|
| directory or API route patterns | API Design | |
- For each detected stack, attempt to load the corresponding reference file using Read. If the file does not exist yet (Phase 2+), skip silently.
- Collect all loaded references into a context bundle to pass to sub-skills.
Reference Loading
References are loaded from the
directory within this skill's own directory. Resolve the path relative to this SKILL.md file's location, NOT the user's project. Use Read with:
references/{reference-file}.md
(This resolves to the
folder next to this SKILL.md file.)
If a reference file does not exist, continue without it. Never fail the review because a reference is missing.
3. Config Loading
Check for a
file in the project root (the target path or its ancestor directories).
Config Schema
json
{
"severity_overrides": {
"long_method_loc": 50,
"large_class_loc": 500,
"deep_nesting_max": 4,
"max_constructor_deps": 6
},
"skip_categories": ["codeprobe-testing"],
"skip_rules": ["SPEC-GEN-001"],
"framework": "laravel",
"extra_references": [],
"report_format": "markdown"
}
Config Behavior
- If absent: All defaults apply. No error.
- : Pass to sub-skills so they adjust thresholds accordingly.
- : Do not invoke the listed sub-skills, even in or mode.
- : Pass to sub-skills so they suppress findings with matching IDs.
- : If set, skip auto-detection for that framework and force-load the corresponding reference. Other auto-detection still proceeds.
- : Additional reference file paths to load and pass to sub-skills.
- : Output format preference (default: ).
4. Sub-Skill Execution
Pre-Loading Phase (runs once before any sub-skill)
Before invoking any sub-skill, the orchestrator MUST pre-load all shared context:
-
Read the shared preamble from
(in this skill's directory). This contains the output contract, execution modes, and constraints shared by all sub-skills.
-
Read all source files at the target path:
- Use Glob to find all source files (, , , , , , , , , and config files like , , , , ).
- Read each file using Read.
- Size cap: If the codebase has more than 50 source files or total LOC exceeds 10,000 lines, do NOT pre-load all files. Instead, pass only the file listing (paths + line counts) and let sub-agents read files they need. Note this in the agent prompt: "Large codebase — file listing provided, use Read for files you need to inspect."
- Store all file contents as a map: .
-
Read all applicable reference files (already loaded during stack detection in Section 2). Store the content.
Invocation Protocol
For each sub-skill to run, spawn an Agent with a prompt that includes:
- The shared preamble (from ) — output contract, modes, constraints.
- The sub-skill name to invoke (e.g., ).
- The mode — one of or .
- Pre-loaded source files — the full content of every source file, formatted as:
=== FILE: {filepath} ===
{content}
=== END FILE ===
- Pre-loaded references — the content of all applicable reference files.
- Config overrides — severity overrides and skip rules from .
- Target path — so the sub-skill knows the project root for any targeted lookups.
The sub-skill's own SKILL.md contains only its domain-specific detection logic. All shared context (output format, modes, source code, references) comes from the orchestrator's prompt.
Collect findings returned by each sub-skill in the standard output contract format (Section 5).
Execution Modes
| Mode | Used By | Behavior |
|---|
| , , etc. | Run complete analysis, return all findings |
| | Count violations, identify top issues, return only counts + top 5 candidates |
Execution Order
- : Run sub-skills sequentially in this order: , , , , , , , , — all in mode. Collect all findings. Apply deduplication (Section 7A). Derive category scores from severity counts. Compute hot spots by aggregating findings per file and ranking by distinct-categories-flagged. Also run for codebase stats (skip gracefully if Python 3 unavailable).
- : Run all 9 sub-skills in mode. Collect candidate issues from all. Rank by severity (critical > major > minor > suggestion), then select top 5. Re-run relevant sub-skills in mode for just those 5 findings to get complete detail.
Available Sub-Skills
- — Security vulnerability detection
- — Error handling & resilience
- — SOLID principles analysis
- — Architecture analysis
- — Design patterns advisor
- — Performance & scalability
- — Code smell detection
- — Test quality & coverage
- — Framework-specific best practices
5. Output Contract
Every finding from every sub-skill MUST include these fields:
| Field | Required | Description |
|---|
| Yes | Unique identifier in format (e.g., , ) |
| Yes | One of: , , , |
| Yes | File path + line range (e.g., src/UserService.php:45-67
) |
| Yes | One sentence describing the issue |
| Yes | Concrete proof from the code — quote the relevant lines |
| Yes | What to do to fix it |
| Yes | A copy-pasteable prompt the user can give to Claude Code to apply the fix |
| No | Optional code snippet showing the improved version |
Finding Format Example
### SRP-001 | Major | `src/UserService.php:45-67`
**Problem:** UserService violates Single Responsibility — handles authentication, email sending, and database queries in one class.
**Evidence:**
> Lines 45-50: `public function authenticate($credentials) { ... }`
> Lines 52-60: `public function sendWelcomeEmail($user) { ... }`
> Lines 62-67: `public function findByUsername($name) { ... }`
**Suggestion:** Extract email logic into a dedicated `UserMailer` service and database queries into a `UserRepository`.
**Fix prompt:**
> Refactor `src/UserService.php` to follow Single Responsibility Principle: extract `sendWelcomeEmail()` into a new `UserMailer` class and `findByUsername()` into a `UserRepository` class. Keep `authenticate()` in `UserService` and inject the new dependencies.
6. Severity Levels
| Level | Emoji | Meaning | Examples |
|---|
| Critical | 🔴 | Confirmed bugs, exploitable security vulnerabilities, or data loss/corruption risks that would cause harm in production | SQL injection with user input, missing auth on data-mutating endpoint, race condition causing data corruption, unhandled crash on a core path, missing DB transaction on multi-step writes |
| Major | 🟠 | Significant maintainability, reliability, or scalability problem that increases risk but is not an immediate production defect | Missing tests for critical business logic, large classes, code duplication, missing error handling on external calls, N+1 queries, missing input validation |
| Minor | 🟡 | Code smell, low risk, worth addressing for long-term health | Magic numbers, deep nesting, poor naming, missing edge case tests, verbose error details |
| Suggestion | 🔵 | Improvement idea, nice to have, no real risk if ignored | Pattern opportunities, style improvements, speculative generality |
Severity Guardrails
The following are NEVER Critical — classify as Major at most:
- Missing tests (even for critical business logic)
- Code duplication or large classes/files
- Code smells of any kind
- Framework convention violations
- Missing documentation, comments, or type annotations
Critical is reserved exclusively for:
- Confirmed bugs (code that produces wrong results or crashes)
- Exploitable security vulnerabilities (injection, auth bypass, IDOR with proof)
- Data loss or corruption risks (missing transactions, race conditions on writes)
- Sensitive data exposure (secrets in code, credentials in logs)
Sub-skills: do NOT escalate findings beyond the severity specified in your detection table. If your table says "Major," report it as Major even if the specific instance seems severe. The orchestrator's scoring formula accounts for finding counts at each level.
7. Scoring
After collecting all findings, compute scores per category and an overall score.
Category Score Formula
Each penalty component is capped to prevent a single severity level from dominating the score:
crit_penalty = min(50, critical_count * 15)
major_penalty = min(30, major_count * 6)
minor_penalty = min(10, minor_count * 2)
category_score = max(0, 100 - crit_penalty - major_penalty - minor_penalty)
Suggestions do not affect the score.
Rationale: Diminishing returns prevent a single severity from flooring the score. A category with 4 criticals scores 40 (not 0), reflecting problems exist but the code is not completely broken. The maximum total penalty from all three levels combined is 90, so a score of 0 requires extreme findings across all severities.
Category Weights
| Category | Weight |
|---|
| Security | 20% |
| SOLID | 15% |
| Architecture | 15% |
| Error Handling | 12% |
| Performance | 12% |
| Test Quality | 10% |
| Code Smells | 8% |
| Design Patterns | 4% |
| Framework | 4% |
All 9 categories are active. Weights sum to 100%.
Overall Score
overall = sum(category_score_i * weight_i for each active category)
If
in
excludes some categories, normalize by dividing by the sum of active weights:
overall = sum(category_score_i * weight_i for each active category) / sum(weight_i for each active category)
Clamp the result to the range [0, 100].
Score Interpretation
| Range | Status |
|---|
| 80-100 | Healthy |
| 60-79 | Needs Attention |
| 0-59 | Critical |
7A. Cross-Category Deduplication
Before computing scores, deduplicate findings that flag the same issue from multiple categories.
Deduplication Procedure
-
Group findings by location. Normalize each finding's
to
. Two findings overlap if they share the same file AND their line ranges overlap (i.e., start_line_A <= end_line_B AND start_line_B <= end_line_A).
-
For each group of overlapping findings from different categories:
a. Select a primary finding. Use this priority order:
- Security findings (SEC) take priority for anything involving auth, injection, or data exposure
- Error Handling findings (ERR) take priority for exception/validation issues
- Performance findings (PERF) take priority for query/caching issues
- SOLID findings (SRP/OCP/LSP/ISP/DIP) take priority for structural violations
- Architecture findings (ARCH) take priority for layer/boundary violations
- If still ambiguous, the category with the higher weight (Section 7) wins
b. Mark duplicates. For each non-primary finding in the group, append to its field:
[Duplicate of {primary_id} — counted there]
and change its severity to so it does not affect the score of its own category.
c. Cross-reference the primary. Append to the primary finding's field: Also flagged by: {list of duplicate category:id pairs}
-
Recount severity totals per category after deduplication, then proceed to scoring.
Examples
- "Refresh bypasses quota" found as SEC-007, ERR-011, FW-001 at same location: keep SEC-007, mark ERR-011 and FW-001 as duplicates (severity → suggestion).
- "God component" found as SRP-001, SMELL-001, ARCH-005 at same file: keep SRP-001 (SOLID priority for structural), mark others as duplicates.
- Same SRP violation found as SRP-001 and SMELL-001: keep SRP-001, mark SMELL-001 as duplicate.
8. Report Rendering
Render the final output based on the command used.
— Full Audit Report
Use the template at
templates/full-audit-report.md
(loaded via Read). The template opens with a
visual health dashboard (category scores, codebase stats, hot spots) and then uses a
tiered output format for findings to control token usage. Render order:
- Dashboard header:
Code Health Report — {project}
title line and Overall Health: {score}/100 {status_emoji}
where status is derived from the thresholds in the "Status thresholds" block below.
- Category score bars: a 10-character block-character bar proportional to the score for each of the 9 categories (Architecture, Security, Framework, Performance, SOLID, Design Patterns, Code Smells, Test Quality, Error Handling), followed by .
- Codebase Stats: output of (total files, LOC, backend/frontend split, largest file, test file ratio, comment ratio). If Python 3 is unavailable, omit this block and note: "Install Python 3 for codebase statistics."
- Hot Spots: top 3 files by distinct-categories-flagged (computed from the same findings that feed the scores).
- Horizontal rule.
- Executive Summary: 2-3 sentences covering the most important findings.
- Critical findings — Full detail: Each critical finding rendered with evidence, suggestion, and fix prompt. These are the most important and justify the token cost.
- Major findings — Summary table: One row per major finding with ID, file, problem, and fix prompt. No evidence block (saves ~200 tokens per finding).
- Minor findings — Counts only: Aggregated count per category. No individual findings listed.
- Suggestions — Counts only: Aggregated count per category. No individual findings listed.
- Prioritized Fix Order: Ordered list of all critical and major fix prompts, ranked by impact.
If the template does not exist, render inline following the same structure.
Status thresholds (applied to overall health and each category score):
- 80-100 = "Healthy"
- 60-79 = "Needs Attention"
- 0-59 = "Critical"
Token budget guidance: For a codebase with ~100 findings, the tiered findings format (steps 7-10) targets ~8,000-12,000 tokens (vs ~40,000 with full detail for all findings). The dashboard adds a small fixed cost (~400 tokens). The user can drill into specific categories with
etc. for full detail on any category.
— Quick Review Summary
Use the template at
templates/quick-review-summary.md
(loaded via Read). If the template does not exist yet, render inline:
- Header: Project name, "Quick Review — Top 5 Issues".
- Top 5 Findings: Full detail for the 5 most impactful issues, each with fix prompt.
- Summary Counts: Total issues found by severity across all categories.
- Next Step: Suggest running for the complete picture.
9. Claude.ai Degraded Mode
Detect whether filesystem access is available. If the user has pasted or uploaded code rather than providing a file path, or if Read/Glob/Grep tools are unavailable:
- Switch to degraded mode: Analyze only the in-context code provided.
- Execute sub-skills sequentially on the pasted code (no parallel agents).
- Skip and all script-dependent steps.
- Skip , , and the Codebase Stats row of the audit dashboard (still render scores, hot spots, and findings).
- Inform the user: "Running in Claude.ai mode — some features like codebase statistics, diff review, and multi-file analysis are unavailable. Analyzing the provided code directly."
- Still produce findings in the standard output contract format.
- Still compute scores based on findings from available sub-skills.
10. Phase 3 Stubs
When the user invokes a command that routes to an unbuilt feature, respond with:
Not yet available. This feature is coming in Phase 3. Currently available commands:
- — Full code audit
- — SOLID principles check
/codeprobe security <path>
— Security audit
- — Code smells detection
/codeprobe architecture <path>
— Architecture analysis
/codeprobe patterns <path>
— Design patterns analysis
/codeprobe performance <path>
— Performance audit
- — Error handling audit
- — Test quality audit
/codeprobe framework <path>
— Framework best practices
- — Top 5 issues
11. Execution Flow Summary
When
is invoked, execute this sequence:
- Parse command: Extract subcommand and target path from user input.
- Validate command: Check routing table. If Phase 3 stub, respond with stub message.
- Resolve target path: Use provided path or default to current working directory.
- Load config: Check for at project root. Apply defaults if absent.
- Auto-detect stack: Scan target path for technology signals. Load matching references.
- Apply config overrides: If is set in config, adjust detection. Apply and .
- Execute sub-skills: Route to appropriate sub-skills based on command and mode.
- Collect findings: Aggregate all findings in the output contract format.
- Deduplicate findings: Apply the cross-category deduplication procedure (Section 7A). Adjust severity of duplicates to . Recount severity totals per category.
- Compute scores: Calculate per-category and overall scores using the post-deduplication severity counts and the formulas in Section 7.
- Render report: Format output using the appropriate template or inline format. Use the tiered output format for .
- Present to user: Display the final report.
Remember: This entire process is READ-ONLY. At no point do we modify any user files.