Loading...
Loading...
Launch a meta-judge then a judge sub-agent to evaluate results produced in the current conversation
npx skill4agent add neolabhq/context-engineering-kit judgeEvaluation Scope:
- Original request: [summary]
- Work produced: [description]
- Files involved: [list]
- Artifact type: [code | documentation | configuration | etc.]
- Evaluation focus: [from arguments or "general quality"]
Launching meta-judge to generate evaluation criteria...## Task
Generate an evaluation specification yaml for the following evaluation task. You will produce rubrics, checklists, and scoring criteria that a judge agent will use to evaluate the work.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`
## User Prompt
{Original task or request that prompted the work}
## Context
{Any relevant context about the work being evaluated}
{Evaluation focus from arguments, or "General quality assessment"}
## Artifact Type
{code | documentation | configuration | etc.}
## Instructions
Return only the final evaluation specification YAML in your response.Use Task tool:
- description: "Meta-judge: Generate evaluation criteria for {brief work summary}"
- prompt: {meta-judge prompt}
- model: opus
- subagent_type: "sadd:meta-judge"You are an Expert Judge evaluating the quality of work against an evaluation specification produced by the meta judge.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`
## Work Under Evaluation
[ORIGINAL TASK]
{paste the original request/task}
[/ORIGINAL TASK]
[WORK OUTPUT]
{summary of what was created/modified}
[/WORK OUTPUT]
[FILES INVOLVED]
{list of files with brief descriptions}
[/FILES INVOLVED]
## Evaluation Specification
```yaml
{meta-judge's evaluation specification YAML}
CRITICAL: NEVER provide score threshold to judges in any format. Judge MUST not know what threshold for score is, in order to not be biased!!!
**Dispatch:**
### Phase 4: Process and Present Results
After receiving the judge's evaluation:
1. **Validate the evaluation**:
- Check that all criteria have scores in valid range (1-5)
- Verify each score has supporting justification with evidence
- Confirm weighted total calculation is correct
- Check for contradictions between justification and score
- Verify self-verification was completed with documented adjustments
2. **If validation fails**:
- Note the specific issue
- Request clarification or re-evaluation if needed
3. **Present results to user**:
- Display the full evaluation report
- Highlight the verdict and key findings
- Offer follow-up options:
- Address specific improvements
- Request clarification on any judgment
- Proceed with the work as-is
## Scoring Interpretation
| Score Range | Verdict | Interpretation | Recommendation |
|-------------|---------|----------------|----------------|
| 4.50 - 5.00 | EXCELLENT | Exceptional quality, exceeds expectations | Ready as-is |
| 4.00 - 4.49 | GOOD | Solid quality, meets professional standards | Minor improvements optional |
| 3.50 - 3.99 | ACCEPTABLE | Adequate but has room for improvement | Improvements recommended |
| 3.00 - 3.49 | NEEDS IMPROVEMENT | Below standard, requires work | Address issues before use |
| 1.00 - 2.99 | INSUFFICIENT | Does not meet basic requirements | Significant rework needed |
## Important Guidelines
1. **Meta-judge first**: Always generate evaluation specification before judging - never skip the meta-judge phase
2. **Include CLAUDE_PLUGIN_ROOT**: Both meta-judge and judge need the resolved plugin root path
3. **Meta-judge YAML**: Pass only the meta-judge YAML to the judge, do not modify it
4. **Context Isolation**: Pass only relevant context to sub-agents - not the entire conversation
5. **Justification First**: Always require evidence and reasoning BEFORE the score
6. **Evidence-Based**: Every score must cite specific evidence (file paths, line numbers, quotes)
7. **Bias Mitigation**: Explicitly warn against length bias, verbosity bias, and authority bias
8. **Be Objective**: Base assessments on evidence and rubric definitions, not preferences
9. **Be Specific**: Cite exact locations, not vague observations
10. **Be Constructive**: Frame criticism as opportunities for improvement with impact context
11. **Consider Context**: Account for stated constraints, complexity, and requirements
12. **Report Confidence**: Lower confidence when evidence is ambiguous or criteria unclear
13. **Single Judge**: This command uses one focused judge for context isolation
## Notes
- This is a **report-only** command - it evaluates but does not modify work
- The meta-judge generates criteria tailored to the specific artifact type and evaluation focus
- The judge operates with fresh context for unbiased assessment
- Scores are calibrated to professional development standards
- Low scores indicate improvement opportunities, not failures
- Use the evaluation to inform next steps and iterations
- Low confidence evaluations may warrant human review