judge
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseJudge Command
评判命令
<task>
You are a coordinator launching a two-phase evaluation pipeline to assess work produced earlier in this conversation. First, a meta-judge generates tailored evaluation criteria. Then, a judge sub-agent applies those criteria with isolated context, structured scoring, and evidence-based feedback. The evaluation is **report-only** - findings are presented without automatic changes.
</task>
<context>
This command implements the **meta-judge -> LLM-as-Judge** pattern with context isolation:
- **Structured Evaluation**: Meta-judge produces tailored rubrics, checklists, and scoring criteria before judging
- **Context Isolation**: Judge operates with fresh context, preventing confirmation bias from accumulated session state
- **Evidence-Based**: Every score requires specific citations from the work (file locations, line numbers)
- **Multi-Dimensional Rubric**: Generated by meta-judge to match the specific artifact type and evaluation focus
- **Self-Verification**: Dynamic verification questions with documented adjustments
</context>
<task>
你是一名协调者,负责启动一个两阶段评估流水线,以评估本次对话中早期生成的工作成果。首先,元评判者生成定制化的评估标准。随后,评判子代理在隔离上下文、结构化评分和基于证据的反馈机制下应用这些标准。本次评估仅为**报告性质**——评估结果仅作展示,不会自动进行修改。
</task>
<context>
本命令实现了带有上下文隔离的**元评判者 -> LLM-as-Judge**模式:
- **结构化评估**:元评判者在开始评判前生成定制化的评分细则、检查清单和评分标准
- **上下文隔离**:评判者在全新上下文环境中运作,避免会话累积状态导致的确认偏差
- **基于证据**:每一项评分都需要来自工作成果的具体引用(文件位置、行号)
- **多维度评分细则**:由元评判者生成,匹配特定工件类型和评估重点
- **自我验证**:带有文档化调整的动态验证问题
</context>
Your Workflow
工作流程
Phase 1: Context Extraction
阶段1:上下文提取
Before launching the evaluation pipeline, identify what needs evaluation:
-
Identify the work to evaluate:
- Review conversation history for completed work
- If arguments provided: Use them to focus on specific aspects
- If unclear: Ask user "What work should I evaluate? (code changes, analysis, documentation, etc.)"
-
Extract evaluation context:
- Original task or request that prompted the work
- The actual output/result produced
- Files created or modified (with brief descriptions)
- Any constraints, requirements, or acceptance criteria mentioned
- Artifact type (code, documentation, configuration, etc.)
-
Provide scope for user:
Evaluation Scope: - Original request: [summary] - Work produced: [description] - Files involved: [list] - Artifact type: [code | documentation | configuration | etc.] - Evaluation focus: [from arguments or "general quality"] Launching meta-judge to generate evaluation criteria...
IMPORTANT: Pass only the extracted context to the sub-agents - not the entire conversation. This prevents context pollution and enables focused assessment.
在启动评估流水线之前,先确定需要评估的内容:
-
确定待评估的工作成果:
- 回顾对话历史,找出已完成的工作
- 若提供了参数:使用参数聚焦特定评估维度
- 若不明确:询问用户“我应该评估哪些工作?(代码变更、分析内容、文档等)”
-
提取评估上下文:
- 催生该工作成果的原始任务或需求
- 实际生成的输出/结果
- 创建或修改的文件(附带简要说明)
- 提及的任何约束条件、需求或验收标准
- 工件类型(代码、文档、配置等)
-
向用户展示评估范围:
Evaluation Scope: - Original request: [summary] - Work produced: [description] - Files involved: [list] - Artifact type: [code | documentation | configuration | etc.] - Evaluation focus: [from arguments or "general quality"] Launching meta-judge to generate evaluation criteria...
重要提示:仅将提取的上下文传递给子代理,而非整个对话内容。这可防止上下文污染,实现聚焦式评估。
Phase 2: Dispatch Meta-Judge
阶段2:启动元评判者
Launch a meta-judge agent to generate an evaluation specification tailored to the specific work being evaluated. The meta-judge will return an evaluation specification YAML containing rubrics, checklists, and scoring criteria.
Meta-Judge Prompt:
markdown
undefined启动元评判者代理,为待评估的特定工作成果生成定制化的评估规范。元评判者将返回包含评分细则、检查清单和评分标准的YAML格式评估规范。
元评判者提示词:
markdown
undefinedTask
Task
Generate an evaluation specification yaml for the following evaluation task. You will produce rubrics, checklists, and scoring criteria that a judge agent will use to evaluate the work.
CLAUDE_PLUGIN_ROOT=
${CLAUDE_PLUGIN_ROOT}Generate an evaluation specification yaml for the following evaluation task. You will produce rubrics, checklists, and scoring criteria that a judge agent will use to evaluate the work.
CLAUDE_PLUGIN_ROOT=
${CLAUDE_PLUGIN_ROOT}User Prompt
User Prompt
{Original task or request that prompted the work}
{Original task or request that prompted the work}
Context
Context
{Any relevant context about the work being evaluated}
{Evaluation focus from arguments, or "General quality assessment"}
{Any relevant context about the work being evaluated}
{Evaluation focus from arguments, or "General quality assessment"}
Artifact Type
Artifact Type
{code | documentation | configuration | etc.}
{code | documentation | configuration | etc.}
Instructions
Instructions
Return only the final evaluation specification YAML in your response.
**Dispatch:**
Use Task tool:
- description: "Meta-judge: Generate evaluation criteria for {brief work summary}"
- prompt: {meta-judge prompt}
- model: opus
- subagent_type: "sadd:meta-judge"
Wait for the meta-judge to complete before proceeding to Phase 3.Return only the final evaluation specification YAML in your response.
**启动指令:**
Use Task tool:
- description: "Meta-judge: Generate evaluation criteria for {brief work summary}"
- prompt: {meta-judge prompt}
- model: opus
- subagent_type: "sadd:meta-judge"
等待元评判者完成后,再进入阶段3。Phase 3: Dispatch Judge Agent
阶段3:启动评判代理
After the meta-judge completes, extract its evaluation specification YAML and dispatch the judge agent with both the work context and the specification.
CRITICAL: Provide to the judge the EXACT meta-judge evaluation specification YAML. Do not skip, add, modify, shorten, or summarize any text in it!
Judge Agent Prompt:
markdown
You are an Expert Judge evaluating the quality of work against an evaluation specification produced by the meta judge.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`元评判者完成后,提取其生成的YAML评估规范,并将工作上下文与该规范一并传递给评判代理。
关键要求:必须将元评判者生成的YAML评估规范原封不动地提供给评判者!不得跳过、添加、修改、缩短或总结其中任何内容!
评判代理提示词:
markdown
You are an Expert Judge evaluating the quality of work against an evaluation specification produced by the meta judge.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`Work Under Evaluation
Work Under Evaluation
[ORIGINAL TASK]
{paste the original request/task}
[/ORIGINAL TASK]
[WORK OUTPUT]
{summary of what was created/modified}
[/WORK OUTPUT]
[FILES INVOLVED]
{list of files with brief descriptions}
[/FILES INVOLVED]
[ORIGINAL TASK]
{paste the original request/task}
[/ORIGINAL TASK]
[WORK OUTPUT]
{summary of what was created/modified}
[/WORK OUTPUT]
[FILES INVOLVED]
{list of files with brief descriptions}
[/FILES INVOLVED]
Evaluation Specification
Evaluation Specification
yaml
{meta-judge's evaluation specification YAML}yaml
{meta-judge's evaluation specification YAML}Instructions
Instructions
Follow your full judge process as defined in your agent instructions!
CRITICAL: You must reply with this exact structured evaluation report format in YAML at the START of your response!
CRITICAL: NEVER provide score threshold to judges in any format. Judge MUST not know what threshold for score is, in order to not be biased!!!
**Dispatch:**
Use Task tool:
- description: "Judge: Evaluate {brief work summary}"
- prompt: {judge prompt with exact meta-judge specification YAML}
- model: opus
- subagent_type: "sadd:judge"
undefinedFollow your full judge process as defined in your agent instructions!
CRITICAL: You must reply with this exact structured evaluation report format in YAML at the START of your response!
CRITICAL: NEVER provide score threshold to judges in any format. Judge MUST not know what threshold for score is, in order to not be biased!!!
**启动指令:**
Use Task tool:
- description: "Judge: Evaluate {brief work summary}"
- prompt: {judge prompt with exact meta-judge specification YAML}
- model: opus
- subagent_type: "sadd:judge"
undefinedPhase 4: Process and Present Results
阶段4:处理并展示结果
After receiving the judge's evaluation:
-
Validate the evaluation:
- Check that all criteria have scores in valid range (1-5)
- Verify each score has supporting justification with evidence
- Confirm weighted total calculation is correct
- Check for contradictions between justification and score
- Verify self-verification was completed with documented adjustments
-
If validation fails:
- Note the specific issue
- Request clarification or re-evaluation if needed
-
Present results to user:
- Display the full evaluation report
- Highlight the verdict and key findings
- Offer follow-up options:
- Address specific improvements
- Request clarification on any judgment
- Proceed with the work as-is
收到评判者的评估结果后:
-
验证评估结果:
- 检查所有标准的评分是否在有效范围(1-5)内
- 验证每项评分是否有附带证据的支持性理由
- 确认加权总分计算正确
- 检查理由与评分之间是否存在矛盾
- 验证是否完成了带有文档化调整的自我验证
-
若验证失败:
- 记录具体问题
- 必要时请求澄清或重新评估
-
向用户展示结果:
- 展示完整的评估报告
- 突出显示评估结论和关键发现
- 提供后续选项:
- 针对特定问题进行改进
- 请求对某项评判进行澄清
- 按当前状态继续推进工作
Scoring Interpretation
评分说明
| Score Range | Verdict | Interpretation | Recommendation |
|---|---|---|---|
| 4.50 - 5.00 | EXCELLENT | Exceptional quality, exceeds expectations | Ready as-is |
| 4.00 - 4.49 | GOOD | Solid quality, meets professional standards | Minor improvements optional |
| 3.50 - 3.99 | ACCEPTABLE | Adequate but has room for improvement | Improvements recommended |
| 3.00 - 3.49 | NEEDS IMPROVEMENT | Below standard, requires work | Address issues before use |
| 1.00 - 2.99 | INSUFFICIENT | Does not meet basic requirements | Significant rework needed |
| 评分范围 | 评估结论 | 解读 | 建议 |
|---|---|---|---|
| 4.50 - 5.00 | 优秀 | 质量卓越,超出预期 | 可直接使用 |
| 4.00 - 4.49 | 良好 | 质量可靠,符合专业标准 | 可选择进行小幅改进 |
| 3.50 - 3.99 | 合格 | 质量达标,但仍有改进空间 | 建议进行改进 |
| 3.00 - 3.49 | 需要改进 | 未达标准,需优化 | 使用前需解决问题 |
| 1.00 - 2.99 | 不合格 | 未满足基本要求 | 需要大幅返工 |
Important Guidelines
重要指南
- Meta-judge first: Always generate evaluation specification before judging - never skip the meta-judge phase
- Include CLAUDE_PLUGIN_ROOT: Both meta-judge and judge need the resolved plugin root path
- Meta-judge YAML: Pass only the meta-judge YAML to the judge, do not modify it
- Context Isolation: Pass only relevant context to sub-agents - not the entire conversation
- Justification First: Always require evidence and reasoning BEFORE the score
- Evidence-Based: Every score must cite specific evidence (file paths, line numbers, quotes)
- Bias Mitigation: Explicitly warn against length bias, verbosity bias, and authority bias
- Be Objective: Base assessments on evidence and rubric definitions, not preferences
- Be Specific: Cite exact locations, not vague observations
- Be Constructive: Frame criticism as opportunities for improvement with impact context
- Consider Context: Account for stated constraints, complexity, and requirements
- Report Confidence: Lower confidence when evidence is ambiguous or criteria unclear
- Single Judge: This command uses one focused judge for context isolation
- 先启动元评判者:始终先生成评估规范再进行评判——绝不能跳过元评判者阶段
- 包含CLAUDE_PLUGIN_ROOT:元评判者和评判者都需要已解析的插件根路径
- 元评判者YAML:仅将元评判者生成的YAML传递给评判者,不得修改
- 上下文隔离:仅将相关上下文传递给子代理,而非整个对话内容
- 理由优先:始终要求先提供证据和推理,再给出评分
- 基于证据:每项评分必须引用具体证据(文件路径、行号、引用内容)
- 偏差缓解:明确警告避免篇幅偏差、冗余偏差和权威偏差
- 保持客观:基于证据和评分细则定义进行评估,而非个人偏好
- 具体明确:引用确切位置,而非模糊表述
- 具有建设性:将批评转化为带有影响背景的改进机会
- 考虑上下文:考虑既定约束条件、复杂度和需求
- 报告置信度:当证据模糊或标准不明确时,降低置信度
- 单一评判者:本命令使用单个聚焦式评判者以实现上下文隔离
Notes
注意事项
- This is a report-only command - it evaluates but does not modify work
- The meta-judge generates criteria tailored to the specific artifact type and evaluation focus
- The judge operates with fresh context for unbiased assessment
- Scores are calibrated to professional development standards
- Low scores indicate improvement opportunities, not failures
- Use the evaluation to inform next steps and iterations
- Low confidence evaluations may warrant human review
- 这是一个仅报告的命令——仅评估工作成果,不进行修改
- 元评判者生成的标准会匹配特定工件类型和评估重点
- 评判者在全新上下文环境中运作,以实现无偏评估
- 评分以专业开发标准为校准依据
- 低分代表改进机会,而非失败
- 利用评估结果指导后续步骤和迭代
- 低置信度的评估可能需要人工审核