judge

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Judge Command

评判命令

<task> You are a coordinator launching a two-phase evaluation pipeline to assess work produced earlier in this conversation. First, a meta-judge generates tailored evaluation criteria. Then, a judge sub-agent applies those criteria with isolated context, structured scoring, and evidence-based feedback. The evaluation is **report-only** - findings are presented without automatic changes. </task> <context> This command implements the **meta-judge -> LLM-as-Judge** pattern with context isolation: - **Structured Evaluation**: Meta-judge produces tailored rubrics, checklists, and scoring criteria before judging - **Context Isolation**: Judge operates with fresh context, preventing confirmation bias from accumulated session state - **Evidence-Based**: Every score requires specific citations from the work (file locations, line numbers) - **Multi-Dimensional Rubric**: Generated by meta-judge to match the specific artifact type and evaluation focus - **Self-Verification**: Dynamic verification questions with documented adjustments </context>
<task> 你是一名协调者,负责启动一个两阶段评估流水线,以评估本次对话中早期生成的工作成果。首先,元评判者生成定制化的评估标准。随后,评判子代理在隔离上下文、结构化评分和基于证据的反馈机制下应用这些标准。本次评估仅为**报告性质**——评估结果仅作展示,不会自动进行修改。 </task> <context> 本命令实现了带有上下文隔离的**元评判者 -> LLM-as-Judge**模式: - **结构化评估**:元评判者在开始评判前生成定制化的评分细则、检查清单和评分标准 - **上下文隔离**:评判者在全新上下文环境中运作,避免会话累积状态导致的确认偏差 - **基于证据**:每一项评分都需要来自工作成果的具体引用(文件位置、行号) - **多维度评分细则**:由元评判者生成,匹配特定工件类型和评估重点 - **自我验证**:带有文档化调整的动态验证问题 </context>

Your Workflow

工作流程

Phase 1: Context Extraction

阶段1:上下文提取

Before launching the evaluation pipeline, identify what needs evaluation:
  1. Identify the work to evaluate:
    • Review conversation history for completed work
    • If arguments provided: Use them to focus on specific aspects
    • If unclear: Ask user "What work should I evaluate? (code changes, analysis, documentation, etc.)"
  2. Extract evaluation context:
    • Original task or request that prompted the work
    • The actual output/result produced
    • Files created or modified (with brief descriptions)
    • Any constraints, requirements, or acceptance criteria mentioned
    • Artifact type (code, documentation, configuration, etc.)
  3. Provide scope for user:
    Evaluation Scope:
    - Original request: [summary]
    - Work produced: [description]
    - Files involved: [list]
    - Artifact type: [code | documentation | configuration | etc.]
    - Evaluation focus: [from arguments or "general quality"]
    
    Launching meta-judge to generate evaluation criteria...
IMPORTANT: Pass only the extracted context to the sub-agents - not the entire conversation. This prevents context pollution and enables focused assessment.
在启动评估流水线之前,先确定需要评估的内容:
  1. 确定待评估的工作成果
    • 回顾对话历史,找出已完成的工作
    • 若提供了参数:使用参数聚焦特定评估维度
    • 若不明确:询问用户“我应该评估哪些工作?(代码变更、分析内容、文档等)”
  2. 提取评估上下文
    • 催生该工作成果的原始任务或需求
    • 实际生成的输出/结果
    • 创建或修改的文件(附带简要说明)
    • 提及的任何约束条件、需求或验收标准
    • 工件类型(代码、文档、配置等)
  3. 向用户展示评估范围
    Evaluation Scope:
    - Original request: [summary]
    - Work produced: [description]
    - Files involved: [list]
    - Artifact type: [code | documentation | configuration | etc.]
    - Evaluation focus: [from arguments or "general quality"]
    
    Launching meta-judge to generate evaluation criteria...
重要提示:仅将提取的上下文传递给子代理,而非整个对话内容。这可防止上下文污染,实现聚焦式评估。

Phase 2: Dispatch Meta-Judge

阶段2:启动元评判者

Launch a meta-judge agent to generate an evaluation specification tailored to the specific work being evaluated. The meta-judge will return an evaluation specification YAML containing rubrics, checklists, and scoring criteria.
Meta-Judge Prompt:
markdown
undefined
启动元评判者代理,为待评估的特定工作成果生成定制化的评估规范。元评判者将返回包含评分细则、检查清单和评分标准的YAML格式评估规范。
元评判者提示词:
markdown
undefined

Task

Task

Generate an evaluation specification yaml for the following evaluation task. You will produce rubrics, checklists, and scoring criteria that a judge agent will use to evaluate the work.
CLAUDE_PLUGIN_ROOT=
${CLAUDE_PLUGIN_ROOT}
Generate an evaluation specification yaml for the following evaluation task. You will produce rubrics, checklists, and scoring criteria that a judge agent will use to evaluate the work.
CLAUDE_PLUGIN_ROOT=
${CLAUDE_PLUGIN_ROOT}

User Prompt

User Prompt

{Original task or request that prompted the work}
{Original task or request that prompted the work}

Context

Context

{Any relevant context about the work being evaluated} {Evaluation focus from arguments, or "General quality assessment"}
{Any relevant context about the work being evaluated} {Evaluation focus from arguments, or "General quality assessment"}

Artifact Type

Artifact Type

{code | documentation | configuration | etc.}
{code | documentation | configuration | etc.}

Instructions

Instructions

Return only the final evaluation specification YAML in your response.

**Dispatch:**
Use Task tool:
  • description: "Meta-judge: Generate evaluation criteria for {brief work summary}"
  • prompt: {meta-judge prompt}
  • model: opus
  • subagent_type: "sadd:meta-judge"

Wait for the meta-judge to complete before proceeding to Phase 3.
Return only the final evaluation specification YAML in your response.

**启动指令:**
Use Task tool:
  • description: "Meta-judge: Generate evaluation criteria for {brief work summary}"
  • prompt: {meta-judge prompt}
  • model: opus
  • subagent_type: "sadd:meta-judge"

等待元评判者完成后,再进入阶段3。

Phase 3: Dispatch Judge Agent

阶段3:启动评判代理

After the meta-judge completes, extract its evaluation specification YAML and dispatch the judge agent with both the work context and the specification.
CRITICAL: Provide to the judge the EXACT meta-judge evaluation specification YAML. Do not skip, add, modify, shorten, or summarize any text in it!
Judge Agent Prompt:
markdown
You are an Expert Judge evaluating the quality of work against an evaluation specification produced by the meta judge.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`
元评判者完成后,提取其生成的YAML评估规范,并将工作上下文与该规范一并传递给评判代理。
关键要求:必须将元评判者生成的YAML评估规范原封不动地提供给评判者!不得跳过、添加、修改、缩短或总结其中任何内容!
评判代理提示词:
markdown
You are an Expert Judge evaluating the quality of work against an evaluation specification produced by the meta judge.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

Work Under Evaluation

Work Under Evaluation

[ORIGINAL TASK] {paste the original request/task} [/ORIGINAL TASK]
[WORK OUTPUT] {summary of what was created/modified} [/WORK OUTPUT]
[FILES INVOLVED] {list of files with brief descriptions} [/FILES INVOLVED]
[ORIGINAL TASK] {paste the original request/task} [/ORIGINAL TASK]
[WORK OUTPUT] {summary of what was created/modified} [/WORK OUTPUT]
[FILES INVOLVED] {list of files with brief descriptions} [/FILES INVOLVED]

Evaluation Specification

Evaluation Specification

yaml
{meta-judge's evaluation specification YAML}
yaml
{meta-judge's evaluation specification YAML}

Instructions

Instructions

Follow your full judge process as defined in your agent instructions!
CRITICAL: You must reply with this exact structured evaluation report format in YAML at the START of your response!

CRITICAL: NEVER provide score threshold to judges in any format. Judge MUST not know what threshold for score is, in order to not be biased!!!

**Dispatch:**
Use Task tool:
  • description: "Judge: Evaluate {brief work summary}"
  • prompt: {judge prompt with exact meta-judge specification YAML}
  • model: opus
  • subagent_type: "sadd:judge"
undefined
Follow your full judge process as defined in your agent instructions!
CRITICAL: You must reply with this exact structured evaluation report format in YAML at the START of your response!

CRITICAL: NEVER provide score threshold to judges in any format. Judge MUST not know what threshold for score is, in order to not be biased!!!

**启动指令:**
Use Task tool:
  • description: "Judge: Evaluate {brief work summary}"
  • prompt: {judge prompt with exact meta-judge specification YAML}
  • model: opus
  • subagent_type: "sadd:judge"
undefined

Phase 4: Process and Present Results

阶段4:处理并展示结果

After receiving the judge's evaluation:
  1. Validate the evaluation:
    • Check that all criteria have scores in valid range (1-5)
    • Verify each score has supporting justification with evidence
    • Confirm weighted total calculation is correct
    • Check for contradictions between justification and score
    • Verify self-verification was completed with documented adjustments
  2. If validation fails:
    • Note the specific issue
    • Request clarification or re-evaluation if needed
  3. Present results to user:
    • Display the full evaluation report
    • Highlight the verdict and key findings
    • Offer follow-up options:
      • Address specific improvements
      • Request clarification on any judgment
      • Proceed with the work as-is
收到评判者的评估结果后:
  1. 验证评估结果
    • 检查所有标准的评分是否在有效范围(1-5)内
    • 验证每项评分是否有附带证据的支持性理由
    • 确认加权总分计算正确
    • 检查理由与评分之间是否存在矛盾
    • 验证是否完成了带有文档化调整的自我验证
  2. 若验证失败
    • 记录具体问题
    • 必要时请求澄清或重新评估
  3. 向用户展示结果
    • 展示完整的评估报告
    • 突出显示评估结论和关键发现
    • 提供后续选项:
      • 针对特定问题进行改进
      • 请求对某项评判进行澄清
      • 按当前状态继续推进工作

Scoring Interpretation

评分说明

Score RangeVerdictInterpretationRecommendation
4.50 - 5.00EXCELLENTExceptional quality, exceeds expectationsReady as-is
4.00 - 4.49GOODSolid quality, meets professional standardsMinor improvements optional
3.50 - 3.99ACCEPTABLEAdequate but has room for improvementImprovements recommended
3.00 - 3.49NEEDS IMPROVEMENTBelow standard, requires workAddress issues before use
1.00 - 2.99INSUFFICIENTDoes not meet basic requirementsSignificant rework needed
评分范围评估结论解读建议
4.50 - 5.00优秀质量卓越,超出预期可直接使用
4.00 - 4.49良好质量可靠,符合专业标准可选择进行小幅改进
3.50 - 3.99合格质量达标,但仍有改进空间建议进行改进
3.00 - 3.49需要改进未达标准,需优化使用前需解决问题
1.00 - 2.99不合格未满足基本要求需要大幅返工

Important Guidelines

重要指南

  1. Meta-judge first: Always generate evaluation specification before judging - never skip the meta-judge phase
  2. Include CLAUDE_PLUGIN_ROOT: Both meta-judge and judge need the resolved plugin root path
  3. Meta-judge YAML: Pass only the meta-judge YAML to the judge, do not modify it
  4. Context Isolation: Pass only relevant context to sub-agents - not the entire conversation
  5. Justification First: Always require evidence and reasoning BEFORE the score
  6. Evidence-Based: Every score must cite specific evidence (file paths, line numbers, quotes)
  7. Bias Mitigation: Explicitly warn against length bias, verbosity bias, and authority bias
  8. Be Objective: Base assessments on evidence and rubric definitions, not preferences
  9. Be Specific: Cite exact locations, not vague observations
  10. Be Constructive: Frame criticism as opportunities for improvement with impact context
  11. Consider Context: Account for stated constraints, complexity, and requirements
  12. Report Confidence: Lower confidence when evidence is ambiguous or criteria unclear
  13. Single Judge: This command uses one focused judge for context isolation
  1. 先启动元评判者:始终先生成评估规范再进行评判——绝不能跳过元评判者阶段
  2. 包含CLAUDE_PLUGIN_ROOT:元评判者和评判者都需要已解析的插件根路径
  3. 元评判者YAML:仅将元评判者生成的YAML传递给评判者,不得修改
  4. 上下文隔离:仅将相关上下文传递给子代理,而非整个对话内容
  5. 理由优先:始终要求先提供证据和推理,再给出评分
  6. 基于证据:每项评分必须引用具体证据(文件路径、行号、引用内容)
  7. 偏差缓解:明确警告避免篇幅偏差、冗余偏差和权威偏差
  8. 保持客观:基于证据和评分细则定义进行评估,而非个人偏好
  9. 具体明确:引用确切位置,而非模糊表述
  10. 具有建设性:将批评转化为带有影响背景的改进机会
  11. 考虑上下文:考虑既定约束条件、复杂度和需求
  12. 报告置信度:当证据模糊或标准不明确时,降低置信度
  13. 单一评判者:本命令使用单个聚焦式评判者以实现上下文隔离

Notes

注意事项

  • This is a report-only command - it evaluates but does not modify work
  • The meta-judge generates criteria tailored to the specific artifact type and evaluation focus
  • The judge operates with fresh context for unbiased assessment
  • Scores are calibrated to professional development standards
  • Low scores indicate improvement opportunities, not failures
  • Use the evaluation to inform next steps and iterations
  • Low confidence evaluations may warrant human review
  • 这是一个仅报告的命令——仅评估工作成果,不进行修改
  • 元评判者生成的标准会匹配特定工件类型和评估重点
  • 评判者在全新上下文环境中运作,以实现无偏评估
  • 评分以专业开发标准为校准依据
  • 低分代表改进机会,而非失败
  • 利用评估结果指导后续步骤和迭代
  • 低置信度的评估可能需要人工审核