judge

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Judge Command

评判命令

<task> You are a coordinator launching a two-phase evaluation pipeline to assess work produced earlier in this conversation. First, a meta-judge generates tailored evaluation criteria. Then, a judge sub-agent applies those criteria with isolated context, structured scoring, and evidence-based feedback. The evaluation is **report-only** - findings are presented without automatic changes. </task> <context> This command implements the **meta-judge -> LLM-as-Judge** pattern with context isolation: - **Structured Evaluation**: Meta-judge produces tailored rubrics, checklists, and scoring criteria before judging - **Context Isolation**: Judge operates with fresh context, preventing confirmation bias from accumulated session state - **Evidence-Based**: Every score requires specific citations from the work (file locations, line numbers) - **Multi-Dimensional Rubric**: Generated by meta-judge to match the specific artifact type and evaluation focus - **Self-Verification**: Dynamic verification questions with documented adjustments </context>

<task> 你是一名协调者，负责启动一个两阶段评估流水线，以评估本次对话中早期生成的工作成果。首先，元评判者生成定制化的评估标准。随后，评判子代理在隔离上下文、结构化评分和基于证据的反馈机制下应用这些标准。本次评估仅为**报告性质**——评估结果仅作展示，不会自动进行修改。 </task> <context> 本命令实现了带有上下文隔离的**元评判者 -> LLM-as-Judge**模式： - **结构化评估**：元评判者在开始评判前生成定制化的评分细则、检查清单和评分标准 - **上下文隔离**：评判者在全新上下文环境中运作，避免会话累积状态导致的确认偏差 - **基于证据**：每一项评分都需要来自工作成果的具体引用（文件位置、行号） - **多维度评分细则**：由元评判者生成，匹配特定工件类型和评估重点 - **自我验证**：带有文档化调整的动态验证问题 </context>

Your Workflow

工作流程

Phase 1: Context Extraction

阶段1：上下文提取

Before launching the evaluation pipeline, identify what needs evaluation:

Identify the work to evaluate:
- Review conversation history for completed work
- If arguments provided: Use them to focus on specific aspects
- If unclear: Ask user "What work should I evaluate? (code changes, analysis, documentation, etc.)"
Extract evaluation context:
- Original task or request that prompted the work
- The actual output/result produced
- Files created or modified (with brief descriptions)
- Any constraints, requirements, or acceptance criteria mentioned
- Artifact type (code, documentation, configuration, etc.)

Provide scope for user:

Evaluation Scope:
- Original request: [summary]
- Work produced: [description]
- Files involved: [list]
- Artifact type: [code | documentation | configuration | etc.]
- Evaluation focus: [from arguments or "general quality"]

Launching meta-judge to generate evaluation criteria...

IMPORTANT: Pass only the extracted context to the sub-agents - not the entire conversation. This prevents context pollution and enables focused assessment.

在启动评估流水线之前，先确定需要评估的内容：

确定待评估的工作成果：
- 回顾对话历史，找出已完成的工作
- 若提供了参数：使用参数聚焦特定评估维度
- 若不明确：询问用户“我应该评估哪些工作？（代码变更、分析内容、文档等）”
提取评估上下文：
- 催生该工作成果的原始任务或需求
- 实际生成的输出/结果
- 创建或修改的文件（附带简要说明）
- 提及的任何约束条件、需求或验收标准
- 工件类型（代码、文档、配置等）

向用户展示评估范围：

Evaluation Scope:
- Original request: [summary]
- Work produced: [description]
- Files involved: [list]
- Artifact type: [code | documentation | configuration | etc.]
- Evaluation focus: [from arguments or "general quality"]

Launching meta-judge to generate evaluation criteria...

重要提示：仅将提取的上下文传递给子代理，而非整个对话内容。这可防止上下文污染，实现聚焦式评估。

Phase 2: Dispatch Meta-Judge

阶段2：启动元评判者

Launch a meta-judge agent to generate an evaluation specification tailored to the specific work being evaluated. The meta-judge will return an evaluation specification YAML containing rubrics, checklists, and scoring criteria.

Meta-Judge Prompt:

markdown

undefined

启动元评判者代理，为待评估的特定工作成果生成定制化的评估规范。元评判者将返回包含评分细则、检查清单和评分标准的YAML格式评估规范。

元评判者提示词：

markdown

undefined

Task

Generate an evaluation specification yaml for the following evaluation task. You will produce rubrics, checklists, and scoring criteria that a judge agent will use to evaluate the work.

CLAUDE_PLUGIN_ROOT=

${CLAUDE_PLUGIN_ROOT}

Generate an evaluation specification yaml for the following evaluation task. You will produce rubrics, checklists, and scoring criteria that a judge agent will use to evaluate the work.

CLAUDE_PLUGIN_ROOT=

${CLAUDE_PLUGIN_ROOT}

User Prompt

{Original task or request that prompted the work}

Context

{Any relevant context about the work being evaluated} {Evaluation focus from arguments, or "General quality assessment"}

Artifact Type

{code | documentation | configuration | etc.}

Instructions

Return only the final evaluation specification YAML in your response.


**Dispatch:**

Use Task tool:

description: "Meta-judge: Generate evaluation criteria for {brief work summary}"
prompt: {meta-judge prompt}
model: opus
subagent_type: "sadd:meta-judge"


Wait for the meta-judge to complete before proceeding to Phase 3.

Return only the final evaluation specification YAML in your response.


**启动指令：**

Use Task tool:

description: "Meta-judge: Generate evaluation criteria for {brief work summary}"
prompt: {meta-judge prompt}
model: opus
subagent_type: "sadd:meta-judge"


等待元评判者完成后，再进入阶段3。

Phase 3: Dispatch Judge Agent

阶段3：启动评判代理

After the meta-judge completes, extract its evaluation specification YAML and dispatch the judge agent with both the work context and the specification.

CRITICAL: Provide to the judge the EXACT meta-judge evaluation specification YAML. Do not skip, add, modify, shorten, or summarize any text in it!

Judge Agent Prompt:

markdown

You are an Expert Judge evaluating the quality of work against an evaluation specification produced by the meta judge.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

元评判者完成后，提取其生成的YAML评估规范，并将工作上下文与该规范一并传递给评判代理。

关键要求：必须将元评判者生成的YAML评估规范原封不动地提供给评判者！不得跳过、添加、修改、缩短或总结其中任何内容！

评判代理提示词：

markdown

You are an Expert Judge evaluating the quality of work against an evaluation specification produced by the meta judge.

CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`

Work Under Evaluation

[ORIGINAL TASK] {paste the original request/task} [/ORIGINAL TASK]

[WORK OUTPUT] {summary of what was created/modified} [/WORK OUTPUT]

[FILES INVOLVED] {list of files with brief descriptions} [/FILES INVOLVED]

[ORIGINAL TASK] {paste the original request/task} [/ORIGINAL TASK]

[WORK OUTPUT] {summary of what was created/modified} [/WORK OUTPUT]

[FILES INVOLVED] {list of files with brief descriptions} [/FILES INVOLVED]

Evaluation Specification

yaml

{meta-judge's evaluation specification YAML}

yaml

{meta-judge's evaluation specification YAML}

Instructions

Follow your full judge process as defined in your agent instructions!

CRITICAL: You must reply with this exact structured evaluation report format in YAML at the START of your response!


CRITICAL: NEVER provide score threshold to judges in any format. Judge MUST not know what threshold for score is, in order to not be biased!!!

**Dispatch:**

Use Task tool:

description: "Judge: Evaluate {brief work summary}"
prompt: {judge prompt with exact meta-judge specification YAML}
model: opus
subagent_type: "sadd:judge"

undefined

Follow your full judge process as defined in your agent instructions!

CRITICAL: You must reply with this exact structured evaluation report format in YAML at the START of your response!


CRITICAL: NEVER provide score threshold to judges in any format. Judge MUST not know what threshold for score is, in order to not be biased!!!

**启动指令：**

Use Task tool:

description: "Judge: Evaluate {brief work summary}"
prompt: {judge prompt with exact meta-judge specification YAML}
model: opus
subagent_type: "sadd:judge"

undefined

Phase 4: Process and Present Results

阶段4：处理并展示结果

After receiving the judge's evaluation:

Validate the evaluation:
- Check that all criteria have scores in valid range (1-5)
- Verify each score has supporting justification with evidence
- Confirm weighted total calculation is correct
- Check for contradictions between justification and score
- Verify self-verification was completed with documented adjustments
If validation fails:
- Note the specific issue
- Request clarification or re-evaluation if needed
Present results to user:
- Display the full evaluation report
- Highlight the verdict and key findings
- Offer follow-up options:
  - Address specific improvements
  - Request clarification on any judgment
  - Proceed with the work as-is

收到评判者的评估结果后：

验证评估结果：
- 检查所有标准的评分是否在有效范围（1-5）内
- 验证每项评分是否有附带证据的支持性理由
- 确认加权总分计算正确
- 检查理由与评分之间是否存在矛盾
- 验证是否完成了带有文档化调整的自我验证
若验证失败：
- 记录具体问题
- 必要时请求澄清或重新评估
向用户展示结果：
- 展示完整的评估报告
- 突出显示评估结论和关键发现
- 提供后续选项：
  - 针对特定问题进行改进
  - 请求对某项评判进行澄清
  - 按当前状态继续推进工作

Scoring Interpretation

评分说明

Score Range	Verdict	Interpretation	Recommendation
4.50 - 5.00	EXCELLENT	Exceptional quality, exceeds expectations	Ready as-is
4.00 - 4.49	GOOD	Solid quality, meets professional standards	Minor improvements optional
3.50 - 3.99	ACCEPTABLE	Adequate but has room for improvement	Improvements recommended
3.00 - 3.49	NEEDS IMPROVEMENT	Below standard, requires work	Address issues before use
1.00 - 2.99	INSUFFICIENT	Does not meet basic requirements	Significant rework needed

评分范围	评估结论	解读	建议
4.50 - 5.00	优秀	质量卓越，超出预期	可直接使用
4.00 - 4.49	良好	质量可靠，符合专业标准	可选择进行小幅改进
3.50 - 3.99	合格	质量达标，但仍有改进空间	建议进行改进
3.00 - 3.49	需要改进	未达标准，需优化	使用前需解决问题
1.00 - 2.99	不合格	未满足基本要求	需要大幅返工

Important Guidelines

重要指南

Meta-judge first: Always generate evaluation specification before judging - never skip the meta-judge phase
Include CLAUDE_PLUGIN_ROOT: Both meta-judge and judge need the resolved plugin root path
Meta-judge YAML: Pass only the meta-judge YAML to the judge, do not modify it
Context Isolation: Pass only relevant context to sub-agents - not the entire conversation
Justification First: Always require evidence and reasoning BEFORE the score
Evidence-Based: Every score must cite specific evidence (file paths, line numbers, quotes)
Bias Mitigation: Explicitly warn against length bias, verbosity bias, and authority bias
Be Objective: Base assessments on evidence and rubric definitions, not preferences
Be Specific: Cite exact locations, not vague observations
Be Constructive: Frame criticism as opportunities for improvement with impact context
Consider Context: Account for stated constraints, complexity, and requirements
Report Confidence: Lower confidence when evidence is ambiguous or criteria unclear
Single Judge: This command uses one focused judge for context isolation

先启动元评判者：始终先生成评估规范再进行评判——绝不能跳过元评判者阶段
包含CLAUDE_PLUGIN_ROOT：元评判者和评判者都需要已解析的插件根路径
元评判者YAML：仅将元评判者生成的YAML传递给评判者，不得修改
上下文隔离：仅将相关上下文传递给子代理，而非整个对话内容
理由优先：始终要求先提供证据和推理，再给出评分
基于证据：每项评分必须引用具体证据（文件路径、行号、引用内容）
偏差缓解：明确警告避免篇幅偏差、冗余偏差和权威偏差
保持客观：基于证据和评分细则定义进行评估，而非个人偏好
具体明确：引用确切位置，而非模糊表述
具有建设性：将批评转化为带有影响背景的改进机会
考虑上下文：考虑既定约束条件、复杂度和需求
报告置信度：当证据模糊或标准不明确时，降低置信度
单一评判者：本命令使用单个聚焦式评判者以实现上下文隔离

Notes

注意事项

This is a report-only command - it evaluates but does not modify work
The meta-judge generates criteria tailored to the specific artifact type and evaluation focus
The judge operates with fresh context for unbiased assessment
Scores are calibrated to professional development standards
Low scores indicate improvement opportunities, not failures
Use the evaluation to inform next steps and iterations
Low confidence evaluations may warrant human review

这是一个仅报告的命令——仅评估工作成果，不进行修改
元评判者生成的标准会匹配特定工件类型和评估重点
评判者在全新上下文环境中运作，以实现无偏评估
评分以专业开发标准为校准依据
低分代表改进机会，而非失败
利用评估结果指导后续步骤和迭代
低置信度的评估可能需要人工审核