judge-with-debate
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesejudge-with-debate
judge-with-debate
<task>
Evaluate solutions through multi-agent debate where independent judges analyze, challenge each other's assessments, and iteratively refine their evaluations until reaching consensus or maximum rounds.
</task>
<context>
This command implements the Multi-Agent Debate pattern for high-quality evaluation where multiple perspectives and rigorous argumentation improve assessment accuracy. Unlike single-pass evaluation, debate forces judges to defend their positions with evidence and consider counter-arguments.
Key benefits:
- Structured evaluation - Meta-judge produces tailored rubrics and criteria before judging begins
- Multiple perspectives - Three independent judges reduce individual bias
- Evidence-based debate - Judges defend positions with specific evidence from the solution and evaluation specification
- Iterative refinement - Up to 3 debate rounds drive convergence on accurate scores
- Shared specification - Meta-judge runs once; all judges across all rounds share the same evaluation specification
<task>
通过多Agent辩论评估解决方案,独立评审员会互相分析、质疑彼此的评估,并迭代优化评估结果,直至达成共识或达到最大轮次。
</task>
<context>
该命令实现了多Agent辩论模式以进行高质量评估,多视角和严谨的论证可提升评估准确性。与单次评估不同,辩论会促使评审员用证据捍卫自己的立场,并考虑反驳意见。
核心优势:
- 结构化评估 - 元评审员(Meta-judge)在评审开始前生成定制化的评分标准和准则
- 多视角分析 - 三名独立评审员可减少个体偏见
- 基于证据的辩论 - 评审员会依据解决方案和评估规范中的具体证据捍卫立场
- 迭代优化 - 最多3轮辩论推动评估结果收敛至准确分数
- 统一规范 - 元评审员仅运行一次;所有轮次的所有评审员共享同一评估规范
Pattern: Debate-Based Evaluation
模式:基于辩论的评估
This command implements iterative multi-judge debate:
Phase 0: Setup
mkdir -p .specs/reports
|
Phase 0.5: Dispatch Meta-Judge
Meta-Judge (Opus)
|
Evaluation Specification YAML
|
Phase 1: Independent Analysis (3 judges in parallel)
+- Judge 1 -> {name}.1.md -+
Solution +- Judge 2 -> {name}.2.md -+-+
+- Judge 3 -> {name}.3.md -+ |
|
Phase 2: Debate Round (iterative) |
Each judge reads others' reports |
| |
Argue + Defend + Challenge |
(grounded in eval specification) |
| |
Revise if convinced --------------+
| |
Check consensus |
+- Yes -> Final Report |
+- No -> Next Round ---------+本命令实现了迭代式多评审员辩论流程:
阶段0:准备
mkdir -p .specs/reports
|
阶段0.5:调度元评审员
Meta-Judge (Opus)
|
评估规范YAML
|
阶段1:独立分析(3名评审员并行)
+- 评审员1 -> {name}.1.md -+
解决方案 +- 评审员2 -> {name}.2.md -+-+
+- 评审员3 -> {name}.3.md -+ |
|
阶段2:辩论轮次(迭代) |
每位评审员阅读其他评审员的报告 |
| |
论证 + 捍卫 + 质疑 |
(基于评估规范) |
| |
若被说服则修订 --------------+
| |
检查共识情况 |
+- 达成共识 -> 最终报告 |
+- 未达成共识 -> 进入下一轮 ---------+Process
流程
Setup: Create Reports Directory
准备:创建报告目录
Before starting evaluation, ensure the reports directory exists:
bash
mkdir -p .specs/reportsReport naming convention:
.specs/reports/{solution-name}-{YYYY-MM-DD}.[1|2|3].mdWhere:
- - Derived from solution filename (e.g.,
{solution-name}fromusers-api)src/api/users.ts - - Current date
{YYYY-MM-DD} - - Judge number
[1|2|3]
开始评估前,请确保报告目录已存在:
bash
mkdir -p .specs/reports报告命名规则:
.specs/reports/{solution-name}-{YYYY-MM-DD}.[1|2|3].md其中:
- - 从解决方案文件名派生(例如,从
{solution-name}得到src/api/users.ts)users-api - - 当前日期
{YYYY-MM-DD} - - 评审员编号
[1|2|3]
Phase 0.5: Dispatch Meta-Judge
阶段0.5:调度元评审员
Before independent analysis, dispatch a meta-judge agent to generate a tailored evaluation specification. The meta-judge runs ONCE and produces rubrics, checklists, and scoring criteria that ALL judges will use across ALL rounds.
Meta-judge prompt template:
markdown
undefined在独立分析前,调度元评审员Agent生成定制化的评估规范。元评审员仅运行一次,生成的评分标准、检查清单和评分准则将供所有轮次的所有评审员使用。
元评审员提示模板:
markdown
undefinedTask
任务
Generate an evaluation specification yaml for the following evaluation task. You will produce rubrics, checklists, and scoring criteria that multiple judge agents will use to evaluate the solution through independent analysis and multi-round debate.
CLAUDE_PLUGIN_ROOT=
${CLAUDE_PLUGIN_ROOT}为以下评估任务生成评估规范YAML。你需要生成评分标准、检查清单和评分准则,供多个评审员Agent通过独立分析和多轮辩论来评估解决方案。
CLAUDE_PLUGIN_ROOT=
${CLAUDE_PLUGIN_ROOT}User Prompt
用户提示
{task description - what the solution was supposed to accomplish}
{任务描述 - 解决方案应完成的目标}
Context
上下文
{Any relevant context about the solution being evaluated}
{与待评估解决方案相关的任何上下文信息}
Artifact Type
工件类型
{code | documentation | configuration | etc.}
{代码 | 文档 | 配置 | 其他}
Evaluation Mode
评估模式
Multi-judge debate with consensus-seeking across rounds
多评审员辩论,通过多轮寻求共识
Instructions
说明
Return only the final evaluation specification YAML in your response.
The specification should support both independent analysis and debate-based refinement.
**Dispatch:**
Use Task tool:
- description: "Meta-judge: generate evaluation specification for {solution-name}"
- prompt: {meta-judge prompt}
- model: opus
- subagent_type: "sadd:meta-judge"
Wait for the meta-judge to complete and extract the evaluation specification YAML from its output before proceeding to Phase 1.仅返回最终的评估规范YAML。
该规范需同时支持独立分析和基于辩论的优化。
**调度方式**:
使用Task工具:
- description: "Meta-judge: generate evaluation specification for {solution-name}"
- prompt: {元评审员提示内容}
- model: opus
- subagent_type: "sadd:meta-judge"
等待元评审员完成任务,并从其输出中提取评估规范YAML,之后再进入阶段1。Phase 1: Independent Analysis
阶段1:独立分析
Launch 3 independent judge agents in parallel (Opus for rigor):
- Each judge receives:
- Path to solution(s) being evaluated
- The meta-judge's evaluation specification YAML
- Task description
- Each produces independent assessment saved to
.specs/reports/{solution-name}-{date}.[1|2|3].md - Reports must include:
- Per-criterion scores with evidence
- Specific quotes/examples supporting ratings
- Overall weighted score
- Key strengths and weaknesses
Key principle: Independence in initial analysis prevents groupthink.
Prompt template for initial judges:
markdown
You are Judge {N} evaluating a solution independently against an evaluation specification produced by the meta judge.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`启动3个独立评审员Agent并行运行(使用Opus模型以保证严谨性):
- 每位评审员会收到:
- 待评估解决方案的路径
- 元评审员生成的评估规范YAML
- 任务描述
- 每位评审员生成独立评估报告,保存至
.specs/reports/{solution-name}-{date}.[1|2|3].md - 报告必须包含:
- 带证据的分项评分
- 支持评分的具体引用/示例
- 整体加权得分
- 核心优势与不足
核心原则:初始分析保持独立性,避免群体思维。
初始评审员提示模板:
markdown
你是评审员{N},需独立依据元评审员生成的评估规范评估解决方案。
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`Solution
解决方案
{path to solution file(s)}
{解决方案文件路径}
Task Description
任务描述
{what the solution was supposed to accomplish}
{解决方案应完成的目标}
Evaluation Specification
评估规范
yaml
{meta-judge's evaluation specification YAML}yaml
{元评审员生成的评估规范YAML}Output File
输出文件
.specs/reports/{solution-name}-{date}.{N}.md
.specs/reports/{solution-name}-{date}.{N}.md
Instructions
说明
Follow your full judge process as defined in your agent instructions!
Additional instructions:
- Read the solution thoroughly
- For each criterion from the evaluation specification:
- Find specific evidence (quote exact text)
- Score on the defined scale
- Justify with concrete examples
- Calculate weighted overall score
- Write comprehensive report to {output_file}
Add to report beginning
Done by Judge {N}
**Dispatch each judge:**
Use Task tool:
- description: "Judge {N}: independent analysis of {solution-name}"
- prompt: {judge prompt with evaluation specification YAML}
- model: opus
- subagent_type: "sadd:judge"
undefined遵循Agent指令中定义的完整评审流程!
额外说明:
- 仔细阅读解决方案
- 针对评估规范中的每一项准则:
- 找到具体证据(引用确切文本)
- 按照定义的评分标准打分
- 用具体示例证明评分合理性
- 计算加权整体得分
- 撰写完整报告至{output_file}
在报告开头添加
Done by Judge {N}
**调度每位评审员**:
使用Task工具:
- description: "Judge {N}: independent analysis of {solution-name}"
- prompt: {包含评估规范YAML的评审员提示内容}
- model: opus
- subagent_type: "sadd:judge"
undefinedPhase 2: Debate Rounds (Iterative)
阶段2:辩论轮次(迭代)
For each debate round (max 3 rounds):
Launch 3 debate agents in parallel:
- Each judge agent receives:
- Path to their own previous report ()
.specs/reports/{solution-name}-{date}.[1|2|3].md - Paths to other judges' reports ()
.specs/reports/{solution-name}-{date}.[1|2|3].md - The original solution
- The meta-judge's evaluation specification YAML
- Path to their own previous report (
- Each judge:
- Identifies disagreements with other judges (>1 point score gap on any criterion)
- Defends their own ratings with evidence from the solution and evaluation specification
- Challenges other judges' ratings they disagree with
- Considers counter-arguments
- Revises their assessment if convinced
- Updates their report file with new section:
## Debate Round {R} - After they reply, if they reached agreement move to Phase 3: Consensus Report
Key principle: Judges communicate only through filesystem - orchestrator doesn't mediate and don't read reports files itself, it can overflow your context.
Prompt template for debate judges:
markdown
You are Judge {N} in debate round {R}.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`每一轮辩论(最多3轮):
启动3个辩论Agent并行运行:
- 每位评审员Agent会收到:
- 自己之前的报告路径()
.specs/reports/{solution-name}-{date}.[1|2|3].md - 其他评审员的报告路径()
.specs/reports/{solution-name}-{date}.[1|2|3].md - 原始解决方案
- 元评审员生成的评估规范YAML
- 自己之前的报告路径(
- 每位评审员:
- 识别与其他评审员的分歧(任何准则上的分数差距超过1分)
- 依据解决方案和评估规范中的证据捍卫自己的评分
- 质疑自己不同意的其他评审员的评分
- 考虑反驳意见
- 若被说服则修订自己的评估
- 在报告文件中添加新章节:
## 辩论轮次{R} - 完成回复后,若已达成共识则进入阶段3:共识报告
核心原则:评审员仅通过文件系统沟通 - 编排器不进行调解,也不直接读取报告文件,避免上下文溢出。
辩论评审员提示模板:
markdown
你是第{R}轮辩论中的评审员{N}。
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`Your Previous Report
你的上一份报告
{path to .specs/reports/{solution-name}-{date}.{N}.md}
{.specs/reports/{solution-name}-{date}.{N}.md路径}
Other Judges' Reports
其他评审员的报告
Judge 1: .specs/reports/{solution-name}-{date}.1.md
...
评审员1:.specs/reports/{solution-name}-{date}.1.md
...
Task Description
任务描述
{what the solution was supposed to accomplish}
{解决方案应完成的目标}
Solution
解决方案
{path to solution}
{解决方案路径}
Evaluation Specification
评估规范
yaml
{meta-judge's evaluation specification YAML}yaml
{元评审员生成的评估规范YAML}Output File
输出文件
.specs/reports/{solution-name}-{date}.{N}.md (append to existing file)
.specs/reports/{solution-name}-{date}.{N}.md(追加至现有文件)
Instructions
说明
Follow your full judge process as defined in your agent instructions!
Additional debate instructions:
- Read your previous assessment from {your_previous_report}
- Read all other judges' reports
- Identify disagreements (where your scores differ by >1 point)
- For each major disagreement:
- State the disagreement clearly
- Defend your position with evidence from the solution and evaluation specification
- Challenge the other judge's position with counter-evidence
- Consider whether their evidence changes your view
- Update your report file by APPENDING debate round section
- Reply whether you reached agreement, and with which judge. Include revisited scores and criteria scores.
CRITICAL:
- Ground your arguments in the evaluation specification criteria
- Only revise if you find their evidence compelling
- Defend your original scores if you still believe them
- Quote specific evidence from the solution
**Dispatch each debate judge:**
Use Task tool:
- description: "Judge {N}: debate round {R} for {solution-name}"
- prompt: {debate judge prompt with evaluation specification YAML}
- model: opus
- subagent_type: "sadd:judge"
undefined遵循Agent指令中定义的完整评审流程!
额外辩论说明:
- 阅读你在{your_previous_report}中的上一份评估
- 阅读所有其他评审员的报告
- 识别分歧(你的分数与他人差距超过1分的情况)
- 针对每个主要分歧:
- 清晰陈述分歧点
- 依据解决方案和评估规范中的证据捍卫你的立场
- 用反证质疑其他评审员的立场
- 考虑他们的证据是否改变你的观点
- 通过追加辩论轮次章节更新你的报告文件
- 回复是否已达成共识,以及与哪位评审员达成共识。包含重新评估后的分数和分项得分。
关键要求:
- 你的论证需基于评估规范准则
- 仅在发现令人信服的证据时才修订评分
- 若仍坚持原评分则捍卫你的初始分数
- 引用解决方案中的具体证据
**调度每位辩论评审员**:
使用Task工具:
- description: "Judge {N}: debate round {R} for {solution-name}"
- prompt: {包含评估规范YAML的辩论评审员提示内容}
- model: opus
- subagent_type: "sadd:judge"
undefinedConsensus Check
共识检查
After each debate round, check for consensus:
Consensus achieved if:
- All judges' overall scores within 0.5 points of each other
- No criterion has >1 point disagreement across any two judges
- All judges explicitly state they accept the consensus
If no consensus after 3 rounds:
- Report persistent disagreements
- Provide all judge reports for human review
- Flag that automated evaluation couldn't reach consensus
Orchestration Instructions:
Step 1: Dispatch Meta-Judge (Phase 0.5)
- Launch meta-judge agent
- Wait for meta-judge to complete
- Extract the evaluation specification YAML from meta-judge output
Step 2: Run Independent Analysis (Phase 1)
- Launch 3 judge agents in parallel (Judge 1, 2, 3) with the evaluation specification YAML
- Each writes their independent assessment to
.specs/reports/{solution-name}-{date}.[1|2|3].md - Wait for all 3 agents to complete
Step 3: Check for Consensus
Let's work through this systematically to ensure accurate consensus detection.
Read all three reports and extract:
- Each judge's overall weighted score
- Each judge's score for every criterion
Check consensus step by step:
- First, extract all overall scores from each report and list them explicitly
- Calculate the difference between the highest and lowest overall scores
- If difference <= 0.5 points -> overall consensus achieved
- If difference > 0.5 points -> no consensus yet
- Next, for each criterion, list all three judges' scores side by side
- For each criterion, calculate the difference between highest and lowest scores
- If any criterion has difference > 1.0 point -> no consensus on that criterion
- Finally, verify consensus is achieved only if BOTH conditions are met:
- Overall scores within 0.5 points
- All criterion scores within 1.0 point
Step 4: Decision Point
- If consensus achieved: Go to Step 6 (Generate Consensus Report)
- If no consensus AND round < 3: Go to Step 5 (Run Debate Round)
- If no consensus AND round = 3: Go to Step 7 (Report No Consensus)
Step 5: Run Debate Round
- Increment round counter (round = round + 1)
- Launch 3 judge agents in parallel with the same evaluation specification YAML
- Each agent reads:
- Their own previous report from filesystem
- Other judges' reports from filesystem
- Original solution
- Each agent appends "Debate Round {R}" section to their own report file
- Wait for all 3 agents to complete
- Go back to Step 3 (Check for Consensus)
Step 6: Reply with Report
Let's synthesize the evaluation results step by step.
- Read all final reports carefully
- Before generating the report, analyze the following:
- What is the consensus status (achieved or not)?
- What were the key points of agreement across all judges?
- What were the main areas of disagreement, if any?
- How did the debate rounds change the evaluations?
- Reply to user with a report that contains:
- If there is consensus:
- Consensus scores (average of all judges)
- Consensus strengths/weaknesses
- Number of rounds to reach consensus
- Final recommendation with clear justification
- If there is no consensus:
- All judges' final scores showing disagreements
- Specific criteria where consensus wasn't reached
- Analysis of why consensus couldn't be reached
- Flag for human review
- If there is consensus:
- Command complete
Step 7: Report No Consensus
- Report persistent disagreements
- Provide all judge reports for human review
- Flag that automated evaluation couldn't reach consensus
每轮辩论后,检查共识情况:
达成共识的条件:
- 所有评审员的整体得分差距在0.5分以内
- 任何准则上的分数差距在任意两位评审员之间不超过1分
- 所有评审员明确表示接受共识
若3轮后仍未达成共识:
- 报告持续存在的分歧
- 提供所有评审员报告供人工审核
- 标记自动化评估无法达成共识
编排说明:
步骤1:调度元评审员(阶段0.5)
- 启动元评审员Agent
- 等待元评审员完成任务
- 从元评审员输出中提取评估规范YAML
步骤2:执行独立分析(阶段1)
- 并行启动3个评审员Agent(评审员1、2、3),并提供评估规范YAML
- 每位评审员将独立评估报告写入
.specs/reports/{solution-name}-{date}.[1|2|3].md - 等待所有3个Agent完成任务
步骤3:检查共识情况
我们将系统地执行此步骤以确保准确检测共识。
读取所有三份报告并提取:
- 每位评审员的整体加权得分
- 每位评审员的各项准则得分
逐步检查共识:
- 首先,从每份报告中提取所有整体得分并明确列出
- 计算最高和最低整体得分之间的差值
- 若差值≤0.5分 -> 达成整体共识
- 若差值>0.5分 -> 尚未达成共识
- 接下来,针对每项准则,并列列出三位评审员的得分
- 针对每项准则,计算最高和最低得分之间的差值
- 若任何准则的差值>1.0分 -> 该准则未达成共识
- 最后,仅当同时满足以下两个条件时,才确认达成共识:
- 整体得分差距在0.5分以内
- 所有准则得分差距在1.0分以内
步骤4:决策点
- 若达成共识:进入步骤6(生成共识报告)
- 若未达成共识且轮次<3:进入步骤5(执行辩论轮次)
- 若未达成共识且轮次=3:进入步骤7(报告未达成共识)
步骤5:执行辩论轮次
- 增加轮次计数器(轮次=轮次+1)
- 并行启动3个评审员Agent,提供相同的评估规范YAML
- 每个Agent读取:
- 文件系统中自己的上一份报告
- 文件系统中其他评审员的报告
- 原始解决方案
- 每个Agent在自己的报告文件中追加"Debate Round {R}"章节
- 等待所有3个Agent完成任务
- 返回步骤3(检查共识情况)
步骤6:回复报告
我们将逐步综合评估结果。
- 仔细阅读所有最终报告
- 在生成报告前,分析以下内容:
- 共识状态(已达成或未达成)
- 所有评审员达成一致的关键点
- 主要分歧领域(若有)
- 辩论轮次如何改变评估结果
- 向用户回复包含以下内容的报告:
- 若达成共识:
- 共识得分(所有评审员的平均分)
- 共识认可的优势/不足
- 达成共识所需的轮次
- 带明确理由的最终建议
- 若未达成共识:
- 所有评审员的最终得分,展示分歧
- 未达成共识的具体准则
- 无法达成共识的原因分析
- 标记需人工审核
- 若达成共识:
- 命令完成
步骤7:报告未达成共识
- 报告持续存在的分歧
- 提供所有评审员报告供人工审核
- 标记自动化评估无法达成共识
Phase 3: Consensus Report
阶段3:共识报告
If consensus achieved, synthesize the final report by working through each section methodically:
markdown
undefined若达成共识,通过系统地梳理每个部分来生成最终报告:
markdown
undefinedConsensus Evaluation Report
共识评估报告
Let's compile the final consensus by analyzing each component systematically.
我们将通过系统分析每个组件来整理最终共识。
Consensus Scores
共识得分
First, let's consolidate all judges' final scores:
| Criterion | Judge 1 | Judge 2 | Judge 3 | Final |
|---|---|---|---|---|
| {Name} | {X}/5 | {X}/5 | {X}/5 | {X}/5 |
| ... |
Consensus Overall Score: {avg}/5.0
首先,整合所有评审员的最终得分:
| 准则 | 评审员1 | 评审员2 | 评审员3 | 最终得分 |
|---|---|---|---|---|
| {名称} | {X}/5 | {X}/5 | {X}/5 | {X}/5 |
| ... |
共识整体得分:{avg}/5.0
Consensus Strengths
共识优势
[Review each judge's identified strengths and extract the common themes that all judges agreed upon]
[查看每位评审员指出的优势,提取所有评审员一致认可的共同主题]
Consensus Weaknesses
共识不足
[Review each judge's identified weaknesses and extract the common themes that all judges agreed upon]
[查看每位评审员指出的不足,提取所有评审员一致认可的共同主题]
Debate Summary
辩论总结
Let's trace how consensus was reached:
- Rounds to consensus: {N}
- Initial disagreements: {list with specific criteria and score gaps}
- How resolved: {for each disagreement, explain what evidence or argument led to resolution}
我们来追踪共识的达成过程:
- 达成共识所需轮次:{N}
- 初始分歧:{列出具体准则和分数差距}
- 解决方式:{针对每个分歧,解释是什么证据或论证促成了共识}
Final Recommendation
最终建议
Based on the consensus scores and the key strengths/weaknesses identified:
{Pass/Fail/Needs Revision with clear justification tied to the evidence}
<output>
The command produces:
1. **Reports directory**: `.specs/reports/` (created if not exists)
2. **Initial reports**: `.specs/reports/{solution-name}-{date}.1.md`, `.specs/reports/{solution-name}-{date}.2.md`, `.specs/reports/{solution-name}-{date}.3.md`
3. **Debate updates**: Appended sections in each report file per round
4. **Final synthesis**: Replied to user (consensus or disagreement summary)
</output>基于共识得分和已识别的核心优势/不足:
{通过/不通过/需要修订,并附上与证据相关的明确理由}
<output>
该命令将生成:
1. **报告目录**:`.specs/reports/`(若不存在则创建)
2. **初始报告**:`.specs/reports/{solution-name}-{date}.1.md`、`.specs/reports/{solution-name}-{date}.2.md`、`.specs/reports/{solution-name}-{date}.3.md`
3. **辩论更新内容**:每轮辩论后追加到各报告文件中的章节
4. **最终综合结果**:向用户回复的内容(共识或分歧总结)
</output>Best Practices
最佳实践
Meta-Judge + Judge Verification
元评审员 + 评审员验证
- Never skip meta-judge - Tailored evaluation criteria produce better judgments and more grounded debates
- Meta-judge runs once - Same specification for all 3 judges across all debate rounds
- Include CLAUDE_PLUGIN_ROOT - Both meta-judge and judges need the resolved plugin root path
- Meta-judge YAML - Pass only the YAML to judges, do not modify it
- Debate grounding - Judges should reference evaluation specification criteria when defending positions
- 绝不跳过元评审员 - 定制化评估准则能产生更优质的判断和更有依据的辩论
- 元评审员仅运行一次 - 所有3位评审员在所有辩论轮次中使用同一规范
- 包含CLAUDE_PLUGIN_ROOT - 元评审员和评审员都需要解析后的插件根路径
- 元评审员YAML - 仅将YAML传递给评审员,不得修改
- 辩论依据 - 评审员捍卫立场时需参考评估规范准则
Common Pitfalls
常见陷阱
- Judges create new reports instead of appending - Loses debate history
- Orchestrator passes reports between judges - Violates filesystem communication principle
- Weak initial assessments - Garbage in, garbage out
- Too many debate rounds - Diminishing returns after 3 rounds
- Sycophancy in debate - Judges agree too easily without real evidence
- Modifying meta-judge YAML - Specification must be passed verbatim to all judges
- Re-running meta-judge between rounds - Specification is generated once and shared
- 评审员创建新报告而非追加内容 - 丢失辩论历史
- 编排器在评审员之间传递报告 - 违反文件系统沟通原则
- 初始评估质量低下 - 输入垃圾,输出垃圾
- 辩论轮次过多 - 3轮后收益递减
- 辩论中的附和行为 - 评审员轻易达成一致而无真实证据支撑
- 修改元评审员YAML - 规范必须原封不动地传递给所有评审员
- 轮次间重新运行元评审员 - 规范仅生成一次并共享使用
Do This
正确做法
- Judges append to their own report file
- Judges read other reports from filesystem directly
- Strong evidence-based initial assessments
- Maximum 3 debate rounds
- Require evidence for changing positions
- Ground debate arguments in the evaluation specification criteria
- Use same evaluation specification across all rounds
- 评审员追加到自己的报告文件中
- 评审员直接从文件系统读取其他报告
- 基于强证据的初始评估
- 最多3轮辩论
- 要求提供修改立场的证据
- 辩论论证需基于评估规范准则
- 所有轮次使用同一评估规范
Example Usage
示例用法
Evaluating an API Implementation
评估API实现
bash
/judge-with-debate Implement REST API for user management --solution "src/api/users.ts" Phase 0.5 - Meta-Judge (assuming date 2025-01-15):
- Meta-judge generates evaluation specification YAML with criteria:
- Correctness (30%), Design (25%), Security (20%), Performance (15%), Documentation (10%)
- Rubrics, checklists, and scoring definitions for each criterion
Phase 1 - Independent Analysis (3 judges receive specification):
- - Judge 1 scores correctness 4/5, security 3/5
.specs/reports/users-api-2025-01-15.1.md - - Judge 2 scores correctness 4/5, security 5/5
.specs/reports/users-api-2025-01-15.2.md - - Judge 3 scores correctness 5/5, security 4/5
.specs/reports/users-api-2025-01-15.3.md
Disagreement detected: Security scores range from 3-5
Phase 2 - Debate Round 1 (judges reference evaluation specification):
- Judge 1 defends 3/5: "Missing rate limiting, input validation incomplete per specification checklist item 4"
- Judge 2 challenges: "Rate limiting exists in middleware (line 45), satisfies specification rubric"
- Judge 1 revises to 4/5: "Missed middleware, but input validation still weak per specification"
- Judge 3 defends 4/5: "Input validation adequate for requirements as defined in specification"
Debate Round 1 outputs:
- All judges now 4-5/5 on security (within 1 point)
- Disagreement on input validation remains
Debate Round 2 (same evaluation specification):
- Judges examine specific validation code against specification criteria
- Judge 2 revises to 4/5: "Upon re-examination, email validation regex is weak per specification checklist"
- Consensus: Security = 4/5
Final consensus:
Correctness: 4.3/5
Design: 4.5/5
Security: 4.0/5 (2 debate rounds to consensus)
Performance: 4.7/5
Documentation: 4.0/5
Overall: 4.3/5 - PASSbash
/judge-with-debate Implement REST API for user management --solution "src/api/users.ts" 阶段0.5 - 元评审员(假设日期为2025-01-15):
- 元评审员生成包含以下准则的评估规范YAML:
- 正确性(30%)、设计(25%)、安全性(20%)、性能(15%)、文档(10%)
- 针对每项准则的评分标准、检查清单和评分定义
阶段1 - 独立分析(3位评审员收到规范):
- - 评审员1给正确性打4/5,安全性打3/5
.specs/reports/users-api-2025-01-15.1.md - - 评审员2给正确性打4/5,安全性打5/5
.specs/reports/users-api-2025-01-15.2.md - - 评审员3给正确性打5/5,安全性打4/5
.specs/reports/users-api-2025-01-15.3.md
检测到分歧:安全性得分范围为3-5分
阶段2 - 辩论轮次1(评审员参考评估规范):
- 评审员1捍卫3/5的评分:"缺少速率限制,输入验证不符合规范检查清单第4项"
- 评审员2提出质疑:"速率限制存在于中间件中(第45行),符合规范评分标准"
- 评审员1修订为4/5:"之前没注意到中间件,但输入验证仍不符合规范"
- 评审员3捍卫4/5的评分:"输入验证符合规范中定义的要求"
辩论轮次1输出:
- 所有评审员的安全性得分现在为4-5/5(差距在1分以内)
- 输入验证方面仍存在分歧
辩论轮次2(使用同一评估规范):
- 评审员对照规范准则检查具体验证代码
- 评审员2修订为4/5:"重新检查后发现,邮箱验证正则表达式不符合规范检查清单"
- 达成共识:安全性=4/5
最终共识:
正确性:4.3/5
设计:4.5/5
安全性:4.0/5(经过2轮辩论达成共识)
性能:4.7/5
文档:4.0/5
整体得分:4.3/5 - 通过