tree-of-thoughts
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chinesetree-of-thoughts
Tree of Thoughts
<task>
Execute complex reasoning tasks through systematic exploration of solution space, pruning unpromising branches, expanding viable approaches, and synthesizing the best solution.
</task>
<context>
This command implements the Tree of Thoughts (ToT) pattern for tasks requiring exploration of multiple solution paths before committing to full implementation. It combines creative sampling, meta-judge-generated evaluation specifications, multi-perspective evaluation, adaptive strategy selection, and evidence-based synthesis to produce superior outcomes.
Key benefits:
- Systematic exploration - Multiple agents explore different regions of the solution space
- Structured evaluation - Meta-judges produce tailored rubrics and criteria before judging
- Independent verification - Judges apply meta-judge specifications mechanically, reducing bias
- Adaptive strategy - Clear winners get polished, split decisions get synthesized, failures get redesigned
<task>
通过系统性探索解决方案空间、剪枝无前景分支、扩展可行路径并合成最优方案,来执行复杂推理任务。
</task>
<context>
该命令为需要在全面实现前探索多种解决方案路径的任务,实现了Tree of Thoughts(ToT)模式。它结合了创造性抽样、元评估生成的评估规范、多视角评估、自适应策略选择以及基于证据的合成,以产出更优结果。
核心优势:
- 系统性探索 - 多个Agent探索解决方案空间的不同区域
- 结构化评估 - 元评估(Meta-Judge)在评估前生成定制化的评分标准和准则
- 独立验证 - 评估者严格遵循元评估规范进行判断,减少偏见
- 自适应策略 - 明确的最优方案会被优化,分歧决策会被合成,失败方案会被重新设计
Pattern: Tree of Thoughts (ToT)
模式:Tree of Thoughts (ToT)
This command implements an eight-phase systematic reasoning pattern with meta-judge evaluation and adaptive strategy selection:
Phase 1: Exploration (Propose Approaches)
┌─ Agent A → Proposals A1, A2 (with probabilities) ─┐
Task ───┼─ Agent B → Proposals B1, B2 (with probabilities) ─┼─┐
└─ Agent C → Proposals C1, C2 (with probabilities) ─┘ │
│
Phase 1.5: Pruning Meta-Judge (runs in parallel with Phase 1) │
Meta-Judge → Pruning Evaluation Specification YAML ───┤
│
Phase 2: Pruning (Vote for Best 3) │
┌─ Judge 1 → Votes + Rationale ─┐ │
├─ Judge 2 → Votes + Rationale ─┼─────────────────────┤
└─ Judge 3 → Votes + Rationale ─┘ │
│ │
├─→ Select Top 3 Proposals │
│ │
Phase 3: Expansion (Develop Full Solutions) │
┌─ Agent A → Solution A (from proposal X) ─┐ │
├─ Agent B → Solution B (from proposal Y) ─┼──────────┤
└─ Agent C → Solution C (from proposal Z) ─┘ │
│
Phase 3.5: Evaluation Meta-Judge (runs in parallel w/ Phase 3)│
Meta-Judge → Evaluation Specification YAML ───────────┤
│
Phase 4: Evaluation (Judge Full Solutions) │
┌─ Judge 1 → Report 1 ─┐ │
├─ Judge 2 → Report 2 ─┼──────────────────────────────┤
└─ Judge 3 → Report 3 ─┘ │
│
Phase 4.5: Adaptive Strategy Selection │
Analyze Consensus ────────────────────────────────────┤
├─ Clear Winner? → SELECT_AND_POLISH │
├─ All Flawed (<3.0)? → REDESIGN (Phase 3) │
└─ Split Decision? → FULL_SYNTHESIS │
│ │
Phase 5: Synthesis (Only if FULL_SYNTHESIS) │
Synthesizer ────────────────────┴──────────────────────┴─→ Final Solution该命令实现了一个包含元评估和自适应策略选择的八阶段系统性推理模式:
Phase 1: Exploration (Propose Approaches)
┌─ Agent A → Proposals A1, A2 (with probabilities) ─┐
Task ───┼─ Agent B → Proposals B1, B2 (with probabilities) ─┼─┐
└─ Agent C → Proposals C1, C2 (with probabilities) ─┘ │
│
Phase 1.5: Pruning Meta-Judge (runs in parallel with Phase 1) │
Meta-Judge → Pruning Evaluation Specification YAML ───┤
│
Phase 2: Pruning (Vote for Best 3) │
┌─ Judge 1 → Votes + Rationale ─┐ │
├─ Judge 2 → Votes + Rationale ─┼─────────────────────┤
└─ Judge 3 → Votes + Rationale ─┘ │
│ │
├─→ Select Top 3 Proposals │
│ │
Phase 3: Expansion (Develop Full Solutions) │
┌─ Agent A → Solution A (from proposal X) ─┐ │
├─ Agent B → Solution B (from proposal Y) ─┼──────────┤
└─ Agent C → Solution C (from proposal Z) ─┘ │
│
Phase 3.5: Evaluation Meta-Judge (runs in parallel w/ Phase 3)│
Meta-Judge → Evaluation Specification YAML ───────────┤
│
Phase 4: Evaluation (Judge Full Solutions) │
┌─ Judge 1 → Report 1 ─┐ │
├─ Judge 2 → Report 2 ─┼──────────────────────────────┤
└─ Judge 3 → Report 3 ─┘ │
│
Phase 4.5: Adaptive Strategy Selection │
Analyze Consensus ────────────────────────────────────┤
├─ Clear Winner? → SELECT_AND_POLISH │
├─ All Flawed (<3.0)? → REDESIGN (Phase 3) │
└─ Split Decision? → FULL_SYNTHESIS │
│ │
Phase 5: Synthesis (Only if FULL_SYNTHESIS) │
Synthesizer ────────────────────┴──────────────────────┴─→ Final SolutionProcess
流程
Setup: Create Directory Structure
准备工作:创建目录结构
Before starting, ensure the directory structure exists:
bash
mkdir -p .specs/research .specs/reportsNaming conventions:
- Proposals:
.specs/research/{solution-name}-{YYYY-MM-DD}.proposals.[a|b|c].md - Pruning:
.specs/research/{solution-name}-{YYYY-MM-DD}.pruning.[1|2|3].md - Selection:
.specs/research/{solution-name}-{YYYY-MM-DD}.selection.md - Evaluation:
.specs/reports/{solution-name}-{YYYY-MM-DD}.[1|2|3].md
Where:
- - Derived from output path (e.g.,
{solution-name}from outputusers-api)specs/api/users.md - - Current date
{YYYY-MM-DD}
Note: Solutions remain in their specified output locations; only research and evaluation files go to
.specs/开始前,请确保目录结构已存在:
bash
mkdir -p .specs/research .specs/reports命名规范:
- 方案文件:
.specs/research/{solution-name}-{YYYY-MM-DD}.proposals.[a|b|c].md - 剪枝评估文件:
.specs/research/{solution-name}-{YYYY-MM-DD}.pruning.[1|2|3].md - 选择记录文件:
.specs/research/{solution-name}-{YYYY-MM-DD}.selection.md - 最终评估文件:
.specs/reports/{solution-name}-{YYYY-MM-DD}.[1|2|3].md
其中:
- - 从输出路径派生(例如,输出为
{solution-name}时,名称为specs/api/users.md)users-api - - 当前日期
{YYYY-MM-DD}
注意: 解决方案文件保存在指定输出位置;仅研究和评估文件存入目录
.specs/Phase 1: Exploration (Propose Approaches)
阶段1:探索(提出方案)
Launch 3 independent agents in parallel (recommended: Sonnet for speed):
- Each agent receives identical task description and context
- Each agent generates 6 high-level approaches (not full implementations)
- For each approach, agent provides:
- Approach description (2-3 paragraphs)
- Key design decisions and trade-offs
- Probability estimate (0.0-1.0)
- Estimated complexity (low/medium/high)
- Potential risks and failure modes
- Proposals saved to
.specs/research/{solution-name}-{date}.proposals.[a|b|c].md
Key principle: Systematic exploration through probabilistic sampling from the full distribution of possible approaches.
Prompt template for explorers:
markdown
<task>
{task_description}
</task>
<constraints>
{constraints_if_any}
</constraints>
<context>
{relevant_context}
</context>
<output>
{.specs/research/{solution-name}-{date}.proposals.[a|b|c].md - each agent gets unique letter identifier}
</output>
Instructions:
Let's approach this systematically by first understanding what we're solving, then exploring the solution space.
**Step 1: Decompose the problem**
Before generating approaches, break down the task:
- What is the core problem being solved?
- What are the key constraints and requirements?
- What subproblems must any solution address?
- What are the evaluation criteria for success?
**Step 2: Map the solution space**
Identify the major dimensions along which solutions can vary:
- Architecture patterns (e.g., monolithic vs distributed)
- Implementation strategies (e.g., eager vs lazy)
- Trade-off axes (e.g., performance vs simplicity)
**Step 3: Generate 6 distinct high-level approaches**
**Sampling guidance:**
Please sample approaches at random from the [full distribution / tails of the distribution]
- For first 3 approaches aim for high probability, over 0.80
- For last 3 approaches aim for diversity - explore different regions of the solution space, such that the probability of each response is less than 0.10
For each approach, provide:
- Name and one-sentence summary
- Detailed description (2-3 paragraphs)
- Key design decisions and rationale
- Trade-offs (what you gain vs what you sacrifice)
- Probability (0.0-1.0)
- Complexity estimate (low/medium/high)
- Potential risks and failure modes
**Step 4: Verify diversity**
Before finalizing, check:
- Are approaches genuinely different, not minor variations?
- Do they span different regions of the solution space?
- Have you covered both conventional and unconventional options?
CRITICAL:
- Do NOT implement full solutions yet - only high-level approaches
- Ensure approaches are genuinely different, not minor variations并行启动3个独立Agent(推荐使用Sonnet以提升速度):
- 每个Agent接收相同的任务描述和上下文
- 每个Agent生成6种高层级方案(非完整实现)
- 针对每种方案,Agent需提供:
- 方案描述(2-3段)
- 核心设计决策与权衡
- 概率估算(0.0-1.0)
- 复杂度估算(低/中/高)
- 潜在风险与失败模式
- 方案文件保存至
.specs/research/{solution-name}-{date}.proposals.[a|b|c].md
核心原则: 通过对所有可能方案的概率抽样实现系统性探索。
探索Agent提示模板:
markdown
<task>
{task_description}
</task>
<constraints>
{constraints_if_any}
</constraints>
<context>
{relevant_context}
</context>
<output>
{.specs/research/{solution-name}-{date}.proposals.[a|b|c].md - 每个Agent分配唯一字母标识}
</output>
Instructions:
Let's approach this systematically by first understanding what we're solving, then exploring the solution space.
**Step 1: Decompose the problem**
Before generating approaches, break down the task:
- What is the core problem being solved?
- What are the key constraints and requirements?
- What subproblems must any solution address?
- What are the evaluation criteria for success?
**Step 2: Map the solution space**
Identify the major dimensions along which solutions can vary:
- Architecture patterns (e.g., monolithic vs distributed)
- Implementation strategies (e.g., eager vs lazy)
- Trade-off axes (e.g., performance vs simplicity)
**Step 3: Generate 6 distinct high-level approaches**
**Sampling guidance:**
Please sample approaches at random from the [full distribution / tails of the distribution]
- For first 3 approaches aim for high probability, over 0.80
- For last 3 approaches aim for diversity - explore different regions of the solution space, such that the probability of each response is less than 0.10
For each approach, provide:
- Name and one-sentence summary
- Detailed description (2-3 paragraphs)
- Key design decisions and rationale
- Trade-offs (what you gain vs what you sacrifice)
- Probability (0.0-1.0)
- Complexity estimate (low/medium/high)
- Potential risks and failure modes
**Step 4: Verify diversity**
Before finalizing, check:
- Are approaches genuinely different, not minor variations?
- Do they span different regions of the solution space?
- Have you covered both conventional and unconventional options?
CRITICAL:
- Do NOT implement full solutions yet - only high-level approaches
- Ensure approaches are genuinely different, not minor variationsPhase 1.5: Dispatch Pruning Meta-Judge
阶段1.5:启动剪枝元评估(Pruning Meta-Judge)
CRITICAL: Launch the pruning meta-judge in parallel with Phase 1 exploration agents. The meta-judge does not need exploration output to generate pruning criteria — it only needs the original task description.
The pruning meta-judge generates an evaluation specification (rubrics, checklist, scoring criteria) tailored to evaluating high-level proposals for pruning.
Prompt template for pruning meta-judge:
markdown
undefined重要提示: 与阶段1的探索Agent并行启动剪枝元评估。元评估无需探索输出即可生成剪枝标准——仅需原始任务描述。
剪枝元评估生成针对高层级方案剪枝的评估规范(评分标准、检查清单、评分准则)。
剪枝元评估提示模板:
markdown
undefinedTask
Task
Generate an evaluation specification yaml for pruning high-level solution proposals. You will produce rubrics, checklists, and scoring criteria that judge agents will use to select the top 3 proposals for full development.
CLAUDE_PLUGIN_ROOT=
${CLAUDE_PLUGIN_ROOT}Generate an evaluation specification yaml for pruning high-level solution proposals. You will produce rubrics, checklists, and scoring criteria that judge agents will use to select the top 3 proposals for full development.
CLAUDE_PLUGIN_ROOT=
${CLAUDE_PLUGIN_ROOT}User Prompt
User Prompt
{Original task description from user}
{Original task description from user}
Context
Context
{Any relevant codebase context, file paths, constraints}
{Any relevant codebase context, file paths, constraints}
Artifact Type
Artifact Type
proposals (high-level approaches with probability estimates, not full implementations)
proposals (high-level approaches with probability estimates, not full implementations)
Evaluation Focus
Evaluation Focus
Feasibility, alignment with requirements, potential for high-quality result, risk manageability
Feasibility, alignment with requirements, potential for high-quality result, risk manageability
Instructions
Instructions
Return only the final evaluation specification YAML in your response.
The specification should support comparative evaluation and ranking of proposals.
**Dispatch:**
Use Task tool:
- description: "Pruning Meta-judge: {brief task summary}"
- prompt: {pruning meta-judge prompt}
- model: opus
- subagent_type: "sadd:meta-judge"
undefinedReturn only the final evaluation specification YAML in your response.
The specification should support comparative evaluation and ranking of proposals.
**调度命令:**
Use Task tool:
- description: "Pruning Meta-judge: {brief task summary}"
- prompt: {pruning meta-judge prompt}
- model: opus
- subagent_type: "sadd:meta-judge"
undefinedPhase 2: Pruning (Vote for Top 3 Candidates)
阶段2:剪枝(投票选出Top3候选方案)
Wait for BOTH Phase 1 exploration agents AND Phase 1.5 pruning meta-judge to complete before proceeding.
Launch 3 independent judges in parallel (recommended: Opus for rigor):
- Each judge receives ALL proposal files (from ) and the pruning meta-judge evaluation specification YAML
.specs/research/ - Judges evaluate each proposal against the meta-judge-generated pruning criteria
- Each judge produces:
- Scores for each proposal (with evidence)
- Vote for top 3 proposals to expand
- Rationale for selections
- Votes saved to
.specs/research/{solution-name}-{date}.pruning.[1|2|3].md
Key principle: Independent evaluation with meta-judge-generated criteria ensures consistent, tailored assessment without hardcoded weights.
CRITICAL: Provide to each judge the EXACT pruning meta-judge's evaluation specification YAML. Do not skip, add, modify, shorten, or summarize any text in it!
Prompt template for pruning judges:
markdown
You are evaluating {N} proposed approaches against an evaluation specification produced by the meta judge, to select the top 3 for full development.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`需等待阶段1的探索Agent和阶段1.5的剪枝元评估均完成后再继续。
并行启动3个独立评估者(推荐使用Opus以保证严谨性):
- 每个评估者接收所有方案文件(来自)和剪枝元评估生成的YAML规范
.specs/research/ - 评估者依据元评估生成的剪枝标准对每个方案进行评估
- 每个评估者输出:
- 各方案得分(附带证据)
- 投票选出Top3待扩展方案
- 选择理由
- 投票结果保存至
.specs/research/{solution-name}-{date}.pruning.[1|2|3].md
核心原则: 基于元评估规范的独立评估确保了一致、定制化的评估,避免硬编码权重带来的偏差。
重要提示:必须向每个评估者提供剪枝元评估生成的完整YAML规范。不得跳过、添加、修改、缩短或总结任何内容!
剪枝评估者提示模板:
markdown
You are evaluating {N} proposed approaches against an evaluation specification produced by the meta judge, to select the top 3 for full development.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`Task
Task
{task_description}
{task_description}
Proposals
Proposals
{list of paths to all proposal files}
Read all proposals carefully before evaluating.
{list of paths to all proposal files}
Read all proposals carefully before evaluating.
Evaluation Specification
Evaluation Specification
yaml
{pruning meta-judge's evaluation specification YAML}yaml
{pruning meta-judge's evaluation specification YAML}Output
Output
{.specs/research/{solution-name}-{date}.pruning.[1|2|3].md}
{.specs/research/{solution-name}-{date}.pruning.[1|2|3].md}
Instructions
Instructions
Follow your full judge process as defined in your agent instructions!
CRITICAL: You must reply with this exact structured evaluation report format in YAML at the START of your response!
**Dispatch:**
Use Task tool:
- description: "Pruning Judge {1|2|3}: {brief task summary}"
- prompt: {pruning judge prompt with exact meta-judge specification YAML}
- model: opus
- subagent_type: "sadd:judge"
undefinedFollow your full judge process as defined in your agent instructions!
CRITICAL: You must reply with this exact structured evaluation report format in YAML at the START of your response!
**调度命令:**
Use Task tool:
- description: "Pruning Judge {1|2|3}: {brief task summary}"
- prompt: {pruning judge prompt with exact meta-judge specification YAML}
- model: opus
- subagent_type: "sadd:judge"
undefinedPhase 2b: Select Top 3 Proposals
阶段2b:选出Top3方案
After judges complete voting:
- Aggregate votes using ranked choice:
- 1st choice = 3 points
- 2nd choice = 2 points
- 3rd choice = 1 point
- Select top 3 proposals by total points
- Handle ties by comparing average scores across criteria
- Document selection in :
.specs/research/{solution-name}-{date}.selection.md- Vote tallies
- Selected proposals
- Consensus rationale
评估者完成投票后:
- 使用排序选择法汇总投票:
- 第1选择 = 3分
- 第2选择 = 2分
- 第3选择 = 1分
- 按总分选出Top3方案
- 处理平局: 比较各方案的平均得分
- 记录选择过程至:
.specs/research/{solution-name}-{date}.selection.md- 投票统计
- 选中方案
- 共识理由
Phase 3: Expansion (Develop Full Solutions)
阶段3:扩展(开发完整解决方案)
Launch 3 independent agents in parallel (recommended: Opus for quality):
- Each agent receives:
- One selected proposal to expand
- Original task description and context
- Judge feedback from pruning phase (concerns, questions)
- Agent produces complete solution implementing the proposal:
- Full implementation details
- Addresses concerns raised by judges
- Documents key decisions made during expansion
- Solutions saved to ,
solution.a.md,solution.b.mdsolution.c.md
Key principle: Focused development of validated approaches with awareness of evaluation feedback.
Prompt template for expansion agents:
markdown
You are developing a full solution based on a selected proposal.
<task>
{task_description}
</task>
<selected_proposal>
{write selected proposal EXACTLY as it is. Including all details provided by the agent}
Read this carefully - it is your starting point.
</selected_proposal>
<judge_feedback>
{concerns and questions from judges about this proposal}
Address these in your implementation.
</judge_feedback>
<output>
solution.[*].md where [*] is your unique identifier (a, b, or c)
</output>
Instructions:
Let's work through this systematically to ensure we build a complete, high-quality solution.
**Step 1: Understand the proposal deeply**
Before implementing, analyze:
- What is the core insight or approach of this proposal?
- What are the key design decisions already made?
- What gaps need to be filled for a complete solution?
**Step 2: Address judge feedback**
For each concern raised by judges:
- What specific change or addition addresses this concern?
- How does this change integrate with the proposal's approach?
**Step 3: Decompose into implementation subproblems**
Break the solution into logical parts:
- What are the main components or sections?
- What must be defined first for other parts to build upon?
- What are the dependencies between parts?
**Step 4: Implement each subproblem**
For each component, work through:
- Core functionality and behavior
- Edge cases and error handling
- Integration points with other components
**Step 5: Self-verification**
Generate 3-5 verification questions about critical aspects, then answer them:
- Review solution against each question
- Identify gaps or weaknesses
- Fix identified issues
**Step 6: Document changes**
Explain what was changed from the original proposal and why.
<example>
**Example of good expansion thinking:**
Proposal: "Use event-driven architecture with message queue"
Step 1 Analysis:
- Core insight: Decouple components via async messaging
- Key decisions: Events as primary communication, eventual consistency
- Gaps: Need to define event schemas, queue technology, error handling
Step 2 - Addressing judge concern "What about message ordering?":
- Add partition keys for ordered processing within entity scope
- Document ordering guarantees and limitations
Step 3 - Subproblems:
1. Event schema definitions (foundational - others depend on this)
2. Producer interfaces (depends on schemas)
3. Consumer handlers (depends on schemas)
4. Error handling and dead letter queues (depends on both)
5. Integration patterns (builds on all above)
</example>
CRITICAL:
- Stay faithful to the selected proposal's core approach
- Do not switch to a different approach midway
- Address judge feedback explicitly
- Produce a complete, implementable solution并行启动3个独立Agent(推荐使用Opus以保证质量):
- 每个Agent接收:
- 一个选中的方案进行扩展
- 原始任务描述和上下文
- 剪枝阶段的评估者反馈(关注点、疑问)
- Agent产出完整解决方案以实现该方案:
- 完整实现细节
- 回应评估者提出的关注点
- 记录扩展过程中的核心决策
- 解决方案保存至,
solution.a.md,solution.b.mdsolution.c.md
核心原则: 聚焦于已验证方案的开发,并结合评估反馈进行优化。
扩展Agent提示模板:
markdown
You are developing a full solution based on a selected proposal.
<task>
{task_description}
</task>
<selected_proposal>
{write selected proposal EXACTLY as it is. Including all details provided by the agent}
Read this carefully - it is your starting point.
</selected_proposal>
<judge_feedback>
{concerns and questions from judges about this proposal}
Address these in your implementation.
</judge_feedback>
<output>
solution.[*].md where [*] is your unique identifier (a, b, or c)
</output>
Instructions:
Let's work through this systematically to ensure we build a complete, high-quality solution.
**Step 1: Understand the proposal deeply**
Before implementing, analyze:
- What is the core insight or approach of this proposal?
- What are the key design decisions already made?
- What gaps need to be filled for a complete solution?
**Step 2: Address judge feedback**
For each concern raised by judges:
- What specific change or addition addresses this concern?
- How does this change integrate with the proposal's approach?
**Step 3: Decompose into implementation subproblems**
Break the solution into logical parts:
- What are the main components or sections?
- What must be defined first for other parts to build upon?
- What are the dependencies between parts?
**Step 4: Implement each subproblem**
For each component, work through:
- Core functionality and behavior
- Edge cases and error handling
- Integration points with other components
**Step 5: Self-verification**
Generate 3-5 verification questions about critical aspects, then answer them:
- Review solution against each question
- Identify gaps or weaknesses
- Fix identified issues
**Step 6: Document changes**
Explain what was changed from the original proposal and why.
<example>
**Example of good expansion thinking:**
Proposal: "Use event-driven architecture with message queue"
Step 1 Analysis:
- Core insight: Decouple components via async messaging
- Key decisions: Events as primary communication, eventual consistency
- Gaps: Need to define event schemas, queue technology, error handling
Step 2 - Addressing judge concern "What about message ordering?":
- Add partition keys for ordered processing within entity scope
- Document ordering guarantees and limitations
Step 3 - Subproblems:
1. Event schema definitions (foundational - others depend on this)
2. Producer interfaces (depends on schemas)
3. Consumer handlers (depends on schemas)
4. Error handling and dead letter queues (depends on both)
5. Integration patterns (builds on all above)
</example>
CRITICAL:
- Stay faithful to the selected proposal's core approach
- Do not switch to a different approach midway
- Address judge feedback explicitly
- Produce a complete, implementable solutionPhase 3.5: Dispatch Evaluation Meta-Judge
阶段3.5:启动评估元评估(Evaluation Meta-Judge)
CRITICAL: Launch the evaluation meta-judge in parallel with Phase 3 expansion agents. The meta-judge does not need expansion output to generate evaluation criteria — it only needs the original task description.
The evaluation meta-judge generates an evaluation specification (rubrics, checklist, scoring criteria) tailored to evaluating full solution implementations.
Prompt template for evaluation meta-judge:
markdown
undefined重要提示: 与阶段3的扩展Agent并行启动评估元评估。元评估无需扩展输出即可生成评估标准——仅需原始任务描述。
评估元评估生成针对完整解决方案实现的评估规范(评分标准、检查清单、评分准则)。
评估元评估提示模板:
markdown
undefinedTask
Task
Generate an evaluation specification yaml for evaluating full solution implementations. You will produce rubrics, checklists, and scoring criteria that judge agents will use to evaluate and compare competitive implementations.
CLAUDE_PLUGIN_ROOT=
${CLAUDE_PLUGIN_ROOT}Generate an evaluation specification yaml for evaluating full solution implementations. You will produce rubrics, checklists, and scoring criteria that judge agents will use to evaluate and compare competitive implementations.
CLAUDE_PLUGIN_ROOT=
${CLAUDE_PLUGIN_ROOT}User Prompt
User Prompt
{Original task description from user}
{Original task description from user}
Context
Context
{Any relevant codebase context, file paths, constraints}
{Any relevant codebase context, file paths, constraints}
Artifact Type
Artifact Type
{code | documentation | configuration | etc.}
{code | documentation | configuration | etc.}
Number of Solutions
Number of Solutions
3 (full implementations developed from selected proposals)
3 (full implementations developed from selected proposals)
Instructions
Instructions
Return only the final evaluation specification YAML in your response.
The specification should support comparative evaluation across multiple solutions.
**Dispatch:**
Use Task tool:
- description: "Evaluation Meta-judge: {brief task summary}"
- prompt: {evaluation meta-judge prompt}
- model: opus
- subagent_type: "sadd:meta-judge"
undefinedReturn only the final evaluation specification YAML in your response.
The specification should support comparative evaluation across multiple solutions.
**调度命令:**
Use Task tool:
- description: "Evaluation Meta-judge: {brief task summary}"
- prompt: {evaluation meta-judge prompt}
- model: opus
- subagent_type: "sadd:meta-judge"
undefinedPhase 4: Evaluation (Judge Full Solutions)
阶段4:评估(评判完整解决方案)
Wait for BOTH Phase 3 expansion agents AND Phase 3.5 evaluation meta-judge to complete before proceeding.
Launch 3 independent judges in parallel (recommended: Opus for rigor):
- Each judge receives ALL solution files (solution.a.md, solution.b.md, solution.c.md) and the evaluation meta-judge specification YAML
- Judges evaluate against the meta-judge-generated evaluation criteria
- Each judge produces:
- Comparative analysis (which solution excels where)
- Evidence-based ratings (with specific quotes/examples)
- Final vote (which solution they prefer and why)
- Reports saved to
.specs/reports/{solution-name}-{date}.[1|2|3].md
Key principle: Multiple independent evaluations with meta-judge-generated specifications and explicit evidence reduce bias and catch different quality aspects.
CRITICAL: Provide to each judge the EXACT evaluation meta-judge's evaluation specification YAML. Do not skip, add, modify, shorten, or summarize any text in it!
CRITICAL: NEVER provide score threshold to judges. Judge MUST not know what threshold for score is, in order to not be biased!!!
Prompt template for evaluation judges:
markdown
You are evaluating {number} full solutions against an evaluation specification produced by the meta judge.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`需等待阶段3的扩展Agent和阶段3.5的评估元评估均完成后再继续。
并行启动3个独立评估者(推荐使用Opus以保证严谨性):
- 每个评估者接收所有解决方案文件(solution.a.md, solution.b.md, solution.c.md)和评估元评估生成的YAML规范
- 评估者依据元评估生成的评估标准进行评估
- 每个评估者输出:
- 对比分析(各方案的优势领域)
- 基于证据的评分(附带具体引用/示例)
- 最终投票(偏好的方案及理由)
- 评估报告保存至
.specs/reports/{solution-name}-{date}.[1|2|3].md
核心原则: 基于元评估规范的多轮独立评估及明确证据,可减少偏差并覆盖不同质量维度。
重要提示:必须向每个评估者提供评估元评估生成的完整YAML规范。不得跳过、添加、修改、缩短或总结任何内容!
重要提示:绝不能向评估者提供分数阈值。评估者必须不知道分数阈值,以避免产生偏见!
评估评估者提示模板:
markdown
You are evaluating {number} full solutions against an evaluation specification produced by the meta judge.
CLAUDE_PLUGIN_ROOT=`${CLAUDE_PLUGIN_ROOT}`Task
Task
{task_description}
{task_description}
Solutions
Solutions
{list of paths to all solution files}
Read all solutions carefully before evaluating.
{list of paths to all solution files}
Read all solutions carefully before evaluating.
Evaluation Specification
Evaluation Specification
yaml
{evaluation meta-judge's evaluation specification YAML}yaml
{evaluation meta-judge's evaluation specification YAML}Output
Output
Write full report to: .specs/reports/{solution-name}-{date}.[1|2|3].md
CRITICAL: You must reply with this exact structured header format:
VOTE: [Solution A/B/C]
SCORES:
Solution A: [X.X]/5.0
Solution B: [X.X]/5.0
Solution C: [X.X]/5.0
CRITERIA:
- {criterion_1}: [X.X]/5.0
- {criterion_2}: [X.X]/5.0 ...
[Summary of your evaluation]
Write full report to: .specs/reports/{solution-name}-{date}.[1|2|3].md
CRITICAL: You must reply with this exact structured header format:
VOTE: [Solution A/B/C]
SCORES:
Solution A: [X.X]/5.0
Solution B: [X.X]/5.0
Solution C: [X.X]/5.0
CRITERIA:
- {criterion_1}: [X.X]/5.0
- {criterion_2}: [X.X]/5.0 ...
[Summary of your evaluation]
Instructions
Instructions
Follow your full judge process as defined in your agent instructions!
CRITICAL: You must reply with this exact structured evaluation report format in YAML at the START of your response!
**Dispatch:**
Use Task tool:
- description: "Evaluation Judge {1|2|3}: {brief task summary}"
- prompt: {evaluation judge prompt with exact meta-judge specification YAML}
- model: opus
- subagent_type: "sadd:judge"
undefinedFollow your full judge process as defined in your agent instructions!
CRITICAL: You must reply with this exact structured evaluation report format in YAML at the START of your response!
**调度命令:**
Use Task tool:
- description: "Evaluation Judge {1|2|3}: {brief task summary}"
- prompt: {evaluation judge prompt with exact meta-judge specification YAML}
- model: opus
- subagent_type: "sadd:judge"
undefinedPhase 4.5: Adaptive Strategy Selection (Early Return)
阶段4.5:自适应策略选择(提前返回)
The orchestrator (not a subagent) analyzes judge outputs to determine the optimal strategy.
编排器(非子Agent)分析评估者输出以确定最优策略。
Decision Logic
决策逻辑
Step 1: Parse structured headers from judge reply
Parse the judges reply.
CRITICAL: Do not read report files themselves, as they can overflow your context.
Step 2: Check for unanimous winner
Compare all three VOTE values:
- If Judge 1 VOTE = Judge 2 VOTE = Judge 3 VOTE (same solution):
- Strategy: SELECT_AND_POLISH
- Reason: Clear consensus - all three judges prefer same solution
Step 3: Check if all solutions are fundamentally flawed
If no unanimous vote, calculate average scores:
- Average Solution A scores: (Judge1_A + Judge2_A + Judge3_A) / 3
- Average Solution B scores: (Judge1_B + Judge2_B + Judge3_B) / 3
- Average Solution C scores: (Judge1_C + Judge2_C + Judge3_C) / 3
If (avg_A < 3.0) AND (avg_B < 3.0) AND (avg_C < 3.0):
- Strategy: REDESIGN
- Reason: All solutions below quality threshold, fundamental approach issues
Step 4: Default to full synthesis
If none of the above conditions met:
- Strategy: FULL_SYNTHESIS
- Reason: Split decision with merit, synthesis needed to combine best elements
步骤1:解析评估者回复中的结构化头部
解析评估者回复。
重要提示:不要读取报告文件本身,以免超出上下文限制。
步骤2:检查是否存在一致通过的最优方案
比较三个VOTE值:
- 如果Judge 1 VOTE = Judge 2 VOTE = Judge 3 VOTE(同一方案):
- 策略:SELECT_AND_POLISH(选择并优化)
- 理由: 明确共识——三名评估者均偏好同一方案
步骤3:检查所有方案是否存在根本性缺陷
若无一致投票,则计算平均得分:
- 方案A平均得分:(Judge1_A + Judge2_A + Judge3_A) / 3
- 方案B平均得分:(Judge1_B + Judge2_B + Judge3_B) / 3
- 方案C平均得分:(Judge1_C + Judge2_C + Judge3_C) / 3
若(avg_A < 3.0) 且 (avg_B < 3.0) 且 (avg_C < 3.0):
- 策略:REDESIGN(重新设计)
- 理由: 所有方案均低于质量阈值,存在根本性方法问题
步骤4:默认采用完全合成策略
若以上条件均不满足:
- 策略:FULL_SYNTHESIS(完全合成)
- 理由: 存在分歧但各方案均有可取之处,需合成以整合最优元素
Strategy 1: SELECT_AND_POLISH
策略1:SELECT_AND_POLISH(选择并优化)
When: Clear winner (unanimous votes)
Process:
- Select the winning solution as the base
- Launch subagent to apply specific improvements from judge feedback
- Cherry-pick 1-2 best elements from runner-up solutions
- Document what was added and why
Benefits:
- Saves synthesis cost (simpler than full synthesis)
- Preserves proven quality of winning solution
- Focused improvements rather than full reconstruction
Prompt template:
markdown
You are polishing the winning solution based on judge feedback.
<task>
{task_description}
</task>
<winning_solution>
{path_to_winning_solution}
Score: {winning_score}/5.0
Judge consensus: {why_it_won}
</winning_solution>
<runner_up_solutions>
{list of paths to all runner-up solutions}
</runner_up_solutions>
<judge_feedback>
{list of paths to all evaluation reports}
</judge_feedback>
<output>
{final_solution_path}
</output>
Instructions:
Let's approach this polishing task methodically to improve without disrupting what works.
**Step 1: Understand why this solution won**
Analyze the winning solution:
- What are its core strengths that judges praised?
- What makes its approach superior to alternatives?
- Which parts should remain untouched?
**Step 2: Catalog improvement opportunities**
From judge feedback, identify:
- Specific weaknesses mentioned (list each one)
- Missing elements judges noted
- Areas where runner-ups were praised
**Step 3: Prioritize changes by impact**
For each improvement opportunity:
- High impact: Directly addresses judge criticism
- Medium impact: Adds praised element from runner-up
- Low impact: Nice-to-have refinement
Focus on high-impact changes first.
**Step 4: Apply improvements surgically**
For each change:
- Locate the specific section to modify
- Make the minimal change needed to address the issue
- Verify the change integrates cleanly with surrounding content
**Step 5: Cherry-pick from runners-up**
Review runner-up solutions for:
- 1-2 specific elements that judges praised
- Elements that complement (not conflict with) the winning approach
- Only incorporate if clearly superior to winning solution's version
**Step 6: Document all changes**
Record:
- What was changed and why (with reference to judge feedback)
- What was added from other solutions (cite source)
- What was intentionally left unchanged
CRITICAL: Preserve the winning solution's core approach. Make targeted improvements only.适用场景: 存在明确的最优方案(一致投票)
流程:
- 选择最优方案作为基础
- 启动子Agent根据评估者反馈进行针对性改进
- 从备选方案中挑选1-2个最优元素
- 记录添加内容及理由
优势:
- 节省合成成本(比完全合成更简单)
- 保留最优方案的已验证质量
- 聚焦于针对性改进而非完全重构
提示模板:
markdown
You are polishing the winning solution based on judge feedback.
<task>
{task_description}
</task>
<winning_solution>
{path_to_winning_solution}
Score: {winning_score}/5.0
Judge consensus: {why_it_won}
</winning_solution>
<runner_up_solutions>
{list of paths to all runner-up solutions}
</runner_up_solutions>
<judge_feedback>
{list of paths to all evaluation reports}
</judge_feedback>
<output>
{final_solution_path}
</output>
Instructions:
Let's approach this polishing task methodically to improve without disrupting what works.
**Step 1: Understand why this solution won**
Analyze the winning solution:
- What are its core strengths that judges praised?
- What makes its approach superior to alternatives?
- Which parts should remain untouched?
**Step 2: Catalog improvement opportunities**
From judge feedback, identify:
- Specific weaknesses mentioned (list each one)
- Missing elements judges noted
- Areas where runner-ups were praised
**Step 3: Prioritize changes by impact**
For each improvement opportunity:
- High impact: Directly addresses judge criticism
- Medium impact: Adds praised element from runner-up
- Low impact: Nice-to-have refinement
Focus on high-impact changes first.
**Step 4: Apply improvements surgically**
For each change:
- Locate the specific section to modify
- Make the minimal change needed to address the issue
- Verify the change integrates cleanly with surrounding content
**Step 5: Cherry-pick from runners-up**
Review runner-up solutions for:
- 1-2 specific elements that judges praised
- Elements that complement (not conflict with) the winning approach
- Only incorporate if clearly superior to winning solution's version
**Step 6: Document all changes**
Record:
- What was changed and why (with reference to judge feedback)
- What was added from other solutions (cite source)
- What was intentionally left unchanged
CRITICAL: Preserve the winning solution's core approach. Make targeted improvements only.Strategy 2: REDESIGN
策略2:REDESIGN(重新设计)
When: All solutions scored <3.0/5.0 (fundamental issues across the board)
Process:
- Launch new agent to analyze the failure modes and lessons learned
- Return to Phase 3 (Expansion), provide to new implementation agents the lessons learned and new constraints
Note: If redesign fails twice, escalate to user for guidance.
Prompt template for new implementation:
markdown
You are analyzing why all solutions failed to meet quality standards, to inform a redesign. And implement new solution based on it.
<task>
{task_description}
</task>
<constraints>
{constraints_if_any}
</constraints>
<context>
{relevant_context}
</context>
<failed_solutions>
{list of paths to all solution files}
Average scores: A={avg_a}/5.0, B={avg_b}/5.0, C={avg_c}/5.0
</failed_solutions>
<evaluation_reports>
{list of paths to all evaluation reports}
All solutions scored below 3.0/5.0 threshold.
</evaluation_reports>
<output>
.specs/research/{solution-name}-{date}.redesign-analysis.md
</output>
Instructions:
Let's break this down systematically to understand what went wrong and how to design new solution based on it.
1. First, analyze the task carefully - what is being asked and what are the key requirements?
2. Read through each solution and its evaluation report
3. For each solution, think step by step about:
- What was the core approach?
- What specific issues did judges identify?
- Why did this approach fail to meet the quality threshold?
4. Identify common failure patterns across all solutions:
- Are there shared misconceptions?
- Are there missing requirements that all solutions overlooked?
- Are there fundamental constraints that weren't considered?
5. Extract lessons learned:
- What approaches should be avoided?
- What constraints must be addressed?
6. Generate improved guidance for the next iteration:
- New constraints to add
- Specific approaches to try - what are the different ways to solve this?
- Key requirements to emphasize
7. Think through the tradeoffs step by step and choose the approach you believe is best
8. Implement it completely
9. Generate 5 verification questions about critical aspects
10. Answer your own questions:
- Review solution against each question
- Identify gaps or weaknesses
11. Revise solution:
- Fix identified issues
12. Explain what was changed and why适用场景: 所有方案得分<3.0/5.0(普遍存在根本性问题)
流程:
- 启动新Agent分析失败模式及经验教训
- 返回阶段3(扩展),向新的实现Agent提供经验教训和新约束
注意: 若重新设计失败两次,需升级至用户寻求指导。
新实现提示模板:
markdown
You are analyzing why all solutions failed to meet quality standards, to inform a redesign. And implement new solution based on it.
<task>
{task_description}
</task>
<constraints>
{constraints_if_any}
</constraints>
<context>
{relevant_context}
</context>
<failed_solutions>
{list of paths to all solution files}
Average scores: A={avg_a}/5.0, B={avg_b}/5.0, C={avg_c}/5.0
</failed_solutions>
<evaluation_reports>
{list of paths to all evaluation reports}
All solutions scored below 3.0/5.0 threshold.
</evaluation_reports>
<output>
.specs/research/{solution-name}-{date}.redesign-analysis.md
</output>
Instructions:
Let's break this down systematically to understand what went wrong and how to design new solution based on it.
1. First, analyze the task carefully - what is being asked and what are the key requirements?
2. Read through each solution and its evaluation report
3. For each solution, think step by step about:
- What was the core approach?
- What specific issues did judges identify?
- Why did this approach fail to meet the quality threshold?
4. Identify common failure patterns across all solutions:
- Are there shared misconceptions?
- Are there missing requirements that all solutions overlooked?
- Are there fundamental constraints that weren't considered?
5. Extract lessons learned:
- What approaches should be avoided?
- What constraints must be addressed?
6. Generate improved guidance for the next iteration:
- New constraints to add
- Specific approaches to try - what are the different ways to solve this?
- Key requirements to emphasize
7. Think through the tradeoffs step by step and choose the approach you believe is best
8. Implement it completely
9. Generate 5 verification questions about critical aspects
10. Answer your own questions:
- Review solution against each question
- Identify gaps or weaknesses
11. Revise solution:
- Fix identified issues
12. Explain what was changed and whyStrategy 3: FULL_SYNTHESIS (Default)
策略3:FULL_SYNTHESIS(完全合成,默认策略)
When: No clear winner AND solutions have merit (scores >=3.0)
Process: Proceed to Phase 5 (Evidence-Based Synthesis)
适用场景: 无明确最优方案且各方案均有可取之处(得分>=3.0)
流程: 进入阶段5(基于证据的合成)
Phase 5: Synthesis (Evidence-Based Combination)
阶段5:合成(基于证据的整合)
Only executed when Strategy 3 (FULL_SYNTHESIS) selected in Phase 4.5
Launch 1 synthesis agent (recommended: Opus for quality):
- Agent receives:
- All solutions (from specified output location)
- All evaluation reports (from )
.specs/reports/ - Selection rationale from pruning phase (from )
.specs/research/
- Agent analyzes:
- Consensus strengths (what multiple judges praised)
- Consensus weaknesses (what multiple judges criticized)
- Complementary elements where solutions took different approaches
- Agent produces final solution by:
- Copying superior sections when one solution clearly wins
- Combining approaches when hybrid is better
- Fixing identified issues that judges caught
- Documenting decisions (what was taken from where and why)
Key principle: Evidence-based synthesis leverages collective intelligence from exploration and evaluation.
Prompt template for synthesizer:
markdown
You are synthesizing the best solution from explored, pruned, and evaluated implementations.
<task>
{task_description}
</task>
<solutions>
{list of paths to all solution files}
</solutions>
<evaluation_reports>
{list of paths to all evaluation reports}
</evaluation_reports>
<selection_rationale>
{path to selection.md explaining why these proposals were chosen}
</selection_rationale>
<output>
{output_path} - The final synthesized solution
</output>
Instructions:
Let's approach this synthesis systematically by first analyzing, then decomposing, then building.
**Step 1: Build the evidence base**
Before synthesizing, gather evidence from judge reports:
- What did multiple judges praise? (consensus strengths)
- What did multiple judges criticize? (consensus weaknesses)
- Where did judges disagree? (areas needing careful analysis)
**Step 2: Decompose into synthesis subproblems**
Break the solution into logical sections or components. For each component:
- Which solution handles this best? (cite evidence)
- Are there complementary elements from multiple solutions?
- What issues were identified that need fixing?
**Step 3: Solve each subproblem**
For each component/section, determine the synthesis strategy:
*Strategy A - Clear winner:* If one solution is clearly superior for this component:
- Copy that section directly
- Document: "Taken from Solution X because [judge evidence]"
*Strategy B - Complementary combination:* If solutions have complementary strengths:
- Identify what each contributes
- Combine carefully, ensuring consistency
- Document: "Combined X from Solution A with Y from Solution B because [rationale]"
*Strategy C - All flawed:* If all solutions have issues in this area:
- Start with the best version
- Apply fixes based on judge criticism
- Document: "Based on Solution X, modified to address [specific issues]"
**Step 4: Integrate and verify consistency**
After synthesizing all components:
- Check that combined elements work together
- Resolve any contradictions between borrowed sections
- Ensure consistent terminology and style
**Step 5: Document synthesis decisions**
Create a synthesis log:
- What you took from each solution (with specific citations)
- Why you made those choices (reference judge feedback)
- How you addressed identified weaknesses
- Any novel combinations or improvements
<example>
**Example synthesis decision for an API design:**
Component: Authentication flow
- Solution A: JWT with refresh tokens (praised for security by 2/3 judges)
- Solution B: Session-based (praised for simplicity by 1 judge, criticized for scalability)
- Solution C: OAuth2 only (criticized as over-engineered for use case)
Decision: Take Solution A's authentication flow directly.
Evidence: Judges 1 and 3 both noted "JWT approach provides good balance of security and statelessness"
Modification: None needed - this section was rated highest across judges.
</example>
**Step 6: Revise your solution**
- Generate 5 verification questions about critical aspects
- Answer your own questions:
- Review solution against each question
- Identify gaps or weaknesses
- Revise solution:
- Fix identified issues
- Explain what was changed and why
CRITICAL:
- Do not create something entirely new - synthesize the best from what exists
- Cite your sources (which solution, which section)
- Explain every major decision
- Address all consensus weaknesses identified by judges仅在阶段4.5选择策略3(FULL_SYNTHESIS)时执行
启动1个合成Agent(推荐使用Opus以保证质量):
- Agent接收:
- 所有解决方案(来自指定输出位置)
- 所有评估报告(来自)
.specs/reports/ - 剪枝阶段的选择理由(来自)
.specs/research/
- Agent分析:
- 共识优势(多名评估者称赞的点)
- 共识劣势(多名评估者批评的点)
- 互补元素(各方案采用不同方法的领域)
- Agent产出最终解决方案:
- 复制最优部分:当某方案在某部分明显更优时
- 整合多种方法:当混合方法更优时
- 修复已识别问题:解决评估者发现的问题
- 记录决策过程(从何处借鉴及理由)
核心原则: 基于证据的合成充分利用了探索和评估阶段的集体智慧。
合成Agent提示模板:
markdown
You are synthesizing the best solution from explored, pruned, and evaluated implementations.
<task>
{task_description}
</task>
<solutions>
{list of paths to all solution files}
</solutions>
<evaluation_reports>
{list of paths to all evaluation reports}
</evaluation_reports>
<selection_rationale>
{path to selection.md explaining why these proposals were chosen}
</selection_rationale>
<output>
{output_path} - The final synthesized solution
</output>
Instructions:
Let's approach this synthesis systematically by first analyzing, then decomposing, then building.
**Step 1: Build the evidence base**
Before synthesizing, gather evidence from judge reports:
- What did multiple judges praise? (consensus strengths)
- What did multiple judges criticize? (consensus weaknesses)
- Where did judges disagree? (areas needing careful analysis)
**Step 2: Decompose into synthesis subproblems**
Break the solution into logical sections or components. For each component:
- Which solution handles this best? (cite evidence)
- Are there complementary elements from multiple solutions?
- What issues were identified that need fixing?
**Step 3: Solve each subproblem**
For each component/section, determine the synthesis strategy:
*Strategy A - Clear winner:* If one solution is clearly superior for this component:
- Copy that section directly
- Document: "Taken from Solution X because [judge evidence]"
*Strategy B - Complementary combination:* If solutions have complementary strengths:
- Identify what each contributes
- Combine carefully, ensuring consistency
- Document: "Combined X from Solution A with Y from Solution B because [rationale]"
*Strategy C - All flawed:* If all solutions have issues in this area:
- Start with the best version
- Apply fixes based on judge criticism
- Document: "Based on Solution X, modified to address [specific issues]"
**Step 4: Integrate and verify consistency**
After synthesizing all components:
- Check that combined elements work together
- Resolve any contradictions between borrowed sections
- Ensure consistent terminology and style
**Step 5: Document synthesis decisions**
Create a synthesis log:
- What you took from each solution (with specific citations)
- Why you made those choices (reference judge feedback)
- How you addressed identified weaknesses
- Any novel combinations or improvements
<example>
**Example synthesis decision for an API design:**
Component: Authentication flow
- Solution A: JWT with refresh tokens (praised for security by 2/3 judges)
- Solution B: Session-based (praised for simplicity by 1 judge, criticized for scalability)
- Solution C: OAuth2 only (criticized as over-engineered for use case)
Decision: Take Solution A's authentication flow directly.
Evidence: Judges 1 and 3 both noted "JWT approach provides good balance of security and statelessness"
Modification: None needed - this section was rated highest across judges.
</example>
**Step 6: Revise your solution**
- Generate 5 verification questions about critical aspects
- Answer your own questions:
- Review solution against each question
- Identify gaps or weaknesses
- Revise solution:
- Fix identified issues
- Explain what was changed and why
CRITICAL:
- Do not create something entirely new - synthesize the best from what exists
- Cite your sources (which solution, which section)
- Explain every major decision
- Address all consensus weaknesses identified by judgesOutputs (All Strategies)
所有策略通用输出
-
Research directory:(created if not exists)
.specs/research/- Proposals: - High-level approaches with probabilities
.specs/research/{solution-name}-{date}.proposals.[a|b|c].md - Pruning: - Judge evaluations and votes
.specs/research/{solution-name}-{date}.pruning.[1|2|3].md - Selection: - Vote tallies and selected proposals
.specs/research/{solution-name}-{date}.selection.md
- Proposals:
-
Expansion outputs:
- ,
solution.a.md,solution.b.md- Full implementations (in specified output location)solution.c.md
-
Reports directory:(created if not exists)
.specs/reports/- Evaluation: - Final judge reports
.specs/reports/{solution-name}-{date}.[1|2|3].md
- Evaluation:
-
Resulting solution:
{output_path}
-
研究目录:(不存在则自动创建)
.specs/research/- 方案文件:- 带概率的高层级方案
.specs/research/{solution-name}-{date}.proposals.[a|b|c].md - 剪枝评估文件:- 评估者评估结果及投票
.specs/research/{solution-name}-{date}.pruning.[1|2|3].md - 选择记录文件:- 投票统计及选中方案
.specs/research/{solution-name}-{date}.selection.md
- 方案文件:
-
扩展输出:
- ,
solution.a.md,solution.b.md- 完整实现文件(保存至指定输出位置)solution.c.md
-
报告目录:(不存在则自动创建)
.specs/reports/- 最终评估文件:- 评估者最终报告
.specs/reports/{solution-name}-{date}.[1|2|3].md
- 最终评估文件:
-
最终解决方案:
{output_path}
Strategy-Specific Outputs
策略专属输出
- SELECT_AND_POLISH: Polished solution based on winning solution, with targeted improvements
- REDESIGN: Do not stop; return to Phase 3 with lessons learned; eventually finishes at SELECT_AND_POLISH or FULL_SYNTHESIS
- FULL_SYNTHESIS: Synthesized solution combining best elements from all solutions
- SELECT_AND_POLISH:基于最优方案的优化版解决方案,包含针对性改进
- REDESIGN:不终止流程;结合经验教训返回阶段3;最终通过SELECT_AND_POLISH或FULL_SYNTHESIS完成
- FULL_SYNTHESIS:整合所有方案最优元素的合成解决方案
Best Practices
最佳实践
Meta-Judge + Judge Verification
元评估+评估者验证
- Two meta-judges - Separate specs for pruning (proposals) and evaluation (full solutions)
- Meta-judges run in parallel with implementation - Don't block the pipeline; pruning meta-judge runs with Phase 1, evaluation meta-judge runs with Phase 3
- Include CLAUDE_PLUGIN_ROOT - Both meta-judges and judges need the resolved plugin root path
- Meta-judge YAML - Pass only the YAML to judges, do not modify it
- 双元评估 - 为剪枝(方案)和评估(完整解决方案)分别制定规范
- 元评估与实现并行运行 - 不阻塞流程;剪枝元评估与阶段1并行,评估元评估与阶段3并行
- 包含CLAUDE_PLUGIN_ROOT - 元评估和评估者均需解析后的插件根路径
- 元评估YAML - 仅将YAML传递给评估者,不得修改
Common Pitfalls
常见陷阱
- Insufficient exploration - Agents propose similar approaches
- Ignoring judge feedback - Expansion ignores concerns from pruning
- Vague proposals - Can't properly evaluate without implementation details
- Over-exploration - Too many proposals, evaluation becomes expensive
- Forcing synthesis when clear winner exists - Wastes cost and risks degrading quality
- Synthesizing fundamentally flawed solutions - Better to redesign than polish garbage
- 探索不足 - Agent提出相似方案
- 忽略评估者反馈 - 扩展阶段未回应剪枝阶段的关注点
- 方案模糊 - 缺乏实现细节导致无法正确评估
- 过度探索 - 方案过多导致评估成本过高
- 存在明确最优方案时仍强制合成 - 浪费成本且可能降低质量
- 合成存在根本性缺陷的方案 - 重新设计优于优化劣质方案
Recommendations
建议
- Encourage diverse exploration - Prompt for different regions of solution space
- Feed feedback forward - Expansion agents address pruning concerns
- Right level of detail - Proposals have enough detail to evaluate
- Prune aggressively - Only expand most promising 3 approaches
- Trust adaptive strategy selection - Polish clear winners, synthesize split decisions, redesign failures
- 鼓励多样化探索 - 提示探索解决方案空间的不同区域
- 反馈前置 - 扩展Agent回应剪枝阶段的关注点
- 合适的细节程度 - 方案需包含足够细节以支持评估
- 激进剪枝 - 仅扩展最具前景的3种方案
- 信任自适应策略选择 - 优化明确的最优方案,合成分歧决策,重新设计失败方案
Example: API Design
示例:API设计
bash
/tree-of-thoughts "Design REST API for user management (CRUD + auth)" \
--output "specs/api/users.md" \
--criteria "RESTfulness,security,scalability,developer-experience"Phase 1 outputs (assuming date 2025-01-15):
- - 6 approaches from Agent A
.specs/research/users-api-2025-01-15.proposals.a.md - - 6 approaches from Agent B
.specs/research/users-api-2025-01-15.proposals.b.md - - 6 approaches from Agent C
.specs/research/users-api-2025-01-15.proposals.c.md
Phase 1.5 output (runs in parallel with Phase 1):
- Pruning Meta-judge (Opus, ) generates pruning evaluation specification YAML
sadd:meta-judge
Phase 2 outputs (3 judges with pruning meta-judge spec):
- - Top 3: Resource-based REST, Pure REST, Monolithic
.specs/research/users-api-2025-01-15.pruning.1.md - - Top 3: Pure REST, Hybrid (services), Resource-based REST
.specs/research/users-api-2025-01-15.pruning.2.md - - Top 3: Resource-based REST, REST+GraphQL hybrid, Pure REST
.specs/research/users-api-2025-01-15.pruning.3.md - - Selected: Resource-based REST (8 pts), Pure REST (7 pts), Monolithic (4 pts)
.specs/research/users-api-2025-01-15.selection.md
Phase 3 outputs:
- - Full resource-based design with nested routes
specs/api/users.a.md - - Flat REST design with simple endpoints
specs/api/users.b.md - - Monolithic API with service-oriented internals
specs/api/users.c.md
Phase 3.5 output (runs in parallel with Phase 3):
- Evaluation Meta-judge (Opus, ) generates evaluation specification YAML
sadd:meta-judge
Phase 4 outputs (3 judges with evaluation meta-judge spec):
-
:
.specs/reports/users-api-2025-01-15.1.mdVOTE: Solution A SCORES: A=4.2/5.0, B=3.8/5.0, C=3.4/5.0"Prefers A for RESTfulness, criticizes C complexity" -
:
.specs/reports/users-api-2025-01-15.2.mdVOTE: Solution B SCORES: A=3.9/5.0, B=4.1/5.0, C=3.5/5.0"Prefers B for simplicity, criticizes A deep nesting" -
:
.specs/reports/users-api-2025-01-15.3.mdVOTE: Solution A SCORES: A=4.3/5.0, B=3.6/5.0, C=3.2/5.0"Prefers A for discoverability, criticizes B lack of structure"
Phase 4.5 decision (orchestrator parses headers):
- Split votes: A, B, A (no unanimous winner)
- Average scores: A=4.1, B=3.8, C=3.4 (all >=3.0)
- Strategy: FULL_SYNTHESIS
- Reason: Split decision with merit, synthesis needed
Phase 5 output (synthesis):
- - Resource-based structure (from A), max 2-level nesting (from B), internal services (from C)
specs/api/users.md
bash
/tree-of-thoughts "Design REST API for user management (CRUD + auth)" \
--output "specs/api/users.md" \
--criteria "RESTfulness,security,scalability,developer-experience"阶段1输出(假设日期为2025-01-15):
- - Agent A提出的6种方案
.specs/research/users-api-2025-01-15.proposals.a.md - - Agent B提出的6种方案
.specs/research/users-api-2025-01-15.proposals.b.md - - Agent C提出的6种方案
.specs/research/users-api-2025-01-15.proposals.c.md
阶段1.5输出(与阶段1并行运行):
- 剪枝元评估(使用Opus模型,类型)生成剪枝评估规范YAML
sadd:meta-judge
阶段2输出(3名评估者使用剪枝元评估规范):
- - Top3:基于资源的REST、纯REST、单体式
.specs/research/users-api-2025-01-15.pruning.1.md - - Top3:纯REST、混合(服务)、基于资源的REST
.specs/research/users-api-2025-01-15.pruning.2.md - - Top3:基于资源的REST、REST+GraphQL混合、纯REST
.specs/research/users-api-2025-01-15.pruning.3.md - - 选中方案:基于资源的REST(8分)、纯REST(7分)、单体式(4分)
.specs/research/users-api-2025-01-15.selection.md
阶段3输出:
- - 带嵌套路由的完整基于资源设计
specs/api/users.a.md - - 带简单端点的扁平化REST设计
specs/api/users.b.md - - 带面向服务内部逻辑的单体API
specs/api/users.c.md
阶段3.5输出(与阶段3并行运行):
- 评估元评估(使用Opus模型,类型)生成评估规范YAML
sadd:meta-judge
阶段4输出(3名评估者使用评估元评估规范):
-
:
.specs/reports/users-api-2025-01-15.1.mdVOTE: Solution A SCORES: A=4.2/5.0, B=3.8/5.0, C=3.4/5.0"偏好方案A的REST规范性,批评方案C的复杂性" -
:
.specs/reports/users-api-2025-01-15.2.mdVOTE: Solution B SCORES: A=3.9/5.0, B=4.1/5.0, C=3.5/5.0"偏好方案B的简洁性,批评方案A的深层嵌套" -
:
.specs/reports/users-api-2025-01-15.3.mdVOTE: Solution A SCORES: A=4.3/5.0, B=3.6/5.0, C=3.2/5.0"偏好方案A的可发现性,批评方案B缺乏结构"
阶段4.5决策(编排器解析头部):
- 投票分歧:A、B、A(无一致通过方案)
- 平均得分:A=4.1,B=3.8,C=3.4(均>=3.0)
- 策略:FULL_SYNTHESIS
- 理由:存在分歧但各方案均有可取之处,需进行合成
阶段5输出(合成结果):
- - 基于资源的结构(来自方案A)、最多2层嵌套(来自方案B)、内部服务逻辑(来自方案C)
specs/api/users.md