agent-evaluation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Evaluation Methods for Claude Code Agents

Claude Code Agent的评估方法

Evaluation of agent systems requires different approaches than traditional software or even standard language model applications. Agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Effective evaluation must account for these characteristics while providing actionable feedback. A robust evaluation framework enables continuous improvement, catches regressions, and validates that context engineering choices achieve intended effects.

Agent系统的评估需要区别于传统软件甚至标准大语言模型应用的方法。Agent会做出动态决策，不同运行之间具有非确定性，且通常不存在单一正确答案。有效的评估必须考虑这些特性，同时提供可落地的反馈。完善的评估框架能够支持持续优化、捕获回归问题，并验证上下文工程选择是否达到预期效果。

Core Concepts

核心概念

Agent evaluation requires outcome-focused approaches that account for non-determinism and multiple valid paths. Multi-dimensional rubrics capture various quality aspects: factual accuracy, completeness, citation accuracy, source quality, and tool efficiency. LLM-as-judge provides scalable evaluation while human evaluation catches edge cases.

The key insight is that agents may find alternative paths to goals—the evaluation should judge whether they achieve right outcomes while following reasonable processes.

Performance Drivers: The 95% Finding Research on the BrowseComp evaluation (which tests browsing agents' ability to locate hard-to-find information) found that three factors explain 95% of performance variance:

Factor	Variance Explained	Implication
Token usage	80%	More tokens = better performance
Number of tool calls	~10%	More exploration helps
Model choice	~5%	Better models multiply efficiency

Implications for Claude Code development:

Token budgets matter: Evaluate with realistic token constraints
Model upgrades beat token increases: Upgrading models provides larger gains than increasing token budgets
Multi-agent validation: Validates architectures that distribute work across subagents with separate context windows

Agent评估需要以结果为导向的方法，同时考虑非确定性和多种有效路径。多维评分表可涵盖各类质量维度：事实准确性、完整性、引用准确性、来源质量和工具效率。LLM-as-Judge可实现可扩展的评估，而人工评估能覆盖边缘案例。

关键认知在于，Agent可能通过不同路径达成目标——评估应判断其是否通过合理流程实现了正确结果。

性能驱动因素：95%法则 针对BrowseComp评估（测试浏览Agent定位难获取信息的能力）的研究发现，三个因素可以解释95%的性能差异：

因素	解释的差异占比	启示
Token使用量	80%	Token越多 = 性能越好
工具调用次数	~10%	更多探索有助提升
模型选择	~5%	更优模型能放大效率

对Claude Code开发的启示：

Token预算至关重要：在真实的Token限制下开展评估
模型升级优于Token增加：升级模型比增加Token预算带来的收益更大
多Agent验证：验证在独立上下文窗口下由子Agent分布式处理任务的架构

Evaluation Challenges

评估挑战

Non-Determinism and Multiple Valid Paths

非确定性与多种有效路径

Agents may take completely different valid paths to reach goals. One agent might search three sources while another searches ten. They might use different tools to find the same answer. Traditional evaluations that check for specific steps fail in this context.

Solution: The solution is outcomes, not exact execution paths. Judge whether the agent achieves the right result through a reasonable process.

Agent可能通过完全不同的有效路径达成目标。一个Agent可能搜索3个来源，而另一个可能搜索10个。它们可能使用不同工具获取相同答案。传统的检查特定步骤的评估方法在此场景下失效。

解决方案：关注结果而非精确执行路径。判断Agent是否通过合理流程实现了正确结果。

Context-Dependent Failures

依赖上下文的故障

Agent failures often depend on context in subtle ways. An agent might succeed on complex queries but fail on simple ones. It might work well with one tool set but fail with another. Failures may emerge only after extended interaction when context accumulates.

Solution: Evaluation must cover a range of complexity levels and test extended interactions, not just isolated queries.

Agent故障通常以微妙的方式依赖上下文。Agent可能在复杂查询上成功，但在简单查询上失败；可能在某一工具集下表现良好，但在另一工具集下失效；故障可能仅在上下文累积后的长期交互中出现。

解决方案：评估必须覆盖不同复杂度层级，并测试长期交互，而非仅针对孤立查询。

Composite Quality Dimensions

复合质量维度

Agent quality is not a single dimension. It includes factual accuracy, completeness, coherence, tool efficiency, and process quality. An agent might score high on accuracy but low in efficiency, or vice versa.

An agent might score high on accuracy but low in efficiency.

Solution: Evaluation rubrics must capture multiple dimensions with appropriate weighting for the use case.

Agent质量并非单一维度，它包括事实准确性、完整性、连贯性、工具效率和流程质量。一个Agent可能在准确性上得分高，但效率低，反之亦然。

解决方案：评估评分表必须捕获多个维度，并根据用例需求赋予相应权重。

Evaluation Rubric Design

评估评分表设计

Multi-Dimensional Rubric

多维评分表

Effective rubrics cover key dimensions with descriptive levels:

Instruction Following (weight: 0.30)

Excellent (1.0): All instructions followed precisely
Good (0.8): Minor deviations that don't affect outcome
Acceptable (0.6): Major instructions followed, minor ones missed
Poor (0.3): Significant instructions ignored
Failed (0.0): Fundamentally misunderstood the task

Output Completeness (weight: 0.25)

Excellent: All requested aspects thoroughly covered
Good: Most aspects covered with minor gaps
Acceptable: Key aspects covered, some gaps
Poor: Major aspects missing
Failed: Fundamental aspects not addressed

Tool Efficiency (weight: 0.20)

Excellent: Optimal tool selection and minimal calls
Good: Good tool selection with minor inefficiencies
Acceptable: Appropriate tools with some redundancy
Poor: Wrong tools or excessive calls
Failed: Severe tool misuse or extremely excessive calls

Reasoning Quality (weight: 0.15)

Excellent: Clear, logical reasoning throughout
Good: Generally sound reasoning with minor gaps
Acceptable: Basic reasoning present
Poor: Reasoning unclear or flawed
Failed: No apparent reasoning

Response Coherence (weight: 0.10)

Excellent: Well-structured, easy to follow
Good: Generally coherent with minor issues
Acceptable: Understandable but could be clearer
Poor: Difficult to follow
Failed: Incoherent

有效的评分表应覆盖关键维度并提供明确的层级描述：

指令遵循度（权重：0.30）

优秀（1.0）：精准遵循所有指令
良好（0.8）：存在微小偏差但不影响结果
合格（0.6）：遵循主要指令，遗漏次要指令
较差（0.3）：显著忽略指令
失败（0.0）：从根本上误解任务

输出完整性（权重：0.25）

优秀：全面覆盖所有请求内容
良好：覆盖大部分内容，存在微小缺口
合格：覆盖核心内容，存在部分缺口
较差：缺失主要内容
失败：未涉及核心内容

工具效率（权重：0.20）

优秀：工具选择最优，调用次数最少
良好：工具选择合理，存在微小低效
合格：工具选择恰当，存在一些冗余
较差：工具选择错误或调用次数过多
失败：严重误用工具或调用次数极多

推理质量（权重：0.15）

优秀：全程推理清晰、逻辑严谨
良好：推理整体合理，存在微小漏洞
合格：具备基础推理能力
较差：推理模糊或存在缺陷
失败：无明显推理过程

响应连贯性（权重：0.10）

优秀：结构清晰，易于理解
良好：整体连贯，存在微小问题
合格：可理解但需优化清晰度
较差：难以理解
失败：逻辑混乱

Scoring Approach

评分方法

Convert dimension assessments to numeric scores (0.0 to 1.0) with appropriate weighting. Calculate weighted overall scores. Set passing thresholds based on use case requirements (typically 0.7 for general use, 0.85 for critical operations).

将维度评估转换为数值分数（0.0至1.0）并赋予相应权重，计算加权总分。根据用例需求设置合格阈值（通用场景通常为0.7，关键操作场景为0.85）。

Evaluation Methodologies

评估方法论

LLM-as-Judge

Using an LLM to evaluate agent outputs scales well and provides consistent judgments. Design evaluation prompts that capture the dimensions of interest. LLM-based evaluation scales to large test sets and provides consistent judgments. The key is designing effective evaluation prompts that capture the dimensions of interest.

Provide clear task description, agent output, ground truth (if available), evaluation scale with level descriptions, and request structured judgment.

Evaluation Prompt Template:

markdown

You are evaluating the output of a Claude Code agent.

使用大语言模型评估Agent输出可实现良好的扩展性，并提供一致的判断。设计能够捕获关注维度的评估提示词。基于LLM的评估可扩展至大型测试集，并提供一致判断，关键在于设计能有效捕获关注维度的评估提示词。

提供清晰的任务描述、Agent输出、基准答案（如有）、带层级描述的评估量表，并要求结构化判断。

评估提示词模板：

markdown

你正在评估Claude Code Agent的输出。

Original Task

原始任务

{task_description}

Agent Output

Agent输出

{agent_output}

Ground Truth (if available)

基准答案（如有）

{expected_output}

Evaluation Criteria

评估标准

For each criterion, assess the output and provide:

Score (1-5)
Specific evidence supporting your score
One improvement suggestion

针对每个标准，评估输出并提供：

分数（1-5）
支持分数的具体证据
一条优化建议

Criteria

标准

Instruction Following: Did the agent follow all instructions?
Completeness: Are all requested aspects covered?
Tool Efficiency: Were appropriate tools used efficiently?
Reasoning Quality: Is the reasoning clear and sound?
Response Coherence: Is the output well-structured?

Provide your evaluation as a structured assessment with scores and justifications.


**Chain-of-Thought Requirement**: Always require justification before the score. Research shows this improves reliability by 15-25% compared to score-first approaches.

指令遵循度：Agent是否遵循了所有指令？
完整性：是否覆盖了所有请求内容？
工具效率：是否高效使用了合适的工具？
推理质量：推理是否清晰合理？
响应连贯性：输出结构是否清晰？

请以结构化评估的形式提供分数和理由。


**思维链要求**：始终要求先提供理由再给出分数。研究表明，与先给分后解释相比，这种方式可将评估可靠性提升15-25%。

Human Evaluation

人工评估

Human evaluation catches what automation misses:

Hallucinated answers on unusual queries
Subtle context misunderstandings
Edge cases that automated evaluation overlooks
Qualitative issues with tone or approach

For Claude Code development, ask users this:

Review agent outputs manually for edge cases
Sample systematically across complexity levels
Track patterns in failures to inform prompt improvements

人工评估能够捕获自动化评估遗漏的内容：

特殊查询下的幻觉答案
细微的上下文误解
自动化评估忽略的边缘案例
语气或方法的定性问题

对于Claude Code开发，建议：

手动审查Agent输出以覆盖边缘案例
按复杂度层级系统性抽样
跟踪故障模式以优化提示词

End-State Evaluation

终态评估

For commands that produce artifacts (files, configurations, code), evaluate the final output rather than the process:

Does the generated code work?
Is the configuration valid?
Does the output meet requirements?

对于生成工件（文件、配置、代码）的命令，评估最终输出而非过程：

生成的代码是否可运行？
配置是否有效？
输出是否符合需求？

Test Set Design

测试集设计

Sample Selection Start with small samples during development. Early in agent development, changes have dramatic impacts because there is abundant low-hanging fruit. Small test sets reveal large effects.

Sample from real usage patterns. Add known edge cases. Ensure coverage across complexity levels.

Complexity Stratification Test sets should span complexity levels: simple (single tool call), medium (multiple tool calls), complex (many tool calls, significant ambiguity), and very complex (extended interaction, deep reasoning).

样本选择 开发初期从少量样本开始。Agent开发早期，优化带来的影响显著，因为存在大量易优化的空间。小型测试集即可体现明显效果。

从真实使用模式中抽样，加入已知边缘案例，确保覆盖不同复杂度层级。

复杂度分层 测试集应涵盖不同复杂度层级：简单（单次工具调用）、中等（多次工具调用）、复杂（大量工具调用、存在显著歧义）、极复杂（长期交互、深度推理）。

Context Engineering Evaluation

上下文工程评估

Testing Prompt Variations

测试提示词变体

When iterating on Claude Code prompts, evaluate systematically:

Baseline: Run current prompt on test cases
Variation: Run modified prompt on same cases
Compare: Measure quality scores, token usage, efficiency
Analyze: Identify which changes improved which dimensions

迭代Claude Code提示词时，需系统性评估：

基准线：在测试案例上运行当前提示词
变体：在相同案例上运行修改后的提示词
对比：衡量质量分数、Token使用量和效率
分析：确定哪些修改优化了哪些维度

Testing Context Strategies

测试上下文策略

Context engineering choices should be validated through systematic evaluation. Run agents with different context strategies on the same test set. Compare quality scores, token usage, and efficiency metrics.

上下文工程选择需通过系统性评估验证。在相同测试集上运行采用不同上下文策略的Agent，对比质量分数、Token使用量和效率指标。

Degradation Testing

退化测试

Test how context degradation affects performance by running agents at different context sizes. Identify performance cliffs where context becomes problematic. Establish safe operating limits.

通过在不同上下文规模下运行Agent，测试上下文退化对性能的影响。确定性能骤降的临界点，建立安全运行限制。

Advanced Evaluation: LLM-as-Judge

进阶评估：LLM-as-Judge

Key insight: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.

关键认知：LLM-as-Judge并非单一技术，而是一系列方法的集合，每种方法适用于不同的评估场景。掌握选择合适方法并缓解已知偏见的能力是本技能的核心。

The Evaluation Taxonomy

评估分类

Evaluation approaches fall into two primary categories with distinct reliability profiles:

Direct Scoring: A single LLM rates one response on a defined scale.

Best for: Objective criteria (factual accuracy, instruction following, toxicity)
Reliability: Moderate to high for well-defined criteria
Failure mode: Score calibration drift, inconsistent scale interpretation

Pairwise Comparison: An LLM compares two responses and selects the better one.

Best for: Subjective preferences (tone, style, persuasiveness)
Reliability: Higher than direct scoring for preferences
Failure mode: Position bias, length bias

Research from the MT-Bench paper (Zheng et al., 2023) establishes that pairwise comparison achieves higher agreement with human judges than direct scoring for preference-based evaluation, while direct scoring remains appropriate for objective criteria with clear ground truth.

评估方法主要分为两类，各有不同的可靠性特征：

直接评分：单个LLM基于定义的量表对单个响应评分。

最佳适用场景：客观标准（事实准确性、指令遵循度、有害性）
可靠性：对于定义明确的标准，可靠性为中到高
失效模式：分数校准漂移、量表解释不一致

成对比较：LLM对比两个响应并选择更优的一个。

最佳适用场景：主观偏好（语气、风格、说服力）
可靠性：在偏好评估上比直接评分更可靠
失效模式：位置偏见、长度偏见

MT-Bench论文（Zheng等人，2023）的研究表明，在基于偏好的评估中，成对比较与人工判断的一致性高于直接评分；而对于有明确基准答案的客观标准，直接评分仍然适用。

The Bias Landscape

偏见类型

LLM judges exhibit systematic biases that must be actively mitigated:

Position Bias: First-position responses receive preferential treatment in pairwise comparison. Mitigation: Evaluate twice with swapped positions, use majority vote or consistency check.

Length Bias: Longer responses are rated higher regardless of quality. Mitigation: Explicit prompting to ignore length, length-normalized scoring.

Self-Enhancement Bias: Models rate their own outputs higher. Mitigation: Use different models for generation and evaluation, or acknowledge limitation.

Verbosity Bias: Detailed explanations receive higher scores even when unnecessary. Mitigation: Criteria-specific rubrics that penalize irrelevant detail.

Authority Bias: Confident, authoritative tone rated higher regardless of accuracy. Mitigation: Require evidence citation, fact-checking layer.

LLM评估者存在系统性偏见，必须主动缓解：

位置偏见：在成对比较中，处于第一位置的响应会受到优先对待。缓解方法：交换位置进行两次评估，采用多数投票或一致性检查。

长度偏见：较长的响应会获得更高分数，无论质量如何。缓解方法：在提示词中明确要求忽略长度，采用长度归一化评分。

自我增强偏见：模型对自身输出的评分高于其他模型。缓解方法：使用与生成模型不同的模型进行评估，或明确说明局限性。

冗余偏见：详细解释会获得更高分数，即使不必要。缓解方法：采用针对特定标准的评分表，惩罚无关细节。

权威偏见：自信、权威的语气会获得更高分数，无论准确性如何。缓解方法：要求提供证据引用，增加事实核查环节。

Metric Selection Framework

指标选择框架

Choose metrics based on the evaluation task structure:

Task Type	Primary Metrics	Secondary Metrics
Binary classification (pass/fail)	Recall, Precision, F1	Cohen's κ
Ordinal scale (1-5 rating)	Spearman's ρ, Kendall's τ	Cohen's κ (weighted)
Pairwise preference	Agreement rate, Position consistency	Confidence calibration
Multi-label	Macro-F1, Micro-F1	Per-label precision/recall

The critical insight: High absolute agreement matters less than systematic disagreement patterns. A judge that consistently disagrees with humans on specific criteria is more problematic than one with random noise.

根据评估任务结构选择指标：

任务类型	主要指标	次要指标
二元分类（通过/失败）	召回率、精确率、F1分数	Cohen's κ
有序量表（1-5评分）	Spearman's ρ、Kendall's τ	加权Cohen's κ
成对偏好	一致率、位置一致性	置信度校准
多标签	Macro-F1、Micro-F1	单标签精确率/召回率

关键认知：绝对一致性的重要性低于系统性分歧模式。在特定标准上持续与人类判断分歧的评估者，比存在随机误差的评估者更具问题。

Evaluation Metrics Reference

评估指标参考

Classification Metrics (Pass/Fail Tasks)

分类指标（通过/失败任务）

Precision: Of all responses marked as passing, what fraction truly passed?

Use when false positives are costly

Recall: Of all actually passing responses, what fraction did we identify?

Use when false negatives are costly

F1 Score: Harmonic mean of precision and recall

Use for balanced single-number summary

精确率：在所有被标记为通过的响应中，真正通过的比例是多少？

适用于假阳性成本高的场景

召回率：在所有实际通过的响应中，我们识别出的比例是多少？

适用于假阴性成本高的场景

F1分数：精确率和召回率的调和平均数

适用于需要平衡两者的单一汇总指标

Agreement Metrics (Comparing to Human Judgment)

一致性指标（与人工判断对比）

Cohen's Kappa: Agreement adjusted for chance

0.8: Almost perfect agreement
0.6-0.8: Substantial agreement
0.4-0.6: Moderate agreement
< 0.4: Fair to poor agreement

Cohen's Kappa：调整了随机一致性后的一致性

0.8：几乎完美一致
0.6-0.8：高度一致
0.4-0.6：中度一致
< 0.4：一致性一般或较差

Correlation Metrics (Ordinal Scores)

Good Evaluation System Indicators

良好评估系统的指标

Metric	Good	Acceptable	Concerning
Spearman's rho	> 0.8	0.6-0.8	< 0.6
Cohen's Kappa	> 0.7	0.5-0.7	< 0.5
Position consistency	> 0.9	0.8-0.9	< 0.8
Length-score correlation	< 0.2	0.2-0.4	> 0.4

指标	良好	可接受	需关注
Spearman's ρ	> 0.8	0.6-0.8	< 0.6
Cohen's Kappa	> 0.7	0.5-0.7	< 0.5
位置一致性	> 0.9	0.8-0.9	< 0.8
长度-分数相关性	< 0.2	0.2-0.4	> 0.4

Evaluation Approaches

评估方法

Direct Scoring Implementation

直接评分实现

Direct scoring requires three components: clear criteria, a calibrated scale, and structured output format.

Criteria Definition Pattern:

Criterion: [Name]
Description: [What this criterion measures]
Weight: [Relative importance, 0-1]

Scale Calibration:

1-3 scales: Binary with neutral option, lowest cognitive load
1-5 scales: Standard Likert, good balance of granularity and reliability
1-10 scales: High granularity but harder to calibrate, use only with detailed rubrics

Prompt Structure for Direct Scoring:

You are an expert evaluator assessing response quality.

直接评分需要三个组件：明确的标准、校准的量表和结构化输出格式。

标准定义模式：

标准：[名称]
描述：[该标准衡量的内容]
权重：[相对重要性，0-1]

量表校准：

1-3分制：带中性选项的二元评分，认知负荷最低
1-5分制：标准李克特量表，在粒度和可靠性间达到良好平衡
1-10分制：高粒度但难以校准，仅在使用详细评分表时采用

直接评分提示词结构：

你是评估响应质量的专家。

Task

任务

Evaluate the following response against each criterion.

根据每个标准评估以下响应。

Original Prompt

原始提示词

{prompt}

Response to Evaluate

待评估响应

{response}

Criteria

标准

{for each criterion: name, description, weight}

{每个标准：名称、描述、权重}

Instructions

说明

For each criterion:

Find specific evidence in the response
Score according to the rubric (1-{max} scale)
Justify your score with evidence
Suggest one specific improvement

针对每个标准：

在响应中找到具体证据
根据评分表给出分数（1-{max}分）
用证据证明分数合理性
提出一条具体优化建议

Output Format

输出格式

Respond with structured JSON containing scores, justifications, and summary.


**Chain-of-Thought Requirement**: All scoring prompts must require justification before the score. Research shows this improves reliability by 15-25% compared to score-first approaches.

以包含分数、理由和总结的结构化JSON格式响应。


**思维链要求**：所有评分提示词必须要求先提供理由再给出分数。研究表明，与先给分后解释相比，这种方式可将评估可靠性提升15-25%。

Pairwise Comparison Implementation

成对比较实现

Pairwise comparison is inherently more reliable for preference-based evaluation but requires bias mitigation.

Position Bias Mitigation Protocol:

First pass: Response A in first position, Response B in second
Second pass: Response B in first position, Response A in second
Consistency check: If passes disagree, return TIE with reduced confidence
Final verdict: Consistent winner with averaged confidence

Prompt Structure for Pairwise Comparison:

You are an expert evaluator comparing two AI responses.

成对比较在基于偏好的评估中本质上更可靠，但需要缓解偏见。

位置偏见缓解方案：

第一轮：响应A在第一位置，响应B在第二位置
第二轮：响应B在第一位置，响应A在第二位置
一致性检查：若两轮结果不一致，返回平局并降低置信度
最终结论：若两轮结果一致，取平均置信度

成对比较提示词结构：

你是比较两个AI响应的专家评估者。

Critical Instructions

关键说明

Do NOT prefer responses because they are longer
Do NOT prefer responses based on position (first vs second)
Focus ONLY on quality according to the specified criteria
Ties are acceptable when responses are genuinely equivalent

不要因为响应更长而偏好它
不要根据位置（第一/第二）偏好响应
仅关注符合指定标准的质量
当响应真正相当时，平局是可接受的

Original Prompt

原始提示词

{prompt}

Response A

响应A

{response_a}

Response B

响应B

{response_b}

Comparison Criteria

对比标准

{criteria list}

{标准列表}

Instructions

说明

Analyze each response independently first
Compare them on each criterion
Determine overall winner with confidence level

先独立分析每个响应
在每个标准上对比两者
确定整体获胜者及置信度

Output Format

输出格式

JSON with per-criterion comparison, overall winner, confidence (0-1), and reasoning.


**Confidence Calibration**: Confidence scores should reflect position consistency:

- Both passes agree: confidence = average of individual confidences
- Passes disagree: confidence = 0.5, verdict = TIE

包含单标准对比、整体获胜者、置信度（0-1）和理由的JSON。


**置信度校准**：置信度分数应反映位置一致性：

- 两轮结果一致：置信度 = 单轮置信度的平均值
- 两轮结果不一致：置信度 = 0.5，结论 = 平局

Rubric Generation

评分表生成

Well-defined rubrics reduce evaluation variance by 40-60% compared to open-ended scoring.

定义清晰的评分表可将评估差异降低40-60%，相比开放式评分更可靠。

Rubric Components

评分表组件

Level descriptions: Clear boundaries for each score level
Characteristics: Observable features that define each level
Examples: Representative outputs for each level (when possible)
Edge cases: Guidance for ambiguous situations
Scoring guidelines: General principles for consistent application

层级描述：每个分数层级的明确边界
特征：定义每个层级的可观察特征
示例（如有）：每个层级的代表性输出
边缘案例：模糊场景的处理指导
评分指南：确保一致应用的通用原则

Strictness Calibration

严格度校准

Lenient: Lower bar for passing scores, appropriate for encouraging iteration
Balanced: Fair, typical expectations for production use
Strict: High standards, appropriate for safety-critical or high-stakes evaluation

宽松：合格分数门槛较低，适合鼓励迭代
平衡：符合生产环境的典型预期
严格：高标准，适用于安全关键或高风险评估

Domain Adaptation

领域适配

Rubrics should use domain-specific terminology:

A "code readability" rubric mentions variables, functions, and comments.
Documentation rubrics reference clarity, accuracy, completeness
Analysis rubrics focus on depth, accuracy, actionability

评分表应使用领域特定术语：

"代码可读性"评分表需提及变量、函数和注释
文档评分表关注清晰度、准确性、完整性
分析评分表聚焦深度、准确性、可落地性

Practical Guidance

实践指南

Evaluation Pipeline Design

评估流水线设计

Production evaluation systems require multiple layers:

┌─────────────────────────────────────────────────┐
│                 Evaluation Pipeline              │
├─────────────────────────────────────────────────┤
│                                                   │
│  Input: Response + Prompt + Context               │
│           │                                       │
│           ▼                                       │
│  ┌─────────────────────┐                         │
│  │   Criteria Loader   │ ◄── Rubrics, weights    │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │   Primary Scorer    │ ◄── Direct or Pairwise  │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │   Bias Mitigation   │ ◄── Position swap, etc. │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │ Confidence Scoring  │ ◄── Calibration         │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  Output: Scores + Justifications + Confidence     │
│                                                   │
└─────────────────────────────────────────────────┘

生产级评估系统需要多层架构：

┌─────────────────────────────────────────────────┐
│                 评估流水线                      │
├─────────────────────────────────────────────────┤
│                                                   │
│  输入：响应 + 提示词 + 上下文                     │
│           │                                       │
│           ▼                                       │
│  ┌─────────────────────┐                         │
│  │   标准加载器        │ ◄── 评分表、权重          │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │   主评分器          │ ◄── 直接评分或成对比较    │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │   偏见缓解          │ ◄── 位置交换等            │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │  置信度评分         │ ◄── 校准                  │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  输出：分数 + 理由 + 置信度                       │
│                                                   │
└─────────────────────────────────────────────────┘

Avoiding Evaluation Pitfalls

避免评估陷阱

Anti-pattern: Scoring without justification

Problem: Scores lack grounding, difficult to debug or improve
Solution: Always require evidence-based justification before score

Anti-pattern: Single-pass pairwise comparison

Problem: Position bias corrupts results
Solution: Always swap positions and check consistency

Anti-pattern: Overloaded criteria

Problem: Criteria measuring multiple things are unreliable
Solution: One criterion = one measurable aspect

Anti-pattern: Missing edge case guidance

Problem: Evaluators handle ambiguous cases inconsistently
Solution: Include edge cases in rubrics with explicit guidance

Anti-pattern: Ignoring confidence calibration

Problem: High-confidence wrong judgments are worse than low-confidence
Solution: Calibrate confidence to position consistency and evidence strength

反模式：无理由评分

问题：分数缺乏依据，难以调试或优化
解决方案：始终要求先提供基于证据的理由再给分

反模式：单次成对比较

问题：位置偏见会影响结果
解决方案：始终交换位置并检查一致性

反模式：标准过载

问题：衡量多个内容的标准不可靠
解决方案：一个标准 = 一个可衡量的维度

反模式：缺失边缘案例指导

问题：评估者处理模糊场景的方式不一致
解决方案：在评分表中加入边缘案例及明确指导

反模式：忽略置信度校准

问题：高置信度的错误判断比低置信度判断更糟
解决方案：根据位置一致性和证据强度校准置信度

Decision Framework: Direct vs. Pairwise

决策框架：直接评分 vs 成对比较

Use this decision tree:

Is there an objective ground truth?
├── Yes → Direct Scoring
│   └── Examples: factual accuracy, instruction following, format compliance
│
└── No → Is it a preference or quality judgment?
    ├── Yes → Pairwise Comparison
    │   └── Examples: tone, style, persuasiveness, creativity
    │
    └── No → Consider reference-based evaluation
        └── Examples: summarization (compare to source), translation (compare to reference)

使用以下决策树：

是否存在客观基准答案？
├── 是 → 直接评分
│   └── 示例：事实准确性、指令遵循度、格式合规性
│
└── 否 → 是否为偏好或质量判断？
    ├── 是 → 成对比较
    │   └── 示例：语气、风格、说服力、创造性
    │
    └── 否 → 考虑基于参考的评估
        └── 示例：摘要（与原文对比）、翻译（与参考译文对比）

Scaling Evaluation

评估规模化

For high-volume evaluation:

Panel of LLMs (PoLL): Use multiple models as judges, aggregate votes
- Reduces individual model bias
- More expensive but more reliable for high-stakes decisions
Hierarchical evaluation: Fast cheap model for screening, expensive model for edge cases
- Cost-effective for large volumes
- Requires calibration of screening threshold
Human-in-the-loop: Automated evaluation for clear cases, human review for low-confidence
- Best reliability for critical applications
- Design feedback loop to improve automated evaluation

针对高吞吐量评估：

LLM专家组（PoLL）：使用多个模型作为评估者，汇总投票结果
- 降低单个模型的偏见
- 成本更高，但对于高风险决策更可靠
分层评估：使用快速低成本模型进行筛选，使用高成本模型处理边缘案例
- 针对大规模场景具成本效益
- 需要校准筛选阈值
人工介入：自动化评估处理明确案例，人工审查低置信度案例
- 关键应用场景下可靠性最佳
- 设计反馈环以优化自动化评估

Examples

示例

Example 1: Direct Scoring for Accuracy

示例1：准确性的直接评分

Input:

Prompt: "What causes seasons on Earth?"
Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun, 
different hemispheres receive more direct sunlight at different times of year."
Criterion: Factual Accuracy (weight: 1.0)
Scale: 1-5

Output:

json

{
  "criterion": "Factual Accuracy",
  "score": 5,
  "evidence": [
    "Correctly identifies axial tilt as primary cause",
    "Correctly explains differential sunlight by hemisphere",
    "No factual errors present"
  ],
  "justification": "Response accurately explains the cause of seasons with correct 
scientific reasoning. Both the axial tilt and its effect on sunlight distribution 
are correctly described.",
  "improvement": "Could add the specific tilt angle (23.5°) for completeness."
}

输入：

提示词："地球的季节是如何形成的？"
响应："季节是由地球的倾斜轴引起的。当地球绕太阳公转时，不同半球在一年中的不同时间接收到更多直射阳光。"
标准：事实准确性（权重：1.0）
量表：1-5分

输出：

json

{
  "criterion": "事实准确性",
  "score": 5,
  "evidence": [
    "正确指出轴向倾斜是主要原因",
    "正确解释了不同半球的阳光差异",
    "无事实错误"
  ],
  "justification": "响应通过正确的科学推理准确解释了季节的成因。轴向倾斜及其对阳光分布的影响均被正确描述。",
  "improvement": "可补充具体倾斜角度（23.5°）以提升完整性。"
}

Example 2: Pairwise Comparison with Position Swap

示例2：带位置交换的成对比较

Input:

Prompt: "Explain machine learning to a beginner"
Response A: [Technical explanation with jargon]
Response B: [Simple analogy-based explanation]
Criteria: ["clarity", "accessibility"]

First Pass (A first):

json

{ "winner": "B", "confidence": 0.8 }

Second Pass (B first):

json

{ "winner": "A", "confidence": 0.6 }

(Note: Winner is A because B was in first position)

Mapped Second Pass:

json

{ "winner": "B", "confidence": 0.6 }

Final Result:

json

{
  "winner": "B",
  "confidence": 0.7,
  "positionConsistency": {
    "consistent": true,
    "firstPassWinner": "B",
    "secondPassWinner": "B"
  }
}

输入：

提示词："向初学者解释机器学习"
响应A：[含术语的技术解释]
响应B：[基于简单类比的解释]
标准：["清晰度", "易理解性"]

第一轮（A在前）：

json

{ "winner": "B", "confidence": 0.8 }

第二轮（B在前）：

json

{ "winner": "A", "confidence": 0.6 }

（注：因B在第一位置，获胜者为A）

映射后第二轮结果：

json

{ "winner": "B", "confidence": 0.6 }

最终结果：

json

{
  "winner": "B",
  "confidence": 0.7,
  "positionConsistency": {
    "consistent": true,
    "firstPassWinner": "B",
    "secondPassWinner": "B"
  }
}

Example 3: Rubric Generation

示例3：评分表生成

Input:

criterionName: "Code Readability"
criterionDescription: "How easy the code is to understand and maintain"
domain: "software engineering"
scale: "1-5"
strictness: "balanced"

Output (abbreviated):

json

{
  "levels": [
    {
      "score": 1,
      "label": "Poor",
      "description": "Code is difficult to understand without significant effort",
      "characteristics": [
        "No meaningful variable or function names",
        "No comments or documentation",
        "Deeply nested or convoluted logic"
      ]
    },
    {
      "score": 3,
      "label": "Adequate",
      "description": "Code is understandable with some effort",
      "characteristics": [
        "Most variables have meaningful names",
        "Basic comments present for complex sections",
        "Logic is followable but could be cleaner"
      ]
    },
    {
      "score": 5,
      "label": "Excellent",
      "description": "Code is immediately clear and maintainable",
      "characteristics": [
        "All names are descriptive and consistent",
        "Comprehensive documentation",
        "Clean, modular structure"
      ]
    }
  ],
  "edgeCases": [
    {
      "situation": "Code is well-structured but uses domain-specific abbreviations",
      "guidance": "Score based on readability for domain experts, not general audience"
    }
  ]
}

输入：

criterionName: "代码可读性"
criterionDescription: "代码的易理解性和可维护性"
domain: "软件工程"
scale: "1-5"
strictness: "balanced"

输出（缩写版）：

json

{
  "levels": [
    {
      "score": 1,
      "label": "差",
      "description": "代码需付出极大努力才能理解",
      "characteristics": [
        "无有意义的变量或函数名",
        "无注释或文档",
        "深度嵌套或复杂的逻辑"
      ]
    },
    {
      "score": 3,
      "label": "合格",
      "description": "代码需付出一定努力才能理解",
      "characteristics": [
        "大多数变量具有有意义的名称",
        "复杂部分有基本注释",
        "逻辑可追踪但可更简洁"
      ]
    },
    {
      "score": 5,
      "label": "优秀",
      "description": "代码清晰易懂且易于维护",
      "characteristics": [
        "所有名称具有描述性且一致",
        "全面的文档",
        "简洁的模块化结构"
      ]
    }
  ],
  "edgeCases": [
    {
      "situation": "代码结构良好但使用领域特定缩写",
      "guidance": "基于领域专家的可读性评分，而非普通受众"
    }
  ]
}

Iterative Improvement Workflow

迭代优化流程

Identify weakness: Use evaluation to find where agent struggles
Hypothesize cause: Is it the prompt? The context? The examples?
Modify prompt: Make targeted changes based on hypothesis
Re-evaluate: Run same test cases with modified prompt
Compare: Did the change improve the target dimension?
Check regression: Did other dimensions suffer?
Iterate: Repeat until quality meets threshold

识别弱点：通过评估发现Agent的薄弱环节
假设原因：是提示词问题？上下文问题？示例问题？
修改提示词：基于假设进行针对性修改
重新评估：在相同测试案例上运行修改后的提示词
对比：修改是否优化了目标维度？
检查回归：其他维度是否受到影响？
迭代：重复直到质量达到阈值

Guidelines

指南

Always require justification before scores - Chain-of-thought prompting improves reliability by 15-25%
Always swap positions in pairwise comparison - Single-pass comparison is corrupted by position bias
Match scale granularity to rubric specificity - Don't use 1-10 without detailed level descriptions
Separate objective and subjective criteria - Use direct scoring for objective, pairwise for subjective
Include confidence scores - Calibrate to position consistency and evidence strength
Define edge cases explicitly - Ambiguous situations cause the most evaluation variance
Use domain-specific rubrics - Generic rubrics produce generic (less useful) evaluations
Validate against human judgments - Automated evaluation is only valuable if it correlates with human assessment
Monitor for systematic bias - Track disagreement patterns by criterion and response type
Design for iteration - Evaluation systems improve with feedback loops

始终要求先给理由再给分 - 思维链提示可将评估可靠性提升15-25%
成对比较时始终交换位置 - 单次比较会受位置偏见影响
量表粒度与评分表特异性匹配 - 无详细层级描述时不要使用1-10分制
区分客观与主观标准 - 客观标准用直接评分，主观标准用成对比较
包含置信度分数 - 根据位置一致性和证据强度校准
明确定义边缘案例 - 模糊场景是评估差异的主要来源
使用领域特定评分表 - 通用评分表会产生通用（无用）的评估结果
与人工判断对比验证 - 只有与人工评估相关时，自动化评估才有价值
监控系统性偏见 - 按标准和响应类型跟踪分歧模式
为迭代设计评估系统 - 评估系统可通过反馈环持续优化

Example: Evaluating a Claude Code Command

示例：评估Claude Code命令

Suppose you've created a

/refactor

command and want to evaluate its quality:

Test Cases:

Simple: Rename a variable across a single file
Medium: Extract a function from existing code
Complex: Refactor a class to use a new design pattern
Very Complex: Restructure module dependencies

Evaluation Rubric:

Correctness: Does the refactored code work?
Completeness: Were all instances updated?
Style: Does it follow project conventions?
Efficiency: Were unnecessary changes avoided?

Evaluation Prompt:

markdown

Evaluate this refactoring output:

Original Code:
{original}

Refactored Code:
{refactored}

Request:
{user_request}

Score 1-5 on each dimension with evidence:
1. Correctness: Does the code still work correctly?
2. Completeness: Were all relevant instances updated?
3. Style: Does it follow the project's coding patterns?
4. Efficiency: Were only necessary changes made?

Provide scores with specific evidence from the code.

Iteration: If evaluation reveals the command often misses instances:

Add explicit instruction: "Search the entire codebase for all occurrences"
Re-evaluate with same test cases
Compare completeness scores
Check that correctness didn't regress

假设你创建了一个

/refactor

命令并想评估其质量：

测试案例：

简单：在单个文件中重命名变量
中等：从现有代码中提取函数
复杂：重构类以使用新设计模式
极复杂：重构模块依赖关系

评估评分表：

正确性：重构后的代码是否可运行？
完整性：所有实例是否都已更新？
风格：是否遵循项目规范？
效率：是否避免了不必要的修改？

评估提示词：

markdown

评估此重构输出：

原始代码：
{original}

重构后代码：
{refactored}

请求：
{user_request}

根据每个维度给出1-5分并提供证据：
1. 正确性：代码是否仍能正确运行？
2. 完整性：所有相关实例是否都已更新？
3. 风格：是否遵循项目的编码模式？
4. 效率：是否仅进行了必要修改？

提供分数并引用代码中的具体证据。

迭代：若评估发现命令经常遗漏实例：

添加明确指令："搜索整个代码库以找到所有出现的位置"
在相同测试案例上重新评估
对比完整性分数
检查正确性是否出现回归

Bias Mitigation Techniques for LLM Evaluation

LLM评估的偏见缓解技术

This reference details specific techniques for mitigating known biases in LLM-as-a-Judge systems.

本参考文档详细介绍了缓解LLM-as-Judge系统中已知偏见的具体技术。

Position Bias

位置偏见

The Problem

问题

In pairwise comparison, LLMs systematically prefer responses in certain positions. Research shows:

GPT has mild first-position bias (~55% preference for first position in ties)
Claude shows similar patterns
Smaller models often show stronger bias

在成对比较中，LLM会系统性偏好特定位置的响应。研究表明：

GPT存在轻微的第一位置偏见（平局场景下约55%偏好第一位置）
Claude表现出类似模式
小型模型通常偏见更明显

Mitigation: Position Swapping Protocol

缓解方案：位置交换流程

python

async def position_swap_comparison(response_a, response_b, prompt, criteria):
    # Pass 1: Original order
    result_ab = await compare(response_a, response_b, prompt, criteria)
    
    # Pass 2: Swapped order
    result_ba = await compare(response_b, response_a, prompt, criteria)
    
    # Map second result (A in second position → B in first)
    result_ba_mapped = {
        'winner': {'A': 'B', 'B': 'A', 'TIE': 'TIE'}[result_ba['winner']],
        'confidence': result_ba['confidence']
    }
    
    # Consistency check
    if result_ab['winner'] == result_ba_mapped['winner']:
        return {
            'winner': result_ab['winner'],
            'confidence': (result_ab['confidence'] + result_ba_mapped['confidence']) / 2,
            'position_consistent': True
        }
    else:
        # Disagreement indicates position bias was a factor
        return {
            'winner': 'TIE',
            'confidence': 0.5,
            'position_consistent': False,
            'bias_detected': True
        }

python

async def position_swap_comparison(response_a, response_b, prompt, criteria):
    # 第一轮：原始顺序
    result_ab = await compare(response_a, response_b, prompt, criteria)
    
    # 第二轮：交换顺序
    result_ba = await compare(response_b, response_a, prompt, criteria)
    
    # 映射第二轮结果（A在第二位置 → B在第一位置）
    result_ba_mapped = {
        'winner': {'A': 'B', 'B': 'A', 'TIE': 'TIE'}[result_ba['winner']],
        'confidence': result_ba['confidence']
    }
    
    # 一致性检查
    if result_ab['winner'] == result_ba_mapped['winner']:
        return {
            'winner': result_ab['winner'],
            'confidence': (result_ab['confidence'] + result_ba_mapped['confidence']) / 2,
            'position_consistent': True
        }
    else:
        # 分歧表明位置偏见是影响因素
        return {
            'winner': 'TIE',
            'confidence': 0.5,
            'position_consistent': False,
            'bias_detected': True
        }

Alternative: Multiple Shuffles

替代方案：多次打乱顺序

For higher reliability, use multiple position orderings:

python

async def multi_shuffle_comparison(response_a, response_b, prompt, criteria, n_shuffles=3):
    results = []
    for i in range(n_shuffles):
        if i % 2 == 0:
            r = await compare(response_a, response_b, prompt, criteria)
        else:
            r = await compare(response_b, response_a, prompt, criteria)
            r['winner'] = {'A': 'B', 'B': 'A', 'TIE': 'TIE'}[r['winner']]
        results.append(r)
    
    # Majority vote
    winners = [r['winner'] for r in results]
    final_winner = max(set(winners), key=winners.count)
    agreement = winners.count(final_winner) / len(winners)
    
    return {
        'winner': final_winner,
        'confidence': agreement,
        'n_shuffles': n_shuffles
    }

为提升可靠性，可使用多种位置顺序：

python

async def multi_shuffle_comparison(response_a, response_b, prompt, criteria, n_shuffles=3):
    results = []
    for i in range(n_shuffles):
        if i % 2 == 0:
            r = await compare(response_a, response_b, prompt, criteria)
        else:
            r = await compare(response_b, response_a, prompt, criteria)
            r['winner'] = {'A': 'B', 'B': 'A', 'TIE': 'TIE'}[r['winner']]
        results.append(r)
    
    # 多数投票
    winners = [r['winner'] for r in results]
    final_winner = max(set(winners), key=winners.count)
    agreement = winners.count(final_winner) / len(winners)
    
    return {
        'winner': final_winner,
        'confidence': agreement,
        'n_shuffles': n_shuffles
    }

Length Bias

长度偏见

The Problem

问题

LLMs tend to rate longer responses higher, regardless of quality. This manifests as:

Verbose responses receiving inflated scores
Concise but complete responses penalized
Padding and repetition being rewarded

LLM倾向于给更长的响应更高分数，无论质量如何。表现为：

冗长响应获得虚高分数
简洁但完整的响应被惩罚
填充内容和重复内容被奖励

Mitigation: Explicit Prompting

缓解方案：明确提示

Include anti-length-bias instructions in the prompt:

CRITICAL EVALUATION GUIDELINES:
- Do NOT prefer responses because they are longer
- Concise, complete answers are as valuable as detailed ones
- Penalize unnecessary verbosity or repetition
- Focus on information density, not word count

在提示词中加入反长度偏见说明：

关键评估指南：
- 不要因为响应更长而偏好它
- 简洁完整的答案与详细答案具有同等价值
- 惩罚不必要的冗长或重复内容
- 关注信息密度而非字数

Mitigation: Length-Normalized Scoring

缓解方案：长度归一化评分

python

def length_normalized_score(score, response_length, target_length=500):
    """Adjust score based on response length."""
    length_ratio = response_length / target_length
    
    if length_ratio > 2.0:
        # Penalize excessively long responses
        penalty = (length_ratio - 2.0) * 0.1
        return max(score - penalty, 1)
    elif length_ratio < 0.3:
        # Penalize excessively short responses
        penalty = (0.3 - length_ratio) * 0.5
        return max(score - penalty, 1)
    else:
        return score

python

def length_normalized_score(score, response_length, target_length=500):
    """根据响应长度调整分数。"""
    length_ratio = response_length / target_length
    
    if length_ratio > 2.0:
        # 惩罚过长响应
        penalty = (length_ratio - 2.0) * 0.1
        return max(score - penalty, 1)
    elif length_ratio < 0.3:
        # 惩罚过短响应
        penalty = (0.3 - length_ratio) * 0.5
        return max(score - penalty, 1)
    else:
        return score

Mitigation: Separate Length Criterion

缓解方案：单独的长度标准

Make length a separate, explicit criterion so it's not implicitly rewarded:

python

criteria = [
    {"name": "Accuracy", "description": "Factual correctness", "weight": 0.4},
    {"name": "Completeness", "description": "Covers key points", "weight": 0.3},
    {"name": "Conciseness", "description": "No unnecessary content", "weight": 0.3}  # Explicit
]

将长度设为单独的明确标准，避免其被隐性奖励：

python

criteria = [
    {"name": "准确性", "description": "事实正确性", "weight": 0.4},
    {"name": "完整性", "description": "覆盖核心要点", "weight": 0.3},
    {"name": "简洁性", "description": "无不必要内容", "weight": 0.3}  # 明确标准
]

Self-Enhancement Bias

自我增强偏见

The Problem

问题

Models rate outputs generated by themselves (or similar models) higher than outputs from different models.

模型对自身（或相似模型）生成的输出评分高于其他模型的输出。

Mitigation: Cross-Model Evaluation

缓解方案：跨模型评估

Use a different model family for evaluation than generation:

python

def get_evaluator_model(generator_model):
    """Select evaluator to avoid self-enhancement bias."""
    if 'gpt' in generator_model.lower():
        return 'claude-4-5-sonnet'
    elif 'claude' in generator_model.lower():
        return 'gpt-5.2'
    else:
        return 'gpt-5.2'  # Default

使用与生成模型不同的模型家族进行评估：

python

def get_evaluator_model(generator_model):
    """选择评估模型以避免自我增强偏见。"""
    if 'gpt' in generator_model.lower():
        return 'claude-4-5-sonnet'
    elif 'claude' in generator_model.lower():
        return 'gpt-5.2'
    else:
        return 'gpt-5.2'  # 默认

Mitigation: Blind Evaluation

缓解方案：盲态评估

Remove model attribution from responses before evaluation:

python

def anonymize_response(response, model_name):
    """Remove model-identifying patterns."""
    patterns = [
        f"As {model_name}",
        "I am an AI",
        "I don't have personal opinions",
        # Model-specific patterns
    ]
    anonymized = response
    for pattern in patterns:
        anonymized = anonymized.replace(pattern, "[REDACTED]")
    return anonymized

在评估前移除响应中的模型标识：

python

def anonymize_response(response, model_name):
    """移除模型识别模式。"""
    patterns = [
        f"As {model_name}",
        "I am an AI",
        "I don't have personal opinions",
        # 模型特定模式
    ]
    anonymized = response
    for pattern in patterns:
        anonymized = anonymized.replace(pattern, "[REDACTED]")
    return anonymized

Verbosity Bias

冗余偏见

The Problem

问题

Detailed explanations receive higher scores even when the extra detail is irrelevant or incorrect.

详细解释会获得更高分数，即使额外细节无关或错误。

Mitigation: Relevance-Weighted Scoring

缓解方案：相关性加权评分

python

async def relevance_weighted_evaluation(response, prompt, criteria):
    # First, assess relevance of each segment
    relevance_scores = await assess_relevance(response, prompt)
    
    # Weight evaluation by relevance
    segments = split_into_segments(response)
    weighted_scores = []
    for segment, relevance in zip(segments, relevance_scores):
        if relevance > 0.5:  # Only count relevant segments
            score = await evaluate_segment(segment, prompt, criteria)
            weighted_scores.append(score * relevance)
    
    return sum(weighted_scores) / len(weighted_scores)

python

async def relevance_weighted_evaluation(response, prompt, criteria):
    # 首先评估每个片段的相关性
    relevance_scores = await assess_relevance(response, prompt)
    
    # 按相关性加权评估
    segments = split_into_segments(response)
    weighted_scores = []
    for segment, relevance in zip(segments, relevance_scores):
        if relevance > 0.5:  # 仅计算相关片段
            score = await evaluate_segment(segment, prompt, criteria)
            weighted_scores.append(score * relevance)
    
    return sum(weighted_scores) / len(weighted_scores)

Mitigation: Rubric with Verbosity Penalty

缓解方案：带冗余惩罚的评分表

Include explicit verbosity penalties in rubrics:

python

rubric_levels = [
    {
        "score": 5,
        "description": "Complete and concise. All necessary information, nothing extraneous.",
        "characteristics": ["Every sentence adds value", "No repetition", "Appropriately scoped"]
    },
    {
        "score": 3,
        "description": "Complete but verbose. Contains unnecessary detail or repetition.",
        "characteristics": ["Main points covered", "Some tangents", "Could be more concise"]
    },
    # ... etc
]

在评分表中加入明确的冗余惩罚：

python

rubric_levels = [
    {
        "score": 5,
        "description": "完整且简洁。包含所有必要信息，无多余内容。",
        "characteristics": ["每句话都有价值", "无重复", "范围恰当"]
    },
    {
        "score": 3,
        "description": "完整但冗长。包含不必要的细节或重复内容。",
        "characteristics": ["覆盖核心要点", "存在一些偏离主题的内容", "可更简洁"]
    },
    # ... 其他层级
]

Authority Bias

权威偏见

The Problem

问题

Confident, authoritative tone is rated higher regardless of accuracy.

自信、权威的语气会获得更高分数，无论准确性如何。

Mitigation: Evidence Requirement

缓解方案：证据要求

Require explicit evidence for claims:

For each claim in the response:
1. Identify whether it's a factual claim
2. Note if evidence or sources are provided
3. Score based on verifiability, not confidence

IMPORTANT: Confident claims without evidence should NOT receive higher scores than 
hedged claims with evidence.

要求为主张提供明确证据：

针对响应中的每个主张：
1. 识别其是否为事实主张
2. 记录是否提供了证据或来源
3. 基于可验证性而非自信程度评分

重要提示：无证据的自信主张不应比有证据的谨慎主张获得更高分数。

Mitigation: Fact-Checking Layer

缓解方案：事实核查环节

Add a fact-checking step before scoring:

python

async def fact_checked_evaluation(response, prompt, criteria):
    # Extract claims
    claims = await extract_claims(response)
    
    # Fact-check each claim
    fact_check_results = await asyncio.gather(*[
        verify_claim(claim) for claim in claims
    ])
    
    # Adjust score based on fact-check results
    accuracy_factor = sum(r['verified'] for r in fact_check_results) / len(fact_check_results)
    
    base_score = await evaluate(response, prompt, criteria)
    return base_score * (0.7 + 0.3 * accuracy_factor)  # At least 70% of score

在评分前加入事实核查步骤：

python

async def fact_checked_evaluation(response, prompt, criteria):
    # 提取主张
    claims = await extract_claims(response)
    
    # 事实核查每个主张
    fact_check_results = await asyncio.gather(*[
        verify_claim(claim) for claim in claims
    ])
    
    # 根据事实核查结果调整分数
    accuracy_factor = sum(r['verified'] for r in fact_check_results) / len(fact_check_results)
    
    base_score = await evaluate(response, prompt, criteria)
    return base_score * (0.7 + 0.3 * accuracy_factor)  # 分数至少为基础分的70%

Aggregate Bias Detection

聚合偏见检测

Monitor for systematic biases in production:

python

class BiasMonitor:
    def __init__(self):
        self.evaluations = []
    
    def record(self, evaluation):
        self.evaluations.append(evaluation)
    
    def detect_position_bias(self):
        """Detect if first position wins more often than expected."""
        first_wins = sum(1 for e in self.evaluations if e['first_position_winner'])
        expected = len(self.evaluations) * 0.5
        z_score = (first_wins - expected) / (expected * 0.5) ** 0.5
        return {'bias_detected': abs(z_score) > 2, 'z_score': z_score}
    
    def detect_length_bias(self):
        """Detect if longer responses score higher."""
        from scipy.stats import spearmanr
        lengths = [e['response_length'] for e in self.evaluations]
        scores = [e['score'] for e in self.evaluations]
        corr, p_value = spearmanr(lengths, scores)
        return {'bias_detected': corr > 0.3 and p_value < 0.05, 'correlation': corr}

在生产环境中监控系统性偏见：

python

class BiasMonitor:
    def __init__(self):
        self.evaluations = []
    
    def record(self, evaluation):
        self.evaluations.append(evaluation)
    
    def detect_position_bias(self):
        """检测第一位置是否比预期更易获胜。"""
        first_wins = sum(1 for e in self.evaluations if e['first_position_winner'])
        expected = len(self.evaluations) * 0.5
        z_score = (first_wins - expected) / (expected * 0.5) ** 0.5
        return {'bias_detected': abs(z_score) > 2, 'z_score': z_score}
    
    def detect_length_bias(self):
        """检测更长响应是否获得更高分数。"""
        from scipy.stats import spearmanr
        lengths = [e['response_length'] for e in self.evaluations]
        scores = [e['score'] for e in self.evaluations]
        corr, p_value = spearmanr(lengths, scores)
        return {'bias_detected': corr > 0.3 and p_value < 0.05, 'correlation': corr}

Summary Table

汇总表

Bias	Primary Mitigation	Secondary Mitigation	Detection Method
Position	Position swapping	Multiple shuffles	Consistency check
Length	Explicit prompting	Length normalization	Length-score correlation
Self-enhancement	Cross-model evaluation	Anonymization	Model comparison study
Verbosity	Relevance weighting	Rubric penalties	Relevance scoring
Authority	Evidence requirement	Fact-checking layer	Confidence-accuracy correlation

偏见	主要缓解方案	次要缓解方案	检测方法
位置偏见	位置交换	多次打乱顺序	一致性检查
长度偏见	明确提示	长度归一化	长度-分数相关性
自我增强偏见	跨模型评估	匿名化	模型对比研究
冗余偏见	相关性加权	评分表惩罚	相关性评分
权威偏见	证据要求	事实核查环节	置信度-准确性相关性

LLM-as-Judge Implementation Patterns for Claude Code

Claude Code的LLM-as-Judge实现模式

This reference provides practical prompt patterns and workflows for evaluating Claude Code commands, skills, and agents during development.

本参考文档提供了开发过程中评估Claude Code命令、技能和Agent的实用提示词模式与流程。

Pattern 1: Structured Evaluation Workflow

模式1：结构化评估流程

The most reliable evaluation follows a structured workflow that separates concerns:

Define Criteria → Gather Test Cases → Run Evaluation → Mitigate Bias → Interpret Results

最可靠的评估遵循分离关注点的结构化流程：

定义标准 → 收集测试案例 → 运行评估 → 缓解偏见 → 解读结果

Step 1: Define Evaluation Criteria

步骤1：定义评估标准

Before evaluating, establish clear criteria. Document them in a reusable format:

markdown

undefined

评估前先建立明确标准，以可复用格式记录：

markdown

undefined

Evaluation Criteria for [Command/Skill Name]

[命令/技能名称]的评估标准

Criterion 1: Instruction Following (weight: 0.30)

标准1：指令遵循度（权重：0.30）

Description: Does the output follow all explicit instructions?
1 (Poor): Ignores or misunderstands core instructions
3 (Adequate): Follows main instructions, misses some details
5 (Excellent): Follows all instructions precisely

描述：输出是否遵循所有明确指令？
1分（差）：忽略或误解核心指令
3分（合格）：遵循主要指令，遗漏部分细节
5分（优秀）：精确遵循所有指令

Criterion 2: Output Completeness (weight: 0.25)

标准2：输出完整性（权重：0.25）

Description: Are all requested aspects covered?
1 (Poor): Major aspects missing
3 (Adequate): Core aspects covered with gaps
5 (Excellent): All aspects thoroughly addressed

描述：是否覆盖所有请求内容？
1分（差）：缺失主要内容
3分（合格）：覆盖核心内容但存在缺口
5分（优秀）：全面覆盖所有内容

Criterion 3: Tool Efficiency (weight: 0.20)

标准3：工具效率（权重：0.20）

Description: Were appropriate tools used efficiently?
1 (Poor): Wrong tools or excessive redundant calls
3 (Adequate): Appropriate tools with some redundancy
5 (Excellent): Optimal tool selection, minimal calls

描述：是否高效使用了合适的工具？
1分（差）：使用错误工具或过度冗余调用
3分（合格）：使用合适工具但存在一些冗余
5分（优秀）：工具选择最优，调用次数最少

Criterion 4: Reasoning Quality (weight: 0.15)

标准4：推理质量（权重：0.15）

Description: Is the reasoning clear and sound?
1 (Poor): No apparent reasoning or flawed logic
3 (Adequate): Basic reasoning present
5 (Excellent): Clear, logical reasoning throughout

描述：推理是否清晰合理？
1分（差）：无明显推理或逻辑缺陷
3分（合格）：具备基础推理能力
5分（优秀）：全程推理清晰、逻辑严谨

Criterion 5: Response Coherence (weight: 0.10)

标准5：响应连贯性（权重：0.10）

Description: Is the output well-structured and clear?
1 (Poor): Difficult to follow or incoherent
3 (Adequate): Understandable but could be clearer
5 (Excellent): Well-structured, easy to follow

undefined

描述：输出结构是否清晰易懂？
1分（差）：难以理解或逻辑混乱
3分（合格）：可理解但可更清晰
5分（优秀）：结构清晰，易于理解

undefined

Step 2: Create Test Cases

步骤2：创建测试案例

Structure test cases by complexity level:

markdown

undefined

按复杂度层级构建测试案例：

markdown

undefined

Test Cases for /refactor Command

/refactor命令的测试案例

Simple (Single Operation)

简单（单次操作）

Input: Rename variable
```
x
```
to
```
count
```
in a single file
Expected: All instances renamed, code still runs
Complexity: Low

输入：在单个文件中将变量
```
x
```
重命名为
```
count
```
预期：所有实例均已重命名，代码仍可运行
复杂度：低

Medium (Multiple Operations)

中等（多次操作）

Input: Extract function from 20-line code block
Expected: New function created, original call site updated, behavior preserved
Complexity: Medium

输入：从20行代码块中提取函数
预期：创建新函数，更新原始调用位置，保留原有行为
复杂度：中

Complex (Cross-File Changes)

复杂（跨文件修改）

Input: Refactor class to use Strategy pattern
Expected: Interface created, implementations separated, all usages updated
Complexity: High

输入：重构类以使用策略模式
预期：创建接口，分离实现，更新所有调用位置
复杂度：高

Edge Case

边缘案例

Input: Refactor code with conflicting variable names in nested scopes
Expected: Correct scoping preserved, no accidental shadowing
Complexity: Edge case

undefined

输入：重构嵌套作用域中存在冲突变量名的代码
预期：保留正确作用域，无意外遮蔽
复杂度：边缘案例

undefined

Step 3: Run Direct Scoring Evaluation

步骤3：运行直接评分评估

Use this prompt template to evaluate a single output:

markdown

You are evaluating the output of a Claude Code command.

使用以下提示词模板评估单个输出：

markdown

你正在评估Claude Code命令的输出。

Original Task

原始任务

{paste the user's original request}

{粘贴用户的原始请求}

Command Output

命令输出

{paste the full command output including tool calls}

{粘贴完整命令输出，包括工具调用}

Evaluation Criteria

评估标准

{paste your criteria definitions from Step 1}

{粘贴步骤1中定义的标准}

Instructions

说明

For each criterion:

Find specific evidence in the output that supports your assessment
Assign a score (1-5) based on the rubric levels
Write a 1-2 sentence justification citing the evidence
Suggest one specific improvement

IMPORTANT: Provide your justification BEFORE stating the score. This improves evaluation reliability.

针对每个标准：

在输出中找到支持评估的具体证据
根据评分表层级给出1-5分
撰写1-2句引用证据的理由
提出一条具体优化建议

重要提示：先提供理由再给出分数。这能提升评估可靠性。

Output Format

输出格式

For each criterion, respond with:

针对每个标准，响应格式如下：

[Criterion Name]

[标准名称]

Evidence: [Quote or describe specific parts of the output] Justification: [Explain how the evidence maps to the rubric level] Score: [1-5] Improvement: [One actionable suggestion]

证据：[引用或描述输出中的具体部分] 理由：[解释证据如何对应评分表层级] 分数：[1-5] 优化建议：[一条可落地的建议]

Overall Assessment

整体评估

Weighted Score: [Calculate: sum of (score × weight)] Pass/Fail: [Pass if weighted score ≥ 3.5] Summary: [2-3 sentences summarizing strengths and weaknesses]

undefined

加权分数：[计算：(分数 × 权重)之和] 通过/失败：[加权分数 ≥ 3.5则通过] 总结：[2-3句话总结优势与不足]

undefined

Step 4: Mitigate Position Bias in Comparisons

步骤4：缓解成对比较中的位置偏见

When comparing two prompt variants (A vs B), use this two-pass workflow:

Pass 1 (A First):

markdown

You are comparing two outputs from different prompt variants.

对比两个提示词变体（A vs B）时，使用以下两轮流程：

第一轮（A在前）：

markdown

你正在比较来自不同提示词变体的两个输出。

Original Task

原始任务

{task description}

{任务描述}

Output A (First Variant)

输出A（第一变体）

{output from prompt variant A}

{提示词变体A的输出}

Output B (Second Variant)

输出B（第二变体）

{output from prompt variant B}

{提示词变体B的输出}

Comparison Criteria

对比标准

Instruction Following
Output Completeness
Reasoning Quality

指令遵循度
输出完整性
推理质量

Critical Instructions

关键说明

Do NOT prefer outputs because they are longer
Do NOT prefer outputs based on their position (first vs second)
Focus ONLY on quality differences
TIE is acceptable when outputs are equivalent

不要因为输出更长而偏好它
不要根据位置（第一/第二）偏好输出
仅关注质量差异
当输出相当时，平局是可接受的

Analysis Process

分析流程

Analyze Output A independently: [strengths, weaknesses]
Analyze Output B independently: [strengths, weaknesses]
Compare on each criterion
Determine winner with confidence (0-1)

独立分析输出A：[优势、不足]
独立分析输出B：[优势、不足]
在每个标准上对比两者
确定获胜者及置信度（0-1）

Output

输出

Reasoning: [Explain why] Winner: [A/B/TIE] Confidence: [0.0-1.0]


**Pass 2 (B First):**
Repeat the same prompt but swap the order—put Output B first and Output A second.

**Interpret Results:**
- If both passes agree → Winner confirmed, average the confidences
- If passes disagree → Result is TIE with confidence 0.5 (position bias detected)

理由：[解释原因] 获胜者：[A/B/平局] 置信度：[0.0-1.0]


**第二轮（B在前）**：
重复相同提示词，但交换顺序——将输出B放在前面，输出A放在后面。

**解读结果**：
- 若两轮结果一致 → 确认获胜者，取置信度平均值
- 若两轮结果不一致 → 结果为平局，置信度0.5（检测到位置偏见）

Pattern 2: Hierarchical Evaluation Workflow

模式2：分层评估流程

For complex evaluations, use a hierarchical approach:

Quick Screen (cheap model) → Detailed Evaluation (expensive model) → Human Review (edge cases)

针对复杂评估，使用分层方法：

快速筛选（低成本模型） → 详细评估（高成本模型） → 人工审查（边缘案例）

Tier 1: Quick Screen (Use Haiku)

第一层：快速筛选（使用Haiku）

markdown

Rate this command output 0-10 for basic adequacy.

Task: {brief task description}
Output: {command output}

Quick assessment: Does this output reasonably address the task?
Score (0-10):
One-line reasoning:

Decision rule: Score < 5 → Fail, Score ≥ 7 → Pass, Score 5-7 → Escalate to detailed evaluation

markdown

为该命令输出的基本充足性评分（0-10分）。

任务：{简要任务描述}
输出：{命令输出}

快速评估：该输出是否合理完成了任务？
分数（0-10）：
一句话理由：

决策规则：分数 < 5 → 失败，分数 ≥ 7 → 通过，分数5-7 → 升级至详细评估

Tier 2: Detailed Evaluation (Use Opus)

第二层：详细评估（使用Opus）

Use the full direct scoring prompt from Pattern 1 for borderline cases.

对 borderline案例使用模式1中的完整直接评分提示词。

Tier 3: Human Review

第三层：人工审查

For low-confidence automated evaluations (confidence < 0.6), queue for manual review:

markdown

undefined

对低置信度自动化评估（置信度 < 0.6），提交人工审查：

markdown

undefined

Human Review Request

人工审查请求

Automated Score: 3.2/5 (Confidence: 0.45) Reason for Escalation: Low confidence, evaluator disagreed across passes

自动化分数：3.2/5（置信度：0.45） 升级原因：置信度低，评估者两轮结果不一致

What to Review

审查内容

Does the output actually complete the task?
Are the automated criterion scores reasonable?
What did the automation miss?

输出是否实际完成了任务？
自动化标准分数是否合理？
自动化评估遗漏了什么？

Original Task

原始任务

{task}

{任务}

Output

输出

{output}

{输出}

Automated Assessment

自动化评估

{paste automated evaluation}

{粘贴自动化评估结果}

Human Override

人工覆盖

[ ] Agree with automation [ ] Override to PASS - Reason: ___ [ ] Override to FAIL - Reason: ___

undefined

[ ] 同意自动化评估结果 [ ] 覆盖为通过 - 理由：___ [ ] 覆盖为失败 - 理由：___

undefined

Pattern 3: Panel of LLM Judges (PoLL)

模式3：LLM评估专家组（PoLL）

For high-stakes evaluation, use multiple models::

针对高风险评估，使用多个模型：

Workflow

流程

Run 3 independent evaluations with different prompt framings:
- Evaluation 1: Standard criteria prompt
- Evaluation 2: Adversarial framing ("Find problems with this output")
- Evaluation 3: User perspective ("Would a developer be satisfied?")
Aggregate results:
- Take median score per criterion (robust to outliers)
- Flag criteria with high variance (std > 1.0) for review
- Overall pass requires majority agreement

运行3次独立评估，使用不同提示词框架：
- 评估1：标准标准提示词
- 评估2：对抗性框架（"找出该输出的问题"）
- 评估3：用户视角（"开发者会对该结果满意吗？"）
汇总结果：
- 取每个标准的中位数分数（对异常值鲁棒）
- 标记高方差标准（标准差 > 1.0）以进行审查
- 整体通过需多数同意

Multi-Judge Prompt Variants

多评估者提示词变体

Standard Framing:

markdown

Evaluate this output against the specified criteria. Be fair and balanced.

Adversarial Framing:

markdown

Your role is to find problems with this output. Be critical and thorough.
Look for: factual errors, missing requirements, inefficiencies, unclear explanations.

User Perspective:

markdown

Imagine you're a developer who requested this task.
Would you be satisfied with this result? Would you need to redo any work?

标准框架：

markdown

根据指定标准评估该输出。保持公平平衡。

对抗性框架：

markdown

你的角色是找出该输出的问题。保持批判性和全面性。
寻找：事实错误、缺失需求、低效、模糊解释。

用户视角：

markdown

假设你是请求该任务的开发者。
你会对该结果满意吗？你需要重做任何工作吗？

Agreement Analysis

一致性分析

After running all judges, check consistency:

Criterion	Judge 1	Judge 2	Judge 3	Median	Std Dev
Instruction Following	4	4	5	4	0.58
Completeness	3	4	3	3	0.58
Tool Efficiency	2	3	4	3	1.00 ⚠️

⚠️ High variance on Tool Efficiency suggests the criterion needs clearer definition or the output has ambiguous efficiency characteristics.

所有评估者完成评估后，检查一致性：

标准	评估者1	评估者2	评估者3	中位数	标准差
指令遵循度	4	4	5	4	0.58
完整性	3	4	3	3	0.58
工具效率	2	3	4	3	1.00 ⚠️

⚠️ 高方差的工具效率标准表明，该标准需要更清晰的定义，或输出的效率特征存在歧义。

Pattern 4: Confidence Calibration

模式4：置信度校准

Confidence scores should be calibrated to actual reliability:

置信度分数应与实际可靠性校准：

Confidence Factors

置信度因素

Factor	High Confidence	Low Confidence
Position consistency	Both passes agree	Passes disagree
Evidence count	3+ specific citations	Vague or no citations
Criterion agreement	All criteria align	Criteria scores vary widely
Edge case match	Similar to known cases	Novel situation

因素	高置信度	低置信度
位置一致性	两轮结果一致	两轮结果不一致
证据数量	3+个具体引用	模糊或无引用
标准一致性	所有标准分数一致	标准分数差异大
边缘案例匹配	与已知案例相似	全新场景

Calibration Prompt Addition

校准提示词补充

Add this to evaluation prompts:

markdown

undefined

在评估提示词中加入以下内容：

markdown

undefined

Confidence Assessment

置信度评估

After scoring, assess your confidence:

Evidence Strength: How specific was the evidence you cited?
- Strong: Quoted exact passages, precise observations
- Moderate: General observations, reasonable inferences
- Weak: Vague impressions, assumptions
Criterion Clarity: How clear were the criterion boundaries?
- Clear: Easy to map output to rubric levels
- Ambiguous: Output fell between levels
- Unclear: Rubric didn't fit this case
Overall Confidence: [0.0-1.0]
- 0.9+: Very confident, clear evidence, obvious rubric fit
- 0.7-0.9: Confident, good evidence, minor ambiguity
- 0.5-0.7: Moderate confidence, some ambiguity
- <0.5: Low confidence, significant uncertainty

Confidence: [score] Confidence Reasoning: [explain what factors affected confidence]

undefined

评分后，评估你的置信度：

证据强度：你引用的证据有多具体？
- 强：引用了确切段落、精确观察结果
- 中：一般性观察、合理推断
- 弱：模糊印象、假设
标准清晰度：标准边界有多清晰？
- 清晰：易于将输出映射到评分表层级
- 模糊：输出处于层级之间
- 不清晰：评分表不适用于该案例
整体置信度：[0.0-1.0]
- 0.9+：非常自信，证据明确，评分表匹配度高
- 0.7-0.9：自信，证据充分，存在微小模糊
- 0.5-0.7：中等置信度，存在一些模糊
- <0.5：低置信度，存在显著不确定性

置信度：[分数] 置信度理由：[解释影响置信度的因素]

undefined

Pattern 5: Structured Output Format

模式5：结构化输出格式

Request consistent output structure for easier analysis:

要求一致的输出格式以简化分析：

Evaluation Output Template

评估输出模板

markdown

undefined

markdown

undefined

Evaluation Results

评估结果

Metadata

元数据

Evaluated: [command/skill name]
Test Case: [test case ID or description]
Evaluator: [model used]
Timestamp: [when evaluated]

评估对象：[命令/技能名称]
测试案例：[测试案例ID或描述]
评估者：[使用的模型]
时间戳：[评估时间]

Criterion Scores

标准分数

Criterion	Score	Weight	Weighted	Confidence
Instruction Following	4/5	0.30	1.20	0.85
Output Completeness	3/5	0.25	0.75	0.70
Tool Efficiency	5/5	0.20	1.00	0.90
Reasoning Quality	4/5	0.15	0.60	0.75
Response Coherence	4/5	0.10	0.40	0.80

标准	分数	权重	加权分数	置信度
指令遵循度	4/5	0.30	1.20	0.85
输出完整性	3/5	0.25	0.75	0.70
工具效率	5/5	0.20	1.00	0.90
推理质量	4/5	0.15	0.60	0.75
响应连贯性	4/5	0.10	0.40	0.80

Summary

总结

Overall Score: 3.95/5.0
Pass Threshold: 3.5/5.0
Result: ✅ PASS

整体分数：3.95/5.0
通过阈值：3.5/5.0
结果：✅ 通过

Evidence Summary

证据总结

Strengths: [bullet points]
Weaknesses: [bullet points]
Improvements: [prioritized suggestions]

优势：[要点]
不足：[要点]
优化建议：[按优先级排序的建议]

Confidence Assessment

置信度评估

Overall Confidence: 0.78
Flags: [any concerns or caveats]

undefined

整体置信度：0.78
标记：[任何关注点或注意事项]

undefined

Evaluation Workflows for Claude Code Development

Claude Code开发的评估流程

Workflow: Testing a New Command

流程：测试新命令

Write 5-10 test cases spanning complexity levels
Run command on each test case, capture full output
Quick screen all outputs with Tier 1 evaluation
Detailed evaluate failures and borderline cases
Identify patterns in failures to guide prompt improvements
Iterate prompt based on specific weaknesses found
Re-evaluate same test cases to measure improvement

编写5-10个测试案例，覆盖不同复杂度层级
在每个测试案例上运行命令，捕获完整输出
使用第一层评估快速筛选所有输出
对失败和borderline案例进行详细评估
识别故障模式以指导提示词优化
基于发现的具体弱点迭代提示词
在相同测试案例上重新评估以衡量优化效果

Workflow: Comparing Prompt Variants

流程：对比提示词变体

Create variant prompts (e.g., different instruction phrasings)
Run both variants on identical test cases
Pairwise compare with position swapping
Calculate win rate for each variant
Analyze which cases each variant handles better
Decide: Pick winner or create hybrid

创建变体提示词（如不同指令表述）
在相同测试案例上运行两个变体
使用位置交换进行成对比较
计算每个变体的胜率
分析每个变体更擅长处理的案例
决策：选择获胜变体或创建混合变体

Workflow: Regression Testing

流程：回归测试

Maintain test suite of representative cases
Before changes: Run evaluation, record baseline scores
After changes: Re-run evaluation
Compare: Flag regressions (score drops > 0.5)
Investigate: Why did specific cases regress?
Accept or revert: Based on overall impact

维护代表性案例的测试套件
修改前：运行评估，记录基准分数
修改后：重新运行评估
对比：标记回归（分数下降 > 0.5）
调查：特定案例为何出现回归？
接受或回滚：基于整体影响决策

Workflow: Continuous Quality Monitoring

流程：持续质量监控

Sample production usage (if available)
Run lightweight evaluation on samples
Track metrics over time:
- Average scores by criterion
- Failure rate
- Low-confidence rate
Alert on degradation: Score drop > 10% from baseline
Periodic deep dive: Monthly detailed evaluation on random sample

抽样生产使用数据（如有）
对样本运行轻量评估
随时间跟踪指标：
- 各标准的平均分数
- 失败率
- 低置信度率
性能下降告警：分数比基准下降 > 10%
定期深度分析：每月对随机样本进行详细评估

Anti-Patterns to Avoid

需避免的反模式

❌ Scoring Without Justification

❌ 无理由评分

Problem: Scores lack grounding, difficult to debug Solution: Always require evidence before score

问题：分数缺乏依据，难以调试 解决方案：始终要求先提供证据再给分

❌ Single-Pass Pairwise Comparison

❌ 单次成对比较

Problem: Position bias corrupts results Solution: Always swap positions and check consistency

问题：位置偏见会影响结果 解决方案：始终交换位置并检查一致性

❌ Overloaded Criteria

❌ 过载标准

Problem: Criteria measuring multiple things are unreliable Solution: One criterion = one measurable aspect

问题：衡量多个内容的标准不可靠 解决方案：一个标准 = 一个可衡量的维度

❌ Missing Edge Case Guidance

❌ 缺失边缘案例指导

Problem: Evaluators handle ambiguous cases inconsistently Solution: Include edge cases in rubrics with explicit guidance

问题：评估者处理模糊场景的方式不一致 解决方案：在评分表中加入边缘案例及明确指导

❌ Ignoring Low Confidence

❌ 忽略低置信度

Problem: Acting on uncertain evaluations leads to wrong conclusions Solution: Escalate low-confidence cases for human review

问题：基于不确定评估采取行动会导致错误结论 解决方案：将低置信度案例升级至人工审查

❌ Generic Rubrics

❌ 通用评分表

Problem: Generic criteria produce vague, unhelpful evaluations Solution: Create domain-specific rubrics (code commands vs documentation commands vs analysis commands)

问题：通用标准会产生模糊、无用的评估结果 解决方案：创建领域特定评分表（代码命令 vs 文档命令 vs 分析命令）

Handling Evaluation Failures

处理评估失败

When evaluations fail or produce unreliable results, use these recovery strategies:

当评估失败或产生不可靠结果时，使用以下恢复策略：

Malformed Output Disregard

忽略格式错误的输出

When the evaluator produces unparseable or incomplete output:

Mark as invalid and ingore for analysis - incorrect output, usally means halicunations during thinking process
Retry initial prompt without chagnes - multiple retries usally more consistent rahter one shot prompt
if still produce incorrect output, flag for human review: Mark as "evaluation failed, needs manual check" and queue for later

当评估者产生无法解析或不完整的输出时：

标记为无效并忽略分析 - 错误输出通常是思考过程中的幻觉导致
不修改原始提示词重试 - 多次重试通常比单次提示更一致
若仍产生错误输出，标记为人工审查：标记为"评估失败，需手动检查"并延后处理

Validation Checklist

验证清单

Validating Evaluation Prompts (Meta-Evaluation)

验证评估提示词（元评估）

Before using an evaluation prompt in production, test it against known cases:

在生产环境中使用评估提示词前，针对已知案例进行测试：

Calibration Test Cases

校准测试案例

Create a small set of outputs with known quality levels:

Test Type	Description	Expected Score
Known-good	Clearly excellent output	4.5+ / 5.0
Known-bad	Clearly poor output	< 2.5 / 5.0
Boundary	Borderline case	3.0-3.5 with nuanced explanation

创建具有已知质量水平的小型输出集：

测试类型	描述	预期分数
已知优秀	明显优秀的输出	4.5+ / 5.0
已知较差	明显较差的输出	< 2.5 / 5.0
Borderline	borderline案例	3.0-3.5，带详细解释

Validation Workflow

验证流程

Known-good test: Evaluate a clearly excellent output
- If score < 4.0 → Rubric is too strict or evidence requirements unclear
Known-bad test: Evaluate a clearly poor output
- If score > 3.0 → Rubric is too lenient or criteria not specific enough
Boundary test: Evaluate a borderline case
- Should produce moderate score (3.0-3.5) with detailed explanation
- If confident high/low score → Criteria lack nuance
Consistency test: Run same evaluation 3 times
- Score variance should be < 0.5
- If higher variance → Criteria need tighter definitions

已知优秀测试：评估明显优秀的输出
- 若分数 < 4.0 → 评分表过于严格或证据要求不明确
已知较差测试：评估明显较差的输出
- 若分数 > 3.0 → 评分表过于宽松或标准不够具体
Borderline测试：评估borderline案例
- 应给出中等分数（3.0-3.5）及详细解释
- 若给出高/低置信度分数 → 标准缺乏细微差别
一致性测试：重复运行3次相同评估
- 分数差异应 < 0.5
- 若差异更大 → 标准需要更严格的定义

Position Bias Validation

位置偏见验证

Test for position bias before using pairwise comparisons:

markdown

undefined

在使用成对比较前测试位置偏见：

markdown

undefined

Position Bias Test

位置偏见测试

Run this test with IDENTICAL outputs in both positions:

Test Case: [Same output text] Position A: [Paste output] Position B: [Paste identical output]

Expected Result: TIE with high confidence (>0.9)

If Result Shows Winner:

Position bias detected
Add stronger anti-bias instructions to prompt
Re-test until TIE achieved consistently

undefined

在两个位置使用完全相同的输出运行此测试：

测试案例：[相同输出文本] 位置A：[粘贴输出] 位置B：[粘贴相同输出]

预期结果：平局且置信度高（>0.9）

若结果显示有获胜者：

检测到位置偏见
在提示词中加入更强的反偏见说明
重新测试直到持续得到平局结果

undefined

Evaluation Prompt Iteration

评估提示词迭代

When calibration tests fail:

Identify failure mode: Too strict? Too lenient? Inconsistent?
Adjust specific rubric levels: Add examples, clarify boundaries
Re-run calibration tests: All 4 tests must pass
Document changes: Track what adjustments improved reliability

当校准测试失败时：

识别失败模式：过于严格？过于宽松？不一致？
调整特定评分表层级：加入示例、明确边界
重新运行校准测试：所有4项测试必须通过
记录修改：跟踪哪些调整提升了可靠性

Metric Selection Guide for LLM Evaluation

LLM评估的指标选择指南

This reference provides guidance on selecting appropriate metrics for different evaluation scenarios.

本参考文档为不同评估场景提供指标选择指导。

Metric Categories

指标类别

Classification Metrics

分类指标

Use for binary or multi-class evaluation tasks (pass/fail, correct/incorrect).

适用于二元或多类评估任务（通过/失败、正确/错误）。

Precision

精确率

Precision = True Positives / (True Positives + False Positives)

Interpretation: Of all responses the judge said were good, what fraction were actually good?

Use when: False positives are costly (e.g., approving unsafe content)

精确率 = 真阳性 / (真阳性 + 假阳性)

解读：评估者认为良好的响应中，真正良好的比例是多少？

适用场景：假阳性成本高（如批准不安全内容）

Recall

召回率

Recall = True Positives / (True Positives + False Negatives)

Interpretation: Of all actually good responses, what fraction did the judge identify?

Use when: False negatives are costly (e.g., missing good content in filtering)

召回率 = 真阳性 / (真阳性 + 假阴性)

解读：真正良好的响应中，评估者识别出的比例是多少？

适用场景：假阴性成本高（如过滤时遗漏优质内容）

F1 Score

F1分数

F1 = 2 * (Precision * Recall) / (Precision + Recall)

Interpretation: Harmonic mean of precision and recall

Use when: You need a single number balancing both concerns

F1 = 2 * (精确率 * 召回率) / (精确率 + 召回率)

解读：精确率和召回率的调和平均数

适用场景：需要平衡两者的单一汇总指标

Agreement Metrics

一致性指标

Use for comparing automated evaluation with human judgment.

用于对比自动化评估与人工判断。

Cohen's Kappa (κ)

κ = (Observed Agreement - Expected Agreement) / (1 - Expected Agreement)

Interpretation: Agreement adjusted for chance

κ > 0.8: Almost perfect agreement
κ 0.6-0.8: Substantial agreement
κ 0.4-0.6: Moderate agreement
κ < 0.4: Fair to poor agreement

Use for: Binary or categorical judgments

κ = (观察一致性 - 预期一致性) / (1 - 预期一致性)

解读：调整了随机一致性后的一致性

κ > 0.8：几乎完美一致
κ 0.6-0.8：高度一致
κ 0.4-0.6：中度一致
κ < 0.4：一致性一般或较差

Weighted Kappa

加权Kappa

For ordinal scales where disagreement severity matters:

Interpretation: Penalizes large disagreements more than small ones

适用于分歧严重程度重要的有序量表：

解读：对严重分歧的惩罚大于轻微分歧

Correlation Metrics

Pairwise Comparison Metrics

成对比较指标

Agreement Rate

一致率

Agreement = (Matching Decisions) / (Total Comparisons)

Interpretation: Simple percentage of agreement

一致率 = (匹配决策数) / (总比较数)

解读：简单的一致百分比

Position Consistency

位置一致性

Consistency = (Consistent across position swaps) / (Total comparisons)

Interpretation: How often does swapping position change the decision?

一致性 = (位置交换后结果一致的次数) / (总比较数)

解读：交换位置后决策改变的频率

Selection Decision Tree

选择决策树

What type of evaluation task?
│
├── Binary classification (pass/fail)
│   └── Use: Precision, Recall, F1, Cohen's κ
│
├── Ordinal scale (1-5 rating)
│   ├── Comparing to human judgments?
│   │   └── Use: Spearman's ρ, Weighted κ
│   └── Comparing two automated judges?
│       └── Use: Kendall's τ, Spearman's ρ
│
├── Pairwise preference
│   └── Use: Agreement rate, Position consistency
│
└── Multi-label classification
    └── Use: Macro-F1, Micro-F1, Per-label metrics

评估任务类型是什么？
│
├── 二元分类（通过/失败）
│   └── 使用：精确率、召回率、F1分数、Cohen's κ
│
├── 有序量表（1-5评分）
│   ├── 与人工判断对比？
│   │   └── 使用：Spearman's ρ、加权κ
│   └── 对比两个自动化评估者？
│       └── 使用：Kendall's τ、Spearman's ρ
│
├── 成对偏好
│   └── 使用：一致率、位置一致性
│
└── 多标签分类
    └── 使用：Macro-F1、Micro-F1、单标签指标

Metric Selection by Use Case

按用例选择指标

Use Case 1: Validating Automated Evaluation

用例1：验证自动化评估

Goal: Ensure automated evaluation correlates with human judgment

Recommended Metrics:

Primary: Spearman's ρ (for ordinal scales) or Cohen's κ (for categorical)
Secondary: Per-criterion agreement
Diagnostic: Confusion matrix for systematic errors

目标：确保自动化评估与人工判断相关

推荐指标：

主要：Spearman's ρ（有序量表）或Cohen's κ（分类）
次要：单标准一致性
诊断：系统误差的混淆矩阵

Use Case 2: Comparing Two Models

用例2：对比两个模型

Goal: Determine which model produces better outputs

Recommended Metrics:

Primary: Win rate (from pairwise comparison)
Secondary: Position consistency (bias check)
Diagnostic: Per-criterion breakdown

目标：确定哪个模型生成的输出更优

推荐指标：

主要：胜率（来自成对比较）
次要：位置一致性（偏见检查）
诊断：单标准细分

Use Case 3: Quality Monitoring

用例3：质量监控

Goal: Track evaluation quality over time

Recommended Metrics:

Primary: Rolling agreement with human spot-checks
Secondary: Score distribution stability
Diagnostic: Bias indicators (position, length)

目标：随时间跟踪评估质量

推荐指标：

主要：与人工抽查的滚动一致性
次要：分数分布稳定性
诊断：偏见指标（位置、长度）

Interpreting Metric Results

解读指标结果

Good Evaluation System Indicators

良好评估系统的指标

Metric	Good	Acceptable	Concerning
Spearman's ρ	> 0.8	0.6-0.8	< 0.6
Cohen's κ	> 0.7	0.5-0.7	< 0.5
Position consistency	> 0.9	0.8-0.9	< 0.8
Length correlation	< 0.2	0.2-0.4	> 0.4

指标	良好	可接受	需关注
Spearman's ρ	> 0.8	0.6-0.8	< 0.6
Cohen's κ	> 0.7	0.5-0.7	< 0.5
位置一致性	> 0.9	0.8-0.9	< 0.8
长度相关性	< 0.2	0.2-0.4	> 0.4

Warning Signs

警告信号

High agreement but low correlation: May indicate calibration issues
Low position consistency: Position bias affecting results
High length correlation: Length bias inflating scores
Per-criterion variance: Some criteria may be poorly defined

高一致率但低相关性：可能表明校准问题
低位置一致性：位置偏见影响结果
高长度相关性：长度偏见导致分数虚高
单标准方差大：部分标准可能定义不佳

Reporting Template

报告模板

markdown

undefined

markdown

undefined

Evaluation System Metrics Report

评估系统指标报告

Human Agreement

人工一致性

Spearman's ρ: 0.82 (p < 0.001)
Cohen's κ: 0.74
Sample size: 500 evaluations

Spearman's ρ：0.82（p < 0.001）
Cohen's κ：0.74
样本量：500次评估

Bias Indicators

偏见指标

Position consistency: 91%
Length-score correlation: 0.12

位置一致性：91%
长度-分数相关性：0.12

Per-Criterion Performance

单标准性能

Criterion	Spearman's ρ	κ
Accuracy	0.88	0.79
Clarity	0.76	0.68
Completeness	0.81	0.72

标准	Spearman's ρ	κ
准确性	0.88	0.79
清晰度	0.76	0.68
完整性	0.81	0.72

Recommendations

建议

All metrics within acceptable ranges
Monitor "Clarity" criterion - lower agreement may indicate need for rubric refinement

undefined

所有指标均在可接受范围内
监控"清晰度"标准——一致性较低可能表明需要优化评分表

undefined

agent-evaluation

Original

Translation

Evaluation Methods for Claude Code Agents

Claude Code Agent的评估方法

Core Concepts

核心概念

Evaluation Challenges

评估挑战

Non-Determinism and Multiple Valid Paths

非确定性与多种有效路径

Context-Dependent Failures

依赖上下文的故障

Composite Quality Dimensions

复合质量维度

Evaluation Rubric Design

评估评分表设计

Multi-Dimensional Rubric

多维评分表

Scoring Approach

评分方法

Evaluation Methodologies

评估方法论

LLM-as-Judge

LLM-as-Judge

Original Task

原始任务

Agent Output

Agent输出

Ground Truth (if available)

基准答案（如有）

Evaluation Criteria

评估标准

Criteria

标准

Human Evaluation

人工评估

End-State Evaluation

终态评估

Test Set Design

测试集设计

Context Engineering Evaluation

上下文工程评估

Testing Prompt Variations

测试提示词变体

Testing Context Strategies

测试上下文策略

Degradation Testing

退化测试

Advanced Evaluation: LLM-as-Judge

进阶评估：LLM-as-Judge

The Evaluation Taxonomy

评估分类

The Bias Landscape

偏见类型

Metric Selection Framework

指标选择框架

Evaluation Metrics Reference

评估指标参考

Classification Metrics (Pass/Fail Tasks)

分类指标（通过/失败任务）

Agreement Metrics (Comparing to Human Judgment)

一致性指标（与人工判断对比）

Correlation Metrics (Ordinal Scores)

相关性指标（有序分数）

Good Evaluation System Indicators

良好评估系统的指标

Evaluation Approaches

评估方法

Direct Scoring Implementation

直接评分实现

Task

任务

Original Prompt

原始提示词

Response to Evaluate

待评估响应

Criteria

标准

Instructions