agent-evaluation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Evaluation Methods for Claude Code Agents

Claude Code Agent的评估方法

Evaluation of agent systems requires different approaches than traditional software or even standard language model applications. Agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Effective evaluation must account for these characteristics while providing actionable feedback. A robust evaluation framework enables continuous improvement, catches regressions, and validates that context engineering choices achieve intended effects.
Agent系统的评估需要区别于传统软件甚至标准大语言模型应用的方法。Agent会做出动态决策,不同运行之间具有非确定性,且通常不存在单一正确答案。有效的评估必须考虑这些特性,同时提供可落地的反馈。完善的评估框架能够支持持续优化、捕获回归问题,并验证上下文工程选择是否达到预期效果。

Core Concepts

核心概念

Agent evaluation requires outcome-focused approaches that account for non-determinism and multiple valid paths. Multi-dimensional rubrics capture various quality aspects: factual accuracy, completeness, citation accuracy, source quality, and tool efficiency. LLM-as-judge provides scalable evaluation while human evaluation catches edge cases.
The key insight is that agents may find alternative paths to goals—the evaluation should judge whether they achieve right outcomes while following reasonable processes.
Performance Drivers: The 95% Finding Research on the BrowseComp evaluation (which tests browsing agents' ability to locate hard-to-find information) found that three factors explain 95% of performance variance:
FactorVariance ExplainedImplication
Token usage80%More tokens = better performance
Number of tool calls~10%More exploration helps
Model choice~5%Better models multiply efficiency
Implications for Claude Code development:
  • Token budgets matter: Evaluate with realistic token constraints
  • Model upgrades beat token increases: Upgrading models provides larger gains than increasing token budgets
  • Multi-agent validation: Validates architectures that distribute work across subagents with separate context windows
Agent评估需要以结果为导向的方法,同时考虑非确定性和多种有效路径。多维评分表可涵盖各类质量维度:事实准确性、完整性、引用准确性、来源质量和工具效率。LLM-as-Judge可实现可扩展的评估,而人工评估能覆盖边缘案例。
关键认知在于,Agent可能通过不同路径达成目标——评估应判断其是否通过合理流程实现了正确结果。
性能驱动因素:95%法则 针对BrowseComp评估(测试浏览Agent定位难获取信息的能力)的研究发现,三个因素可以解释95%的性能差异:
因素解释的差异占比启示
Token使用量80%Token越多 = 性能越好
工具调用次数~10%更多探索有助提升
模型选择~5%更优模型能放大效率
对Claude Code开发的启示:
  • Token预算至关重要:在真实的Token限制下开展评估
  • 模型升级优于Token增加:升级模型比增加Token预算带来的收益更大
  • 多Agent验证:验证在独立上下文窗口下由子Agent分布式处理任务的架构

Evaluation Challenges

评估挑战

Non-Determinism and Multiple Valid Paths

非确定性与多种有效路径

Agents may take completely different valid paths to reach goals. One agent might search three sources while another searches ten. They might use different tools to find the same answer. Traditional evaluations that check for specific steps fail in this context.
Solution: The solution is outcomes, not exact execution paths. Judge whether the agent achieves the right result through a reasonable process.
Agent可能通过完全不同的有效路径达成目标。一个Agent可能搜索3个来源,而另一个可能搜索10个。它们可能使用不同工具获取相同答案。传统的检查特定步骤的评估方法在此场景下失效。
解决方案:关注结果而非精确执行路径。判断Agent是否通过合理流程实现了正确结果。

Context-Dependent Failures

依赖上下文的故障

Agent failures often depend on context in subtle ways. An agent might succeed on complex queries but fail on simple ones. It might work well with one tool set but fail with another. Failures may emerge only after extended interaction when context accumulates.
Solution: Evaluation must cover a range of complexity levels and test extended interactions, not just isolated queries.
Agent故障通常以微妙的方式依赖上下文。Agent可能在复杂查询上成功,但在简单查询上失败;可能在某一工具集下表现良好,但在另一工具集下失效;故障可能仅在上下文累积后的长期交互中出现。
解决方案:评估必须覆盖不同复杂度层级,并测试长期交互,而非仅针对孤立查询。

Composite Quality Dimensions

复合质量维度

Agent quality is not a single dimension. It includes factual accuracy, completeness, coherence, tool efficiency, and process quality. An agent might score high on accuracy but low in efficiency, or vice versa.
An agent might score high on accuracy but low in efficiency.
Solution: Evaluation rubrics must capture multiple dimensions with appropriate weighting for the use case.
Agent质量并非单一维度,它包括事实准确性、完整性、连贯性、工具效率和流程质量。一个Agent可能在准确性上得分高,但效率低,反之亦然。
解决方案:评估评分表必须捕获多个维度,并根据用例需求赋予相应权重。

Evaluation Rubric Design

评估评分表设计

Multi-Dimensional Rubric

多维评分表

Effective rubrics cover key dimensions with descriptive levels:
Instruction Following (weight: 0.30)
  • Excellent (1.0): All instructions followed precisely
  • Good (0.8): Minor deviations that don't affect outcome
  • Acceptable (0.6): Major instructions followed, minor ones missed
  • Poor (0.3): Significant instructions ignored
  • Failed (0.0): Fundamentally misunderstood the task
Output Completeness (weight: 0.25)
  • Excellent: All requested aspects thoroughly covered
  • Good: Most aspects covered with minor gaps
  • Acceptable: Key aspects covered, some gaps
  • Poor: Major aspects missing
  • Failed: Fundamental aspects not addressed
Tool Efficiency (weight: 0.20)
  • Excellent: Optimal tool selection and minimal calls
  • Good: Good tool selection with minor inefficiencies
  • Acceptable: Appropriate tools with some redundancy
  • Poor: Wrong tools or excessive calls
  • Failed: Severe tool misuse or extremely excessive calls
Reasoning Quality (weight: 0.15)
  • Excellent: Clear, logical reasoning throughout
  • Good: Generally sound reasoning with minor gaps
  • Acceptable: Basic reasoning present
  • Poor: Reasoning unclear or flawed
  • Failed: No apparent reasoning
Response Coherence (weight: 0.10)
  • Excellent: Well-structured, easy to follow
  • Good: Generally coherent with minor issues
  • Acceptable: Understandable but could be clearer
  • Poor: Difficult to follow
  • Failed: Incoherent
有效的评分表应覆盖关键维度并提供明确的层级描述:
指令遵循度(权重:0.30)
  • 优秀(1.0):精准遵循所有指令
  • 良好(0.8):存在微小偏差但不影响结果
  • 合格(0.6):遵循主要指令,遗漏次要指令
  • 较差(0.3):显著忽略指令
  • 失败(0.0):从根本上误解任务
输出完整性(权重:0.25)
  • 优秀:全面覆盖所有请求内容
  • 良好:覆盖大部分内容,存在微小缺口
  • 合格:覆盖核心内容,存在部分缺口
  • 较差:缺失主要内容
  • 失败:未涉及核心内容
工具效率(权重:0.20)
  • 优秀:工具选择最优,调用次数最少
  • 良好:工具选择合理,存在微小低效
  • 合格:工具选择恰当,存在一些冗余
  • 较差:工具选择错误或调用次数过多
  • 失败:严重误用工具或调用次数极多
推理质量(权重:0.15)
  • 优秀:全程推理清晰、逻辑严谨
  • 良好:推理整体合理,存在微小漏洞
  • 合格:具备基础推理能力
  • 较差:推理模糊或存在缺陷
  • 失败:无明显推理过程
响应连贯性(权重:0.10)
  • 优秀:结构清晰,易于理解
  • 良好:整体连贯,存在微小问题
  • 合格:可理解但需优化清晰度
  • 较差:难以理解
  • 失败:逻辑混乱

Scoring Approach

评分方法

Convert dimension assessments to numeric scores (0.0 to 1.0) with appropriate weighting. Calculate weighted overall scores. Set passing thresholds based on use case requirements (typically 0.7 for general use, 0.85 for critical operations).
将维度评估转换为数值分数(0.0至1.0)并赋予相应权重,计算加权总分。根据用例需求设置合格阈值(通用场景通常为0.7,关键操作场景为0.85)。

Evaluation Methodologies

评估方法论

LLM-as-Judge

LLM-as-Judge

Using an LLM to evaluate agent outputs scales well and provides consistent judgments. Design evaluation prompts that capture the dimensions of interest. LLM-based evaluation scales to large test sets and provides consistent judgments. The key is designing effective evaluation prompts that capture the dimensions of interest.
Provide clear task description, agent output, ground truth (if available), evaluation scale with level descriptions, and request structured judgment.
Evaluation Prompt Template:
markdown
You are evaluating the output of a Claude Code agent.
使用大语言模型评估Agent输出可实现良好的扩展性,并提供一致的判断。设计能够捕获关注维度的评估提示词。基于LLM的评估可扩展至大型测试集,并提供一致判断,关键在于设计能有效捕获关注维度的评估提示词。
提供清晰的任务描述、Agent输出、基准答案(如有)、带层级描述的评估量表,并要求结构化判断。
评估提示词模板
markdown
你正在评估Claude Code Agent的输出。

Original Task

原始任务

{task_description}
{task_description}

Agent Output

Agent输出

{agent_output}
{agent_output}

Ground Truth (if available)

基准答案(如有)

{expected_output}
{expected_output}

Evaluation Criteria

评估标准

For each criterion, assess the output and provide:
  1. Score (1-5)
  2. Specific evidence supporting your score
  3. One improvement suggestion
针对每个标准,评估输出并提供:
  1. 分数(1-5)
  2. 支持分数的具体证据
  3. 一条优化建议

Criteria

标准

  1. Instruction Following: Did the agent follow all instructions?
  2. Completeness: Are all requested aspects covered?
  3. Tool Efficiency: Were appropriate tools used efficiently?
  4. Reasoning Quality: Is the reasoning clear and sound?
  5. Response Coherence: Is the output well-structured?
Provide your evaluation as a structured assessment with scores and justifications.

**Chain-of-Thought Requirement**: Always require justification before the score. Research shows this improves reliability by 15-25% compared to score-first approaches.
  1. 指令遵循度:Agent是否遵循了所有指令?
  2. 完整性:是否覆盖了所有请求内容?
  3. 工具效率:是否高效使用了合适的工具?
  4. 推理质量:推理是否清晰合理?
  5. 响应连贯性:输出结构是否清晰?
请以结构化评估的形式提供分数和理由。

**思维链要求**:始终要求先提供理由再给出分数。研究表明,与先给分后解释相比,这种方式可将评估可靠性提升15-25%。

Human Evaluation

人工评估

Human evaluation catches what automation misses:
  • Hallucinated answers on unusual queries
  • Subtle context misunderstandings
  • Edge cases that automated evaluation overlooks
  • Qualitative issues with tone or approach
For Claude Code development, ask users this:
  • Review agent outputs manually for edge cases
  • Sample systematically across complexity levels
  • Track patterns in failures to inform prompt improvements
人工评估能够捕获自动化评估遗漏的内容:
  • 特殊查询下的幻觉答案
  • 细微的上下文误解
  • 自动化评估忽略的边缘案例
  • 语气或方法的定性问题
对于Claude Code开发,建议:
  • 手动审查Agent输出以覆盖边缘案例
  • 按复杂度层级系统性抽样
  • 跟踪故障模式以优化提示词

End-State Evaluation

终态评估

For commands that produce artifacts (files, configurations, code), evaluate the final output rather than the process:
  • Does the generated code work?
  • Is the configuration valid?
  • Does the output meet requirements?
对于生成工件(文件、配置、代码)的命令,评估最终输出而非过程:
  • 生成的代码是否可运行?
  • 配置是否有效?
  • 输出是否符合需求?

Test Set Design

测试集设计

Sample Selection Start with small samples during development. Early in agent development, changes have dramatic impacts because there is abundant low-hanging fruit. Small test sets reveal large effects.
Sample from real usage patterns. Add known edge cases. Ensure coverage across complexity levels.
Complexity Stratification Test sets should span complexity levels: simple (single tool call), medium (multiple tool calls), complex (many tool calls, significant ambiguity), and very complex (extended interaction, deep reasoning).
样本选择 开发初期从少量样本开始。Agent开发早期,优化带来的影响显著,因为存在大量易优化的空间。小型测试集即可体现明显效果。
从真实使用模式中抽样,加入已知边缘案例,确保覆盖不同复杂度层级。
复杂度分层 测试集应涵盖不同复杂度层级:简单(单次工具调用)、中等(多次工具调用)、复杂(大量工具调用、存在显著歧义)、极复杂(长期交互、深度推理)。

Context Engineering Evaluation

上下文工程评估

Testing Prompt Variations

测试提示词变体

When iterating on Claude Code prompts, evaluate systematically:
  1. Baseline: Run current prompt on test cases
  2. Variation: Run modified prompt on same cases
  3. Compare: Measure quality scores, token usage, efficiency
  4. Analyze: Identify which changes improved which dimensions
迭代Claude Code提示词时,需系统性评估:
  1. 基准线:在测试案例上运行当前提示词
  2. 变体:在相同案例上运行修改后的提示词
  3. 对比:衡量质量分数、Token使用量和效率
  4. 分析:确定哪些修改优化了哪些维度

Testing Context Strategies

测试上下文策略

Context engineering choices should be validated through systematic evaluation. Run agents with different context strategies on the same test set. Compare quality scores, token usage, and efficiency metrics.
上下文工程选择需通过系统性评估验证。在相同测试集上运行采用不同上下文策略的Agent,对比质量分数、Token使用量和效率指标。

Degradation Testing

退化测试

Test how context degradation affects performance by running agents at different context sizes. Identify performance cliffs where context becomes problematic. Establish safe operating limits.
通过在不同上下文规模下运行Agent,测试上下文退化对性能的影响。确定性能骤降的临界点,建立安全运行限制。

Advanced Evaluation: LLM-as-Judge

进阶评估:LLM-as-Judge

Key insight: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.
关键认知:LLM-as-Judge并非单一技术,而是一系列方法的集合,每种方法适用于不同的评估场景。掌握选择合适方法并缓解已知偏见的能力是本技能的核心。

The Evaluation Taxonomy

评估分类

Evaluation approaches fall into two primary categories with distinct reliability profiles:
Direct Scoring: A single LLM rates one response on a defined scale.
  • Best for: Objective criteria (factual accuracy, instruction following, toxicity)
  • Reliability: Moderate to high for well-defined criteria
  • Failure mode: Score calibration drift, inconsistent scale interpretation
Pairwise Comparison: An LLM compares two responses and selects the better one.
  • Best for: Subjective preferences (tone, style, persuasiveness)
  • Reliability: Higher than direct scoring for preferences
  • Failure mode: Position bias, length bias
Research from the MT-Bench paper (Zheng et al., 2023) establishes that pairwise comparison achieves higher agreement with human judges than direct scoring for preference-based evaluation, while direct scoring remains appropriate for objective criteria with clear ground truth.
评估方法主要分为两类,各有不同的可靠性特征:
直接评分:单个LLM基于定义的量表对单个响应评分。
  • 最佳适用场景:客观标准(事实准确性、指令遵循度、有害性)
  • 可靠性:对于定义明确的标准,可靠性为中到高
  • 失效模式:分数校准漂移、量表解释不一致
成对比较:LLM对比两个响应并选择更优的一个。
  • 最佳适用场景:主观偏好(语气、风格、说服力)
  • 可靠性:在偏好评估上比直接评分更可靠
  • 失效模式:位置偏见、长度偏见
MT-Bench论文(Zheng等人,2023)的研究表明,在基于偏好的评估中,成对比较与人工判断的一致性高于直接评分;而对于有明确基准答案的客观标准,直接评分仍然适用。

The Bias Landscape

偏见类型

LLM judges exhibit systematic biases that must be actively mitigated:
Position Bias: First-position responses receive preferential treatment in pairwise comparison. Mitigation: Evaluate twice with swapped positions, use majority vote or consistency check.
Length Bias: Longer responses are rated higher regardless of quality. Mitigation: Explicit prompting to ignore length, length-normalized scoring.
Self-Enhancement Bias: Models rate their own outputs higher. Mitigation: Use different models for generation and evaluation, or acknowledge limitation.
Verbosity Bias: Detailed explanations receive higher scores even when unnecessary. Mitigation: Criteria-specific rubrics that penalize irrelevant detail.
Authority Bias: Confident, authoritative tone rated higher regardless of accuracy. Mitigation: Require evidence citation, fact-checking layer.
LLM评估者存在系统性偏见,必须主动缓解:
位置偏见:在成对比较中,处于第一位置的响应会受到优先对待。缓解方法:交换位置进行两次评估,采用多数投票或一致性检查。
长度偏见:较长的响应会获得更高分数,无论质量如何。缓解方法:在提示词中明确要求忽略长度,采用长度归一化评分。
自我增强偏见:模型对自身输出的评分高于其他模型。缓解方法:使用与生成模型不同的模型进行评估,或明确说明局限性。
冗余偏见:详细解释会获得更高分数,即使不必要。缓解方法:采用针对特定标准的评分表,惩罚无关细节。
权威偏见:自信、权威的语气会获得更高分数,无论准确性如何。缓解方法:要求提供证据引用,增加事实核查环节。

Metric Selection Framework

指标选择框架

Choose metrics based on the evaluation task structure:
Task TypePrimary MetricsSecondary Metrics
Binary classification (pass/fail)Recall, Precision, F1Cohen's κ
Ordinal scale (1-5 rating)Spearman's ρ, Kendall's τCohen's κ (weighted)
Pairwise preferenceAgreement rate, Position consistencyConfidence calibration
Multi-labelMacro-F1, Micro-F1Per-label precision/recall
The critical insight: High absolute agreement matters less than systematic disagreement patterns. A judge that consistently disagrees with humans on specific criteria is more problematic than one with random noise.
根据评估任务结构选择指标:
任务类型主要指标次要指标
二元分类(通过/失败)召回率、精确率、F1分数Cohen's κ
有序量表(1-5评分)Spearman's ρ、Kendall's τ加权Cohen's κ
成对偏好一致率、位置一致性置信度校准
多标签Macro-F1、Micro-F1单标签精确率/召回率
关键认知:绝对一致性的重要性低于系统性分歧模式。在特定标准上持续与人类判断分歧的评估者,比存在随机误差的评估者更具问题。

Evaluation Metrics Reference

评估指标参考

Classification Metrics (Pass/Fail Tasks)

分类指标(通过/失败任务)

Precision: Of all responses marked as passing, what fraction truly passed?
  • Use when false positives are costly
Recall: Of all actually passing responses, what fraction did we identify?
  • Use when false negatives are costly
F1 Score: Harmonic mean of precision and recall
  • Use for balanced single-number summary
精确率:在所有被标记为通过的响应中,真正通过的比例是多少?
  • 适用于假阳性成本高的场景
召回率:在所有实际通过的响应中,我们识别出的比例是多少?
  • 适用于假阴性成本高的场景
F1分数:精确率和召回率的调和平均数
  • 适用于需要平衡两者的单一汇总指标

Agreement Metrics (Comparing to Human Judgment)

一致性指标(与人工判断对比)

Cohen's Kappa: Agreement adjusted for chance
  • 0.8: Almost perfect agreement
  • 0.6-0.8: Substantial agreement
  • 0.4-0.6: Moderate agreement
  • < 0.4: Fair to poor agreement
Cohen's Kappa:调整了随机一致性后的一致性
  • 0.8:几乎完美一致
  • 0.6-0.8:高度一致
  • 0.4-0.6:中度一致
  • < 0.4:一致性一般或较差

Correlation Metrics (Ordinal Scores)

相关性指标(有序分数)

Spearman's Rank Correlation: Correlation between rankings
  • 0.9: Very strong correlation
  • 0.7-0.9: Strong correlation
  • 0.5-0.7: Moderate correlation
  • < 0.5: Weak correlation
Spearman秩相关系数:排名之间的相关性
  • 0.9:极强相关性
  • 0.7-0.9:强相关性
  • 0.5-0.7:中度相关性
  • < 0.5:弱相关性

Good Evaluation System Indicators

良好评估系统的指标

MetricGoodAcceptableConcerning
Spearman's rho> 0.80.6-0.8< 0.6
Cohen's Kappa> 0.70.5-0.7< 0.5
Position consistency> 0.90.8-0.9< 0.8
Length-score correlation< 0.20.2-0.4> 0.4
指标良好可接受需关注
Spearman's ρ> 0.80.6-0.8< 0.6
Cohen's Kappa> 0.70.5-0.7< 0.5
位置一致性> 0.90.8-0.9< 0.8
长度-分数相关性< 0.20.2-0.4> 0.4

Evaluation Approaches

评估方法

Direct Scoring Implementation

直接评分实现

Direct scoring requires three components: clear criteria, a calibrated scale, and structured output format.
Criteria Definition Pattern:
Criterion: [Name]
Description: [What this criterion measures]
Weight: [Relative importance, 0-1]
Scale Calibration:
  • 1-3 scales: Binary with neutral option, lowest cognitive load
  • 1-5 scales: Standard Likert, good balance of granularity and reliability
  • 1-10 scales: High granularity but harder to calibrate, use only with detailed rubrics
Prompt Structure for Direct Scoring:
You are an expert evaluator assessing response quality.
直接评分需要三个组件:明确的标准、校准的量表和结构化输出格式。
标准定义模式
标准:[名称]
描述:[该标准衡量的内容]
权重:[相对重要性,0-1]
量表校准
  • 1-3分制:带中性选项的二元评分,认知负荷最低
  • 1-5分制:标准李克特量表,在粒度和可靠性间达到良好平衡
  • 1-10分制:高粒度但难以校准,仅在使用详细评分表时采用
直接评分提示词结构
你是评估响应质量的专家。

Task

任务

Evaluate the following response against each criterion.
根据每个标准评估以下响应。

Original Prompt

原始提示词

{prompt}
{prompt}

Response to Evaluate

待评估响应

{response}
{response}

Criteria

标准

{for each criterion: name, description, weight}
{每个标准:名称、描述、权重}

Instructions

说明

For each criterion:
  1. Find specific evidence in the response
  2. Score according to the rubric (1-{max} scale)
  3. Justify your score with evidence
  4. Suggest one specific improvement
针对每个标准:
  1. 在响应中找到具体证据
  2. 根据评分表给出分数(1-{max}分)
  3. 用证据证明分数合理性
  4. 提出一条具体优化建议

Output Format

输出格式

Respond with structured JSON containing scores, justifications, and summary.

**Chain-of-Thought Requirement**: All scoring prompts must require justification before the score. Research shows this improves reliability by 15-25% compared to score-first approaches.
以包含分数、理由和总结的结构化JSON格式响应。

**思维链要求**:所有评分提示词必须要求先提供理由再给出分数。研究表明,与先给分后解释相比,这种方式可将评估可靠性提升15-25%。

Pairwise Comparison Implementation

成对比较实现

Pairwise comparison is inherently more reliable for preference-based evaluation but requires bias mitigation.
Position Bias Mitigation Protocol:
  1. First pass: Response A in first position, Response B in second
  2. Second pass: Response B in first position, Response A in second
  3. Consistency check: If passes disagree, return TIE with reduced confidence
  4. Final verdict: Consistent winner with averaged confidence
Prompt Structure for Pairwise Comparison:
You are an expert evaluator comparing two AI responses.
成对比较在基于偏好的评估中本质上更可靠,但需要缓解偏见。
位置偏见缓解方案
  1. 第一轮:响应A在第一位置,响应B在第二位置
  2. 第二轮:响应B在第一位置,响应A在第二位置
  3. 一致性检查:若两轮结果不一致,返回平局并降低置信度
  4. 最终结论:若两轮结果一致,取平均置信度
成对比较提示词结构
你是比较两个AI响应的专家评估者。

Critical Instructions

关键说明

  • Do NOT prefer responses because they are longer
  • Do NOT prefer responses based on position (first vs second)
  • Focus ONLY on quality according to the specified criteria
  • Ties are acceptable when responses are genuinely equivalent
  • 不要因为响应更长而偏好它
  • 不要根据位置(第一/第二)偏好响应
  • 仅关注符合指定标准的质量
  • 当响应真正相当时,平局是可接受的

Original Prompt

原始提示词

{prompt}
{prompt}

Response A

响应A

{response_a}
{response_a}

Response B

响应B

{response_b}
{response_b}

Comparison Criteria

对比标准

{criteria list}
{标准列表}

Instructions

说明

  1. Analyze each response independently first
  2. Compare them on each criterion
  3. Determine overall winner with confidence level
  1. 先独立分析每个响应
  2. 在每个标准上对比两者
  3. 确定整体获胜者及置信度

Output Format

输出格式

JSON with per-criterion comparison, overall winner, confidence (0-1), and reasoning.

**Confidence Calibration**: Confidence scores should reflect position consistency:

- Both passes agree: confidence = average of individual confidences
- Passes disagree: confidence = 0.5, verdict = TIE
包含单标准对比、整体获胜者、置信度(0-1)和理由的JSON。

**置信度校准**:置信度分数应反映位置一致性:

- 两轮结果一致:置信度 = 单轮置信度的平均值
- 两轮结果不一致:置信度 = 0.5,结论 = 平局

Rubric Generation

评分表生成

Well-defined rubrics reduce evaluation variance by 40-60% compared to open-ended scoring.
定义清晰的评分表可将评估差异降低40-60%,相比开放式评分更可靠。

Rubric Components

评分表组件

  1. Level descriptions: Clear boundaries for each score level
  2. Characteristics: Observable features that define each level
  3. Examples: Representative outputs for each level (when possible)
  4. Edge cases: Guidance for ambiguous situations
  5. Scoring guidelines: General principles for consistent application
  1. 层级描述:每个分数层级的明确边界
  2. 特征:定义每个层级的可观察特征
  3. 示例(如有):每个层级的代表性输出
  4. 边缘案例:模糊场景的处理指导
  5. 评分指南:确保一致应用的通用原则

Strictness Calibration

严格度校准

  • Lenient: Lower bar for passing scores, appropriate for encouraging iteration
  • Balanced: Fair, typical expectations for production use
  • Strict: High standards, appropriate for safety-critical or high-stakes evaluation
  • 宽松:合格分数门槛较低,适合鼓励迭代
  • 平衡:符合生产环境的典型预期
  • 严格:高标准,适用于安全关键或高风险评估

Domain Adaptation

领域适配

Rubrics should use domain-specific terminology:
  • A "code readability" rubric mentions variables, functions, and comments.
  • Documentation rubrics reference clarity, accuracy, completeness
  • Analysis rubrics focus on depth, accuracy, actionability
评分表应使用领域特定术语:
  • "代码可读性"评分表需提及变量、函数和注释
  • 文档评分表关注清晰度、准确性、完整性
  • 分析评分表聚焦深度、准确性、可落地性

Practical Guidance

实践指南

Evaluation Pipeline Design

评估流水线设计

Production evaluation systems require multiple layers:
┌─────────────────────────────────────────────────┐
│                 Evaluation Pipeline              │
├─────────────────────────────────────────────────┤
│                                                   │
│  Input: Response + Prompt + Context               │
│           │                                       │
│           ▼                                       │
│  ┌─────────────────────┐                         │
│  │   Criteria Loader   │ ◄── Rubrics, weights    │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │   Primary Scorer    │ ◄── Direct or Pairwise  │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │   Bias Mitigation   │ ◄── Position swap, etc. │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │ Confidence Scoring  │ ◄── Calibration         │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  Output: Scores + Justifications + Confidence     │
│                                                   │
└─────────────────────────────────────────────────┘
生产级评估系统需要多层架构:
┌─────────────────────────────────────────────────┐
│                 评估流水线                      │
├─────────────────────────────────────────────────┤
│                                                   │
│  输入:响应 + 提示词 + 上下文                     │
│           │                                       │
│           ▼                                       │
│  ┌─────────────────────┐                         │
│  │   标准加载器        │ ◄── 评分表、权重          │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │   主评分器          │ ◄── 直接评分或成对比较    │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │   偏见缓解          │ ◄── 位置交换等            │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  ┌─────────────────────┐                         │
│  │  置信度评分         │ ◄── 校准                  │
│  └──────────┬──────────┘                         │
│             │                                     │
│             ▼                                     │
│  输出:分数 + 理由 + 置信度                       │
│                                                   │
└─────────────────────────────────────────────────┘

Avoiding Evaluation Pitfalls

避免评估陷阱

Anti-pattern: Scoring without justification
  • Problem: Scores lack grounding, difficult to debug or improve
  • Solution: Always require evidence-based justification before score
Anti-pattern: Single-pass pairwise comparison
  • Problem: Position bias corrupts results
  • Solution: Always swap positions and check consistency
Anti-pattern: Overloaded criteria
  • Problem: Criteria measuring multiple things are unreliable
  • Solution: One criterion = one measurable aspect
Anti-pattern: Missing edge case guidance
  • Problem: Evaluators handle ambiguous cases inconsistently
  • Solution: Include edge cases in rubrics with explicit guidance
Anti-pattern: Ignoring confidence calibration
  • Problem: High-confidence wrong judgments are worse than low-confidence
  • Solution: Calibrate confidence to position consistency and evidence strength
反模式:无理由评分
  • 问题:分数缺乏依据,难以调试或优化
  • 解决方案:始终要求先提供基于证据的理由再给分
反模式:单次成对比较
  • 问题:位置偏见会影响结果
  • 解决方案:始终交换位置并检查一致性
反模式:标准过载
  • 问题:衡量多个内容的标准不可靠
  • 解决方案:一个标准 = 一个可衡量的维度
反模式:缺失边缘案例指导
  • 问题:评估者处理模糊场景的方式不一致
  • 解决方案:在评分表中加入边缘案例及明确指导
反模式:忽略置信度校准
  • 问题:高置信度的错误判断比低置信度判断更糟
  • 解决方案:根据位置一致性和证据强度校准置信度

Decision Framework: Direct vs. Pairwise

决策框架:直接评分 vs 成对比较

Use this decision tree:
Is there an objective ground truth?
├── Yes → Direct Scoring
│   └── Examples: factual accuracy, instruction following, format compliance
└── No → Is it a preference or quality judgment?
    ├── Yes → Pairwise Comparison
    │   └── Examples: tone, style, persuasiveness, creativity
    └── No → Consider reference-based evaluation
        └── Examples: summarization (compare to source), translation (compare to reference)
使用以下决策树:
是否存在客观基准答案?
├── 是 → 直接评分
│   └── 示例:事实准确性、指令遵循度、格式合规性
└── 否 → 是否为偏好或质量判断?
    ├── 是 → 成对比较
    │   └── 示例:语气、风格、说服力、创造性
    └── 否 → 考虑基于参考的评估
        └── 示例:摘要(与原文对比)、翻译(与参考译文对比)

Scaling Evaluation

评估规模化

For high-volume evaluation:
  1. Panel of LLMs (PoLL): Use multiple models as judges, aggregate votes
    • Reduces individual model bias
    • More expensive but more reliable for high-stakes decisions
  2. Hierarchical evaluation: Fast cheap model for screening, expensive model for edge cases
    • Cost-effective for large volumes
    • Requires calibration of screening threshold
  3. Human-in-the-loop: Automated evaluation for clear cases, human review for low-confidence
    • Best reliability for critical applications
    • Design feedback loop to improve automated evaluation
针对高吞吐量评估:
  1. LLM专家组(PoLL):使用多个模型作为评估者,汇总投票结果
    • 降低单个模型的偏见
    • 成本更高,但对于高风险决策更可靠
  2. 分层评估:使用快速低成本模型进行筛选,使用高成本模型处理边缘案例
    • 针对大规模场景具成本效益
    • 需要校准筛选阈值
  3. 人工介入:自动化评估处理明确案例,人工审查低置信度案例
    • 关键应用场景下可靠性最佳
    • 设计反馈环以优化自动化评估

Examples

示例

Example 1: Direct Scoring for Accuracy

示例1:准确性的直接评分

Input:
Prompt: "What causes seasons on Earth?"
Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun, 
different hemispheres receive more direct sunlight at different times of year."
Criterion: Factual Accuracy (weight: 1.0)
Scale: 1-5
Output:
json
{
  "criterion": "Factual Accuracy",
  "score": 5,
  "evidence": [
    "Correctly identifies axial tilt as primary cause",
    "Correctly explains differential sunlight by hemisphere",
    "No factual errors present"
  ],
  "justification": "Response accurately explains the cause of seasons with correct 
scientific reasoning. Both the axial tilt and its effect on sunlight distribution 
are correctly described.",
  "improvement": "Could add the specific tilt angle (23.5°) for completeness."
}
输入
提示词:"地球的季节是如何形成的?"
响应:"季节是由地球的倾斜轴引起的。当地球绕太阳公转时,不同半球在一年中的不同时间接收到更多直射阳光。"
标准:事实准确性(权重:1.0)
量表:1-5分
输出
json
{
  "criterion": "事实准确性",
  "score": 5,
  "evidence": [
    "正确指出轴向倾斜是主要原因",
    "正确解释了不同半球的阳光差异",
    "无事实错误"
  ],
  "justification": "响应通过正确的科学推理准确解释了季节的成因。轴向倾斜及其对阳光分布的影响均被正确描述。",
  "improvement": "可补充具体倾斜角度(23.5°)以提升完整性。"
}

Example 2: Pairwise Comparison with Position Swap

示例2:带位置交换的成对比较

Input:
Prompt: "Explain machine learning to a beginner"
Response A: [Technical explanation with jargon]
Response B: [Simple analogy-based explanation]
Criteria: ["clarity", "accessibility"]
First Pass (A first):
json
{ "winner": "B", "confidence": 0.8 }
Second Pass (B first):
json
{ "winner": "A", "confidence": 0.6 }
(Note: Winner is A because B was in first position)
Mapped Second Pass:
json
{ "winner": "B", "confidence": 0.6 }
Final Result:
json
{
  "winner": "B",
  "confidence": 0.7,
  "positionConsistency": {
    "consistent": true,
    "firstPassWinner": "B",
    "secondPassWinner": "B"
  }
}
输入
提示词:"向初学者解释机器学习"
响应A:[含术语的技术解释]
响应B:[基于简单类比的解释]
标准:["清晰度", "易理解性"]
第一轮(A在前)
json
{ "winner": "B", "confidence": 0.8 }
第二轮(B在前)
json
{ "winner": "A", "confidence": 0.6 }
(注:因B在第一位置,获胜者为A)
映射后第二轮结果
json
{ "winner": "B", "confidence": 0.6 }
最终结果
json
{
  "winner": "B",
  "confidence": 0.7,
  "positionConsistency": {
    "consistent": true,
    "firstPassWinner": "B",
    "secondPassWinner": "B"
  }
}

Example 3: Rubric Generation

示例3:评分表生成

Input:
criterionName: "Code Readability"
criterionDescription: "How easy the code is to understand and maintain"
domain: "software engineering"
scale: "1-5"
strictness: "balanced"
Output (abbreviated):
json
{
  "levels": [
    {
      "score": 1,
      "label": "Poor",
      "description": "Code is difficult to understand without significant effort",
      "characteristics": [
        "No meaningful variable or function names",
        "No comments or documentation",
        "Deeply nested or convoluted logic"
      ]
    },
    {
      "score": 3,
      "label": "Adequate",
      "description": "Code is understandable with some effort",
      "characteristics": [
        "Most variables have meaningful names",
        "Basic comments present for complex sections",
        "Logic is followable but could be cleaner"
      ]
    },
    {
      "score": 5,
      "label": "Excellent",
      "description": "Code is immediately clear and maintainable",
      "characteristics": [
        "All names are descriptive and consistent",
        "Comprehensive documentation",
        "Clean, modular structure"
      ]
    }
  ],
  "edgeCases": [
    {
      "situation": "Code is well-structured but uses domain-specific abbreviations",
      "guidance": "Score based on readability for domain experts, not general audience"
    }
  ]
}
输入
criterionName: "代码可读性"
criterionDescription: "代码的易理解性和可维护性"
domain: "软件工程"
scale: "1-5"
strictness: "balanced"
输出(缩写版):
json
{
  "levels": [
    {
      "score": 1,
      "label": "差",
      "description": "代码需付出极大努力才能理解",
      "characteristics": [
        "无有意义的变量或函数名",
        "无注释或文档",
        "深度嵌套或复杂的逻辑"
      ]
    },
    {
      "score": 3,
      "label": "合格",
      "description": "代码需付出一定努力才能理解",
      "characteristics": [
        "大多数变量具有有意义的名称",
        "复杂部分有基本注释",
        "逻辑可追踪但可更简洁"
      ]
    },
    {
      "score": 5,
      "label": "优秀",
      "description": "代码清晰易懂且易于维护",
      "characteristics": [
        "所有名称具有描述性且一致",
        "全面的文档",
        "简洁的模块化结构"
      ]
    }
  ],
  "edgeCases": [
    {
      "situation": "代码结构良好但使用领域特定缩写",
      "guidance": "基于领域专家的可读性评分,而非普通受众"
    }
  ]
}

Iterative Improvement Workflow

迭代优化流程

  1. Identify weakness: Use evaluation to find where agent struggles
  2. Hypothesize cause: Is it the prompt? The context? The examples?
  3. Modify prompt: Make targeted changes based on hypothesis
  4. Re-evaluate: Run same test cases with modified prompt
  5. Compare: Did the change improve the target dimension?
  6. Check regression: Did other dimensions suffer?
  7. Iterate: Repeat until quality meets threshold
  1. 识别弱点:通过评估发现Agent的薄弱环节
  2. 假设原因:是提示词问题?上下文问题?示例问题?
  3. 修改提示词:基于假设进行针对性修改
  4. 重新评估:在相同测试案例上运行修改后的提示词
  5. 对比:修改是否优化了目标维度?
  6. 检查回归:其他维度是否受到影响?
  7. 迭代:重复直到质量达到阈值

Guidelines

指南

  1. Always require justification before scores - Chain-of-thought prompting improves reliability by 15-25%
  2. Always swap positions in pairwise comparison - Single-pass comparison is corrupted by position bias
  3. Match scale granularity to rubric specificity - Don't use 1-10 without detailed level descriptions
  4. Separate objective and subjective criteria - Use direct scoring for objective, pairwise for subjective
  5. Include confidence scores - Calibrate to position consistency and evidence strength
  6. Define edge cases explicitly - Ambiguous situations cause the most evaluation variance
  7. Use domain-specific rubrics - Generic rubrics produce generic (less useful) evaluations
  8. Validate against human judgments - Automated evaluation is only valuable if it correlates with human assessment
  9. Monitor for systematic bias - Track disagreement patterns by criterion and response type
  10. Design for iteration - Evaluation systems improve with feedback loops
  1. 始终要求先给理由再给分 - 思维链提示可将评估可靠性提升15-25%
  2. 成对比较时始终交换位置 - 单次比较会受位置偏见影响
  3. 量表粒度与评分表特异性匹配 - 无详细层级描述时不要使用1-10分制
  4. 区分客观与主观标准 - 客观标准用直接评分,主观标准用成对比较
  5. 包含置信度分数 - 根据位置一致性和证据强度校准
  6. 明确定义边缘案例 - 模糊场景是评估差异的主要来源
  7. 使用领域特定评分表 - 通用评分表会产生通用(无用)的评估结果
  8. 与人工判断对比验证 - 只有与人工评估相关时,自动化评估才有价值
  9. 监控系统性偏见 - 按标准和响应类型跟踪分歧模式
  10. 为迭代设计评估系统 - 评估系统可通过反馈环持续优化

Example: Evaluating a Claude Code Command

示例:评估Claude Code命令

Suppose you've created a
/refactor
command and want to evaluate its quality:
Test Cases:
  1. Simple: Rename a variable across a single file
  2. Medium: Extract a function from existing code
  3. Complex: Refactor a class to use a new design pattern
  4. Very Complex: Restructure module dependencies
Evaluation Rubric:
  • Correctness: Does the refactored code work?
  • Completeness: Were all instances updated?
  • Style: Does it follow project conventions?
  • Efficiency: Were unnecessary changes avoided?
Evaluation Prompt:
markdown
Evaluate this refactoring output:

Original Code:
{original}

Refactored Code:
{refactored}

Request:
{user_request}

Score 1-5 on each dimension with evidence:
1. Correctness: Does the code still work correctly?
2. Completeness: Were all relevant instances updated?
3. Style: Does it follow the project's coding patterns?
4. Efficiency: Were only necessary changes made?

Provide scores with specific evidence from the code.
Iteration: If evaluation reveals the command often misses instances:
  1. Add explicit instruction: "Search the entire codebase for all occurrences"
  2. Re-evaluate with same test cases
  3. Compare completeness scores
  4. Check that correctness didn't regress
假设你创建了一个
/refactor
命令并想评估其质量:
测试案例
  1. 简单:在单个文件中重命名变量
  2. 中等:从现有代码中提取函数
  3. 复杂:重构类以使用新设计模式
  4. 极复杂:重构模块依赖关系
评估评分表
  • 正确性:重构后的代码是否可运行?
  • 完整性:所有实例是否都已更新?
  • 风格:是否遵循项目规范?
  • 效率:是否避免了不必要的修改?
评估提示词
markdown
评估此重构输出:

原始代码:
{original}

重构后代码:
{refactored}

请求:
{user_request}

根据每个维度给出1-5分并提供证据:
1. 正确性:代码是否仍能正确运行?
2. 完整性:所有相关实例是否都已更新?
3. 风格:是否遵循项目的编码模式?
4. 效率:是否仅进行了必要修改?

提供分数并引用代码中的具体证据。
迭代: 若评估发现命令经常遗漏实例:
  1. 添加明确指令:"搜索整个代码库以找到所有出现的位置"
  2. 在相同测试案例上重新评估
  3. 对比完整性分数
  4. 检查正确性是否出现回归

Bias Mitigation Techniques for LLM Evaluation

LLM评估的偏见缓解技术

This reference details specific techniques for mitigating known biases in LLM-as-a-Judge systems.
本参考文档详细介绍了缓解LLM-as-Judge系统中已知偏见的具体技术。

Position Bias

位置偏见

The Problem

问题

In pairwise comparison, LLMs systematically prefer responses in certain positions. Research shows:
  • GPT has mild first-position bias (~55% preference for first position in ties)
  • Claude shows similar patterns
  • Smaller models often show stronger bias
在成对比较中,LLM会系统性偏好特定位置的响应。研究表明:
  • GPT存在轻微的第一位置偏见(平局场景下约55%偏好第一位置)
  • Claude表现出类似模式
  • 小型模型通常偏见更明显

Mitigation: Position Swapping Protocol

缓解方案:位置交换流程

python
async def position_swap_comparison(response_a, response_b, prompt, criteria):
    # Pass 1: Original order
    result_ab = await compare(response_a, response_b, prompt, criteria)
    
    # Pass 2: Swapped order
    result_ba = await compare(response_b, response_a, prompt, criteria)
    
    # Map second result (A in second position → B in first)
    result_ba_mapped = {
        'winner': {'A': 'B', 'B': 'A', 'TIE': 'TIE'}[result_ba['winner']],
        'confidence': result_ba['confidence']
    }
    
    # Consistency check
    if result_ab['winner'] == result_ba_mapped['winner']:
        return {
            'winner': result_ab['winner'],
            'confidence': (result_ab['confidence'] + result_ba_mapped['confidence']) / 2,
            'position_consistent': True
        }
    else:
        # Disagreement indicates position bias was a factor
        return {
            'winner': 'TIE',
            'confidence': 0.5,
            'position_consistent': False,
            'bias_detected': True
        }
python
async def position_swap_comparison(response_a, response_b, prompt, criteria):
    # 第一轮:原始顺序
    result_ab = await compare(response_a, response_b, prompt, criteria)
    
    # 第二轮:交换顺序
    result_ba = await compare(response_b, response_a, prompt, criteria)
    
    # 映射第二轮结果(A在第二位置 → B在第一位置)
    result_ba_mapped = {
        'winner': {'A': 'B', 'B': 'A', 'TIE': 'TIE'}[result_ba['winner']],
        'confidence': result_ba['confidence']
    }
    
    # 一致性检查
    if result_ab['winner'] == result_ba_mapped['winner']:
        return {
            'winner': result_ab['winner'],
            'confidence': (result_ab['confidence'] + result_ba_mapped['confidence']) / 2,
            'position_consistent': True
        }
    else:
        # 分歧表明位置偏见是影响因素
        return {
            'winner': 'TIE',
            'confidence': 0.5,
            'position_consistent': False,
            'bias_detected': True
        }

Alternative: Multiple Shuffles

替代方案:多次打乱顺序

For higher reliability, use multiple position orderings:
python
async def multi_shuffle_comparison(response_a, response_b, prompt, criteria, n_shuffles=3):
    results = []
    for i in range(n_shuffles):
        if i % 2 == 0:
            r = await compare(response_a, response_b, prompt, criteria)
        else:
            r = await compare(response_b, response_a, prompt, criteria)
            r['winner'] = {'A': 'B', 'B': 'A', 'TIE': 'TIE'}[r['winner']]
        results.append(r)
    
    # Majority vote
    winners = [r['winner'] for r in results]
    final_winner = max(set(winners), key=winners.count)
    agreement = winners.count(final_winner) / len(winners)
    
    return {
        'winner': final_winner,
        'confidence': agreement,
        'n_shuffles': n_shuffles
    }
为提升可靠性,可使用多种位置顺序:
python
async def multi_shuffle_comparison(response_a, response_b, prompt, criteria, n_shuffles=3):
    results = []
    for i in range(n_shuffles):
        if i % 2 == 0:
            r = await compare(response_a, response_b, prompt, criteria)
        else:
            r = await compare(response_b, response_a, prompt, criteria)
            r['winner'] = {'A': 'B', 'B': 'A', 'TIE': 'TIE'}[r['winner']]
        results.append(r)
    
    # 多数投票
    winners = [r['winner'] for r in results]
    final_winner = max(set(winners), key=winners.count)
    agreement = winners.count(final_winner) / len(winners)
    
    return {
        'winner': final_winner,
        'confidence': agreement,
        'n_shuffles': n_shuffles
    }

Length Bias

长度偏见

The Problem

问题

LLMs tend to rate longer responses higher, regardless of quality. This manifests as:
  • Verbose responses receiving inflated scores
  • Concise but complete responses penalized
  • Padding and repetition being rewarded
LLM倾向于给更长的响应更高分数,无论质量如何。表现为:
  • 冗长响应获得虚高分数
  • 简洁但完整的响应被惩罚
  • 填充内容和重复内容被奖励

Mitigation: Explicit Prompting

缓解方案:明确提示

Include anti-length-bias instructions in the prompt:
CRITICAL EVALUATION GUIDELINES:
- Do NOT prefer responses because they are longer
- Concise, complete answers are as valuable as detailed ones
- Penalize unnecessary verbosity or repetition
- Focus on information density, not word count
在提示词中加入反长度偏见说明:
关键评估指南:
- 不要因为响应更长而偏好它
- 简洁完整的答案与详细答案具有同等价值
- 惩罚不必要的冗长或重复内容
- 关注信息密度而非字数

Mitigation: Length-Normalized Scoring

缓解方案:长度归一化评分

python
def length_normalized_score(score, response_length, target_length=500):
    """Adjust score based on response length."""
    length_ratio = response_length / target_length
    
    if length_ratio > 2.0:
        # Penalize excessively long responses
        penalty = (length_ratio - 2.0) * 0.1
        return max(score - penalty, 1)
    elif length_ratio < 0.3:
        # Penalize excessively short responses
        penalty = (0.3 - length_ratio) * 0.5
        return max(score - penalty, 1)
    else:
        return score
python
def length_normalized_score(score, response_length, target_length=500):
    """根据响应长度调整分数。"""
    length_ratio = response_length / target_length
    
    if length_ratio > 2.0:
        # 惩罚过长响应
        penalty = (length_ratio - 2.0) * 0.1
        return max(score - penalty, 1)
    elif length_ratio < 0.3:
        # 惩罚过短响应
        penalty = (0.3 - length_ratio) * 0.5
        return max(score - penalty, 1)
    else:
        return score

Mitigation: Separate Length Criterion

缓解方案:单独的长度标准

Make length a separate, explicit criterion so it's not implicitly rewarded:
python
criteria = [
    {"name": "Accuracy", "description": "Factual correctness", "weight": 0.4},
    {"name": "Completeness", "description": "Covers key points", "weight": 0.3},
    {"name": "Conciseness", "description": "No unnecessary content", "weight": 0.3}  # Explicit
]
将长度设为单独的明确标准,避免其被隐性奖励:
python
criteria = [
    {"name": "准确性", "description": "事实正确性", "weight": 0.4},
    {"name": "完整性", "description": "覆盖核心要点", "weight": 0.3},
    {"name": "简洁性", "description": "无不必要内容", "weight": 0.3}  # 明确标准
]

Self-Enhancement Bias

自我增强偏见

The Problem

问题

Models rate outputs generated by themselves (or similar models) higher than outputs from different models.
模型对自身(或相似模型)生成的输出评分高于其他模型的输出。

Mitigation: Cross-Model Evaluation

缓解方案:跨模型评估

Use a different model family for evaluation than generation:
python
def get_evaluator_model(generator_model):
    """Select evaluator to avoid self-enhancement bias."""
    if 'gpt' in generator_model.lower():
        return 'claude-4-5-sonnet'
    elif 'claude' in generator_model.lower():
        return 'gpt-5.2'
    else:
        return 'gpt-5.2'  # Default
使用与生成模型不同的模型家族进行评估:
python
def get_evaluator_model(generator_model):
    """选择评估模型以避免自我增强偏见。"""
    if 'gpt' in generator_model.lower():
        return 'claude-4-5-sonnet'
    elif 'claude' in generator_model.lower():
        return 'gpt-5.2'
    else:
        return 'gpt-5.2'  # 默认

Mitigation: Blind Evaluation

缓解方案:盲态评估

Remove model attribution from responses before evaluation:
python
def anonymize_response(response, model_name):
    """Remove model-identifying patterns."""
    patterns = [
        f"As {model_name}",
        "I am an AI",
        "I don't have personal opinions",
        # Model-specific patterns
    ]
    anonymized = response
    for pattern in patterns:
        anonymized = anonymized.replace(pattern, "[REDACTED]")
    return anonymized
在评估前移除响应中的模型标识:
python
def anonymize_response(response, model_name):
    """移除模型识别模式。"""
    patterns = [
        f"As {model_name}",
        "I am an AI",
        "I don't have personal opinions",
        # 模型特定模式
    ]
    anonymized = response
    for pattern in patterns:
        anonymized = anonymized.replace(pattern, "[REDACTED]")
    return anonymized

Verbosity Bias

冗余偏见

The Problem

问题

Detailed explanations receive higher scores even when the extra detail is irrelevant or incorrect.
详细解释会获得更高分数,即使额外细节无关或错误。

Mitigation: Relevance-Weighted Scoring

缓解方案:相关性加权评分

python
async def relevance_weighted_evaluation(response, prompt, criteria):
    # First, assess relevance of each segment
    relevance_scores = await assess_relevance(response, prompt)
    
    # Weight evaluation by relevance
    segments = split_into_segments(response)
    weighted_scores = []
    for segment, relevance in zip(segments, relevance_scores):
        if relevance > 0.5:  # Only count relevant segments
            score = await evaluate_segment(segment, prompt, criteria)
            weighted_scores.append(score * relevance)
    
    return sum(weighted_scores) / len(weighted_scores)
python
async def relevance_weighted_evaluation(response, prompt, criteria):
    # 首先评估每个片段的相关性
    relevance_scores = await assess_relevance(response, prompt)
    
    # 按相关性加权评估
    segments = split_into_segments(response)
    weighted_scores = []
    for segment, relevance in zip(segments, relevance_scores):
        if relevance > 0.5:  # 仅计算相关片段
            score = await evaluate_segment(segment, prompt, criteria)
            weighted_scores.append(score * relevance)
    
    return sum(weighted_scores) / len(weighted_scores)

Mitigation: Rubric with Verbosity Penalty

缓解方案:带冗余惩罚的评分表

Include explicit verbosity penalties in rubrics:
python
rubric_levels = [
    {
        "score": 5,
        "description": "Complete and concise. All necessary information, nothing extraneous.",
        "characteristics": ["Every sentence adds value", "No repetition", "Appropriately scoped"]
    },
    {
        "score": 3,
        "description": "Complete but verbose. Contains unnecessary detail or repetition.",
        "characteristics": ["Main points covered", "Some tangents", "Could be more concise"]
    },
    # ... etc
]
在评分表中加入明确的冗余惩罚:
python
rubric_levels = [
    {
        "score": 5,
        "description": "完整且简洁。包含所有必要信息,无多余内容。",
        "characteristics": ["每句话都有价值", "无重复", "范围恰当"]
    },
    {
        "score": 3,
        "description": "完整但冗长。包含不必要的细节或重复内容。",
        "characteristics": ["覆盖核心要点", "存在一些偏离主题的内容", "可更简洁"]
    },
    # ... 其他层级
]

Authority Bias

权威偏见

The Problem

问题

Confident, authoritative tone is rated higher regardless of accuracy.
自信、权威的语气会获得更高分数,无论准确性如何。

Mitigation: Evidence Requirement

缓解方案:证据要求

Require explicit evidence for claims:
For each claim in the response:
1. Identify whether it's a factual claim
2. Note if evidence or sources are provided
3. Score based on verifiability, not confidence

IMPORTANT: Confident claims without evidence should NOT receive higher scores than 
hedged claims with evidence.
要求为主张提供明确证据:
针对响应中的每个主张:
1. 识别其是否为事实主张
2. 记录是否提供了证据或来源
3. 基于可验证性而非自信程度评分

重要提示:无证据的自信主张不应比有证据的谨慎主张获得更高分数。

Mitigation: Fact-Checking Layer

缓解方案:事实核查环节

Add a fact-checking step before scoring:
python
async def fact_checked_evaluation(response, prompt, criteria):
    # Extract claims
    claims = await extract_claims(response)
    
    # Fact-check each claim
    fact_check_results = await asyncio.gather(*[
        verify_claim(claim) for claim in claims
    ])
    
    # Adjust score based on fact-check results
    accuracy_factor = sum(r['verified'] for r in fact_check_results) / len(fact_check_results)
    
    base_score = await evaluate(response, prompt, criteria)
    return base_score * (0.7 + 0.3 * accuracy_factor)  # At least 70% of score
在评分前加入事实核查步骤:
python
async def fact_checked_evaluation(response, prompt, criteria):
    # 提取主张
    claims = await extract_claims(response)
    
    # 事实核查每个主张
    fact_check_results = await asyncio.gather(*[
        verify_claim(claim) for claim in claims
    ])
    
    # 根据事实核查结果调整分数
    accuracy_factor = sum(r['verified'] for r in fact_check_results) / len(fact_check_results)
    
    base_score = await evaluate(response, prompt, criteria)
    return base_score * (0.7 + 0.3 * accuracy_factor)  # 分数至少为基础分的70%

Aggregate Bias Detection

聚合偏见检测

Monitor for systematic biases in production:
python
class BiasMonitor:
    def __init__(self):
        self.evaluations = []
    
    def record(self, evaluation):
        self.evaluations.append(evaluation)
    
    def detect_position_bias(self):
        """Detect if first position wins more often than expected."""
        first_wins = sum(1 for e in self.evaluations if e['first_position_winner'])
        expected = len(self.evaluations) * 0.5
        z_score = (first_wins - expected) / (expected * 0.5) ** 0.5
        return {'bias_detected': abs(z_score) > 2, 'z_score': z_score}
    
    def detect_length_bias(self):
        """Detect if longer responses score higher."""
        from scipy.stats import spearmanr
        lengths = [e['response_length'] for e in self.evaluations]
        scores = [e['score'] for e in self.evaluations]
        corr, p_value = spearmanr(lengths, scores)
        return {'bias_detected': corr > 0.3 and p_value < 0.05, 'correlation': corr}
在生产环境中监控系统性偏见:
python
class BiasMonitor:
    def __init__(self):
        self.evaluations = []
    
    def record(self, evaluation):
        self.evaluations.append(evaluation)
    
    def detect_position_bias(self):
        """检测第一位置是否比预期更易获胜。"""
        first_wins = sum(1 for e in self.evaluations if e['first_position_winner'])
        expected = len(self.evaluations) * 0.5
        z_score = (first_wins - expected) / (expected * 0.5) ** 0.5
        return {'bias_detected': abs(z_score) > 2, 'z_score': z_score}
    
    def detect_length_bias(self):
        """检测更长响应是否获得更高分数。"""
        from scipy.stats import spearmanr
        lengths = [e['response_length'] for e in self.evaluations]
        scores = [e['score'] for e in self.evaluations]
        corr, p_value = spearmanr(lengths, scores)
        return {'bias_detected': corr > 0.3 and p_value < 0.05, 'correlation': corr}

Summary Table

汇总表

BiasPrimary MitigationSecondary MitigationDetection Method
PositionPosition swappingMultiple shufflesConsistency check
LengthExplicit promptingLength normalizationLength-score correlation
Self-enhancementCross-model evaluationAnonymizationModel comparison study
VerbosityRelevance weightingRubric penaltiesRelevance scoring
AuthorityEvidence requirementFact-checking layerConfidence-accuracy correlation
偏见主要缓解方案次要缓解方案检测方法
位置偏见位置交换多次打乱顺序一致性检查
长度偏见明确提示长度归一化长度-分数相关性
自我增强偏见跨模型评估匿名化模型对比研究
冗余偏见相关性加权评分表惩罚相关性评分
权威偏见证据要求事实核查环节置信度-准确性相关性

LLM-as-Judge Implementation Patterns for Claude Code

Claude Code的LLM-as-Judge实现模式

This reference provides practical prompt patterns and workflows for evaluating Claude Code commands, skills, and agents during development.
本参考文档提供了开发过程中评估Claude Code命令、技能和Agent的实用提示词模式与流程。

Pattern 1: Structured Evaluation Workflow

模式1:结构化评估流程

The most reliable evaluation follows a structured workflow that separates concerns:
Define Criteria → Gather Test Cases → Run Evaluation → Mitigate Bias → Interpret Results
最可靠的评估遵循分离关注点的结构化流程:
定义标准 → 收集测试案例 → 运行评估 → 缓解偏见 → 解读结果

Step 1: Define Evaluation Criteria

步骤1:定义评估标准

Before evaluating, establish clear criteria. Document them in a reusable format:
markdown
undefined
评估前先建立明确标准,以可复用格式记录:
markdown
undefined

Evaluation Criteria for [Command/Skill Name]

[命令/技能名称]的评估标准

Criterion 1: Instruction Following (weight: 0.30)

标准1:指令遵循度(权重:0.30)

  • Description: Does the output follow all explicit instructions?
  • 1 (Poor): Ignores or misunderstands core instructions
  • 3 (Adequate): Follows main instructions, misses some details
  • 5 (Excellent): Follows all instructions precisely
  • 描述:输出是否遵循所有明确指令?
  • 1分(差):忽略或误解核心指令
  • 3分(合格):遵循主要指令,遗漏部分细节
  • 5分(优秀):精确遵循所有指令

Criterion 2: Output Completeness (weight: 0.25)

标准2:输出完整性(权重:0.25)

  • Description: Are all requested aspects covered?
  • 1 (Poor): Major aspects missing
  • 3 (Adequate): Core aspects covered with gaps
  • 5 (Excellent): All aspects thoroughly addressed
  • 描述:是否覆盖所有请求内容?
  • 1分(差):缺失主要内容
  • 3分(合格):覆盖核心内容但存在缺口
  • 5分(优秀):全面覆盖所有内容

Criterion 3: Tool Efficiency (weight: 0.20)

标准3:工具效率(权重:0.20)

  • Description: Were appropriate tools used efficiently?
  • 1 (Poor): Wrong tools or excessive redundant calls
  • 3 (Adequate): Appropriate tools with some redundancy
  • 5 (Excellent): Optimal tool selection, minimal calls
  • 描述:是否高效使用了合适的工具?
  • 1分(差):使用错误工具或过度冗余调用
  • 3分(合格):使用合适工具但存在一些冗余
  • 5分(优秀):工具选择最优,调用次数最少

Criterion 4: Reasoning Quality (weight: 0.15)

标准4:推理质量(权重:0.15)

  • Description: Is the reasoning clear and sound?
  • 1 (Poor): No apparent reasoning or flawed logic
  • 3 (Adequate): Basic reasoning present
  • 5 (Excellent): Clear, logical reasoning throughout
  • 描述:推理是否清晰合理?
  • 1分(差):无明显推理或逻辑缺陷
  • 3分(合格):具备基础推理能力
  • 5分(优秀):全程推理清晰、逻辑严谨

Criterion 5: Response Coherence (weight: 0.10)

标准5:响应连贯性(权重:0.10)

  • Description: Is the output well-structured and clear?
  • 1 (Poor): Difficult to follow or incoherent
  • 3 (Adequate): Understandable but could be clearer
  • 5 (Excellent): Well-structured, easy to follow
undefined
  • 描述:输出结构是否清晰易懂?
  • 1分(差):难以理解或逻辑混乱
  • 3分(合格):可理解但可更清晰
  • 5分(优秀):结构清晰,易于理解
undefined

Step 2: Create Test Cases

步骤2:创建测试案例

Structure test cases by complexity level:
markdown
undefined
按复杂度层级构建测试案例:
markdown
undefined

Test Cases for /refactor Command

/refactor命令的测试案例

Simple (Single Operation)

简单(单次操作)

  • Input: Rename variable
    x
    to
    count
    in a single file
  • Expected: All instances renamed, code still runs
  • Complexity: Low
  • 输入:在单个文件中将变量
    x
    重命名为
    count
  • 预期:所有实例均已重命名,代码仍可运行
  • 复杂度:低

Medium (Multiple Operations)

中等(多次操作)

  • Input: Extract function from 20-line code block
  • Expected: New function created, original call site updated, behavior preserved
  • Complexity: Medium
  • 输入:从20行代码块中提取函数
  • 预期:创建新函数,更新原始调用位置,保留原有行为
  • 复杂度:中

Complex (Cross-File Changes)

复杂(跨文件修改)

  • Input: Refactor class to use Strategy pattern
  • Expected: Interface created, implementations separated, all usages updated
  • Complexity: High
  • 输入:重构类以使用策略模式
  • 预期:创建接口,分离实现,更新所有调用位置
  • 复杂度:高

Edge Case

边缘案例

  • Input: Refactor code with conflicting variable names in nested scopes
  • Expected: Correct scoping preserved, no accidental shadowing
  • Complexity: Edge case
undefined
  • 输入:重构嵌套作用域中存在冲突变量名的代码
  • 预期:保留正确作用域,无意外遮蔽
  • 复杂度:边缘案例
undefined

Step 3: Run Direct Scoring Evaluation

步骤3:运行直接评分评估

Use this prompt template to evaluate a single output:
markdown
You are evaluating the output of a Claude Code command.
使用以下提示词模板评估单个输出:
markdown
你正在评估Claude Code命令的输出。

Original Task

原始任务

{paste the user's original request}
{粘贴用户的原始请求}

Command Output

命令输出

{paste the full command output including tool calls}
{粘贴完整命令输出,包括工具调用}

Evaluation Criteria

评估标准

{paste your criteria definitions from Step 1}
{粘贴步骤1中定义的标准}

Instructions

说明

For each criterion:
  1. Find specific evidence in the output that supports your assessment
  2. Assign a score (1-5) based on the rubric levels
  3. Write a 1-2 sentence justification citing the evidence
  4. Suggest one specific improvement
IMPORTANT: Provide your justification BEFORE stating the score. This improves evaluation reliability.
针对每个标准:
  1. 在输出中找到支持评估的具体证据
  2. 根据评分表层级给出1-5分
  3. 撰写1-2句引用证据的理由
  4. 提出一条具体优化建议
重要提示:先提供理由再给出分数。这能提升评估可靠性。

Output Format

输出格式

For each criterion, respond with:
针对每个标准,响应格式如下:

[Criterion Name]

[标准名称]

Evidence: [Quote or describe specific parts of the output] Justification: [Explain how the evidence maps to the rubric level] Score: [1-5] Improvement: [One actionable suggestion]
证据:[引用或描述输出中的具体部分] 理由:[解释证据如何对应评分表层级] 分数:[1-5] 优化建议:[一条可落地的建议]

Overall Assessment

整体评估

Weighted Score: [Calculate: sum of (score × weight)] Pass/Fail: [Pass if weighted score ≥ 3.5] Summary: [2-3 sentences summarizing strengths and weaknesses]
undefined
加权分数:[计算:(分数 × 权重)之和] 通过/失败:[加权分数 ≥ 3.5则通过] 总结:[2-3句话总结优势与不足]
undefined

Step 4: Mitigate Position Bias in Comparisons

步骤4:缓解成对比较中的位置偏见

When comparing two prompt variants (A vs B), use this two-pass workflow:
Pass 1 (A First):
markdown
You are comparing two outputs from different prompt variants.
对比两个提示词变体(A vs B)时,使用以下两轮流程:
第一轮(A在前)
markdown
你正在比较来自不同提示词变体的两个输出。

Original Task

原始任务

{task description}
{任务描述}

Output A (First Variant)

输出A(第一变体)

{output from prompt variant A}
{提示词变体A的输出}

Output B (Second Variant)

输出B(第二变体)

{output from prompt variant B}
{提示词变体B的输出}

Comparison Criteria

对比标准

  • Instruction Following
  • Output Completeness
  • Reasoning Quality
  • 指令遵循度
  • 输出完整性
  • 推理质量

Critical Instructions

关键说明

  • Do NOT prefer outputs because they are longer
  • Do NOT prefer outputs based on their position (first vs second)
  • Focus ONLY on quality differences
  • TIE is acceptable when outputs are equivalent
  • 不要因为输出更长而偏好它
  • 不要根据位置(第一/第二)偏好输出
  • 仅关注质量差异
  • 当输出相当时,平局是可接受的

Analysis Process

分析流程

  1. Analyze Output A independently: [strengths, weaknesses]
  2. Analyze Output B independently: [strengths, weaknesses]
  3. Compare on each criterion
  4. Determine winner with confidence (0-1)
  1. 独立分析输出A:[优势、不足]
  2. 独立分析输出B:[优势、不足]
  3. 在每个标准上对比两者
  4. 确定获胜者及置信度(0-1)

Output

输出

Reasoning: [Explain why] Winner: [A/B/TIE] Confidence: [0.0-1.0]

**Pass 2 (B First):**
Repeat the same prompt but swap the order—put Output B first and Output A second.

**Interpret Results:**
- If both passes agree → Winner confirmed, average the confidences
- If passes disagree → Result is TIE with confidence 0.5 (position bias detected)
理由:[解释原因] 获胜者:[A/B/平局] 置信度:[0.0-1.0]

**第二轮(B在前)**:
重复相同提示词,但交换顺序——将输出B放在前面,输出A放在后面。

**解读结果**:
- 若两轮结果一致 → 确认获胜者,取置信度平均值
- 若两轮结果不一致 → 结果为平局,置信度0.5(检测到位置偏见)

Pattern 2: Hierarchical Evaluation Workflow

模式2:分层评估流程

For complex evaluations, use a hierarchical approach:
Quick Screen (cheap model) → Detailed Evaluation (expensive model) → Human Review (edge cases)
针对复杂评估,使用分层方法:
快速筛选(低成本模型) → 详细评估(高成本模型) → 人工审查(边缘案例)

Tier 1: Quick Screen (Use Haiku)

第一层:快速筛选(使用Haiku)

markdown
Rate this command output 0-10 for basic adequacy.

Task: {brief task description}
Output: {command output}

Quick assessment: Does this output reasonably address the task?
Score (0-10):
One-line reasoning:
Decision rule: Score < 5 → Fail, Score ≥ 7 → Pass, Score 5-7 → Escalate to detailed evaluation
markdown
为该命令输出的基本充足性评分(0-10分)。

任务:{简要任务描述}
输出:{命令输出}

快速评估:该输出是否合理完成了任务?
分数(0-10):
一句话理由:
决策规则:分数 < 5 → 失败,分数 ≥ 7 → 通过,分数5-7 → 升级至详细评估

Tier 2: Detailed Evaluation (Use Opus)

第二层:详细评估(使用Opus)

Use the full direct scoring prompt from Pattern 1 for borderline cases.
对 borderline案例使用模式1中的完整直接评分提示词。

Tier 3: Human Review

第三层:人工审查

For low-confidence automated evaluations (confidence < 0.6), queue for manual review:
markdown
undefined
对低置信度自动化评估(置信度 < 0.6),提交人工审查:
markdown
undefined

Human Review Request

人工审查请求

Automated Score: 3.2/5 (Confidence: 0.45) Reason for Escalation: Low confidence, evaluator disagreed across passes
自动化分数:3.2/5(置信度:0.45) 升级原因:置信度低,评估者两轮结果不一致

What to Review

审查内容

  1. Does the output actually complete the task?
  2. Are the automated criterion scores reasonable?
  3. What did the automation miss?
  1. 输出是否实际完成了任务?
  2. 自动化标准分数是否合理?
  3. 自动化评估遗漏了什么?

Original Task

原始任务

{task}
{任务}

Output

输出

{output}
{输出}

Automated Assessment

自动化评估

{paste automated evaluation}
{粘贴自动化评估结果}

Human Override

人工覆盖

[ ] Agree with automation [ ] Override to PASS - Reason: ___ [ ] Override to FAIL - Reason: ___
undefined
[ ] 同意自动化评估结果 [ ] 覆盖为通过 - 理由:___ [ ] 覆盖为失败 - 理由:___
undefined

Pattern 3: Panel of LLM Judges (PoLL)

模式3:LLM评估专家组(PoLL)

For high-stakes evaluation, use multiple models::
针对高风险评估,使用多个模型:

Workflow

流程

  1. Run 3 independent evaluations with different prompt framings:
    • Evaluation 1: Standard criteria prompt
    • Evaluation 2: Adversarial framing ("Find problems with this output")
    • Evaluation 3: User perspective ("Would a developer be satisfied?")
  2. Aggregate results:
    • Take median score per criterion (robust to outliers)
    • Flag criteria with high variance (std > 1.0) for review
    • Overall pass requires majority agreement
  1. 运行3次独立评估,使用不同提示词框架:
    • 评估1:标准标准提示词
    • 评估2:对抗性框架("找出该输出的问题")
    • 评估3:用户视角("开发者会对该结果满意吗?")
  2. 汇总结果
    • 取每个标准的中位数分数(对异常值鲁棒)
    • 标记高方差标准(标准差 > 1.0)以进行审查
    • 整体通过需多数同意

Multi-Judge Prompt Variants

多评估者提示词变体

Standard Framing:
markdown
Evaluate this output against the specified criteria. Be fair and balanced.
Adversarial Framing:
markdown
Your role is to find problems with this output. Be critical and thorough.
Look for: factual errors, missing requirements, inefficiencies, unclear explanations.
User Perspective:
markdown
Imagine you're a developer who requested this task.
Would you be satisfied with this result? Would you need to redo any work?
标准框架
markdown
根据指定标准评估该输出。保持公平平衡。
对抗性框架
markdown
你的角色是找出该输出的问题。保持批判性和全面性。
寻找:事实错误、缺失需求、低效、模糊解释。
用户视角
markdown
假设你是请求该任务的开发者。
你会对该结果满意吗?你需要重做任何工作吗?

Agreement Analysis

一致性分析

After running all judges, check consistency:
CriterionJudge 1Judge 2Judge 3MedianStd Dev
Instruction Following44540.58
Completeness34330.58
Tool Efficiency23431.00 ⚠️
⚠️ High variance on Tool Efficiency suggests the criterion needs clearer definition or the output has ambiguous efficiency characteristics.
所有评估者完成评估后,检查一致性:
标准评估者1评估者2评估者3中位数标准差
指令遵循度44540.58
完整性34330.58
工具效率23431.00 ⚠️
⚠️ 高方差的工具效率标准表明,该标准需要更清晰的定义,或输出的效率特征存在歧义。

Pattern 4: Confidence Calibration

模式4:置信度校准

Confidence scores should be calibrated to actual reliability:
置信度分数应与实际可靠性校准:

Confidence Factors

置信度因素

FactorHigh ConfidenceLow Confidence
Position consistencyBoth passes agreePasses disagree
Evidence count3+ specific citationsVague or no citations
Criterion agreementAll criteria alignCriteria scores vary widely
Edge case matchSimilar to known casesNovel situation
因素高置信度低置信度
位置一致性两轮结果一致两轮结果不一致
证据数量3+个具体引用模糊或无引用
标准一致性所有标准分数一致标准分数差异大
边缘案例匹配与已知案例相似全新场景

Calibration Prompt Addition

校准提示词补充

Add this to evaluation prompts:
markdown
undefined
在评估提示词中加入以下内容:
markdown
undefined

Confidence Assessment

置信度评估

After scoring, assess your confidence:
  1. Evidence Strength: How specific was the evidence you cited?
    • Strong: Quoted exact passages, precise observations
    • Moderate: General observations, reasonable inferences
    • Weak: Vague impressions, assumptions
  2. Criterion Clarity: How clear were the criterion boundaries?
    • Clear: Easy to map output to rubric levels
    • Ambiguous: Output fell between levels
    • Unclear: Rubric didn't fit this case
  3. Overall Confidence: [0.0-1.0]
    • 0.9+: Very confident, clear evidence, obvious rubric fit
    • 0.7-0.9: Confident, good evidence, minor ambiguity
    • 0.5-0.7: Moderate confidence, some ambiguity
    • <0.5: Low confidence, significant uncertainty
Confidence: [score] Confidence Reasoning: [explain what factors affected confidence]
undefined
评分后,评估你的置信度:
  1. 证据强度:你引用的证据有多具体?
    • 强:引用了确切段落、精确观察结果
    • 中:一般性观察、合理推断
    • 弱:模糊印象、假设
  2. 标准清晰度:标准边界有多清晰?
    • 清晰:易于将输出映射到评分表层级
    • 模糊:输出处于层级之间
    • 不清晰:评分表不适用于该案例
  3. 整体置信度:[0.0-1.0]
    • 0.9+:非常自信,证据明确,评分表匹配度高
    • 0.7-0.9:自信,证据充分,存在微小模糊
    • 0.5-0.7:中等置信度,存在一些模糊
    • <0.5:低置信度,存在显著不确定性
置信度:[分数] 置信度理由:[解释影响置信度的因素]
undefined

Pattern 5: Structured Output Format

模式5:结构化输出格式

Request consistent output structure for easier analysis:
要求一致的输出格式以简化分析:

Evaluation Output Template

评估输出模板

markdown
undefined
markdown
undefined

Evaluation Results

评估结果

Metadata

元数据

  • Evaluated: [command/skill name]
  • Test Case: [test case ID or description]
  • Evaluator: [model used]
  • Timestamp: [when evaluated]
  • 评估对象:[命令/技能名称]
  • 测试案例:[测试案例ID或描述]
  • 评估者:[使用的模型]
  • 时间戳:[评估时间]

Criterion Scores

标准分数

CriterionScoreWeightWeightedConfidence
Instruction Following4/50.301.200.85
Output Completeness3/50.250.750.70
Tool Efficiency5/50.201.000.90
Reasoning Quality4/50.150.600.75
Response Coherence4/50.100.400.80
标准分数权重加权分数置信度
指令遵循度4/50.301.200.85
输出完整性3/50.250.750.70
工具效率5/50.201.000.90
推理质量4/50.150.600.75
响应连贯性4/50.100.400.80

Summary

总结

  • Overall Score: 3.95/5.0
  • Pass Threshold: 3.5/5.0
  • Result: ✅ PASS
  • 整体分数:3.95/5.0
  • 通过阈值:3.5/5.0
  • 结果:✅ 通过

Evidence Summary

证据总结

  • Strengths: [bullet points]
  • Weaknesses: [bullet points]
  • Improvements: [prioritized suggestions]
  • 优势:[要点]
  • 不足:[要点]
  • 优化建议:[按优先级排序的建议]

Confidence Assessment

置信度评估

  • Overall Confidence: 0.78
  • Flags: [any concerns or caveats]
undefined
  • 整体置信度:0.78
  • 标记:[任何关注点或注意事项]
undefined

Evaluation Workflows for Claude Code Development

Claude Code开发的评估流程

Workflow: Testing a New Command

流程:测试新命令

  1. Write 5-10 test cases spanning complexity levels
  2. Run command on each test case, capture full output
  3. Quick screen all outputs with Tier 1 evaluation
  4. Detailed evaluate failures and borderline cases
  5. Identify patterns in failures to guide prompt improvements
  6. Iterate prompt based on specific weaknesses found
  7. Re-evaluate same test cases to measure improvement
  1. 编写5-10个测试案例,覆盖不同复杂度层级
  2. 在每个测试案例上运行命令,捕获完整输出
  3. 使用第一层评估快速筛选所有输出
  4. 对失败和borderline案例进行详细评估
  5. 识别故障模式以指导提示词优化
  6. 基于发现的具体弱点迭代提示词
  7. 在相同测试案例上重新评估以衡量优化效果

Workflow: Comparing Prompt Variants

流程:对比提示词变体

  1. Create variant prompts (e.g., different instruction phrasings)
  2. Run both variants on identical test cases
  3. Pairwise compare with position swapping
  4. Calculate win rate for each variant
  5. Analyze which cases each variant handles better
  6. Decide: Pick winner or create hybrid
  1. 创建变体提示词(如不同指令表述)
  2. 在相同测试案例上运行两个变体
  3. 使用位置交换进行成对比较
  4. 计算每个变体的胜率
  5. 分析每个变体更擅长处理的案例
  6. 决策:选择获胜变体或创建混合变体

Workflow: Regression Testing

流程:回归测试

  1. Maintain test suite of representative cases
  2. Before changes: Run evaluation, record baseline scores
  3. After changes: Re-run evaluation
  4. Compare: Flag regressions (score drops > 0.5)
  5. Investigate: Why did specific cases regress?
  6. Accept or revert: Based on overall impact
  1. 维护代表性案例的测试套件
  2. 修改前:运行评估,记录基准分数
  3. 修改后:重新运行评估
  4. 对比:标记回归(分数下降 > 0.5)
  5. 调查:特定案例为何出现回归?
  6. 接受或回滚:基于整体影响决策

Workflow: Continuous Quality Monitoring

流程:持续质量监控

  1. Sample production usage (if available)
  2. Run lightweight evaluation on samples
  3. Track metrics over time:
    • Average scores by criterion
    • Failure rate
    • Low-confidence rate
  4. Alert on degradation: Score drop > 10% from baseline
  5. Periodic deep dive: Monthly detailed evaluation on random sample
  1. 抽样生产使用数据(如有)
  2. 对样本运行轻量评估
  3. 随时间跟踪指标
    • 各标准的平均分数
    • 失败率
    • 低置信度率
  4. 性能下降告警:分数比基准下降 > 10%
  5. 定期深度分析:每月对随机样本进行详细评估

Anti-Patterns to Avoid

需避免的反模式

❌ Scoring Without Justification

❌ 无理由评分

Problem: Scores lack grounding, difficult to debug Solution: Always require evidence before score
问题:分数缺乏依据,难以调试 解决方案:始终要求先提供证据再给分

❌ Single-Pass Pairwise Comparison

❌ 单次成对比较

Problem: Position bias corrupts results Solution: Always swap positions and check consistency
问题:位置偏见会影响结果 解决方案:始终交换位置并检查一致性

❌ Overloaded Criteria

❌ 过载标准

Problem: Criteria measuring multiple things are unreliable Solution: One criterion = one measurable aspect
问题:衡量多个内容的标准不可靠 解决方案:一个标准 = 一个可衡量的维度

❌ Missing Edge Case Guidance

❌ 缺失边缘案例指导

Problem: Evaluators handle ambiguous cases inconsistently Solution: Include edge cases in rubrics with explicit guidance
问题:评估者处理模糊场景的方式不一致 解决方案:在评分表中加入边缘案例及明确指导

❌ Ignoring Low Confidence

❌ 忽略低置信度

Problem: Acting on uncertain evaluations leads to wrong conclusions Solution: Escalate low-confidence cases for human review
问题:基于不确定评估采取行动会导致错误结论 解决方案:将低置信度案例升级至人工审查

❌ Generic Rubrics

❌ 通用评分表

Problem: Generic criteria produce vague, unhelpful evaluations Solution: Create domain-specific rubrics (code commands vs documentation commands vs analysis commands)
问题:通用标准会产生模糊、无用的评估结果 解决方案:创建领域特定评分表(代码命令 vs 文档命令 vs 分析命令)

Handling Evaluation Failures

处理评估失败

When evaluations fail or produce unreliable results, use these recovery strategies:
当评估失败或产生不可靠结果时,使用以下恢复策略:

Malformed Output Disregard

忽略格式错误的输出

When the evaluator produces unparseable or incomplete output:
  1. Mark as invalid and ingore for analysis - incorrect output, usally means halicunations during thinking process
  2. Retry initial prompt without chagnes - multiple retries usally more consistent rahter one shot prompt
  3. if still produce incorrect output, flag for human review: Mark as "evaluation failed, needs manual check" and queue for later
当评估者产生无法解析或不完整的输出时:
  1. 标记为无效并忽略分析 - 错误输出通常是思考过程中的幻觉导致
  2. 不修改原始提示词重试 - 多次重试通常比单次提示更一致
  3. 若仍产生错误输出,标记为人工审查:标记为"评估失败,需手动检查"并延后处理

Validation Checklist

验证清单

Before trusting evaluation results, verify:
  • All criteria have scores in valid range (1-5)
  • Each score has a justification referencing specific evidence
  • Confidence score is provided and reasonable
  • No contradictions between justification and assigned score
  • Weighted total calculation is correct
信任评估结果前,验证:
  • 所有标准的分数均在有效范围(1-5)内
  • 每个分数均有引用具体证据的理由
  • 提供了合理的置信度分数
  • 理由与分数之间无矛盾
  • 加权总分计算正确

Validating Evaluation Prompts (Meta-Evaluation)

验证评估提示词(元评估)

Before using an evaluation prompt in production, test it against known cases:
在生产环境中使用评估提示词前,针对已知案例进行测试:

Calibration Test Cases

校准测试案例

Create a small set of outputs with known quality levels:
Test TypeDescriptionExpected Score
Known-goodClearly excellent output4.5+ / 5.0
Known-badClearly poor output< 2.5 / 5.0
BoundaryBorderline case3.0-3.5 with nuanced explanation
创建具有已知质量水平的小型输出集:
测试类型描述预期分数
已知优秀明显优秀的输出4.5+ / 5.0
已知较差明显较差的输出< 2.5 / 5.0
Borderlineborderline案例3.0-3.5,带详细解释

Validation Workflow

验证流程

  1. Known-good test: Evaluate a clearly excellent output
    • If score < 4.0 → Rubric is too strict or evidence requirements unclear
  2. Known-bad test: Evaluate a clearly poor output
    • If score > 3.0 → Rubric is too lenient or criteria not specific enough
  3. Boundary test: Evaluate a borderline case
    • Should produce moderate score (3.0-3.5) with detailed explanation
    • If confident high/low score → Criteria lack nuance
  4. Consistency test: Run same evaluation 3 times
    • Score variance should be < 0.5
    • If higher variance → Criteria need tighter definitions
  1. 已知优秀测试:评估明显优秀的输出
    • 若分数 < 4.0 → 评分表过于严格或证据要求不明确
  2. 已知较差测试:评估明显较差的输出
    • 若分数 > 3.0 → 评分表过于宽松或标准不够具体
  3. Borderline测试:评估borderline案例
    • 应给出中等分数(3.0-3.5)及详细解释
    • 若给出高/低置信度分数 → 标准缺乏细微差别
  4. 一致性测试:重复运行3次相同评估
    • 分数差异应 < 0.5
    • 若差异更大 → 标准需要更严格的定义

Position Bias Validation

位置偏见验证

Test for position bias before using pairwise comparisons:
markdown
undefined
在使用成对比较前测试位置偏见:
markdown
undefined

Position Bias Test

位置偏见测试

Run this test with IDENTICAL outputs in both positions:
Test Case: [Same output text] Position A: [Paste output] Position B: [Paste identical output]
Expected Result: TIE with high confidence (>0.9)
If Result Shows Winner:
  • Position bias detected
  • Add stronger anti-bias instructions to prompt
  • Re-test until TIE achieved consistently
undefined
在两个位置使用完全相同的输出运行此测试:
测试案例:[相同输出文本] 位置A:[粘贴输出] 位置B:[粘贴相同输出]
预期结果:平局且置信度高(>0.9)
若结果显示有获胜者:
  • 检测到位置偏见
  • 在提示词中加入更强的反偏见说明
  • 重新测试直到持续得到平局结果
undefined

Evaluation Prompt Iteration

评估提示词迭代

When calibration tests fail:
  1. Identify failure mode: Too strict? Too lenient? Inconsistent?
  2. Adjust specific rubric levels: Add examples, clarify boundaries
  3. Re-run calibration tests: All 4 tests must pass
  4. Document changes: Track what adjustments improved reliability
当校准测试失败时:
  1. 识别失败模式:过于严格?过于宽松?不一致?
  2. 调整特定评分表层级:加入示例、明确边界
  3. 重新运行校准测试:所有4项测试必须通过
  4. 记录修改:跟踪哪些调整提升了可靠性

Metric Selection Guide for LLM Evaluation

LLM评估的指标选择指南

This reference provides guidance on selecting appropriate metrics for different evaluation scenarios.
本参考文档为不同评估场景提供指标选择指导。

Metric Categories

指标类别

Classification Metrics

分类指标

Use for binary or multi-class evaluation tasks (pass/fail, correct/incorrect).
适用于二元或多类评估任务(通过/失败、正确/错误)。

Precision

精确率

Precision = True Positives / (True Positives + False Positives)
Interpretation: Of all responses the judge said were good, what fraction were actually good?
Use when: False positives are costly (e.g., approving unsafe content)
精确率 = 真阳性 / (真阳性 + 假阳性)
解读:评估者认为良好的响应中,真正良好的比例是多少?
适用场景:假阳性成本高(如批准不安全内容)

Recall

召回率

Recall = True Positives / (True Positives + False Negatives)
Interpretation: Of all actually good responses, what fraction did the judge identify?
Use when: False negatives are costly (e.g., missing good content in filtering)
召回率 = 真阳性 / (真阳性 + 假阴性)
解读:真正良好的响应中,评估者识别出的比例是多少?
适用场景:假阴性成本高(如过滤时遗漏优质内容)

F1 Score

F1分数

F1 = 2 * (Precision * Recall) / (Precision + Recall)
Interpretation: Harmonic mean of precision and recall
Use when: You need a single number balancing both concerns
F1 = 2 * (精确率 * 召回率) / (精确率 + 召回率)
解读:精确率和召回率的调和平均数
适用场景:需要平衡两者的单一汇总指标

Agreement Metrics

一致性指标

Use for comparing automated evaluation with human judgment.
用于对比自动化评估与人工判断。

Cohen's Kappa (κ)

Cohen's Kappa (κ)

κ = (Observed Agreement - Expected Agreement) / (1 - Expected Agreement)
Interpretation: Agreement adjusted for chance
  • κ > 0.8: Almost perfect agreement
  • κ 0.6-0.8: Substantial agreement
  • κ 0.4-0.6: Moderate agreement
  • κ < 0.4: Fair to poor agreement
Use for: Binary or categorical judgments
κ = (观察一致性 - 预期一致性) / (1 - 预期一致性)
解读:调整了随机一致性后的一致性
  • κ > 0.8:几乎完美一致
  • κ 0.6-0.8:高度一致
  • κ 0.4-0.6:中度一致
  • κ < 0.4:一致性一般或较差

Weighted Kappa

加权Kappa

For ordinal scales where disagreement severity matters:
Interpretation: Penalizes large disagreements more than small ones
适用于分歧严重程度重要的有序量表:
解读:对严重分歧的惩罚大于轻微分歧

Correlation Metrics

相关性指标

Use for ordinal/continuous scores.
适用于有序/连续分数。

Spearman's Rank Correlation (ρ)

Spearman秩相关系数 (ρ)

Interpretation: Correlation between rankings, not absolute values
  • ρ > 0.9: Very strong correlation
  • ρ 0.7-0.9: Strong correlation
  • ρ 0.5-0.7: Moderate correlation
  • ρ < 0.5: Weak correlation
Use when: Order matters more than exact values
解读:排名之间的相关性
  • ρ > 0.9:极强相关性
  • ρ 0.7-0.9:强相关性
  • ρ 0.5-0.7:中度相关性
  • ρ < 0.5:弱相关性
适用场景:顺序比精确值更重要

Kendall's Tau (τ)

Kendall's Tau (τ)

Interpretation: Similar to Spearman but based on pairwise concordance
Use when: You have many tied values
解读:与Spearman类似,但基于成对一致性
适用场景:存在大量并列值

Pearson Correlation (r)

Pearson相关系数 (r)

Interpretation: Linear correlation between scores
Use when: Exact score values matter, not just order
解读:分数之间的线性相关性
适用场景:精确分数值重要,而非仅顺序

Pairwise Comparison Metrics

成对比较指标

Agreement Rate

一致率

Agreement = (Matching Decisions) / (Total Comparisons)
Interpretation: Simple percentage of agreement
一致率 = (匹配决策数) / (总比较数)
解读:简单的一致百分比

Position Consistency

位置一致性

Consistency = (Consistent across position swaps) / (Total comparisons)
Interpretation: How often does swapping position change the decision?
一致性 = (位置交换后结果一致的次数) / (总比较数)
解读:交换位置后决策改变的频率

Selection Decision Tree

选择决策树

What type of evaluation task?
├── Binary classification (pass/fail)
│   └── Use: Precision, Recall, F1, Cohen's κ
├── Ordinal scale (1-5 rating)
│   ├── Comparing to human judgments?
│   │   └── Use: Spearman's ρ, Weighted κ
│   └── Comparing two automated judges?
│       └── Use: Kendall's τ, Spearman's ρ
├── Pairwise preference
│   └── Use: Agreement rate, Position consistency
└── Multi-label classification
    └── Use: Macro-F1, Micro-F1, Per-label metrics
评估任务类型是什么?
├── 二元分类(通过/失败)
│   └── 使用:精确率、召回率、F1分数、Cohen's κ
├── 有序量表(1-5评分)
│   ├── 与人工判断对比?
│   │   └── 使用:Spearman's ρ、加权κ
│   └── 对比两个自动化评估者?
│       └── 使用:Kendall's τ、Spearman's ρ
├── 成对偏好
│   └── 使用:一致率、位置一致性
└── 多标签分类
    └── 使用:Macro-F1、Micro-F1、单标签指标

Metric Selection by Use Case

按用例选择指标

Use Case 1: Validating Automated Evaluation

用例1:验证自动化评估

Goal: Ensure automated evaluation correlates with human judgment
Recommended Metrics:
  1. Primary: Spearman's ρ (for ordinal scales) or Cohen's κ (for categorical)
  2. Secondary: Per-criterion agreement
  3. Diagnostic: Confusion matrix for systematic errors
目标:确保自动化评估与人工判断相关
推荐指标
  1. 主要:Spearman's ρ(有序量表)或Cohen's κ(分类)
  2. 次要:单标准一致性
  3. 诊断:系统误差的混淆矩阵

Use Case 2: Comparing Two Models

用例2:对比两个模型

Goal: Determine which model produces better outputs
Recommended Metrics:
  1. Primary: Win rate (from pairwise comparison)
  2. Secondary: Position consistency (bias check)
  3. Diagnostic: Per-criterion breakdown
目标:确定哪个模型生成的输出更优
推荐指标
  1. 主要:胜率(来自成对比较)
  2. 次要:位置一致性(偏见检查)
  3. 诊断:单标准细分

Use Case 3: Quality Monitoring

用例3:质量监控

Goal: Track evaluation quality over time
Recommended Metrics:
  1. Primary: Rolling agreement with human spot-checks
  2. Secondary: Score distribution stability
  3. Diagnostic: Bias indicators (position, length)
目标:随时间跟踪评估质量
推荐指标
  1. 主要:与人工抽查的滚动一致性
  2. 次要:分数分布稳定性
  3. 诊断:偏见指标(位置、长度)

Interpreting Metric Results

解读指标结果

Good Evaluation System Indicators

良好评估系统的指标

MetricGoodAcceptableConcerning
Spearman's ρ> 0.80.6-0.8< 0.6
Cohen's κ> 0.70.5-0.7< 0.5
Position consistency> 0.90.8-0.9< 0.8
Length correlation< 0.20.2-0.4> 0.4
指标良好可接受需关注
Spearman's ρ> 0.80.6-0.8< 0.6
Cohen's κ> 0.70.5-0.7< 0.5
位置一致性> 0.90.8-0.9< 0.8
长度相关性< 0.20.2-0.4> 0.4

Warning Signs

警告信号

  1. High agreement but low correlation: May indicate calibration issues
  2. Low position consistency: Position bias affecting results
  3. High length correlation: Length bias inflating scores
  4. Per-criterion variance: Some criteria may be poorly defined
  1. 高一致率但低相关性:可能表明校准问题
  2. 低位置一致性:位置偏见影响结果
  3. 高长度相关性:长度偏见导致分数虚高
  4. 单标准方差大:部分标准可能定义不佳

Reporting Template

报告模板

markdown
undefined
markdown
undefined

Evaluation System Metrics Report

评估系统指标报告

Human Agreement

人工一致性

  • Spearman's ρ: 0.82 (p < 0.001)
  • Cohen's κ: 0.74
  • Sample size: 500 evaluations
  • Spearman's ρ:0.82(p < 0.001)
  • Cohen's κ:0.74
  • 样本量:500次评估

Bias Indicators

偏见指标

  • Position consistency: 91%
  • Length-score correlation: 0.12
  • 位置一致性:91%
  • 长度-分数相关性:0.12

Per-Criterion Performance

单标准性能

CriterionSpearman's ρκ
Accuracy0.880.79
Clarity0.760.68
Completeness0.810.72
标准Spearman's ρκ
准确性0.880.79
清晰度0.760.68
完整性0.810.72

Recommendations

建议

  • All metrics within acceptable ranges
  • Monitor "Clarity" criterion - lower agreement may indicate need for rubric refinement
undefined
  • 所有指标均在可接受范围内
  • 监控"清晰度"标准——一致性较低可能表明需要优化评分表
undefined