advanced-evaluation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Advanced Evaluation

高级评估

LLM-as-a-Judge techniques for evaluating AI outputs. Not a single technique but a family of approaches - choosing the right one and mitigating biases is the core competency.

LLM-as-a-Judge技术用于评估AI输出。这并非单一技术，而是一系列方法的集合——选择合适的方法并缓解偏差是核心能力。

When to Activate

适用场景

Building automated evaluation pipelines for LLM outputs
Comparing multiple model responses to select the best one
Establishing consistent quality standards
Debugging inconsistent evaluation results
Designing A/B tests for prompt or model changes
Creating rubrics for human or automated evaluation

为LLM输出构建自动化评估流水线
对比多个模型响应以选出最优结果
建立一致的质量标准
调试不一致的评估结果
为提示词或模型变更设计A/B测试
为人工或自动化评估创建评分标准

Core Concepts

核心概念

Evaluation Taxonomy

评估分类

Direct Scoring: Single LLM rates one response on a defined scale.

Best for: Objective criteria (factual accuracy, instruction following, toxicity)
Reliability: Moderate to high for well-defined criteria

Pairwise Comparison: LLM compares two responses and selects better one.

Best for: Subjective preferences (tone, style, persuasiveness)
Reliability: Higher than direct scoring for preferences

直接评分：单个LLM基于定义的评分标准对单个响应进行评分。

最适用于：客观标准（事实准确性、指令遵循度、毒性）
可靠性：针对定义清晰的标准，可靠性为中到高

两两对比：LLM对比两个响应并选出更优的一个。

最适用于：主观偏好（语气、风格、说服力）
可靠性：在偏好类评估中高于直接评分

Known Biases

已知偏差

Bias	Description	Mitigation
Position	First-position preference	Swap positions, check consistency
Length	Longer = higher scores	Explicit prompting, length-normalized scoring
Self-Enhancement	Models rate own outputs higher	Use different model for evaluation
Verbosity	Unnecessary detail rated higher	Criteria-specific rubrics
Authority	Confident tone rated higher	Require evidence citation

偏差类型	描述	缓解方法
位置偏差	偏好排在首位的内容	交换位置，检查一致性
长度偏差	内容越长得分越高	明确提示、基于长度归一化的评分
自我提升偏差	模型给自己的输出打更高分	使用不同模型进行评估
冗余偏差	不必要的细节获得更高评分	基于特定标准的评分细则
权威偏差	自信的语气获得更高评分	要求引用证据

Decision Framework

决策框架

Is there an objective ground truth?
├── Yes → Direct Scoring (factual accuracy, format compliance)
└── No → Pairwise Comparison (tone, style, creativity)

是否存在客观的基准事实？
├── 是 → 直接评分（事实准确性、格式合规性）
└── 否 → 两两对比（语气、风格、创造力）

Quick Reference

快速参考

Direct Scoring Requirements

直接评分要求

Clear criteria definitions
Calibrated scale (1-5 recommended)
Chain-of-thought: justification BEFORE score (improves reliability 15-25%)

清晰的标准定义
校准后的评分尺度（推荐1-5分）
思维链：先给出理由再打分（可将可靠性提升15-25%）

Pairwise Comparison Protocol

两两对比流程

First pass: A in first position
Second pass: B in first position (swap)
Consistency check: If passes disagree → TIE
Final verdict: Consistent winner with averaged confidence

第一轮：A排在首位
第二轮：B排在首位（交换位置）
一致性检查：若两轮结果不一致→判定为平局
最终结论：结果一致的胜出者，取平均置信度

Rubric Components

评分标准组成

Level descriptions with clear boundaries
Observable characteristics per level
Edge case guidance
Strictness calibration (lenient/balanced/strict)

具有清晰边界的等级描述
各等级的可观察特征
边缘情况指导
严格程度校准（宽松/平衡/严格）

Integration

集成

Works with:

context-fundamentals - Effective context structure
tool-design - Evaluation tool schemas
evaluation (foundational) - Core evaluation concepts

For detailed implementation patterns, prompt templates, examples, and metrics:

references/full-guide.md