advanced-evaluation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Advanced Evaluation

高级评估

LLM-as-a-Judge techniques for evaluating AI outputs. Not a single technique but a family of approaches - choosing the right one and mitigating biases is the core competency.
LLM-as-a-Judge技术用于评估AI输出。这并非单一技术,而是一系列方法的集合——选择合适的方法并缓解偏差是核心能力。

When to Activate

适用场景

  • Building automated evaluation pipelines for LLM outputs
  • Comparing multiple model responses to select the best one
  • Establishing consistent quality standards
  • Debugging inconsistent evaluation results
  • Designing A/B tests for prompt or model changes
  • Creating rubrics for human or automated evaluation
  • 为LLM输出构建自动化评估流水线
  • 对比多个模型响应以选出最优结果
  • 建立一致的质量标准
  • 调试不一致的评估结果
  • 为提示词或模型变更设计A/B测试
  • 为人工或自动化评估创建评分标准

Core Concepts

核心概念

Evaluation Taxonomy

评估分类

Direct Scoring: Single LLM rates one response on a defined scale.
  • Best for: Objective criteria (factual accuracy, instruction following, toxicity)
  • Reliability: Moderate to high for well-defined criteria
Pairwise Comparison: LLM compares two responses and selects better one.
  • Best for: Subjective preferences (tone, style, persuasiveness)
  • Reliability: Higher than direct scoring for preferences
直接评分:单个LLM基于定义的评分标准对单个响应进行评分。
  • 最适用于:客观标准(事实准确性、指令遵循度、毒性)
  • 可靠性:针对定义清晰的标准,可靠性为中到高
两两对比:LLM对比两个响应并选出更优的一个。
  • 最适用于:主观偏好(语气、风格、说服力)
  • 可靠性:在偏好类评估中高于直接评分

Known Biases

已知偏差

BiasDescriptionMitigation
PositionFirst-position preferenceSwap positions, check consistency
LengthLonger = higher scoresExplicit prompting, length-normalized scoring
Self-EnhancementModels rate own outputs higherUse different model for evaluation
VerbosityUnnecessary detail rated higherCriteria-specific rubrics
AuthorityConfident tone rated higherRequire evidence citation
偏差类型描述缓解方法
位置偏差偏好排在首位的内容交换位置,检查一致性
长度偏差内容越长得分越高明确提示、基于长度归一化的评分
自我提升偏差模型给自己的输出打更高分使用不同模型进行评估
冗余偏差不必要的细节获得更高评分基于特定标准的评分细则
权威偏差自信的语气获得更高评分要求引用证据

Decision Framework

决策框架

Is there an objective ground truth?
├── Yes → Direct Scoring (factual accuracy, format compliance)
└── No → Pairwise Comparison (tone, style, creativity)
是否存在客观的基准事实?
├── 是 → 直接评分(事实准确性、格式合规性)
└── 否 → 两两对比(语气、风格、创造力)

Quick Reference

快速参考

Direct Scoring Requirements

直接评分要求

  1. Clear criteria definitions
  2. Calibrated scale (1-5 recommended)
  3. Chain-of-thought: justification BEFORE score (improves reliability 15-25%)
  1. 清晰的标准定义
  2. 校准后的评分尺度(推荐1-5分)
  3. 思维链:先给出理由再打分(可将可靠性提升15-25%)

Pairwise Comparison Protocol

两两对比流程

  1. First pass: A in first position
  2. Second pass: B in first position (swap)
  3. Consistency check: If passes disagree → TIE
  4. Final verdict: Consistent winner with averaged confidence
  1. 第一轮:A排在首位
  2. 第二轮:B排在首位(交换位置)
  3. 一致性检查:若两轮结果不一致→判定为平局
  4. 最终结论:结果一致的胜出者,取平均置信度

Rubric Components

评分标准组成

  • Level descriptions with clear boundaries
  • Observable characteristics per level
  • Edge case guidance
  • Strictness calibration (lenient/balanced/strict)
  • 具有清晰边界的等级描述
  • 各等级的可观察特征
  • 边缘情况指导
  • 严格程度校准(宽松/平衡/严格)

Integration

集成

Works with:
  • context-fundamentals - Effective context structure
  • tool-design - Evaluation tool schemas
  • evaluation (foundational) - Core evaluation concepts

For detailed implementation patterns, prompt templates, examples, and metrics:
references/full-guide.md
See also:
references/implementation-patterns.md
,
references/bias-mitigation.md
,
references/metrics-guide.md
可与以下内容配合使用:
  • context-fundamentals - 有效的上下文结构
  • tool-design - 评估工具架构
  • evaluation(基础)- 核心评估概念

如需详细的实现模式、提示词模板、示例及指标,请参考:
references/full-guide.md
另请参阅:
references/implementation-patterns.md
,
references/bias-mitigation.md
,
references/metrics-guide.md