advanced-evaluation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Advanced Evaluation

高级评估方法

This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems.

Key insight: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.

本技能涵盖了使用LLM作为评估者来评估LLM输出的生产级技术。它整合了学术论文、行业实践和实际实施经验中的研究成果，提炼出构建可靠评估系统的可落地模式。

核心见解：LLM-as-a-Judge并非单一技术，而是一系列方法的集合，每种方法适用于不同的评估场景。选择合适的方法并缓解已知偏差是本技能要培养的核心能力。

When to Activate

触发场景

Activate this skill when:

Building automated evaluation pipelines for LLM outputs
Comparing multiple model responses to select the best one
Establishing consistent quality standards across evaluation teams
Debugging evaluation systems that show inconsistent results
Designing A/B tests for prompt or model changes
Creating rubrics for human or automated evaluation
Analyzing correlation between automated and human judgments

在以下场景中激活本技能：

为LLM输出构建自动化评估流水线
比较多个模型响应以选出最优结果
在评估团队间建立统一的质量标准
调试结果不一致的评估系统
为提示词或模型变更设计A/B测试
为人工或自动化评估创建评估Rubric
分析自动化评估与人工判断的相关性

Core Concepts

核心概念

The Evaluation Taxonomy

评估分类体系

Select between two primary approaches based on whether ground truth exists:

Direct Scoring — Use when objective criteria exist (factual accuracy, instruction following, toxicity). A single LLM rates one response on a defined scale. Achieves moderate-to-high reliability for well-defined criteria. Watch for score calibration drift and inconsistent scale interpretation.

Pairwise Comparison — Use for subjective preferences (tone, style, persuasiveness). An LLM compares two responses and selects the better one. Achieves higher human-judge agreement than direct scoring for preference tasks (Zheng et al., 2023). Watch for position bias and length bias.

根据是否存在真实基准（ground truth），选择两种主要评估方法：

Direct Scoring（直接评分） — 适用于存在客观标准的场景（事实准确性、指令遵循度、毒性检测）。由单个LLM基于定义好的量表对单个响应进行评分。对于定义清晰的标准，可实现中到高的可靠性。需注意分数校准漂移和量表解读不一致的问题。

Pairwise Comparison（成对比较） — 适用于主观偏好类场景（语气、风格、说服力）。由LLM比较两个响应并选出更优的一个。在偏好类任务中，与直接评分相比，该方法与人工判断的一致性更高（Zheng等人，2023）。需注意位置偏差和长度偏差。

The Bias Landscape

偏差类型及缓解策略

Mitigate these systematic biases in every evaluation system:

Position Bias: First-position responses get preferential treatment. Mitigate by evaluating twice with swapped positions, then apply majority vote or consistency check.

Length Bias: Longer responses score higher regardless of quality. Mitigate by explicitly prompting to ignore length and applying length-normalized scoring.

Self-Enhancement Bias: Models rate their own outputs higher. Mitigate by using different models for generation and evaluation.

Verbosity Bias: Excessive detail scores higher even when unnecessary. Mitigate with criteria-specific rubrics that penalize irrelevant detail.

Authority Bias: Confident tone scores higher regardless of accuracy. Mitigate by requiring evidence citation and adding a fact-checking layer.

在所有评估系统中，需缓解以下系统性偏差：

位置偏差：排在首位的响应会获得优先对待。缓解方法：交换两个响应的位置进行两次评估，然后采用多数投票或一致性检查。

长度偏差：更长的响应无论质量如何都会获得更高分数。缓解方法：在提示词中明确要求忽略长度，并采用长度归一化评分。

自我提升偏差：模型会给自己的输出打更高的分。缓解方法：使用不同的模型分别进行生成和评估。

冗余偏差：过度详细的内容即使不必要也会获得更高分数。缓解方法：使用针对特定标准的评估Rubric，对无关细节进行扣分。

权威偏差：语气自信的响应无论准确性如何都会获得更高分数。缓解方法：要求提供证据引用，并添加事实核查环节。

Metric Selection Framework

指标选择框架

Match metrics to the evaluation task structure:

Task Type	Primary Metrics	Secondary Metrics
Binary classification (pass/fail)	Recall, Precision, F1	Cohen's kappa
Ordinal scale (1-5 rating)	Spearman's rho, Kendall's tau	Cohen's kappa (weighted)
Pairwise preference	Agreement rate, Position consistency	Confidence calibration
Multi-label	Macro-F1, Micro-F1	Per-label precision/recall

Prioritize systematic disagreement patterns over absolute agreement rates because a judge that consistently disagrees with humans on specific criteria is more problematic than one with random noise.

根据评估任务类型匹配相应指标：

任务类型	主要指标	次要指标
二元分类（通过/不通过）	召回率、精确率、F1值	Cohen's kappa系数
有序量表（1-5评分）	Spearman相关系数、Kendall tau系数	加权Cohen's kappa系数
成对偏好	一致率、位置一致性	置信度校准
多标签分类	Macro-F1、Micro-F1	单标签精确率/召回率

优先关注系统性分歧模式而非绝对一致率，因为在特定标准上持续与人类判断分歧的评估者，比存在随机误差的评估者问题更严重。

Evaluation Approaches

评估方法实现

Direct Scoring Implementation

Direct Scoring（直接评分）实现

Build direct scoring with three components: clear criteria, a calibrated scale, and structured output format.

Criteria Definition Pattern:

Criterion: [Name]
Description: [What this criterion measures]
Weight: [Relative importance, 0-1]

Scale Calibration — Choose scale granularity based on rubric detail:

1-3: Binary with neutral option, lowest cognitive load
1-5: Standard Likert, best balance of granularity and reliability
1-10: Use only with detailed per-level rubrics because calibration is harder

Prompt Structure for Direct Scoring:

You are an expert evaluator assessing response quality.

直接评分由三个部分构成：清晰的评估标准、校准后的量表、结构化输出格式。

标准定义模板：

Criterion: [名称]
Description: [该标准衡量的内容]
Weight: [相对重要性，0-1]

量表校准 — 根据评估Rubric的详细程度选择量表粒度：

1-3分：包含中性选项的二元量表，认知负荷最低
1-5分：标准李克特量表，在粒度和可靠性间达到最佳平衡
1-10分：仅在评估Rubric包含详细的每级描述时使用，因为校准难度更大

直接评分提示词结构：

You are an expert evaluator assessing response quality.

Task

Evaluate the following response against each criterion.

Original Prompt

{prompt}

Response to Evaluate

{response}

Criteria

{for each criterion: name, description, weight}

Instructions

For each criterion:

Find specific evidence in the response
Score according to the rubric (1-{max} scale)
Justify your score with evidence
Suggest one specific improvement

For each criterion:

Find specific evidence in the response
Score according to the rubric (1-{max} scale)
Justify your score with evidence
Suggest one specific improvement

Output Format

Respond with structured JSON containing scores, justifications, and summary.


Always require justification before the score in all scoring prompts because research shows this improves reliability by 15-25% compared to score-first approaches.

Respond with structured JSON containing scores, justifications, and summary.


在所有评分提示词中，始终要求先提供理由再给出分数，因为研究表明，与先给分后说明理由的方式相比，这种方法可将评估可靠性提升15-25%。

Pairwise Comparison Implementation

Pairwise Comparison（成对比较）实现

Apply position bias mitigation in every pairwise evaluation:

First pass: Response A in first position, Response B in second
Second pass: Response B in first position, Response A in second
Consistency check: If passes disagree, return TIE with reduced confidence
Final verdict: Consistent winner with averaged confidence

Prompt Structure for Pairwise Comparison:

You are an expert evaluator comparing two AI responses.

在所有成对评估中，必须应用位置偏差缓解策略：

第一轮：响应A排在首位，响应B排在第二位
第二轮：响应B排在首位，响应A排在第二位
一致性检查：如果两轮结果不一致，则返回平局（TIE）并降低置信度
最终结论：若两轮结果一致，则取平均置信度作为最终置信度

成对比较提示词结构：

You are an expert evaluator comparing two AI responses.

Critical Instructions

Do NOT prefer responses because they are longer
Do NOT prefer responses based on position (first vs second)
Focus ONLY on quality according to the specified criteria
Ties are acceptable when responses are genuinely equivalent

Do NOT prefer responses because they are longer
Do NOT prefer responses based on position (first vs second)
Focus ONLY on quality according to the specified criteria
Ties are acceptable when responses are genuinely equivalent

Original Prompt

{prompt}

Response A

{response_a}

Response B

{response_b}

Comparison Criteria

{criteria list}

Instructions

Analyze each response independently first
Compare them on each criterion
Determine overall winner with confidence level

Analyze each response independently first
Compare them on each criterion
Determine overall winner with confidence level

Output Format

JSON with per-criterion comparison, overall winner, confidence (0-1), and reasoning.


**Confidence Calibration** — Map confidence to position consistency:
- Both passes agree: confidence = average of individual confidences
- Passes disagree: confidence = 0.5, verdict = TIE

JSON with per-criterion comparison, overall winner, confidence (0-1), and reasoning.


**置信度校准** — 将置信度与位置一致性关联：
- 两轮结果一致：置信度 = 两轮单独置信度的平均值
- 两轮结果不一致：置信度 = 0.5，结论 = TIE

Rubric Generation

Rubric生成

Generate rubrics to reduce evaluation variance by 40-60% compared to open-ended scoring.

Include these rubric components:

Level descriptions: Clear boundaries for each score level
Characteristics: Observable features that define each level
Examples: Representative text for each level (optional but valuable)
Edge cases: Guidance for ambiguous situations
Scoring guidelines: General principles for consistent application

Set strictness calibration for the use case:

Lenient: Lower passing bar, appropriate for encouraging iteration
Balanced: Typical production expectations
Strict: High standards for safety-critical or high-stakes evaluation

Adapt rubrics to the domain — use domain-specific terminology. A code readability rubric mentions variables, functions, and comments. A medical accuracy rubric references clinical terminology and evidence standards.

生成评估Rubric可将评估差异降低40-60%，相比开放式评分效果显著。

评估Rubric应包含以下组件：

等级描述：每个分数等级的清晰边界
特征：定义每个等级的可观察特征
示例：每个等级的代表性文本（可选但有价值）
边缘情况：对模糊场景的指导
评分指南：确保一致应用的通用原则

根据使用场景设置严格度校准：

宽松：降低通过门槛，适合鼓励迭代的场景
平衡：符合典型的生产环境预期
严格：适用于安全关键或高风险评估的高标准

根据领域调整Rubric——使用领域特定术语。代码可读性Rubric会提及变量、函数和注释；医疗准确性Rubric会引用临床术语和证据标准。

Practical Guidance

实践指导

Evaluation Pipeline Design

评估流水线设计

Build production evaluation systems with these layers: Criteria Loader (rubrics + weights) -> Primary Scorer (direct or pairwise) -> Bias Mitigation (position swap, etc.) -> Confidence Scoring (calibration) -> Output (scores + justifications + confidence). See Evaluation Pipeline Diagram for the full visual layout.

构建生产级评估系统需包含以下层级：标准加载器（Rubrics + 权重）-> 主评分器（直接或成对）-> 偏差缓解（位置交换等）-> 置信度评分（校准）-> 输出（分数 + 理由 + 置信度）。完整的可视化布局请参见评估流水线图。

Decision Framework: Direct vs. Pairwise

决策框架：直接评分 vs 成对比较

Apply this decision tree:

Is there an objective ground truth?
+-- Yes -> Direct Scoring
|   Examples: factual accuracy, instruction following, format compliance
|
+-- No -> Is it a preference or quality judgment?
    +-- Yes -> Pairwise Comparison
    |   Examples: tone, style, persuasiveness, creativity
    |
    +-- No -> Consider reference-based evaluation
        Examples: summarization (compare to source), translation (compare to reference)

应用以下决策树：

Is there an objective ground truth?
+-- Yes -> Direct Scoring
|   Examples: factual accuracy, instruction following, format compliance
|
+-- No -> Is it a preference or quality judgment?
    +-- Yes -> Pairwise Comparison
    |   Examples: tone, style, persuasiveness, creativity
    |
    +-- No -> Consider reference-based evaluation
        Examples: summarization (compare to source), translation (compare to reference)

Scaling Evaluation

评估规模化

For high-volume evaluation, apply one of these strategies:

Panel of LLMs (PoLL): Use multiple models as judges and aggregate votes to reduce individual model bias. More expensive but more reliable for high-stakes decisions.
Hierarchical evaluation: Use a fast cheap model for screening and an expensive model for edge cases. Requires calibration of the screening threshold.
Human-in-the-loop: Automate clear cases and route low-confidence decisions to human review. Design feedback loops to improve automated evaluation over time.

针对高吞吐量的评估场景，可应用以下策略之一：

LLM专家组（PoLL）：使用多个模型作为评估者，汇总投票结果以减少单个模型的偏差。成本更高，但在高风险决策中更可靠。
分层评估：使用快速廉价的模型进行筛选，对边缘案例使用昂贵的模型。需要校准筛选阈值。
人机协同：自动化处理清晰明确的案例，将低置信度的决策路由给人工审核。设计反馈回路以持续改进自动化评估。

Examples

示例

Example 1: Direct Scoring for Accuracy

示例1：针对准确性的Direct Scoring（直接评分）

Input:

Prompt: "What causes seasons on Earth?"
Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun,
different hemispheres receive more direct sunlight at different times of year."
Criterion: Factual Accuracy (weight: 1.0)
Scale: 1-5

Output:

json

{
  "criterion": "Factual Accuracy",
  "score": 5,
  "evidence": [
    "Correctly identifies axial tilt as primary cause",
    "Correctly explains differential sunlight by hemisphere",
    "No factual errors present"
  ],
  "justification": "Response accurately explains the cause of seasons with correct
scientific reasoning. Both the axial tilt and its effect on sunlight distribution
are correctly described.",
  "improvement": "Could add the specific tilt angle (23.5 degrees) for completeness."
}

输入:

Prompt: "What causes seasons on Earth?"
Response: "Seasons are caused by Earth's tilted axis. As Earth orbits the Sun,
different hemispheres receive more direct sunlight at different times of year."
Criterion: Factual Accuracy (weight: 1.0)
Scale: 1-5

输出:

json

{
  "criterion": "Factual Accuracy",
  "score": 5,
  "evidence": [
    "Correctly identifies axial tilt as primary cause",
    "Correctly explains differential sunlight by hemisphere",
    "No factual errors present"
  ],
  "justification": "Response accurately explains the cause of seasons with correct
scientific reasoning. Both the axial tilt and its effect on sunlight distribution
are correctly described.",
  "improvement": "Could add the specific tilt angle (23.5 degrees) for completeness."
}

Example 2: Pairwise Comparison with Position Swap

示例2：带位置交换的Pairwise Comparison（成对比较）

Input:

Prompt: "Explain machine learning to a beginner"
Response A: [Technical explanation with jargon]
Response B: [Simple analogy-based explanation]
Criteria: ["clarity", "accessibility"]

First Pass (A first):

json

{ "winner": "B", "confidence": 0.8 }

Second Pass (B first):

json

{ "winner": "A", "confidence": 0.6 }

(Note: Winner is A because B was in first position)

Mapped Second Pass:

json

{ "winner": "B", "confidence": 0.6 }

Final Result:

json

{
  "winner": "B",
  "confidence": 0.7,
  "positionConsistency": {
    "consistent": true,
    "firstPassWinner": "B",
    "secondPassWinner": "B"
  }
}

输入:

Prompt: "Explain machine learning to a beginner"
Response A: [Technical explanation with jargon]
Response B: [Simple analogy-based explanation]
Criteria: ["clarity", "accessibility"]

第一轮（A在前）:

json

{ "winner": "B", "confidence": 0.8 }

第二轮（B在前）:

json

{ "winner": "A", "confidence": 0.6 }

(注：因为B排在首位，所以结果出现偏差)

映射后的第二轮结果:

json

{ "winner": "B", "confidence": 0.6 }

最终结果:

json

{
  "winner": "B",
  "confidence": 0.7,
  "positionConsistency": {
    "consistent": true,
    "firstPassWinner": "B",
    "secondPassWinner": "B"
  }
}

Example 3: Rubric Generation

示例3：Rubric生成

Input:

criterionName: "Code Readability"
criterionDescription: "How easy the code is to understand and maintain"
domain: "software engineering"
scale: "1-5"
strictness: "balanced"

Output (abbreviated):

json

{
  "levels": [
    {
      "score": 1,
      "label": "Poor",
      "description": "Code is difficult to understand without significant effort",
      "characteristics": [
        "No meaningful variable or function names",
        "No comments or documentation",
        "Deeply nested or convoluted logic"
      ]
    },
    {
      "score": 3,
      "label": "Adequate",
      "description": "Code is understandable with some effort",
      "characteristics": [
        "Most variables have meaningful names",
        "Basic comments present for complex sections",
        "Logic is followable but could be cleaner"
      ]
    },
    {
      "score": 5,
      "label": "Excellent",
      "description": "Code is immediately clear and maintainable",
      "characteristics": [
        "All names are descriptive and consistent",
        "Comprehensive documentation",
        "Clean, modular structure"
      ]
    }
  ],
  "edgeCases": [
    {
      "situation": "Code is well-structured but uses domain-specific abbreviations",
      "guidance": "Score based on readability for domain experts, not general audience"
    }
  ]
}

输入:

criterionName: "Code Readability"
criterionDescription: "How easy the code is to understand and maintain"
domain: "software engineering"
scale: "1-5"
strictness: "balanced"

输出（节选）:

json

{
  "levels": [
    {
      "score": 1,
      "label": "Poor",
      "description": "Code is difficult to understand without significant effort",
      "characteristics": [
        "No meaningful variable or function names",
        "No comments or documentation",
        "Deeply nested or convoluted logic"
      ]
    },
    {
      "score": 3,
      "label": "Adequate",
      "description": "Code is understandable with some effort",
      "characteristics": [
        "Most variables have meaningful names",
        "Basic comments present for complex sections",
        "Logic is followable but could be cleaner"
      ]
    },
    {
      "score": 5,
      "label": "Excellent",
      "description": "Code is immediately clear and maintainable",
      "characteristics": [
        "All names are descriptive and consistent",
        "Comprehensive documentation",
        "Clean, modular structure"
      ]
    }
  ],
  "edgeCases": [
    {
      "situation": "Code is well-structured but uses domain-specific abbreviations",
      "guidance": "Score based on readability for domain experts, not general audience"
    }
  ]
}

Guidelines

指导原则

Always require justification before scores - Chain-of-thought prompting improves reliability by 15-25%
Always swap positions in pairwise comparison - Single-pass comparison is corrupted by position bias
Match scale granularity to rubric specificity - Don't use 1-10 without detailed level descriptions
Separate objective and subjective criteria - Use direct scoring for objective, pairwise for subjective
Include confidence scores - Calibrate to position consistency and evidence strength
Define edge cases explicitly - Ambiguous situations cause the most evaluation variance
Use domain-specific rubrics - Generic rubrics produce generic (less useful) evaluations
Validate against human judgments - Automated evaluation is only valuable if it correlates with human assessment
Monitor for systematic bias - Track disagreement patterns by criterion, response type, model
Design for iteration - Evaluation systems improve with feedback loops

始终要求先提供理由再给分 - 思维链提示可将评估可靠性提升15-25%
在成对比较中始终交换位置 - 单次成对比较会受位置偏差影响而失效
量表粒度与Rubric特异性匹配 - 若无详细的等级描述，请勿使用1-10分的量表
区分客观与主观标准 - 客观标准使用直接评分，主观标准使用成对比较
包含置信度分数 - 根据位置一致性和证据强度进行校准
明确定义边缘情况 - 模糊场景是导致评估差异的主要原因
使用领域特定的Rubric - 通用Rubric会产生通用（价值较低）的评估结果
与人工判断进行验证 - 只有当自动化评估与人工评估相关时，才具备价值
监控系统性偏差 - 按标准、响应类型、模型跟踪分歧模式
为迭代设计系统 - 评估系统可通过反馈回路不断改进

Gotchas

常见陷阱

Scoring without justification: Scores lack grounding and are difficult to debug. Always require evidence-based justification before the score.
Single-pass pairwise comparison: Position bias corrupts results when positions are not swapped. Always evaluate twice with swapped positions and check consistency.
Overloaded criteria: Criteria that measure multiple things at once produce unreliable scores. Enforce one criterion = one measurable aspect.
Missing edge case guidance: Evaluators handle ambiguous cases inconsistently without explicit instructions. Include edge cases in rubrics with clear resolution rules.
Ignoring confidence calibration: High-confidence wrong judgments are worse than low-confidence ones. Calibrate confidence to position consistency and evidence strength.
Rubric drift: Rubrics become miscalibrated as quality standards evolve or model capabilities improve. Schedule periodic rubric reviews and re-anchor score levels against fresh human-annotated examples.
Evaluation prompt sensitivity: Minor wording changes in evaluation prompts (e.g., reordering instructions, changing phrasing) can cause 10-20% score swings. Version-control evaluation prompts and run regression tests before deploying prompt changes.
Uncontrolled length bias: Longer responses systematically score higher even when conciseness is preferred. Add explicit length-neutrality instructions to evaluation prompts and validate with length-controlled test pairs.

无理由评分：分数缺乏依据，难以调试。始终要求基于证据的理由优先于分数。
单次成对比较：不交换位置的成对比较会受位置偏差影响。始终交换位置进行两次评估并检查一致性。
标准过载：同时衡量多个维度的标准会产生不可靠的分数。确保一个标准对应一个可衡量的维度。
缺失边缘情况指导：若无明确说明，评估者对模糊场景的处理会不一致。在Rubric中包含边缘情况及清晰的解决规则。
忽略置信度校准：高置信度的错误判断比低置信度的更糟。根据位置一致性和证据强度校准置信度。
Rubric漂移：随着质量标准的演变或模型能力的提升，Rubric会出现校准偏差。定期安排Rubric审查，并使用新鲜的人工标注示例重新锚定分数等级。
评估提示词敏感性：评估提示词中的微小措辞变化（如重新排序指令、修改表述）可能导致10-20%的分数波动。对评估提示词进行版本控制，并在部署提示词变更前运行回归测试。
未控制的长度偏差：更长的响应系统地获得更高分数，即使简洁性更受青睐。在评估提示词中添加明确的长度中立说明，并使用长度受控的测试对进行验证。

Integration

集成

This skill integrates with:

context-fundamentals - Evaluation prompts require effective context structure
tool-design - Evaluation tools need proper schemas and error handling
context-optimization - Evaluation prompts can be optimized for token efficiency
evaluation (foundational) - This skill extends the foundational evaluation concepts

本技能可与以下技能集成：

context-fundamentals - 评估提示词需要有效的上下文结构
tool-design - 评估工具需要合适的 schema 和错误处理
context-optimization - 评估提示词可针对令牌效率进行优化
evaluation（基础版）- 本技能扩展了基础评估概念

References

参考资料

Internal reference:

LLM-as-Judge Implementation Patterns - Read when: building an evaluation pipeline from scratch or integrating LLM judges into CI/CD
Bias Mitigation Techniques - Read when: evaluation results show inconsistent or suspicious scoring patterns
Metric Selection Guide - Read when: choosing statistical metrics to validate evaluation reliability
Evaluation Pipeline Diagram - Read when: designing the architecture of a multi-stage evaluation system

External research:

Eugene Yan: Evaluating the Effectiveness of LLM-Evaluators - Read when: surveying the state of the art in LLM evaluation
Judging LLM-as-a-Judge (Zheng et al., 2023) - Read when: understanding position bias and MT-Bench methodology
G-Eval: NLG Evaluation using GPT-4 (Liu et al., 2023) - Read when: implementing chain-of-thought evaluation scoring
Large Language Models are not Fair Evaluators (Wang et al., 2023) - Read when: diagnosing systematic bias in evaluation outputs

Related skills in this collection:

evaluation - Foundational evaluation concepts
context-fundamentals - Context structure for evaluation prompts
tool-design - Building evaluation tools

内部参考：

LLM-as-Judge Implementation Patterns - 适用场景：从零开始构建评估流水线，或在CI/CD中集成LLM评估者
Bias Mitigation Techniques - 适用场景：评估结果出现不一致或可疑的评分模式
Metric Selection Guide - 适用场景：选择统计指标以验证评估可靠性
Evaluation Pipeline Diagram - 适用场景：设计多阶段评估系统的架构

外部研究：

Eugene Yan: Evaluating the Effectiveness of LLM-Evaluators - 适用场景：调研LLM评估的最新技术现状
Judging LLM-as-a-Judge (Zheng et al., 2023) - 适用场景：理解位置偏差和MT-Bench方法论
G-Eval: NLG Evaluation using GPT-4 (Liu et al., 2023) - 适用场景：实现思维链评估评分
Large Language Models are not Fair Evaluators (Wang et al., 2023) - 适用场景：诊断评估输出中的系统性偏差

本技能集中的相关技能：

evaluation - 基础评估概念
context-fundamentals - 评估提示词的上下文结构
tool-design - 构建评估工具

Skill Metadata

技能元数据

Created: 2025-12-24 Last Updated: 2026-03-17 Author: Agent Skills for Context Engineering Contributors Version: 2.0.0

创建时间: 2025-12-24 最后更新: 2026-03-17 作者: Agent Skills for Context Engineering Contributors 版本: 2.0.0