evaluation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Evaluation Methods for Agent Systems

Agent系统的评估方法

Evaluation of agent systems requires different approaches than traditional software or even standard language model applications. Agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Effective evaluation must account for these characteristics while providing actionable feedback. A robust evaluation framework enables continuous improvement, catches regressions, and validates that context engineering choices achieve intended effects.

Agent系统的评估与传统软件甚至标准语言模型应用的评估方法不同。Agent会做出动态决策，不同运行之间具有非确定性，且通常不存在单一正确答案。有效的评估必须考虑这些特性，同时提供可操作的反馈。一个稳健的评估框架能够支持持续改进、捕获回归问题，并验证context engineering选择是否达到预期效果。

When to Activate

适用场景

Activate this skill when:

Testing agent performance systematically
Validating context engineering choices
Measuring improvements over time
Catching regressions before deployment
Building quality gates for agent pipelines
Comparing different agent configurations
Evaluating production systems continuously

在以下场景中启用此技能：

系统性测试Agent性能
验证context engineering选择
衡量长期改进效果
在部署前捕获回归问题
为Agent流水线构建质量门
比较不同的Agent配置
持续评估生产系统

Core Concepts

核心概念

Agent evaluation requires outcome-focused approaches that account for non-determinism and multiple valid paths. Multi-dimensional rubrics capture various quality aspects: factual accuracy, completeness, citation accuracy, source quality, and tool efficiency. LLM-as-judge provides scalable evaluation while human evaluation catches edge cases.

The key insight is that agents may find alternative paths to goals—the evaluation should judge whether they achieve right outcomes while following reasonable processes.

Performance Drivers: The 95% Finding Research on the BrowseComp evaluation (which tests browsing agents' ability to locate hard-to-find information) found that three factors explain 95% of performance variance:

Factor	Variance Explained	Implication
Token usage	80%	More tokens = better performance
Number of tool calls	~10%	More exploration helps
Model choice	~5%	Better models multiply efficiency

This finding has significant implications for evaluation design:

Token budgets matter: Evaluate agents with realistic token budgets, not unlimited resources
Model upgrades beat token increases: Upgrading to Claude Sonnet 4.5 or GPT-5.2 provides larger gains than doubling token budgets on previous versions
Multi-agent validation: The finding validates architectures that distribute work across agents with separate context windows

Agent评估需要以结果为导向的方法，考虑非确定性和多条有效路径。多维评分表可捕获各种质量维度：事实准确性、完整性、引用准确性、来源质量和工具效率。LLM-as-judge提供可扩展的评估，而人工评估可捕获边缘案例。

关键见解是：Agent可能会找到达成目标的替代路径——评估应判断它们是否在遵循合理流程的同时实现了正确结果。

性能驱动因素：95%发现 对BrowseComp评估（测试浏览Agent定位难以查找信息的能力）的研究发现，三个因素解释了95%的性能差异：

因素	解释的差异比例	启示
Token使用量	80%	Token越多=性能越好
工具调用次数	~10%	更多探索有助提升
模型选择	~5%	更优模型可提升效率

这一发现对评估设计具有重要启示：

Token预算至关重要：使用真实的Token预算评估Agent，而非无限资源
模型升级优于Token增加：升级到Claude Sonnet 4.5或GPT-5.2比在旧版本上加倍Token预算带来的收益更大
多Agent验证：该发现验证了在具有独立上下文窗口的Agent之间分配工作的架构

Detailed Topics

详细主题

Evaluation Challenges

评估挑战

Non-Determinism and Multiple Valid Paths Agents may take completely different valid paths to reach goals. One agent might search three sources while another searches ten. They might use different tools to find the same answer. Traditional evaluations that check for specific steps fail in this context.

The solution is outcome-focused evaluation that judges whether agents achieve right outcomes while following reasonable processes.

Context-Dependent Failures Agent failures often depend on context in subtle ways. An agent might succeed on simple queries but fail on complex ones. It might work well with one tool set but fail with another. Failures may emerge only after extended interaction when context accumulates.

Evaluation must cover a range of complexity levels and test extended interactions, not just isolated queries.

Composite Quality Dimensions Agent quality is not a single dimension. It includes factual accuracy, completeness, coherence, tool efficiency, and process quality. An agent might score high on accuracy but low in efficiency, or vice versa.

Evaluation rubrics must capture multiple dimensions with appropriate weighting for the use case.

非确定性与多条有效路径 Agent可能会采取完全不同的有效路径来达成目标。一个Agent可能搜索三个来源，而另一个可能搜索十个。它们可能使用不同的工具找到相同的答案。传统的评估方式（检查特定步骤）在此场景下失效。

解决方案是采用以结果为导向的评估，判断Agent是否在遵循合理流程的同时实现了正确结果。

上下文相关的故障 Agent故障通常以微妙的方式依赖于上下文。Agent可能在简单查询上成功，但在复杂查询上失败。它可能在某一工具集下表现良好，但在另一工具集下失败。故障可能仅在上下文累积后的长时间交互中出现。

评估必须覆盖不同复杂度级别，并测试长时间交互，而不仅仅是孤立的查询。

复合质量维度 Agent质量并非单一维度。它包括事实准确性、完整性、连贯性、工具效率和流程质量。一个Agent可能在准确性上得分高，但效率低，反之亦然。

评估评分表必须捕获多个维度，并根据用例赋予适当的权重。

Evaluation Rubric Design

评估评分表设计

Multi-Dimensional Rubric Effective rubrics cover key dimensions with descriptive levels:

Factual accuracy: Claims match ground truth (excellent to failed)

Completeness: Output covers requested aspects (excellent to failed)

Citation accuracy: Citations match claimed sources (excellent to failed)

Source quality: Uses appropriate primary sources (excellent to failed)

Tool efficiency: Uses right tools reasonable number of times (excellent to failed)

Rubric Scoring Convert dimension assessments to numeric scores (0.0 to 1.0) with appropriate weighting. Calculate weighted overall scores. Determine passing threshold based on use case requirements.

多维评分表 有效的评分表涵盖关键维度，并带有明确的级别描述：

事实准确性：声明与基准事实匹配（优秀到失败）完整性：输出覆盖请求的所有方面（优秀到失败）引用准确性：引用与声称的来源匹配（优秀到失败）来源质量：使用适当的主要来源（优秀到失败）工具效率：合理次数使用正确工具（优秀到失败）

评分表计分 将维度评估转换为数值分数（0.0到1.0）并赋予适当权重。计算加权总分。根据用例要求确定通过阈值。

Evaluation Methodologies

评估方法论

LLM-as-Judge LLM-based evaluation scales to large test sets and provides consistent judgments. The key is designing effective evaluation prompts that capture the dimensions of interest.

Provide clear task description, agent output, ground truth (if available), evaluation scale with level descriptions, and request structured judgment.

Human Evaluation Human evaluation catches what automation misses. Humans notice hallucinated answers on unusual queries, system failures, and subtle biases that automated evaluation misses.

Effective human evaluation covers edge cases, samples systematically, tracks patterns, and provides contextual understanding.

End-State Evaluation For agents that mutate persistent state, end-state evaluation focuses on whether the final state matches expectations rather than how the agent got there.

LLM-as-judge 基于LLM的评估可扩展到大型测试集，并提供一致的判断。关键在于设计有效的评估提示，以捕获感兴趣的维度。

提供清晰的任务描述、Agent输出、基准事实（如果可用）、带有级别描述的评估量表，并请求结构化判断。

人工评估 人工评估可捕获自动化评估遗漏的内容。人类会注意到异常查询上的幻觉答案、系统故障以及自动化评估遗漏的细微偏见。

有效的人工评估涵盖边缘案例、系统抽样、跟踪模式并提供上下文理解。

终态评估 对于会改变持久状态的Agent，终态评估关注最终状态是否符合预期，而非Agent如何达成该状态。

Test Set Design

测试集设计

Sample Selection Start with small samples during development. Early in agent development, changes have dramatic impacts because there is abundant low-hanging fruit. Small test sets reveal large effects.

Sample from real usage patterns. Add known edge cases. Ensure coverage across complexity levels.

Complexity Stratification Test sets should span complexity levels: simple (single tool call), medium (multiple tool calls), complex (many tool calls, significant ambiguity), and very complex (extended interaction, deep reasoning).

样本选择 在开发阶段从少量样本开始。在Agent开发初期，变化会产生巨大影响，因为存在大量容易改进的空间。小型测试集可揭示显著效果。

从真实使用模式中抽样。添加已知的边缘案例。确保覆盖不同复杂度级别。

复杂度分层 测试集应涵盖不同复杂度级别：简单（单次工具调用）、中等（多次工具调用）、复杂（多次工具调用、显著歧义）和极复杂（长时间交互、深度推理）。

Context Engineering Evaluation

Context Engineering评估

Testing Context Strategies Context engineering choices should be validated through systematic evaluation. Run agents with different context strategies on the same test set. Compare quality scores, token usage, and efficiency metrics.

Degradation Testing Test how context degradation affects performance by running agents at different context sizes. Identify performance cliffs where context becomes problematic. Establish safe operating limits.

测试上下文策略 应通过系统评估验证context engineering选择。在同一测试集上运行采用不同上下文策略的Agent。比较质量分数、Token使用量和效率指标。

退化测试 通过在不同上下文大小下运行Agent，测试上下文退化对性能的影响。识别上下文出现问题的性能临界点。建立安全操作限制。

Continuous Evaluation

持续评估

Evaluation Pipeline Build evaluation pipelines that run automatically on agent changes. Track results over time. Compare versions to identify improvements or regressions.

Monitoring Production Track evaluation metrics in production by sampling interactions and evaluating randomly. Set alerts for quality drops. Maintain dashboards for trend analysis.

评估流水线 构建在Agent变更时自动运行的评估流水线。跟踪随时间变化的结果。比较不同版本以识别改进或回归。

生产环境监控 通过抽样交互并随机评估，跟踪生产环境中的评估指标。为质量下降设置警报。维护用于趋势分析的仪表板。

Practical Guidance

实践指南

Building Evaluation Frameworks

构建评估框架

Define quality dimensions relevant to your use case
Create rubrics with clear, actionable level descriptions
Build test sets from real usage patterns and edge cases
Implement automated evaluation pipelines
Establish baseline metrics before making changes
Run evaluations on all significant changes
Track metrics over time for trend analysis
Supplement automated evaluation with human review

定义与您的用例相关的质量维度
创建带有清晰、可操作级别描述的评分表
从真实使用模式和边缘案例构建测试集
实现自动化评估流水线
在进行变更前建立基准指标
对所有重大变更运行评估
随时间跟踪指标以进行趋势分析
用人工审核补充自动化评估

Avoiding Evaluation Pitfalls

避免评估陷阱

Overfitting to specific paths: Evaluate outcomes, not specific steps. Ignoring edge cases: Include diverse test scenarios. Single-metric obsession: Use multi-dimensional rubrics. Neglecting context effects: Test with realistic context sizes. Skipping human evaluation: Automated evaluation misses subtle issues.

过度拟合特定路径：评估结果，而非特定步骤。忽略边缘案例：包含多样化的测试场景。单一指标执念：使用多维评分表。忽视上下文影响：使用真实的上下文大小进行测试。跳过人工评估：自动化评估会遗漏细微问题。

Examples

示例

Example 1: Simple Evaluation

python

def evaluate_agent_response(response, expected):
    rubric = load_rubric()
    scores = {}
    for dimension, config in rubric.items():
        scores[dimension] = assess_dimension(response, expected, dimension)
    overall = weighted_average(scores, config["weights"])
    return {"passed": overall >= 0.7, "scores": scores}

Example 2: Test Set Structure

Test sets should span multiple complexity levels to ensure comprehensive evaluation:

python

test_set = [
    {
        "name": "simple_lookup",
        "input": "What is the capital of France?",
        "expected": {"type": "fact", "answer": "Paris"},
        "complexity": "simple",
        "description": "Single tool call, factual lookup"
    },
    {
        "name": "medium_query",
        "input": "Compare the revenue of Apple and Microsoft last quarter",
        "complexity": "medium",
        "description": "Multiple tool calls, comparison logic"
    },
    {
        "name": "multi_step_reasoning",
        "input": "Analyze sales data from Q1-Q4 and create a summary report with trends",
        "complexity": "complex",
        "description": "Many tool calls, aggregation, analysis"
    },
    {
        "name": "research_synthesis",
        "input": "Research emerging AI technologies, evaluate their potential impact, and recommend adoption strategy",
        "complexity": "very_complex",
        "description": "Extended interaction, deep reasoning, synthesis"
    }
]

示例1：简单评估

python

def evaluate_agent_response(response, expected):
    rubric = load_rubric()
    scores = {}
    for dimension, config in rubric.items():
        scores[dimension] = assess_dimension(response, expected, dimension)
    overall = weighted_average(scores, config["weights"])
    return {"passed": overall >= 0.7, "scores": scores}

示例2：测试集结构

测试集应涵盖多个复杂度级别，以确保全面评估：

python

test_set = [
    {
        "name": "simple_lookup",
        "input": "What is the capital of France?",
        "expected": {"type": "fact", "answer": "Paris"},
        "complexity": "simple",
        "description": "Single tool call, factual lookup"
    },
    {
        "name": "medium_query",
        "input": "Compare the revenue of Apple and Microsoft last quarter",
        "complexity": "medium",
        "description": "Multiple tool calls, comparison logic"
    },
    {
        "name": "multi_step_reasoning",
        "input": "Analyze sales data from Q1-Q4 and create a summary report with trends",
        "complexity": "complex",
        "description": "Many tool calls, aggregation, analysis"
    },
    {
        "name": "research_synthesis",
        "input": "Research emerging AI technologies, evaluate their potential impact, and recommend adoption strategy",
        "complexity": "very_complex",
        "description": "Extended interaction, deep reasoning, synthesis"
    }
]

Guidelines

指南

Use multi-dimensional rubrics, not single metrics
Evaluate outcomes, not specific execution paths
Cover complexity levels from simple to complex
Test with realistic context sizes and histories
Run evaluations continuously, not just before release
Supplement LLM evaluation with human review
Track metrics over time for trend detection
Set clear pass/fail thresholds based on use case

使用多维评分表，而非单一指标
评估结果，而非特定执行路径
覆盖从简单到复杂的复杂度级别
使用真实的上下文大小和历史进行测试
持续运行评估，而非仅在发布前进行
用人工审核补充LLM评估
随时间跟踪指标以检测趋势
根据用例设置明确的通过/失败阈值

Integration

集成

This skill connects to all other skills as a cross-cutting concern:

context-fundamentals - Evaluating context usage
context-degradation - Detecting degradation
context-optimization - Measuring optimization effectiveness
multi-agent-patterns - Evaluating coordination
tool-design - Evaluating tool effectiveness
memory-systems - Evaluating memory quality

此技能作为跨领域关注点与所有其他技能关联：

context-fundamentals - 评估上下文使用情况
context-degradation - 检测退化
context-optimization - 衡量优化效果
multi-agent-patterns - 评估协作
tool-design - 评估工具有效性
memory-systems - 评估内存质量

References

参考资料

Internal reference:

Metrics Reference - Detailed evaluation metrics and implementation

Internal skills:

All other skills connect to evaluation for quality measurement

External resources:

LLM evaluation benchmarks
Agent evaluation research papers
Production monitoring practices

内部参考：

Metrics Reference - 详细的评估指标和实现

内部技能：

所有其他技能都与评估关联以衡量质量

外部资源：

LLM评估基准
Agent评估研究论文
生产环境监控实践

Skill Metadata

技能元数据

Created: 2025-12-20 Last Updated: 2025-12-20 Author: Agent Skills for Context Engineering Contributors Version: 1.0.0

创建时间: 2025-12-20 最后更新: 2025-12-20 作者: Agent Skills for Context Engineering Contributors 版本: 1.0.0