evaluation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Evaluation Methods for Agent Systems

Agent系统的评估方法

Evaluate agent systems differently from traditional software because agents make dynamic decisions, are non-deterministic between runs, and often lack single correct answers. Build evaluation frameworks that account for these characteristics, provide actionable feedback, catch regressions, and validate that context engineering choices achieve intended effects.
Agent系统的评估方式与传统软件不同,因为Agent会做出动态决策,不同运行实例之间具有非确定性,且通常不存在单一正确答案。需构建适配这些特性的评估框架,提供可落地的反馈、捕获回归问题,并验证上下文工程的选择是否达到预期效果。

When to Activate

触发场景

Activate this skill when:
  • Testing agent performance systematically
  • Validating context engineering choices
  • Measuring improvements over time
  • Catching regressions before deployment
  • Building quality gates for agent pipelines
  • Comparing different agent configurations
  • Evaluating production systems continuously
在以下场景激活本技能:
  • 系统化测试Agent性能
  • 验证上下文工程的选择
  • 衡量性能随时间的提升
  • 在部署前捕获回归问题
  • 为Agent流水线构建质量门
  • 对比不同Agent配置
  • 持续评估生产系统

Core Concepts

核心概念

Focus evaluation on outcomes rather than execution paths, because agents may find alternative valid routes to goals. Judge whether the agent achieves the right outcome via a reasonable process, not whether it followed a specific sequence of steps.
Use multi-dimensional rubrics instead of single scores because one number hides critical failures in specific dimensions. Capture factual accuracy, completeness, citation accuracy, source quality, and tool efficiency as separate dimensions, then weight them for the use case.
Deploy LLM-as-judge for scalable evaluation across large test sets while supplementing with human review to catch edge cases, hallucinations, and subtle biases that automated evaluation misses.
Performance Drivers: The 95% Finding
Apply the BrowseComp research finding when designing evaluation budgets: three factors explain 95% of browsing agent performance variance.
FactorVariance ExplainedImplication
Token usage80%More tokens = better performance
Number of tool calls~10%More exploration helps
Model choice~5%Better models multiply efficiency
Act on these implications when designing evaluations:
  • Set realistic token budgets: Evaluate agents with production-realistic token limits, not unlimited resources, because token usage drives 80% of variance.
  • Prioritize model upgrades over token increases: Upgrading model versions provides larger gains than doubling token budgets on previous versions because better models use tokens more efficiently.
  • Validate multi-agent architectures: The finding supports distributing work across agents with separate context windows, so evaluate multi-agent setups against single-agent baselines.
评估应聚焦于结果而非执行路径,因为Agent可能找到达成目标的其他有效路径。判断Agent是否通过合理流程达成正确结果,而非是否遵循特定步骤序列。
使用多维度评估标准而非单一分数,因为单一数值会掩盖特定维度的关键故障。将事实准确性、完整性、引用准确性、来源质量和工具效率作为独立维度,再根据具体场景分配权重。
采用LLM-as-judge实现大规模测试集的可扩展评估,同时辅以人工审核,以捕获自动化评估遗漏的边缘案例、幻觉内容和细微偏差。
性能驱动因素:95%法则
设计评估预算时参考BrowseComp研究结论:三个因素解释了浏览类Agent性能差异的95%。
因素解释方差占比启示
Token usage80%Token用量越多,性能越好
Number of tool calls~10%更多探索有助提升性能
Model choice~5%更优模型能放大效率
设计评估时需落实这些启示:
  • 设置符合实际的Token预算:以生产环境的真实Token限制评估Agent,而非无限制资源,因为Token用量是影响性能差异的核心因素(占80%)。
  • 优先升级模型而非增加Token:升级模型版本带来的收益远大于在旧版本上加倍Token预算,因为更优模型的Token使用效率更高。
  • 验证多Agent架构:该结论支持将任务分配给拥有独立上下文窗口的多个Agent,因此需对比多Agent架构与单Agent基准的性能。

Detailed Topics

详细主题

Evaluation Challenges

评估挑战

Handle Non-Determinism and Multiple Valid Paths
Design evaluations that tolerate path variation because agents may take completely different valid paths to reach goals. One agent might search three sources while another searches ten; both may produce correct answers. Avoid checking for specific steps. Instead, define outcome criteria (correctness, completeness, quality) and score against those, treating the execution path as informational rather than evaluative.
Test Context-Dependent Failures
Evaluate across a range of complexity levels and interaction lengths because agent failures often depend on context in subtle ways. An agent might succeed on simple queries but fail on complex ones, work well with one tool set but fail with another, or degrade after extended interaction as context accumulates. Include simple, medium, complex, and very complex test cases to surface these patterns.
Score Composite Quality Dimensions Separately
Break agent quality into separate dimensions (factual accuracy, completeness, coherence, tool efficiency, process quality) and score each independently because an agent might score high on accuracy but low on efficiency, or vice versa. Then compute weighted aggregates tuned to use-case priorities. This approach reveals which dimensions need improvement rather than averaging away the signal.
处理非确定性与多有效路径
设计可容忍路径差异的评估方案,因为Agent可能采用完全不同的有效路径达成目标。比如一个Agent可能搜索3个来源,另一个搜索10个,但两者都能生成正确答案。避免检查特定步骤,而是定义结果标准(正确性、完整性、质量)并据此评分,将执行路径视为参考信息而非评估依据。
测试上下文相关故障
在不同复杂度层级和交互长度下开展评估,因为Agent故障通常与上下文存在微妙关联。Agent可能在简单查询上表现良好,但在复杂查询上失败;适配某一工具集时表现优异,但切换工具集后失效;或随着上下文累积,在长时间交互后性能下降。需包含简单、中等、复杂、极复杂的测试用例以发现这些模式。
独立评分复合质量维度
将Agent质量拆分为独立维度(事实准确性、完整性、连贯性、工具效率、流程质量)并分别评分,因为Agent可能在准确性上得分高但效率低,反之亦然。再根据场景优先级计算加权总分。这种方式能明确哪些维度需要改进,而非通过平均掩盖问题信号。

Evaluation Rubric Design

评估标准设计

Build Multi-Dimensional Rubrics
Define rubrics covering key dimensions with descriptive levels from excellent to failed. Include these core dimensions and adapt weights per use case:
  • Factual accuracy: Claims match ground truth (weight heavily for knowledge tasks)
  • Completeness: Output covers requested aspects (weight heavily for research tasks)
  • Citation accuracy: Citations match claimed sources (weight for trust-sensitive contexts)
  • Source quality: Uses appropriate primary sources (weight for authoritative outputs)
  • Tool efficiency: Uses right tools a reasonable number of times (weight for cost-sensitive systems)
Convert Rubrics to Numeric Scores
Map dimension assessments to numeric scores (0.0 to 1.0), apply per-dimension weights, and calculate weighted overall scores. Set passing thresholds based on use-case requirements, typically 0.7 for general use and 0.9 for high-stakes applications. Store individual dimension scores alongside the aggregate because the breakdown drives targeted improvement.
构建多维度评估标准
定义覆盖关键维度的评估标准,包含从优秀到失败的描述性等级。以下核心维度可根据场景调整权重:
  • 事实准确性:主张与真实情况匹配(知识类任务权重高)
  • 完整性:输出覆盖请求的所有方面(研究类任务权重高)
  • 引用准确性:引用内容与声称的来源一致(对信任敏感的场景权重高)
  • 来源质量:使用合适的原始来源(权威输出场景权重高)
  • 工具效率:合理次数使用正确工具(成本敏感系统权重高)
将评估标准转换为数值分数
将维度评估映射为0.0至1.0的数值分数,应用各维度权重后计算加权总分。根据场景需求设置合格阈值,通用场景通常为0.7,高风险应用为0.9。同时存储各维度得分与总分,因为维度细分结果能指导针对性优化。

Evaluation Methodologies

评估方法论

Use LLM-as-Judge for Scale
Build LLM-based evaluation prompts that include: clear task description, the agent output under test, ground truth when available, an evaluation scale with explicit level descriptions, and a request for structured judgment with reasoning. LLM judges provide consistent, scalable evaluation across large test sets. Use a different model family than the agent being evaluated to avoid self-enhancement bias.
Supplement with Human Evaluation
Route edge cases, unusual queries, and a random sample of production traffic to human reviewers because humans notice hallucinated answers, system failures, and subtle biases that automated evaluation misses. Track patterns across human reviews to identify systematic issues and feed findings back into automated evaluation criteria.
Apply End-State Evaluation for Stateful Agents
For agents that mutate persistent state (files, databases, configurations), evaluate whether the final state matches expectations rather than how the agent got there. Define expected end-state assertions and verify them programmatically after each test run.
采用LLM-as-judge实现规模化评估
构建基于LLM的评估提示词,需包含:清晰的任务描述、待测试的Agent输出(若有)、真实基准、带明确等级描述的评估尺度,以及要求结构化判断并给出理由的指令。LLM评估者能在大规模测试集上提供一致、可扩展的评估。使用与待评估Agent不同的模型家族,避免自增强偏差。
辅以人工评估
将边缘案例、异常查询和随机抽样的生产流量交由人工审核,因为人类能发现自动化评估遗漏的幻觉内容、系统故障和细微偏差。跟踪人工审核中的模式,识别系统性问题并反馈至自动化评估标准中。
对有状态Agent采用终态评估
对于会修改持久化状态(文件、数据库、配置)的Agent,评估最终状态是否符合预期,而非关注Agent的执行过程。定义预期终态的断言,并在每次测试后通过编程方式验证。

Test Set Design

测试集设计

Select Representative Samples
Start with small samples (20-30 cases) during early development when changes have dramatic impacts and low-hanging fruit is abundant. Scale to 50+ cases for reliable signal as the system matures. Sample from real usage patterns, add known edge cases, and ensure coverage across complexity levels.
Stratify by Complexity
Structure test sets across complexity levels to prevent easy examples from inflating scores:
  • Simple: single tool call, factual lookup
  • Medium: multiple tool calls, comparison logic
  • Complex: many tool calls, significant ambiguity
  • Very complex: extended interaction, deep reasoning, synthesis
Report scores per stratum alongside overall scores to reveal where the agent actually struggles.
选择代表性样本
早期开发阶段使用小样本(20-30个用例),此时变更影响显著,易发现明显问题。系统成熟后扩展至50个以上用例以获取可靠信号。从真实使用模式中抽样,添加已知边缘案例,并确保覆盖不同复杂度层级。
按复杂度分层
按复杂度层级构建测试集,避免简单用例拉高整体分数:
  • 简单:单次工具调用、事实查询
  • 中等:多次工具调用、对比逻辑
  • 复杂:大量工具调用、高度歧义
  • 极复杂:长时间交互、深度推理、综合分析
报告各层级得分与总分,明确Agent的实际薄弱环节。

Context Engineering Evaluation

上下文工程评估

Validate Context Strategies Systematically
Run agents with different context strategies on the same test set and compare quality scores, token usage, and efficiency metrics. This isolates the effect of context engineering from other variables and prevents anecdote-driven decisions.
Run Degradation Tests
Test how context degradation affects performance by running agents at different context sizes. Identify performance cliffs where context becomes problematic and establish safe operating limits. Feed these limits back into context management strategies.
系统化验证上下文策略
在同一测试集上运行采用不同上下文策略的Agent,对比质量得分、Token用量和效率指标。这能隔离上下文工程的影响,避免基于轶事的决策。
开展性能退化测试
通过在不同上下文大小下运行Agent,测试上下文退化对性能的影响。识别性能急剧下降的临界点并设置安全运行阈值,将这些阈值反馈至上下文管理策略中。

Continuous Evaluation

持续评估

Build Automated Evaluation Pipelines
Integrate evaluation into the development workflow so evaluations run automatically on agent changes. Track results over time, compare versions, and block deployments that regress on key metrics.
Monitor Production Quality
Sample production interactions and evaluate them continuously. Set alerts for quality drops below warning (0.85 pass rate) and critical (0.70 pass rate) thresholds. Maintain dashboards showing trend analysis over time windows to detect gradual degradation.
构建自动化评估流水线
将评估集成至开发工作流,使Agent变更时自动触发评估。跟踪结果变化、对比版本差异,阻止关键指标出现回归的部署。
监控生产环境质量
持续抽样评估生产环境交互。设置质量下降预警阈值(合格率0.85)和临界阈值(合格率0.70)。维护仪表盘展示不同时间窗口的趋势分析,以检测性能逐渐退化的情况。

Practical Guidance

实践指南

Building Evaluation Frameworks

构建评估框架

Follow this sequence to build an evaluation framework, because skipping early steps leads to measurements that do not reflect real quality:
  1. Define quality dimensions relevant to the use case before writing any evaluation code, because dimensions chosen later tend to reflect what is easy to measure rather than what matters.
  2. Create rubrics with clear, descriptive level definitions so evaluators (human or LLM) produce consistent scores.
  3. Build test sets from real usage patterns and edge cases, stratified by complexity, with at least 50 cases for reliable signal.
  4. Implement automated evaluation pipelines that run on every significant change.
  5. Establish baseline metrics before making changes so improvements can be measured against a known reference.
  6. Run evaluations on all significant changes and compare against the baseline.
  7. Track metrics over time for trend analysis because gradual degradation is harder to notice than sudden drops.
  8. Supplement automated evaluation with human review on a regular cadence.
按以下步骤构建评估框架,跳过早期步骤会导致评估无法反映真实质量:
  1. 在编写任何评估代码前,定义与场景相关的质量维度,因为后续选择的维度往往偏向易测量项而非关键项。
  2. 创建带有清晰描述性等级定义的评估标准,确保评估者(人工或LLM)给出一致分数。
  3. 基于真实使用模式和边缘案例构建测试集,按复杂度分层,至少包含50个用例以获取可靠信号。
  4. 实现自动化评估流水线,在每次重大变更时运行。
  5. 在变更前建立基准指标,以便对照已知参考衡量改进效果。
  6. 对所有重大变更运行评估并与基准对比。
  7. 跟踪指标变化趋势,因为性能逐渐退化比突然下降更难察觉。
  8. 定期用人工审核补充自动化评估。

Avoiding Evaluation Pitfalls

避免评估陷阱

Guard against these common failures that undermine evaluation reliability:
  • Overfitting to specific paths: Evaluate outcomes, not specific steps, because agents find novel valid paths.
  • Ignoring edge cases: Include diverse test scenarios covering the full complexity spectrum.
  • Single-metric obsession: Use multi-dimensional rubrics because a single score hides dimension-specific failures.
  • Neglecting context effects: Test with realistic context sizes and histories rather than clean-room conditions.
  • Skipping human evaluation: Automated evaluation misses subtle issues that humans catch reliably.
防范以下常见问题,避免影响评估可靠性:
  • 过度拟合特定路径:评估结果而非特定步骤,因为Agent会找到新的有效路径。
  • 忽略边缘案例:覆盖全复杂度范围的多样化测试场景。
  • 沉迷单一指标:使用多维度评估标准,因为单一分数会掩盖特定维度的故障。
  • 忽视上下文影响:在真实上下文大小和历史数据下测试,而非理想环境。
  • 跳过人工评估:自动化评估会遗漏人类能可靠发现的细微问题。

Examples

示例

Example 1: Simple Evaluation
python
def evaluate_agent_response(response, expected):
    rubric = load_rubric()
    scores = {}
    for dimension, config in rubric.items():
        scores[dimension] = assess_dimension(response, expected, dimension)
    overall = weighted_average(scores, config["weights"])
    return {"passed": overall >= 0.7, "scores": scores}
Example 2: Test Set Structure
Test sets should span multiple complexity levels to ensure comprehensive evaluation:
python
test_set = [
    {
        "name": "simple_lookup",
        "input": "What is the capital of France?",
        "expected": {"type": "fact", "answer": "Paris"},
        "complexity": "simple",
        "description": "Single tool call, factual lookup"
    },
    {
        "name": "medium_query",
        "input": "Compare the revenue of Apple and Microsoft last quarter",
        "complexity": "medium",
        "description": "Multiple tool calls, comparison logic"
    },
    {
        "name": "multi_step_reasoning",
        "input": "Analyze sales data from Q1-Q4 and create a summary report with trends",
        "complexity": "complex",
        "description": "Many tool calls, aggregation, analysis"
    },
    {
        "name": "research_synthesis",
        "input": "Research emerging AI technologies, evaluate their potential impact, and recommend adoption strategy",
        "complexity": "very_complex",
        "description": "Extended interaction, deep reasoning, synthesis"
    }
]
示例1:简单评估
python
def evaluate_agent_response(response, expected):
    rubric = load_rubric()
    scores = {}
    for dimension, config in rubric.items():
        scores[dimension] = assess_dimension(response, expected, dimension)
    overall = weighted_average(scores, config["weights"])
    return {"passed": overall >= 0.7, "scores": scores}
示例2:测试集结构
测试集应覆盖多个复杂度层级以确保全面评估:
python
test_set = [
    {
        "name": "simple_lookup",
        "input": "What is the capital of France?",
        "expected": {"type": "fact", "answer": "Paris"},
        "complexity": "simple",
        "description": "Single tool call, factual lookup"
    },
    {
        "name": "medium_query",
        "input": "Compare the revenue of Apple and Microsoft last quarter",
        "complexity": "medium",
        "description": "Multiple tool calls, comparison logic"
    },
    {
        "name": "multi_step_reasoning",
        "input": "Analyze sales data from Q1-Q4 and create a summary report with trends",
        "complexity": "complex",
        "description": "Many tool calls, aggregation, analysis"
    },
    {
        "name": "research_synthesis",
        "input": "Research emerging AI technologies, evaluate their potential impact, and recommend adoption strategy",
        "complexity": "very_complex",
        "description": "Extended interaction, deep reasoning, synthesis"
    }
]

Guidelines

准则

  1. Use multi-dimensional rubrics, not single metrics
  2. Evaluate outcomes, not specific execution paths
  3. Cover complexity levels from simple to complex
  4. Test with realistic context sizes and histories
  5. Run evaluations continuously, not just before release
  6. Supplement LLM evaluation with human review
  7. Track metrics over time for trend detection
  8. Set clear pass/fail thresholds based on use case
  1. 使用多维度评估标准,而非单一指标
  2. 评估结果,而非特定执行路径
  3. 覆盖从简单到复杂的全复杂度层级
  4. 在真实上下文大小和历史数据下测试
  5. 持续开展评估,而非仅在发布前进行
  6. 用人工审核补充LLM评估
  7. 跟踪指标变化趋势以检测性能退化
  8. 根据场景设置明确的合格/不合格阈值

Gotchas

注意事项

  1. Overfitting evals to specific code paths: Tests pass but the agent fails on slight input variations. Write eval criteria against outcomes and semantics, not surface patterns, and rotate test inputs periodically.
  2. LLM-judge self-enhancement bias: Models rate their own outputs higher than independent judges do. Use a different model family as the evaluation judge than the model being evaluated.
  3. Test set contamination: Eval examples leak into training data or prompt templates, inflating scores. Keep eval sets versioned and separate from any data used in prompts or fine-tuning.
  4. Metric gaming: Optimizing for the metric rather than actual quality produces agents that score well but disappoint users. Cross-validate automated metrics against human judgments regularly.
  5. Single-dimension scoring: One aggregate number hides critical failures in specific dimensions. Always report per-dimension scores alongside the overall score, and fail the eval if any single dimension falls below its minimum threshold.
  6. Eval set too small: Fewer than 50 examples produces unreliable signal with high variance between runs. Scale the eval set to at least 50 cases and report confidence intervals.
  7. Not stratifying by difficulty: Easy examples inflate overall scores, masking failures on hard cases. Report scores per complexity stratum and weight the overall score to prevent easy-case dominance.
  8. Treating eval as one-time: Evaluation must be continuous, not a launch gate. Agent quality drifts as models update, tools change, and usage patterns evolve. Run evals on every change and on a regular production cadence.
  1. 评估过度拟合特定代码路径:测试通过但Agent在输入略有变化时失效。基于结果和语义编写评估标准,而非表面模式,并定期轮换测试输入。
  2. LLM评估者自增强偏差:模型对自身输出的评分高于独立评估者。使用与待评估模型不同的模型家族作为评估者。
  3. 测试集污染:评估示例泄露至训练数据或提示词模板,导致分数虚高。对评估集进行版本管理,并与提示词或微调使用的所有数据隔离。
  4. 指标博弈:为优化指标而非实际质量进行调整,导致Agent得分高但用户体验差。定期用人工判断交叉验证自动化指标。
  5. 单一维度评分:单一总分掩盖特定维度的关键故障。始终同时报告各维度得分与总分,若任一维度低于最低阈值则判定评估不合格。
  6. 评估集过小:少于50个示例会导致信号不可靠,不同运行之间差异大。将评估集扩展至至少50个用例并报告置信区间。
  7. 未按难度分层:简单用例拉高总分,掩盖复杂用例的故障。报告各复杂度层级得分,并对总分进行加权以避免简单用例主导结果。
  8. 将评估视为一次性工作:评估必须持续进行,而非仅作为发布门槛。随着模型更新、工具变更和使用模式演变,Agent质量会逐渐下降。在每次变更和定期生产巡检时运行评估。

Integration

集成

This skill connects to all other skills as a cross-cutting concern:
  • context-fundamentals - Evaluating context usage
  • context-degradation - Detecting degradation
  • context-optimization - Measuring optimization effectiveness
  • multi-agent-patterns - Evaluating coordination
  • tool-design - Evaluating tool effectiveness
  • memory-systems - Evaluating memory quality
本技能作为跨领域关注点,与所有其他技能关联:
  • context-fundamentals - 评估上下文使用情况
  • context-degradation - 检测性能退化
  • context-optimization - 衡量优化效果
  • multi-agent-patterns - 评估协作能力
  • tool-design - 评估工具有效性
  • memory-systems - 评估记忆质量

References

参考资料

Internal reference:
  • Metrics Reference - Read when: designing specific evaluation metrics, choosing scoring scales, or implementing weighted rubric calculations
Internal skills:
  • All other skills connect to evaluation for quality measurement
External resources:
  • LLM evaluation benchmarks - Read when: selecting or building benchmark suites for agent comparison
  • Agent evaluation research papers - Read when: adopting new evaluation methodologies or validating current approach
  • Production monitoring practices - Read when: setting up alerting, dashboards, or sampling strategies for live systems

内部参考:
  • Metrics Reference - 适用场景:设计特定评估指标、选择评分尺度或实现加权评估标准计算
内部技能:
  • 所有其他技能均需通过评估衡量质量
外部资源:
  • LLM评估基准 - 适用场景:选择或构建Agent对比用的基准套件
  • Agent评估研究论文 - 适用场景:采用新评估方法论或验证当前方案
  • 生产监控实践 - 适用场景:为生产系统设置告警、仪表盘或抽样策略

Skill Metadata

技能元数据

Created: 2025-12-20 Last Updated: 2026-03-17 Author: Agent Skills for Context Engineering Contributors Version: 1.1.0
创建时间: 2025-12-20 最后更新: 2026-03-17 作者: Agent Skills for Context Engineering Contributors 版本: 1.1.0