adk-eval-guide

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

ADK Evaluation Guide

ADK 评估指南

Scaffolded project? If you used
/adk-scaffold
, you already have
make eval
,
tests/eval/evalsets/
, and
tests/eval/eval_config.json
. Start with
make eval
and iterate from there.
Non-scaffolded? Use
adk eval
directly — see Running Evaluations below.
使用脚手架搭建的项目? 如果您使用了
/adk-scaffold
,那么您已经拥有
make eval
命令、
tests/eval/evalsets/
目录以及
tests/eval/eval_config.json
配置文件。从运行
make eval
开始,然后逐步迭代优化。
非脚手架项目? 直接使用
adk eval
命令——请查看下方的【运行评估】章节。

Reference Files

参考文件

FileContents
references/criteria-guide.md
Complete metrics reference — all 8 criteria, match types, custom metrics, judge model config
references/user-simulation.md
Dynamic conversation testing — ConversationScenario, user simulator config, compatible metrics
references/builtin-tools-eval.md
google_search and model-internal tools — trajectory behavior, metric compatibility
references/multimodal-eval.md
Multimodal inputs — evalset schema, built-in metric limitations, custom evaluator pattern

文件内容
references/criteria-guide.md
完整的指标参考——包含全部8项评估标准、匹配类型、自定义指标、评判模型配置
references/user-simulation.md
动态对话测试——ConversationScenario、用户模拟器配置、兼容的评估指标
references/builtin-tools-eval.md
google_search和模型内置工具——轨迹行为、指标兼容性
references/multimodal-eval.md
多模态输入——评估集schema、内置指标局限性、自定义评估器模式

The Eval-Fix Loop

评估-修复循环

Evaluation is iterative. When a score is below threshold, diagnose the cause, fix it, rerun — don't just report the failure.
评估是一个迭代的过程。当评分低于阈值时,要诊断原因、修复问题、重新运行——不要只报告失败结果。

How to iterate

迭代步骤

  1. Start small: Begin with 1-2 eval cases, not the full suite
  2. Run eval:
    make eval
    (or
    adk eval
    if no Makefile)
  3. Read the scores — identify what failed and why
  4. Fix the code — adjust prompts, tool logic, instructions, or the evalset
  5. Rerun eval — verify the fix worked
  6. Repeat steps 3-5 until the case passes
  7. Only then add more eval cases and expand coverage
Expect 5-10+ iterations. This is normal — each iteration makes the agent better.
  1. 从小规模开始:先从1-2个评估用例入手,不要一开始就用完整的评估套件
  2. 运行评估:执行
    make eval
    (如果没有Makefile则直接用
    adk eval
  3. 查看评分——确定哪些部分失败以及失败原因
  4. 修复代码——调整提示词、工具逻辑、指令或评估集
  5. 重新运行评估——验证修复是否有效
  6. 重复步骤3-5,直到用例通过
  7. 之后再添加更多评估用例,扩大覆盖范围
预计需要5-10+次迭代。这是正常情况——每次迭代都会让Agent变得更优秀。

What to fix when scores fail

评分失败时的修复方向

FailureWhat to change
tool_trajectory_avg_score
low
Fix agent instructions (tool ordering), update evalset
tool_uses
, or switch to
IN_ORDER
/
ANY_ORDER
match type
response_match_score
low
Adjust agent instruction wording, or relax the expected response
final_response_match_v2
low
Refine agent instructions, or adjust expected response — this is semantic, not lexical
rubric_based
score low
Refine agent instructions to address the specific rubric that failed
hallucinations_v1
low
Tighten agent instructions to stay grounded in tool output
Agent calls wrong toolsFix tool descriptions, agent instructions, or tool_config
Agent calls extra toolsUse
IN_ORDER
/
ANY_ORDER
match type, add strict stop instructions, or switch to
rubric_based_tool_use_quality_v1

失败类型需要调整的内容
tool_trajectory_avg_score
分数低
修复Agent指令(工具调用顺序)、更新评估集的
tool_uses
字段,或切换为
IN_ORDER
/
ANY_ORDER
匹配类型
response_match_score
分数低
调整Agent指令措辞,或放宽预期响应的匹配要求
final_response_match_v2
分数低
优化Agent指令,或调整预期响应——该指标是语义层面的匹配,而非字面匹配
rubric_based
分数低
优化Agent指令,以解决未通过的具体评估准则
hallucinations_v1
分数低
收紧Agent指令,使其严格基于工具输出内容生成响应
Agent调用了错误的工具修复工具描述、Agent指令或tool_config配置
Agent调用了额外的工具使用
IN_ORDER
/
ANY_ORDER
匹配类型、添加严格的停止指令,或切换为
rubric_based_tool_use_quality_v1
指标

Choosing the Right Criteria

选择合适的评估准则

GoalRecommended Metric
Regression testing / CI/CD (fast, deterministic)
tool_trajectory_avg_score
+
response_match_score
Semantic response correctness (flexible phrasing OK)
final_response_match_v2
Response quality without reference answer
rubric_based_final_response_quality_v1
Validate tool usage reasoning
rubric_based_tool_use_quality_v1
Detect hallucinated claims
hallucinations_v1
Safety compliance
safety_v1
Dynamic multi-turn conversationsUser simulation +
hallucinations_v1
/
safety_v1
(see
references/user-simulation.md
)
Multimodal input (image, audio, file)
tool_trajectory_avg_score
+ custom metric for response quality (see
references/multimodal-eval.md
)
For the complete metrics reference with config examples, match types, and custom metrics, see
references/criteria-guide.md
.

目标推荐指标
回归测试/CI/CD(快速、确定性)
tool_trajectory_avg_score
+
response_match_score
语义层面的响应正确性(允许灵活措辞)
final_response_match_v2
无参考答案时的响应质量评估
rubric_based_final_response_quality_v1
验证工具调用的推理逻辑
rubric_based_tool_use_quality_v1
检测幻觉内容
hallucinations_v1
安全合规性
safety_v1
动态多轮对话用户模拟 +
hallucinations_v1
/
safety_v1
(详见
references/user-simulation.md
多模态输入(图片、音频、文件)
tool_trajectory_avg_score
+ 自定义响应质量指标(详见
references/multimodal-eval.md
如需包含配置示例、匹配类型和自定义指标的完整指标参考,请查看
references/criteria-guide.md

Running Evaluations

运行评估

bash
undefined
bash
undefined

Scaffolded projects:

脚手架项目:

make eval EVALSET=tests/eval/evalsets/my_evalset.json
make eval EVALSET=tests/eval/evalsets/my_evalset.json

Or directly via ADK CLI:

或直接通过ADK CLI运行:

adk eval ./app <path_to_evalset.json> --config_file_path=<path_to_config.json> --print_detailed_results
adk eval ./app <path_to_evalset.json> --config_file_path=<path_to_config.json> --print_detailed_results

Run specific eval cases from a set:

运行评估集中的特定用例:

adk eval ./app my_evalset.json:eval_1,eval_2
adk eval ./app my_evalset.json:eval_1,eval_2

With GCS storage:

配合GCS存储使用:

adk eval ./app my_evalset.json --eval_storage_uri gs://my-bucket/evals

**CLI options:** `--config_file_path`, `--print_detailed_results`, `--eval_storage_uri`, `--log_level`

**Eval set management:**
```bash
adk eval_set create <agent_path> <eval_set_id>
adk eval_set add_eval_case <agent_path> <eval_set_id> --scenarios_file <path> --session_input_file <path>

adk eval ./app my_evalset.json --eval_storage_uri gs://my-bucket/evals

**CLI选项:** `--config_file_path`, `--print_detailed_results`, `--eval_storage_uri`, `--log_level`

**评估集管理:**
```bash
adk eval_set create <agent_path> <eval_set_id>
adk eval_set add_eval_case <agent_path> <eval_set_id> --scenarios_file <path> --session_input_file <path>

Configuration Schema (
eval_config.json
)

配置Schema(
eval_config.json

Both camelCase and snake_case field names are accepted (Pydantic aliases). The examples below use snake_case, matching the official ADK docs.
同时支持camelCase和snake_case字段命名(通过Pydantic别名实现)。以下示例使用snake_case,与官方ADK文档保持一致。

Full example

完整示例

json
{
  "criteria": {
    "tool_trajectory_avg_score": {
      "threshold": 1.0,
      "match_type": "IN_ORDER"
    },
    "final_response_match_v2": {
      "threshold": 0.8,
      "judge_model_options": {
        "judge_model": "gemini-2.5-flash",
        "num_samples": 5
      }
    },
    "rubric_based_final_response_quality_v1": {
      "threshold": 0.8,
      "rubrics": [
        {
          "rubric_id": "professionalism",
          "rubric_content": { "text_property": "The response must be professional and helpful." }
        },
        {
          "rubric_id": "safety",
          "rubric_content": { "text_property": "The agent must NEVER book without asking for confirmation." }
        }
      ]
    }
  }
}
Simple threshold shorthand is also valid:
"response_match_score": 0.8
For custom metrics,
judge_model_options
details, and
user_simulator_config
, see
references/criteria-guide.md
.

json
{
  "criteria": {
    "tool_trajectory_avg_score": {
      "threshold": 1.0,
      "match_type": "IN_ORDER"
    },
    "final_response_match_v2": {
      "threshold": 0.8,
      "judge_model_options": {
        "judge_model": "gemini-2.5-flash",
        "num_samples": 5
      }
    },
    "rubric_based_final_response_quality_v1": {
      "threshold": 0.8,
      "rubrics": [
        {
          "rubric_id": "professionalism",
          "rubric_content": { "text_property": "The response must be professional and helpful." }
        },
        {
          "rubric_id": "safety",
          "rubric_content": { "text_property": "The agent must NEVER book without asking for confirmation." }
        }
      ]
    }
  }
}
也支持简洁的阈值写法:
"response_match_score": 0.8
如需了解自定义指标、
judge_model_options
详情以及
user_simulator_config
,请查看
references/criteria-guide.md

EvalSet Schema (
evalset.json
)

评估集Schema(
evalset.json

json
{
  "eval_set_id": "my_eval_set",
  "name": "My Eval Set",
  "description": "Tests core capabilities",
  "eval_cases": [
    {
      "eval_id": "search_test",
      "conversation": [
        {
          "invocation_id": "inv_1",
          "user_content": { "parts": [{ "text": "Find a flight to NYC" }] },
          "final_response": {
            "role": "model",
            "parts": [{ "text": "I found a flight for $500. Want to book?" }]
          },
          "intermediate_data": {
            "tool_uses": [
              { "name": "search_flights", "args": { "destination": "NYC" } }
            ],
            "intermediate_responses": [
              ["sub_agent_name", [{ "text": "Found 3 flights to NYC." }]]
            ]
          }
        }
      ],
      "session_input": { "app_name": "my_app", "user_id": "user_1", "state": {} }
    }
  ]
}
Key fields:
  • intermediate_data.tool_uses
    — expected tool call trajectory (chronological order)
  • intermediate_data.intermediate_responses
    — expected sub-agent responses (for multi-agent systems)
  • session_input.state
    — initial session state (overrides Python-level initialization)
  • conversation_scenario
    — alternative to
    conversation
    for user simulation (see
    references/user-simulation.md
    )

json
{
  "eval_set_id": "my_eval_set",
  "name": "My Eval Set",
  "description": "Tests core capabilities",
  "eval_cases": [
    {
      "eval_id": "search_test",
      "conversation": [
        {
          "invocation_id": "inv_1",
          "user_content": { "parts": [{ "text": "Find a flight to NYC" }] },
          "final_response": {
            "role": "model",
            "parts": [{ "text": "I found a flight for $500. Want to book?" }]
          },
          "intermediate_data": {
            "tool_uses": [
              { "name": "search_flights", "args": { "destination": "NYC" } }
            ],
            "intermediate_responses": [
              ["sub_agent_name", [{ "text": "Found 3 flights to NYC." }]]
            ]
          }
        }
      ],
      "session_input": { "app_name": "my_app", "user_id": "user_1", "state": {} }
    }
  ]
}
关键字段:
  • intermediate_data.tool_uses
    — 预期的工具调用轨迹(按时间顺序)
  • intermediate_data.intermediate_responses
    — 预期的子Agent响应(适用于多Agent系统)
  • session_input.state
    — 初始会话状态(会覆盖Python层面的初始化设置)
  • conversation_scenario
    — 替代
    conversation
    字段,用于用户模拟场景(详见
    references/user-simulation.md

Common Gotchas

常见陷阱

The Proactivity Trajectory Gap

主动性轨迹差距

LLMs often perform extra actions not asked for (e.g.,
google_search
after
save_preferences
). This causes
tool_trajectory_avg_score
failures with
EXACT
match. Solutions:
  1. Use
    IN_ORDER
    or
    ANY_ORDER
    match type
    — tolerates extra tool calls between expected ones
  2. Include ALL tools the agent might call in your expected trajectory
  3. Use
    rubric_based_tool_use_quality_v1
    instead of trajectory matching
  4. Add strict stop instructions: "Stop after calling save_preferences. Do NOT search."
大语言模型(LLM)经常会执行未被要求的额外操作(例如,在
save_preferences
之后调用
google_search
)。这会导致使用
EXACT
匹配类型时
tool_trajectory_avg_score
失败。解决方案:
  1. 使用
    IN_ORDER
    ANY_ORDER
    匹配类型
    ——允许在预期的工具调用之间插入额外的工具调用
  2. 在预期轨迹中包含Agent可能调用的所有工具
  3. 使用
    rubric_based_tool_use_quality_v1
    指标替代轨迹匹配
  4. 添加严格的停止指令:"调用save_preferences后停止,请勿执行搜索操作。"

Multi-turn conversations require tool_uses for ALL turns

多轮对话需要为所有轮次配置tool_uses

The
tool_trajectory_avg_score
evaluates each invocation. If you don't specify expected tool calls for intermediate turns, the evaluation will fail even if the agent called the right tools.
json
{
  "conversation": [
    {
      "invocation_id": "inv_1",
      "user_content": { "parts": [{"text": "Find me a flight from NYC to London"}] },
      "intermediate_data": {
        "tool_uses": [
          { "name": "search_flights", "args": {"origin": "NYC", "destination": "LON"} }
        ]
      }
    },
    {
      "invocation_id": "inv_2",
      "user_content": { "parts": [{"text": "Book the first option"}] },
      "final_response": { "role": "model", "parts": [{"text": "Booking confirmed!"}] },
      "intermediate_data": {
        "tool_uses": [
          { "name": "book_flight", "args": {"flight_id": "1"} }
        ]
      }
    }
  ]
}
tool_trajectory_avg_score
指标会评估每一轮调用。如果您不为中间轮次指定预期的工具调用,即使Agent调用了正确的工具,评估也会失败。
json
{
  "conversation": [
    {
      "invocation_id": "inv_1",
      "user_content": { "parts": [{"text": "Find me a flight from NYC to London"}] },
      "intermediate_data": {
        "tool_uses": [
          { "name": "search_flights", "args": {"origin": "NYC", "destination": "LON"} }
        ]
      }
    },
    {
      "invocation_id": "inv_2",
      "user_content": { "parts": [{"text": "Book the first option"}] },
      "final_response": { "role": "model", "parts": [{"text": "Booking confirmed!"}] },
      "intermediate_data": {
        "tool_uses": [
          { "name": "book_flight", "args": {"flight_id": "1"} }
        ]
      }
    }
  ]
}

App name must match directory name

应用名称必须与目录名称匹配

The
App
object's
name
parameter MUST match the directory containing your agent:
python
undefined
App
对象的
name
参数必须包含Agent的目录名称:
python
undefined

CORRECT - matches the "app" directory

正确写法——与"app"目录名称匹配

app = App(root_agent=root_agent, name="app")
app = App(root_agent=root_agent, name="app")

WRONG - causes "Session not found" errors

错误写法——会导致"Session not found"错误

app = App(root_agent=root_agent, name="flight_booking_assistant")
undefined
app = App(root_agent=root_agent, name="flight_booking_assistant")
undefined

The
before_agent_callback
Pattern (State Initialization)

before_agent_callback
模式(状态初始化)

Always use a callback to initialize session state variables used in your instruction template. This prevents
KeyError
crashes on the first turn:
python
async def initialize_state(callback_context: CallbackContext) -> None:
    state = callback_context.state
    if "user_preferences" not in state:
        state["user_preferences"] = {}

root_agent = Agent(
    name="my_agent",
    before_agent_callback=initialize_state,
    instruction="Based on preferences: {user_preferences}...",
)
务必使用回调函数来初始化指令模板中用到的会话状态变量。这可以避免第一轮调用时出现
KeyError
崩溃:
python
async def initialize_state(callback_context: CallbackContext) -> None:
    state = callback_context.state
    if "user_preferences" not in state:
        state["user_preferences"] = {}

root_agent = Agent(
    name="my_agent",
    before_agent_callback=initialize_state,
    instruction="Based on preferences: {user_preferences}...",
)

Eval-State Overrides (Type Mismatch Danger)

评估状态覆盖(类型不匹配风险)

Be careful with
session_input.state
in your evalset. It overrides Python-level initialization:
json
// WRONG — initializes feedback_history as a string, breaks .append()
"state": { "feedback_history": "" }

// CORRECT — matches the Python type (list)
"state": { "feedback_history": [] }

// NOTE: Remove these // comments before using — JSON does not support comments.
在评估集中使用
session_input.state
时要格外小心。它会覆盖Python层面的初始化设置:
json
// 错误写法——将feedback_history初始化为字符串,会导致.append()方法调用失败
"state": { "feedback_history": "" }

// 正确写法——与Python中的类型一致(列表)
"state": { "feedback_history": [] }

// 注意:使用前请删除这些//注释——JSON不支持注释。

Model thinking mode may bypass tools

模型思考模式可能会绕过工具调用

Models with "thinking" enabled may skip tool calls. Use
tool_config
with
mode="ANY"
to force tool usage, or switch to a non-thinking model for predictable tool calling.

开启"思考"模式的模型可能会跳过工具调用。使用
tool_config
并设置
mode="ANY"
来强制工具调用,或切换为非思考模式的模型以获得可预测的工具调用行为。

Common Eval Failure Causes

常见评估失败原因

SymptomCauseFix
Missing
tool_uses
in intermediate turns
Trajectory expects match per invocationAdd expected tool calls to all turns
Agent mentions data not in tool outputHallucinationTighten agent instructions; add
hallucinations_v1
metric
"Session not found" errorApp name mismatchEnsure App
name
matches directory name
Score fluctuates between runsNon-deterministic modelSet
temperature=0
or use rubric-based eval
tool_trajectory_avg_score
always 0
Agent uses
google_search
(model-internal)
Remove trajectory metric; see
references/builtin-tools-eval.md
Trajectory fails but tools are correctExtra tools calledSwitch to
IN_ORDER
/
ANY_ORDER
match type
LLM judge ignores image/audio in eval
get_text_from_content()
skips non-text parts
Use custom metric with vision-capable judge (see
references/multimodal-eval.md
)

症状原因修复方案
中间轮次缺少
tool_uses
轨迹评估要求每一轮调用都匹配为所有轮次添加预期的工具调用
Agent提及了工具输出中不存在的数据幻觉收紧Agent指令;添加
hallucinations_v1
指标
出现"Session not found"错误应用名称不匹配确保App的
name
参数与目录名称一致
评分在多次运行之间波动模型具有非确定性设置
temperature=0
或使用基于准则的评估方法
tool_trajectory_avg_score
始终为0
Agent使用了
google_search
(模型内置工具)
移除轨迹评估指标;详见
references/builtin-tools-eval.md
轨迹评估失败但工具调用正确Agent调用了额外的工具切换为
IN_ORDER
/
ANY_ORDER
匹配类型
LLM评判模型在评估中忽略了图片/音频
get_text_from_content()
方法跳过了非文本内容
使用支持视觉的评判模型和自定义指标(详见
references/multimodal-eval.md

Deep Dive: ADK Docs

深入学习:ADK官方文档

For the official evaluation documentation, fetch these pages:
  • Evaluation overview:
    https://google.github.io/adk-docs/evaluate/index.md
  • Criteria reference:
    https://google.github.io/adk-docs/evaluate/criteria/index.md
  • User simulation:
    https://google.github.io/adk-docs/evaluate/user-sim/index.md

如需查看官方评估文档,请访问以下页面:
  • 评估概览
    https://google.github.io/adk-docs/evaluate/index.md
  • 评估准则参考
    https://google.github.io/adk-docs/evaluate/criteria/index.md
  • 用户模拟
    https://google.github.io/adk-docs/evaluate/user-sim/index.md

Debugging Example

调试示例

User says: "tool_trajectory_avg_score is 0, what's wrong?"
  1. Check if agent uses
    google_search
    — if so, see
    references/builtin-tools-eval.md
  2. Check if using
    EXACT
    match and agent calls extra tools — try
    IN_ORDER
  3. Compare expected
    tool_uses
    in evalset with actual agent behavior
  4. Fix mismatch (update evalset or agent instructions)
用户反馈:"tool_trajectory_avg_score分数为0,是什么问题?"
  1. 检查Agent是否使用了
    google_search
    ——如果是,请查看
    references/builtin-tools-eval.md
  2. 检查是否使用了
    EXACT
    匹配类型且Agent调用了额外工具——尝试切换为
    IN_ORDER
    匹配类型
  3. 对比评估集中的预期
    tool_uses
    与Agent的实际行为
  4. 修复不匹配问题(更新评估集或Agent指令)