adk-evals

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

ADK Evals Skill

ADK Evals 技能

What are Evals?

什么是Evals？

Evals are automated conversation tests for ADK agents. Each eval defines a scenario — a sequence of user messages or events — and asserts on what the bot should do: what it says, which tools it calls, how state changes, what gets written to tables, and more.

Evals run against a live dev bot (

adk dev

), so they test the full stack — not mocks.

Evals是针对ADK agents的自动化对话测试。每个eval定义一个场景——一系列用户消息或事件——并断言机器人应执行的操作：回复内容、调用的工具、状态变化、写入表格的数据等等。

Evals在实时开发机器人（

adk dev

）上运行，因此它们测试的是完整堆栈而非模拟环境。

When to Use This Skill

何时使用该技能

Use this skill when the developer asks about:

Writing evals — file format, assertions, turn types, setup
Running evals — CLI commands, filtering, output interpretation
Testing specific primitives — how to test actions, tools, workflows, conversations, tables, state
The testing loop — write → run → inspect traces → iterate
CI integration — exit codes,
```
--format json
```
flag, tagging strategies
Eval configuration — idleTimeout, judgePassThreshold, judgeModel

Or when you are developing an ADK bot and need to write the equivalent of unit/end-to-end tests.

Trigger questions:

"How do I write an eval?"
"How do I test my workflow?"
"How do I assert that a tool was called with specific params?"
"My eval is failing, how do I debug it?"
"How do I test that the bot stays silent?"
"How do I run evals in CI?"
"How do I seed state before an eval?"
"How do I trigger a workflow in an eval?"

当开发者询问以下内容时使用该技能：

编写evals — 文件格式、断言、轮次类型、设置
运行evals — CLI命令、过滤规则、输出解读
测试特定基础组件 — 如何测试动作、工具、工作流、对话、表格、状态
测试循环 — 编写 → 运行 → 检查追踪信息 → 迭代
CI集成 — 退出码、
```
--format json
```
参数、标签策略
Eval配置 — idleTimeout、judgePassThreshold、judgeModel

或者当你开发ADK机器人，需要编写单元测试/端到端测试的等效测试时。

触发问题：

"如何编写eval？"
"如何测试我的工作流？"
"如何断言工具被调用时传入了特定参数？"
"我的eval失败了，如何调试？"
"如何测试机器人保持静默？"
"如何在CI中运行evals？"
"如何在eval前初始化状态？"
"如何在eval中触发工作流？"

Available Documentation

可用文档

File	Contents
`references/eval-format.md`	Complete file format — all fields, turn types, assertion categories, match operators, setup, outcome, options
`references/testing-workflow.md`	Running evals, interpreting output, using traces, the write → test → iterate loop, CI integration
`references/test-patterns.md`	Per-primitive patterns for actions, tools, workflows, conversations, tables, and state

文件	内容
`references/eval-format.md`	完整文件格式——所有字段、轮次类型、断言类别、匹配运算符、设置、结果、选项
`references/testing-workflow.md`	运行evals、解读输出、使用追踪信息、编写→测试→迭代循环、CI集成
`references/test-patterns.md`	针对动作、工具、工作流、对话、表格和状态的基础组件测试模式

How to Answer

如何解答

Writing an eval → Read
```
eval-format.md
```
for structure and assertions
Running evals → Read
```
testing-workflow.md
```
for CLI commands and output
Testing a specific primitive → Read
```
test-patterns.md
```
for the relevant section
Debugging a failure → Combine
```
testing-workflow.md
```
(inspect traces) +
```
eval-format.md
```
(check assertion syntax)

编写eval → 阅读
```
eval-format.md
```
了解结构和断言
运行evals → 阅读
```
testing-workflow.md
```
了解CLI命令和输出
测试特定基础组件 → 阅读
```
test-patterns.md
```
中的相关章节
调试失败问题 → 结合
```
testing-workflow.md
```
（检查追踪信息） +
```
eval-format.md
```
（检查断言语法）

Quick Reference

快速参考

Eval file structure

Eval文件结构

typescript

import { Eval } from '@botpress/adk'

export default new Eval({
  name: 'greeting',
  type: 'regression',
  tags: ['basic'],

  setup: {
    state: { bot: { welcomeSent: false } },
    workflow: { trigger: 'onboarding', input: { userId: 'test-1' } },
  },

  conversation: [
    {
      user: 'Hi!',
      assert: {
        response: [
          { not_contains: 'error' },
          { llm_judge: 'Response is friendly and offers to help' },
        ],
        tools: [{ not_called: 'createTicket' }],
        state: [{ path: 'conversation.greeted', equals: true }],
      },
    },
  ],

  outcome: {
    state: [{ path: 'conversation.greeted', equals: true }],
  },

  options: {
    idleTimeout: 20000,
    judgePassThreshold: 4,
  },
})

typescript

import { Eval } from '@botpress/adk'

export default new Eval({
  name: 'greeting',
  type: 'regression',
  tags: ['basic'],

  setup: {
    state: { bot: { welcomeSent: false } },
    workflow: { trigger: 'onboarding', input: { userId: 'test-1' } },
  },

  conversation: [
    {
      user: 'Hi!',
      assert: {
        response: [
          { not_contains: 'error' },
          { llm_judge: 'Response is friendly and offers to help' },
        ],
        tools: [{ not_called: 'createTicket' }],
        state: [{ path: 'conversation.greeted', equals: true }],
      },
    },
  ],

  outcome: {
    state: [{ path: 'conversation.greeted', equals: true }],
  },

  options: {
    idleTimeout: 20000,
    judgePassThreshold: 4,
  },
})

Turn types

轮次类型

Turn	When to use
`user: 'message'`	Standard user message
`event: { type, payload }`	Non-message trigger (webhook, integration event)
`expectSilence: true`	Assert bot does NOT respond

轮次	使用场景
`user: 'message'`	标准用户消息
`event: { type, payload }`	非消息触发（webhook、集成事件）
`expectSilence: true`	断言机器人不回复

Assertion categories

断言类别

Category	What it checks
`response`	Bot reply text (contains, matches, llm_judge, similar_to)
`tools`	Tool calls (called, not_called, call_order, params)
`state`	Bot/user/conversation state (equals, changed)
`tables`	Table rows (row_exists, row_count)
`workflow`	Workflow execution (entered, completed)
`timing`	Response time in ms (lte, gte)

类别	检查内容
`response`	机器人回复文本（contains、matches、llm_judge、similar_to）
`tools`	工具调用（called、not_called、call_order、params）
`state`	机器人/用户/对话状态（equals、changed）
`tables`	表格行（row_exists、row_count）
`workflow`	工作流执行（entered、completed）
`timing`	响应时间（毫秒，lte、gte）

CLI commands

CLI命令

bash

adk evals                        # run all evals
adk evals <name>                 # run one eval
adk evals --tag <tag>            # filter by tag
adk evals --type regression      # filter by type
adk evals --verbose              # show all assertions
adk evals --format json          # JSON output for CI

adk evals runs                   # list recent runs
adk evals runs --latest          # most recent run
adk evals runs --latest -v       # with full details

bash

adk evals                        # 运行所有evals
adk evals <name>                 # 运行单个eval
adk evals --tag <tag>            # 按标签过滤
adk evals --type regression      # 按类型过滤
adk evals --verbose              # 显示所有断言
adk evals --format json          # 输出JSON格式用于CI

adk evals runs                   # 列出最近运行记录
adk evals runs --latest          # 最新运行记录
adk evals runs --latest -v       # 显示完整详情

Critical Patterns

关键模式

✅ Every turn needs
user
or
event

typescript

// CORRECT
{ user: 'hello', expectSilence: true }
{ event: { type: 'payment.failed' }, expectSilence: true }

❌ expectSilence
alone is not a valid turn

typescript

// WRONG — missing user or event
{ expectSilence: true }

✅ Assert tool params to verify correct extraction

typescript

// CORRECT — verifies the LLM extracted the right values
{ called: 'createTicket', params: { priority: { equals: 'high' } } }

❌ Only asserting the tool was called

typescript

// INCOMPLETE — doesn't verify params were correct
{ called: 'createTicket' }

✅ Use
outcome
for post-conversation state and table assertions

typescript

// CORRECT — final state checked once after all turns
outcome: {
  state: [{ path: 'conversation.resolved', equals: true }],
  tables: [{ table: 'ticketsTable', row_exists: { status: { equals: 'open' } } }],
}

❌ Checking tables in per-turn assertions when the write happens at the end

typescript

// WRONG — table may not be written until after all turns
conversation: [
  {
    user: 'Create a ticket',
    assert: { tables: [{ table: 'ticketsTable', row_exists: { status: { equals: 'open' } } }] },
  },
]

✅ Seed state to test conditional behavior without running setup turns

typescript

// CORRECT — start in a known state
setup: {
  state: {
    user: { plan: 'pro' },
    conversation: { phase: 'support' },
  },
}

❌ Using conversation turns to set up state (slow and fragile)

typescript

// WRONG — depends on the bot correctly processing setup turns
conversation: [
  { user: 'I am on the pro plan' },      // hoping bot sets user.plan
  { user: 'I need help with billing' },   // actual test turn
]

✅ 每个轮次需要
user
或
event

typescript

// 正确示例
{ user: 'hello', expectSilence: true }
{ event: { type: 'payment.failed' }, expectSilence: true }

❌ 仅
expectSilence
不是有效的轮次

typescript

// 错误示例 — 缺少user或event
{ expectSilence: true }

✅ 断言工具参数以验证提取是否正确

typescript

// 正确示例 — 验证LLM提取了正确的值
{ called: 'createTicket', params: { priority: { equals: 'high' } } }

❌ 仅断言工具被调用

typescript

// 不完整示例 — 未验证参数是否正确
{ called: 'createTicket' }

✅ 使用
outcome
进行对话后的状态和表格断言

typescript

// 正确示例 — 所有轮次结束后检查最终状态
outcome: {
  state: [{ path: 'conversation.resolved', equals: true }],
  tables: [{ table: 'ticketsTable', row_exists: { status: { equals: 'open' } } }],
}

❌ 当写入操作在最后执行时，在每轮断言中检查表格

typescript

// 错误示例 — 表格可能在所有轮次结束后才写入
conversation: [
  {
    user: 'Create a ticket',
    assert: { tables: [{ table: 'ticketsTable', row_exists: { status: { equals: 'open' } } }] },
  },
]

✅ 初始化状态以测试条件行为，无需运行设置轮次

typescript

// 正确示例 — 从已知状态开始
setup: {
  state: {
    user: { plan: 'pro' },
    conversation: { phase: 'support' },
  },
}

❌ 使用对话轮次设置状态（缓慢且脆弱）

typescript

// 错误示例 — 依赖机器人正确处理设置轮次
conversation: [
  { user: 'I am on the pro plan' },      // 期望机器人设置user.plan
  { user: 'I need help with billing' },   // 实际测试轮次
]

Example Questions

示例问题

Writing evals:

"Write an eval that tests my createTicket tool is called with the right priority"
"How do I assert that the bot stays silent after an internal event?"
"How do I test a multi-turn conversation where context is retained?"

Running evals:

"How do I run only regression evals?"
"How do I see which assertions failed and why?"
"How do I integrate evals into GitHub Actions?"

Debugging:

"My eval says the tool wasn't called but I think it was — how do I check?"
"How do I inspect what the bot actually did during an eval?"

Per-primitive:

"How do I test a workflow that uses step.sleep()?"
"How do I verify a row was written to a table after a conversation?"
"How do I test that state changed from the seeded value?"

编写evals：

"编写一个eval，测试我的createTicket工具被调用时传入了正确的优先级"
"如何断言机器人在内部事件后保持静默？"
"如何测试保留上下文的多轮对话？"

运行evals：

"如何仅运行回归测试类型的evals？"
"如何查看哪些断言失败以及原因？"
"如何将evals集成到GitHub Actions中？"

调试：

"我的eval显示工具未被调用，但我认为它被调用了——如何检查？"
"如何查看eval运行期间机器人的实际操作？"

基础组件测试：

"如何测试使用step.sleep()的工作流？"
"如何验证对话后表格中写入了一行数据？"
"如何测试状态是否从初始化值发生了变化？"

Response Format

响应格式

When helping a developer write an eval:

Show the complete
```
new Eval({})
```
call with realistic field values
Include imports (
```
import { Eval } from '@botpress/adk'
```
)
Explain each assertion and why it's the right choice for that scenario
Point out any mutual exclusivity rules if relevant (
```
expectSilence
```
vs
```
assert.response
```
,
```
user
```
vs
```
event
```
)
Suggest the CLI command to run it:
```
adk evals <name>
```

When helping debug a failing eval:

Ask for or show the failing assertion (
```
expected
```
/
```
actual
```
diff)
Suggest opening traces in the Control Panel to see what the bot did
Identify whether the issue is in the eval assertion or the bot's behavior

当帮助开发者编写eval时：

展示完整的
```
new Eval({})
```
调用，并使用真实的字段值
包含导入语句（
```
import { Eval } from '@botpress/adk'
```
）
解释每个断言，以及为什么它是该场景的正确选择
指出任何互斥规则（如
```
expectSilence
```
与
```
assert.response
```
、
```
user
```
与
```
event
```
）
建议运行它的CLI命令：
```
adk evals <name>
```

当帮助调试失败的eval时：

请求或展示失败的断言（
```
expected
```
/
```
actual
```
差异）
建议在控制面板中打开追踪信息，查看机器人的操作
判断问题出在eval断言还是机器人行为上