adk-evals

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

ADK Evals Skill

ADK Evals 技能

What are Evals?

什么是Evals?

Evals are automated conversation tests for ADK agents. Each eval defines a scenario — a sequence of user messages or events — and asserts on what the bot should do: what it says, which tools it calls, how state changes, what gets written to tables, and more.
Evals run against a live dev bot (
adk dev
), so they test the full stack — not mocks.
Evals是针对ADK agents的自动化对话测试。每个eval定义一个场景——一系列用户消息或事件——并断言机器人应执行的操作:回复内容、调用的工具、状态变化、写入表格的数据等等。
Evals在实时开发机器人(
adk dev
)上运行,因此它们测试的是完整堆栈而非模拟环境。

When to Use This Skill

何时使用该技能

Use this skill when the developer asks about:
  • Writing evals — file format, assertions, turn types, setup
  • Running evals — CLI commands, filtering, output interpretation
  • Testing specific primitives — how to test actions, tools, workflows, conversations, tables, state
  • The testing loop — write → run → inspect traces → iterate
  • CI integration — exit codes,
    --format json
    flag, tagging strategies
  • Eval configuration — idleTimeout, judgePassThreshold, judgeModel
Or when you are developing an ADK bot and need to write the equivalent of unit/end-to-end tests.
Trigger questions:
  • "How do I write an eval?"
  • "How do I test my workflow?"
  • "How do I assert that a tool was called with specific params?"
  • "My eval is failing, how do I debug it?"
  • "How do I test that the bot stays silent?"
  • "How do I run evals in CI?"
  • "How do I seed state before an eval?"
  • "How do I trigger a workflow in an eval?"
当开发者询问以下内容时使用该技能:
  • 编写evals — 文件格式、断言、轮次类型、设置
  • 运行evals — CLI命令、过滤规则、输出解读
  • 测试特定基础组件 — 如何测试动作、工具、工作流、对话、表格、状态
  • 测试循环 — 编写 → 运行 → 检查追踪信息 → 迭代
  • CI集成 — 退出码、
    --format json
    参数、标签策略
  • Eval配置 — idleTimeout、judgePassThreshold、judgeModel
或者当你开发ADK机器人,需要编写单元测试/端到端测试的等效测试时。
触发问题:
  • "如何编写eval?"
  • "如何测试我的工作流?"
  • "如何断言工具被调用时传入了特定参数?"
  • "我的eval失败了,如何调试?"
  • "如何测试机器人保持静默?"
  • "如何在CI中运行evals?"
  • "如何在eval前初始化状态?"
  • "如何在eval中触发工作流?"

Available Documentation

可用文档

FileContents
references/eval-format.md
Complete file format — all fields, turn types, assertion categories, match operators, setup, outcome, options
references/testing-workflow.md
Running evals, interpreting output, using traces, the write → test → iterate loop, CI integration
references/test-patterns.md
Per-primitive patterns for actions, tools, workflows, conversations, tables, and state
文件内容
references/eval-format.md
完整文件格式——所有字段、轮次类型、断言类别、匹配运算符、设置、结果、选项
references/testing-workflow.md
运行evals、解读输出、使用追踪信息、编写→测试→迭代循环、CI集成
references/test-patterns.md
针对动作、工具、工作流、对话、表格和状态的基础组件测试模式

How to Answer

如何解答

  1. Writing an eval → Read
    eval-format.md
    for structure and assertions
  2. Running evals → Read
    testing-workflow.md
    for CLI commands and output
  3. Testing a specific primitive → Read
    test-patterns.md
    for the relevant section
  4. Debugging a failure → Combine
    testing-workflow.md
    (inspect traces) +
    eval-format.md
    (check assertion syntax)

  1. 编写eval → 阅读
    eval-format.md
    了解结构和断言
  2. 运行evals → 阅读
    testing-workflow.md
    了解CLI命令和输出
  3. 测试特定基础组件 → 阅读
    test-patterns.md
    中的相关章节
  4. 调试失败问题 → 结合
    testing-workflow.md
    (检查追踪信息) +
    eval-format.md
    (检查断言语法)

Quick Reference

快速参考

Eval file structure

Eval文件结构

typescript
import { Eval } from '@botpress/adk'

export default new Eval({
  name: 'greeting',
  type: 'regression',
  tags: ['basic'],

  setup: {
    state: { bot: { welcomeSent: false } },
    workflow: { trigger: 'onboarding', input: { userId: 'test-1' } },
  },

  conversation: [
    {
      user: 'Hi!',
      assert: {
        response: [
          { not_contains: 'error' },
          { llm_judge: 'Response is friendly and offers to help' },
        ],
        tools: [{ not_called: 'createTicket' }],
        state: [{ path: 'conversation.greeted', equals: true }],
      },
    },
  ],

  outcome: {
    state: [{ path: 'conversation.greeted', equals: true }],
  },

  options: {
    idleTimeout: 20000,
    judgePassThreshold: 4,
  },
})
typescript
import { Eval } from '@botpress/adk'

export default new Eval({
  name: 'greeting',
  type: 'regression',
  tags: ['basic'],

  setup: {
    state: { bot: { welcomeSent: false } },
    workflow: { trigger: 'onboarding', input: { userId: 'test-1' } },
  },

  conversation: [
    {
      user: 'Hi!',
      assert: {
        response: [
          { not_contains: 'error' },
          { llm_judge: 'Response is friendly and offers to help' },
        ],
        tools: [{ not_called: 'createTicket' }],
        state: [{ path: 'conversation.greeted', equals: true }],
      },
    },
  ],

  outcome: {
    state: [{ path: 'conversation.greeted', equals: true }],
  },

  options: {
    idleTimeout: 20000,
    judgePassThreshold: 4,
  },
})

Turn types

轮次类型

TurnWhen to use
user: 'message'
Standard user message
event: { type, payload }
Non-message trigger (webhook, integration event)
expectSilence: true
Assert bot does NOT respond
轮次使用场景
user: 'message'
标准用户消息
event: { type, payload }
非消息触发(webhook、集成事件)
expectSilence: true
断言机器人不回复

Assertion categories

断言类别

CategoryWhat it checks
response
Bot reply text (contains, matches, llm_judge, similar_to)
tools
Tool calls (called, not_called, call_order, params)
state
Bot/user/conversation state (equals, changed)
tables
Table rows (row_exists, row_count)
workflow
Workflow execution (entered, completed)
timing
Response time in ms (lte, gte)
类别检查内容
response
机器人回复文本(contains、matches、llm_judge、similar_to)
tools
工具调用(called、not_called、call_order、params)
state
机器人/用户/对话状态(equals、changed)
tables
表格行(row_exists、row_count)
workflow
工作流执行(entered、completed)
timing
响应时间(毫秒,lte、gte)

CLI commands

CLI命令

bash
adk evals                        # run all evals
adk evals <name>                 # run one eval
adk evals --tag <tag>            # filter by tag
adk evals --type regression      # filter by type
adk evals --verbose              # show all assertions
adk evals --format json          # JSON output for CI

adk evals runs                   # list recent runs
adk evals runs --latest          # most recent run
adk evals runs --latest -v       # with full details

bash
adk evals                        # 运行所有evals
adk evals <name>                 # 运行单个eval
adk evals --tag <tag>            # 按标签过滤
adk evals --type regression      # 按类型过滤
adk evals --verbose              # 显示所有断言
adk evals --format json          # 输出JSON格式用于CI

adk evals runs                   # 列出最近运行记录
adk evals runs --latest          # 最新运行记录
adk evals runs --latest -v       # 显示完整详情

Critical Patterns

关键模式

Every turn needs
user
or
event
typescript
// CORRECT
{ user: 'hello', expectSilence: true }
{ event: { type: 'payment.failed' }, expectSilence: true }
expectSilence
alone is not a valid turn
typescript
// WRONG — missing user or event
{ expectSilence: true }

Assert tool params to verify correct extraction
typescript
// CORRECT — verifies the LLM extracted the right values
{ called: 'createTicket', params: { priority: { equals: 'high' } } }
Only asserting the tool was called
typescript
// INCOMPLETE — doesn't verify params were correct
{ called: 'createTicket' }

Use
outcome
for post-conversation state and table assertions
typescript
// CORRECT — final state checked once after all turns
outcome: {
  state: [{ path: 'conversation.resolved', equals: true }],
  tables: [{ table: 'ticketsTable', row_exists: { status: { equals: 'open' } } }],
}
Checking tables in per-turn assertions when the write happens at the end
typescript
// WRONG — table may not be written until after all turns
conversation: [
  {
    user: 'Create a ticket',
    assert: { tables: [{ table: 'ticketsTable', row_exists: { status: { equals: 'open' } } }] },
  },
]

Seed state to test conditional behavior without running setup turns
typescript
// CORRECT — start in a known state
setup: {
  state: {
    user: { plan: 'pro' },
    conversation: { phase: 'support' },
  },
}
Using conversation turns to set up state (slow and fragile)
typescript
// WRONG — depends on the bot correctly processing setup turns
conversation: [
  { user: 'I am on the pro plan' },      // hoping bot sets user.plan
  { user: 'I need help with billing' },   // actual test turn
]

每个轮次需要
user
event
typescript
// 正确示例
{ user: 'hello', expectSilence: true }
{ event: { type: 'payment.failed' }, expectSilence: true }
expectSilence
不是有效的轮次
typescript
// 错误示例 — 缺少user或event
{ expectSilence: true }

断言工具参数以验证提取是否正确
typescript
// 正确示例 — 验证LLM提取了正确的值
{ called: 'createTicket', params: { priority: { equals: 'high' } } }
仅断言工具被调用
typescript
// 不完整示例 — 未验证参数是否正确
{ called: 'createTicket' }

使用
outcome
进行对话后的状态和表格断言
typescript
// 正确示例 — 所有轮次结束后检查最终状态
outcome: {
  state: [{ path: 'conversation.resolved', equals: true }],
  tables: [{ table: 'ticketsTable', row_exists: { status: { equals: 'open' } } }],
}
当写入操作在最后执行时,在每轮断言中检查表格
typescript
// 错误示例 — 表格可能在所有轮次结束后才写入
conversation: [
  {
    user: 'Create a ticket',
    assert: { tables: [{ table: 'ticketsTable', row_exists: { status: { equals: 'open' } } }] },
  },
]

初始化状态以测试条件行为,无需运行设置轮次
typescript
// 正确示例 — 从已知状态开始
setup: {
  state: {
    user: { plan: 'pro' },
    conversation: { phase: 'support' },
  },
}
使用对话轮次设置状态(缓慢且脆弱)
typescript
// 错误示例 — 依赖机器人正确处理设置轮次
conversation: [
  { user: 'I am on the pro plan' },      // 期望机器人设置user.plan
  { user: 'I need help with billing' },   // 实际测试轮次
]

Example Questions

示例问题

Writing evals:
  • "Write an eval that tests my createTicket tool is called with the right priority"
  • "How do I assert that the bot stays silent after an internal event?"
  • "How do I test a multi-turn conversation where context is retained?"
Running evals:
  • "How do I run only regression evals?"
  • "How do I see which assertions failed and why?"
  • "How do I integrate evals into GitHub Actions?"
Debugging:
  • "My eval says the tool wasn't called but I think it was — how do I check?"
  • "How do I inspect what the bot actually did during an eval?"
Per-primitive:
  • "How do I test a workflow that uses step.sleep()?"
  • "How do I verify a row was written to a table after a conversation?"
  • "How do I test that state changed from the seeded value?"

编写evals:
  • "编写一个eval,测试我的createTicket工具被调用时传入了正确的优先级"
  • "如何断言机器人在内部事件后保持静默?"
  • "如何测试保留上下文的多轮对话?"
运行evals:
  • "如何仅运行回归测试类型的evals?"
  • "如何查看哪些断言失败以及原因?"
  • "如何将evals集成到GitHub Actions中?"
调试:
  • "我的eval显示工具未被调用,但我认为它被调用了——如何检查?"
  • "如何查看eval运行期间机器人的实际操作?"
基础组件测试:
  • "如何测试使用step.sleep()的工作流?"
  • "如何验证对话后表格中写入了一行数据?"
  • "如何测试状态是否从初始化值发生了变化?"

Response Format

响应格式

When helping a developer write an eval:
  1. Show the complete
    new Eval({})
    call with realistic field values
  2. Include imports (
    import { Eval } from '@botpress/adk'
    )
  3. Explain each assertion and why it's the right choice for that scenario
  4. Point out any mutual exclusivity rules if relevant (
    expectSilence
    vs
    assert.response
    ,
    user
    vs
    event
    )
  5. Suggest the CLI command to run it:
    adk evals <name>
When helping debug a failing eval:
  1. Ask for or show the failing assertion (
    expected
    /
    actual
    diff)
  2. Suggest opening traces in the Control Panel to see what the bot did
  3. Identify whether the issue is in the eval assertion or the bot's behavior
当帮助开发者编写eval时:
  1. 展示完整的
    new Eval({})
    调用,并使用真实的字段值
  2. 包含导入语句(
    import { Eval } from '@botpress/adk'
  3. 解释每个断言,以及为什么它是该场景的正确选择
  4. 指出任何互斥规则(如
    expectSilence
    assert.response
    user
    event
  5. 建议运行它的CLI命令:
    adk evals <name>
当帮助调试失败的eval时:
  1. 请求或展示失败的断言(
    expected
    /
    actual
    差异)
  2. 建议在控制面板中打开追踪信息,查看机器人的操作
  3. 判断问题出在eval断言还是机器人行为上