adk-evals
Original:🇺🇸 English
Translated
Complete reference for writing, running, and iterating on evals (automated conversation tests) for ADK agents. Covers eval file format, all assertion types, CLI usage, and per-primitive testing patterns.
2installs
Sourcebotpress/skills
Added on
NPX Install
npx skill4agent add botpress/skills adk-evalsTags
Translated version includes tags in frontmatterSKILL.md Content
View Translation Comparison →ADK Evals Skill
What are Evals?
Evals are automated conversation tests for ADK agents. Each eval defines a scenario — a sequence of user messages or events — and asserts on what the bot should do: what it says, which tools it calls, how state changes, what gets written to tables, and more.
Evals run against a live dev bot (), so they test the full stack — not mocks.
adk devWhen to Use This Skill
Use this skill when the developer asks about:
- Writing evals — file format, assertions, turn types, setup
- Running evals — CLI commands, filtering, output interpretation
- Testing specific primitives — how to test actions, tools, workflows, conversations, tables, state
- The testing loop — write → run → inspect traces → iterate
- CI integration — exit codes, flag, tagging strategies
--format json - Eval configuration — idleTimeout, judgePassThreshold, judgeModel
Or when you are developing an ADK bot and need to write the equivalent of unit/end-to-end tests.
Trigger questions:
- "How do I write an eval?"
- "How do I test my workflow?"
- "How do I assert that a tool was called with specific params?"
- "My eval is failing, how do I debug it?"
- "How do I test that the bot stays silent?"
- "How do I run evals in CI?"
- "How do I seed state before an eval?"
- "How do I trigger a workflow in an eval?"
Available Documentation
| File | Contents |
|---|---|
| Complete file format — all fields, turn types, assertion categories, match operators, setup, outcome, options |
| Running evals, interpreting output, using traces, the write → test → iterate loop, CI integration |
| Per-primitive patterns for actions, tools, workflows, conversations, tables, and state |
How to Answer
- Writing an eval → Read for structure and assertions
eval-format.md - Running evals → Read for CLI commands and output
testing-workflow.md - Testing a specific primitive → Read for the relevant section
test-patterns.md - Debugging a failure → Combine (inspect traces) +
testing-workflow.md(check assertion syntax)eval-format.md
Quick Reference
Eval file structure
typescript
import { Eval } from '@botpress/adk'
export default new Eval({
name: 'greeting',
type: 'regression',
tags: ['basic'],
setup: {
state: { bot: { welcomeSent: false } },
workflow: { trigger: 'onboarding', input: { userId: 'test-1' } },
},
conversation: [
{
user: 'Hi!',
assert: {
response: [
{ not_contains: 'error' },
{ llm_judge: 'Response is friendly and offers to help' },
],
tools: [{ not_called: 'createTicket' }],
state: [{ path: 'conversation.greeted', equals: true }],
},
},
],
outcome: {
state: [{ path: 'conversation.greeted', equals: true }],
},
options: {
idleTimeout: 20000,
judgePassThreshold: 4,
},
})Turn types
| Turn | When to use |
|---|---|
| Standard user message |
| Non-message trigger (webhook, integration event) |
| Assert bot does NOT respond |
Assertion categories
| Category | What it checks |
|---|---|
| Bot reply text (contains, matches, llm_judge, similar_to) |
| Tool calls (called, not_called, call_order, params) |
| Bot/user/conversation state (equals, changed) |
| Table rows (row_exists, row_count) |
| Workflow execution (entered, completed) |
| Response time in ms (lte, gte) |
CLI commands
bash
adk evals # run all evals
adk evals <name> # run one eval
adk evals --tag <tag> # filter by tag
adk evals --type regression # filter by type
adk evals --verbose # show all assertions
adk evals --format json # JSON output for CI
adk evals runs # list recent runs
adk evals runs --latest # most recent run
adk evals runs --latest -v # with full detailsCritical Patterns
✅ Every turn needs or
usereventtypescript
// CORRECT
{ user: 'hello', expectSilence: true }
{ event: { type: 'payment.failed' }, expectSilence: true }❌ alone is not a valid turn
expectSilencetypescript
// WRONG — missing user or event
{ expectSilence: true }✅ Assert tool params to verify correct extraction
typescript
// CORRECT — verifies the LLM extracted the right values
{ called: 'createTicket', params: { priority: { equals: 'high' } } }❌ Only asserting the tool was called
typescript
// INCOMPLETE — doesn't verify params were correct
{ called: 'createTicket' }✅ Use for post-conversation state and table assertions
outcometypescript
// CORRECT — final state checked once after all turns
outcome: {
state: [{ path: 'conversation.resolved', equals: true }],
tables: [{ table: 'ticketsTable', row_exists: { status: { equals: 'open' } } }],
}❌ Checking tables in per-turn assertions when the write happens at the end
typescript
// WRONG — table may not be written until after all turns
conversation: [
{
user: 'Create a ticket',
assert: { tables: [{ table: 'ticketsTable', row_exists: { status: { equals: 'open' } } }] },
},
]✅ Seed state to test conditional behavior without running setup turns
typescript
// CORRECT — start in a known state
setup: {
state: {
user: { plan: 'pro' },
conversation: { phase: 'support' },
},
}❌ Using conversation turns to set up state (slow and fragile)
typescript
// WRONG — depends on the bot correctly processing setup turns
conversation: [
{ user: 'I am on the pro plan' }, // hoping bot sets user.plan
{ user: 'I need help with billing' }, // actual test turn
]Example Questions
Writing evals:
- "Write an eval that tests my createTicket tool is called with the right priority"
- "How do I assert that the bot stays silent after an internal event?"
- "How do I test a multi-turn conversation where context is retained?"
Running evals:
- "How do I run only regression evals?"
- "How do I see which assertions failed and why?"
- "How do I integrate evals into GitHub Actions?"
Debugging:
- "My eval says the tool wasn't called but I think it was — how do I check?"
- "How do I inspect what the bot actually did during an eval?"
Per-primitive:
- "How do I test a workflow that uses step.sleep()?"
- "How do I verify a row was written to a table after a conversation?"
- "How do I test that state changed from the seeded value?"
Response Format
When helping a developer write an eval:
- Show the complete call with realistic field values
new Eval({}) - Include imports ()
import { Eval } from '@botpress/adk' - Explain each assertion and why it's the right choice for that scenario
- Point out any mutual exclusivity rules if relevant (vs
expectSilence,assert.responsevsuser)event - Suggest the CLI command to run it:
adk evals <name>
When helping debug a failing eval:
- Ask for or show the failing assertion (/
expecteddiff)actual - Suggest opening traces in the Control Panel to see what the bot did
- Identify whether the issue is in the eval assertion or the bot's behavior