ADK Evals Skill

What are Evals?

Evals are automated conversation tests for ADK agents. Each eval defines a scenario — a sequence of user messages or events — and asserts on what the bot should do: what it says, which tools it calls, how state changes, what gets written to tables, and more.

Evals run against a live dev bot (

adk dev

), so they test the full stack — not mocks.

When to Use This Skill

Use this skill when the developer asks about:

Writing evals — file format, assertions, turn types, setup
Running evals — CLI commands, filtering, output interpretation
Testing specific primitives — how to test actions, tools, workflows, conversations, tables, state
The testing loop — write → run → inspect traces → iterate
CI integration — exit codes,
```
--format json
```
flag, tagging strategies
Eval configuration — idleTimeout, judgePassThreshold, judgeModel

Or when you are developing an ADK bot and need to write the equivalent of unit/end-to-end tests.

Trigger questions:

"How do I write an eval?"
"How do I test my workflow?"
"How do I assert that a tool was called with specific params?"
"My eval is failing, how do I debug it?"
"How do I test that the bot stays silent?"
"How do I run evals in CI?"
"How do I seed state before an eval?"
"How do I trigger a workflow in an eval?"

Available Documentation

File	Contents
`references/eval-format.md`	Complete file format — all fields, turn types, assertion categories, match operators, setup, outcome, options
`references/testing-workflow.md`	Running evals, interpreting output, using traces, the write → test → iterate loop, CI integration
`references/test-patterns.md`	Per-primitive patterns for actions, tools, workflows, conversations, tables, and state

How to Answer

Writing an eval → Read
```
eval-format.md
```
for structure and assertions
Running evals → Read
```
testing-workflow.md
```
for CLI commands and output
Testing a specific primitive → Read
```
test-patterns.md
```
for the relevant section
Debugging a failure → Combine
```
testing-workflow.md
```
(inspect traces) +
```
eval-format.md
```
(check assertion syntax)

Quick Reference

Eval file structure

typescript

import { Eval } from '@botpress/adk'

export default new Eval({
  name: 'greeting',
  type: 'regression',
  tags: ['basic'],

  setup: {
    state: { bot: { welcomeSent: false } },
    workflow: { trigger: 'onboarding', input: { userId: 'test-1' } },
  },

  conversation: [
    {
      user: 'Hi!',
      assert: {
        response: [
          { not_contains: 'error' },
          { llm_judge: 'Response is friendly and offers to help' },
        ],
        tools: [{ not_called: 'createTicket' }],
        state: [{ path: 'conversation.greeted', equals: true }],
      },
    },
  ],

  outcome: {
    state: [{ path: 'conversation.greeted', equals: true }],
  },

  options: {
    idleTimeout: 20000,
    judgePassThreshold: 4,
  },
})

Turn types

Turn	When to use
`user: 'message'`	Standard user message
`event: { type, payload }`	Non-message trigger (webhook, integration event)
`expectSilence: true`	Assert bot does NOT respond

Assertion categories

Category	What it checks
`response`	Bot reply text (contains, matches, llm_judge, similar_to)
`tools`	Tool calls (called, not_called, call_order, params)
`state`	Bot/user/conversation state (equals, changed)
`tables`	Table rows (row_exists, row_count)
`workflow`	Workflow execution (entered, completed)
`timing`	Response time in ms (lte, gte)

CLI commands

bash

adk evals                        # run all evals
adk evals <name>                 # run one eval
adk evals --tag <tag>            # filter by tag
adk evals --type regression      # filter by type
adk evals --verbose              # show all assertions
adk evals --format json          # JSON output for CI

adk evals runs                   # list recent runs
adk evals runs --latest          # most recent run
adk evals runs --latest -v       # with full details

Critical Patterns

✅ Every turn needs
user
or
event

typescript

// CORRECT
{ user: 'hello', expectSilence: true }
{ event: { type: 'payment.failed' }, expectSilence: true }

❌ expectSilence
alone is not a valid turn

typescript

// WRONG — missing user or event
{ expectSilence: true }

✅ Assert tool params to verify correct extraction

typescript

// CORRECT — verifies the LLM extracted the right values
{ called: 'createTicket', params: { priority: { equals: 'high' } } }

❌ Only asserting the tool was called

typescript

// INCOMPLETE — doesn't verify params were correct
{ called: 'createTicket' }

✅ Use
outcome
for post-conversation state and table assertions

typescript

// CORRECT — final state checked once after all turns
outcome: {
  state: [{ path: 'conversation.resolved', equals: true }],
  tables: [{ table: 'ticketsTable', row_exists: { status: { equals: 'open' } } }],
}

❌ Checking tables in per-turn assertions when the write happens at the end

typescript

// WRONG — table may not be written until after all turns
conversation: [
  {
    user: 'Create a ticket',
    assert: { tables: [{ table: 'ticketsTable', row_exists: { status: { equals: 'open' } } }] },
  },
]

✅ Seed state to test conditional behavior without running setup turns

typescript

// CORRECT — start in a known state
setup: {
  state: {
    user: { plan: 'pro' },
    conversation: { phase: 'support' },
  },
}

❌ Using conversation turns to set up state (slow and fragile)

typescript

// WRONG — depends on the bot correctly processing setup turns
conversation: [
  { user: 'I am on the pro plan' },      // hoping bot sets user.plan
  { user: 'I need help with billing' },   // actual test turn
]

Example Questions

Writing evals:

"Write an eval that tests my createTicket tool is called with the right priority"
"How do I assert that the bot stays silent after an internal event?"
"How do I test a multi-turn conversation where context is retained?"

Running evals:

"How do I run only regression evals?"
"How do I see which assertions failed and why?"
"How do I integrate evals into GitHub Actions?"

Debugging:

"My eval says the tool wasn't called but I think it was — how do I check?"
"How do I inspect what the bot actually did during an eval?"

Per-primitive:

"How do I test a workflow that uses step.sleep()?"
"How do I verify a row was written to a table after a conversation?"
"How do I test that state changed from the seeded value?"

Response Format

When helping a developer write an eval:

Show the complete
```
new Eval({})
```
call with realistic field values
Include imports (
```
import { Eval } from '@botpress/adk'
```
)
Explain each assertion and why it's the right choice for that scenario
Point out any mutual exclusivity rules if relevant (
```
expectSilence
```
vs
```
assert.response
```
,
```
user
```
vs
```
event
```
)
Suggest the CLI command to run it:
```
adk evals <name>
```

When helping debug a failing eval:

Ask for or show the failing assertion (
```
expected
```
/
```
actual
```
diff)
Suggest opening traces in the Control Panel to see what the bot did
Identify whether the issue is in the eval assertion or the bot's behavior

adk-evals

NPX Install

Tags

SKILL.md Content

ADK Evals Skill

What are Evals?

When to Use This Skill

Available Documentation

How to Answer

Quick Reference

Eval file structure

Turn types

Assertion categories

CLI commands

Critical Patterns

Example Questions

Response Format