behavioral-evals

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Behavioral Evals

行为评估（Behavioral Evals）

Overview

概述

Behavioral evaluations (evals) are tests that validate the agent's decision-making (e.g., tool choice) rather than pure functionality. They are critical for verifying prompt changes, debugging steerability, and preventing regressions.

[!NOTE] Single Source of Truth: For core concepts, policies, running tests, and general best practices, always refer to evals/README.md.

行为评估（evals）是用于验证Agent决策过程（如工具选择）而非纯功能的测试。它们对于验证提示词变更、调试可引导性以及防止回归至关重要。

[!NOTE] 唯一可信来源：核心概念、策略、测试运行及通用最佳实践，请始终参考**evals/README.md**。

🔄 Workflow Decision Tree

🔄 工作流决策树

Does a prompt/tool change need validation?
- No -> Normal integration tests.
- Yes -> Continue below.
Is it UI/Interaction heavy?
- Yes -> Use
```
appEvalTest
```
  (
```
AppRig
```
  ). See creating.md.
- No -> Use
```
evalTest
```
  (
```
TestRig
```
  ). See creating.md.
Is it a new test?
- Yes -> Set policy to
```
USUALLY_PASSES
```
  .
- No ->
```
ALWAYS_PASSES
```
  (locks in regression).
Are you fixing a failure or promoting a test?
- Fixing -> See fixing.md.
- Promoting -> See promoting.md.

是否需要验证提示词/工具变更？
- 否 -> 常规集成测试。
- 是 -> 继续以下步骤。
是否涉及大量UI/交互？
- 是 -> 使用
```
appEvalTest
```
  （
```
AppRig
```
  ）。详见**creating.md**。
- 否 -> 使用
```
evalTest
```
  （
```
TestRig
```
  ）。详见**creating.md**。
是否为新测试？
- 是 -> 将策略设置为
```
USUALLY_PASSES
```
  。
- 否 -> 设置为
```
ALWAYS_PASSES
```
  （锁定回归）。
你是在修复故障还是推广测试？
- 修复 -> 详见**fixing.md**。
- 推广 -> 详见**promoting.md**。

📋 Quick Checklist

📋 快速检查清单

1. Setup Workspace

1. 配置工作区

Seed the workspace with necessary files using the

files

object to simulate a realistic scenario (e.g., NodeJS project with

package.json

Details in creating.md

使用

files

对象在工作区中植入必要文件，模拟真实场景（例如包含

package.json

的NodeJS项目）。

详情见*creating.md***

2. Write Assertions

2. 编写断言

Audit agent decisions using

rig.setBreakpoint()

(AppRig only) or index verification on

rig.readToolLogs()

Details in creating.md

使用

rig.setBreakpoint()

（仅AppRig支持）或对

rig.readToolLogs()

进行索引校验，审核Agent的决策。

详情见*creating.md***

3. Verify

3. 验证

Run single tests locally with Vitest. Confirm stability locally before relying on CI workflows.

See evals/README.md for running commands.

使用Vitest在本地运行单个测试。在依赖CI工作流之前，先确认本地测试的稳定性。

运行命令详见*evals/README.md***

📦 Bundled Resources

📦 内置资源

Detailed procedural guides:

creating.md: Assertion strategies, Rig selection, Mock MCPs.
fixing.md: Step-by-step automated investigation, architecture diagnosis guidelines.
promoting.md: Candidate identification criteria and threshold guidelines.

详细流程指南：

creating.md：断言策略、Rig选择、Mock MCP。
fixing.md：分步自动化排查、架构诊断指南。
promoting.md：候选识别标准及阈值指南。