behavioral-evals
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBehavioral Evals
行为评估(Behavioral Evals)
Overview
概述
Behavioral evaluations (evals) are tests that validate the agent's decision-making (e.g., tool choice) rather than pure functionality. They are critical for verifying prompt changes, debugging steerability, and preventing regressions.
[!NOTE] Single Source of Truth: For core concepts, policies, running tests, and general best practices, always refer to evals/README.md.
行为评估(evals)是用于验证Agent决策过程(如工具选择)而非纯功能的测试。它们对于验证提示词变更、调试可引导性以及防止回归至关重要。
[!NOTE] 唯一可信来源:核心概念、策略、测试运行及通用最佳实践,请始终参考**evals/README.md**。
🔄 Workflow Decision Tree
🔄 工作流决策树
- Does a prompt/tool change need validation?
- No -> Normal integration tests.
- Yes -> Continue below.
- Is it UI/Interaction heavy?
- Yes -> Use (
appEvalTest). See creating.md.AppRig - No -> Use (
evalTest). See creating.md.TestRig
- Yes -> Use
- Is it a new test?
- Yes -> Set policy to .
USUALLY_PASSES - No -> (locks in regression).
ALWAYS_PASSES
- Yes -> Set policy to
- Are you fixing a failure or promoting a test?
- Fixing -> See fixing.md.
- Promoting -> See promoting.md.
- 是否需要验证提示词/工具变更?
- 否 -> 常规集成测试。
- 是 -> 继续以下步骤。
- 是否涉及大量UI/交互?
- 是 -> 使用(
appEvalTest)。详见**creating.md**。AppRig - 否 -> 使用(
evalTest)。详见**creating.md**。TestRig
- 是 -> 使用
- 是否为新测试?
- 是 -> 将策略设置为。
USUALLY_PASSES - 否 -> 设置为(锁定回归)。
ALWAYS_PASSES
- 是 -> 将策略设置为
- 你是在修复故障还是推广测试?
- 修复 -> 详见**fixing.md**。
- 推广 -> 详见**promoting.md**。
📋 Quick Checklist
📋 快速检查清单
1. Setup Workspace
1. 配置工作区
Seed the workspace with necessary files using the object to simulate a realistic scenario (e.g., NodeJS project with ).
filespackage.json- Details in creating.md
使用对象在工作区中植入必要文件,模拟真实场景(例如包含的NodeJS项目)。
filespackage.json- 详情见*creating.md***
2. Write Assertions
2. 编写断言
Audit agent decisions using (AppRig only) or index verification on .
rig.setBreakpoint()rig.readToolLogs()- Details in creating.md
使用(仅AppRig支持)或对进行索引校验,审核Agent的决策。
rig.setBreakpoint()rig.readToolLogs()- 详情见*creating.md***
3. Verify
3. 验证
Run single tests locally with Vitest. Confirm stability locally before relying on CI workflows.
- See evals/README.md for running commands.
使用Vitest在本地运行单个测试。在依赖CI工作流之前,先确认本地测试的稳定性。
- 运行命令详见*evals/README.md***
📦 Bundled Resources
📦 内置资源
Detailed procedural guides:
- creating.md: Assertion strategies, Rig selection, Mock MCPs.
- fixing.md: Step-by-step automated investigation, architecture diagnosis guidelines.
- promoting.md: Candidate identification criteria and threshold guidelines.
详细流程指南:
- creating.md:断言策略、Rig选择、Mock MCP。
- fixing.md:分步自动化排查、架构诊断指南。
- promoting.md:候选识别标准及阈值指南。