behavioral-evals

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Behavioral Evals

行为评估(Behavioral Evals)

Overview

概述

Behavioral evaluations (evals) are tests that validate the agent's decision-making (e.g., tool choice) rather than pure functionality. They are critical for verifying prompt changes, debugging steerability, and preventing regressions.
[!NOTE] Single Source of Truth: For core concepts, policies, running tests, and general best practices, always refer to evals/README.md.

行为评估(evals)是用于验证Agent决策过程(如工具选择)而非纯功能的测试。它们对于验证提示词变更、调试可引导性以及防止回归至关重要。
[!NOTE] 唯一可信来源:核心概念、策略、测试运行及通用最佳实践,请始终参考**evals/README.md**。

🔄 Workflow Decision Tree

🔄 工作流决策树

  1. Does a prompt/tool change need validation?
    • No -> Normal integration tests.
    • Yes -> Continue below.
  2. Is it UI/Interaction heavy?
    • Yes -> Use
      appEvalTest
      (
      AppRig
      ). See creating.md.
    • No -> Use
      evalTest
      (
      TestRig
      ). See creating.md.
  3. Is it a new test?
    • Yes -> Set policy to
      USUALLY_PASSES
      .
    • No ->
      ALWAYS_PASSES
      (locks in regression).
  4. Are you fixing a failure or promoting a test?
    • Fixing -> See fixing.md.
    • Promoting -> See promoting.md.

  1. 是否需要验证提示词/工具变更?
    • -> 常规集成测试。
    • -> 继续以下步骤。
  2. 是否涉及大量UI/交互?
    • -> 使用
      appEvalTest
      AppRig
      )。详见**creating.md**。
    • -> 使用
      evalTest
      TestRig
      )。详见**creating.md**。
  3. 是否为新测试?
    • -> 将策略设置为
      USUALLY_PASSES
    • -> 设置为
      ALWAYS_PASSES
      (锁定回归)。
  4. 你是在修复故障还是推广测试?
    • 修复 -> 详见**fixing.md**。
    • 推广 -> 详见**promoting.md**。

📋 Quick Checklist

📋 快速检查清单

1. Setup Workspace

1. 配置工作区

Seed the workspace with necessary files using the
files
object to simulate a realistic scenario (e.g., NodeJS project with
package.json
).
  • Details in creating.md
使用
files
对象在工作区中植入必要文件,模拟真实场景(例如包含
package.json
的NodeJS项目)。
  • 详情见*creating.md***

2. Write Assertions

2. 编写断言

Audit agent decisions using
rig.setBreakpoint()
(AppRig only) or index verification on
rig.readToolLogs()
.
  • Details in creating.md
使用
rig.setBreakpoint()
(仅AppRig支持)或对
rig.readToolLogs()
进行索引校验,审核Agent的决策。
  • 详情见*creating.md***

3. Verify

3. 验证

Run single tests locally with Vitest. Confirm stability locally before relying on CI workflows.
  • See evals/README.md for running commands.

使用Vitest在本地运行单个测试。在依赖CI工作流之前,先确认本地测试的稳定性。
  • 运行命令详见*evals/README.md***

📦 Bundled Resources

📦 内置资源

Detailed procedural guides:
  • creating.md: Assertion strategies, Rig selection, Mock MCPs.
  • fixing.md: Step-by-step automated investigation, architecture diagnosis guidelines.
  • promoting.md: Candidate identification criteria and threshold guidelines.
详细流程指南:
  • creating.md:断言策略、Rig选择、Mock MCP。
  • fixing.md:分步自动化排查、架构诊断指南。
  • promoting.md:候选识别标准及阈值指南。