cxas-sim-eval

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

CXAS Evaluation to Simulation Converter

CXAS评估转模拟用例转换器

This skill helps convert turn-by-turn CXAS golden evaluations into high-level, goal-oriented test cases for the SCRAPI

SimulationEvals

framework. It analyzes the agent's tools to enrich expectations with specific tool calls.

该技能可帮助将逐轮CXAS黄金评估转换为适用于SCRAPI

SimulationEvals

框架的高层级、面向目标的测试用例。它会分析Agent的工具，以特定工具调用丰富测试预期。

Steps

步骤

1. Check Environment

1. 检查环境

Ensure

cxas_scrapi

is installed as a python package. You can check this by running:

bash

python -c "import cxas_scrapi"

Ensure

gcloud

is authenticated properly:

bash

gcloud auth list

If needed, login with:

bash

gcloud auth login

确保

cxas_scrapi

已作为Python包安装。可通过运行以下命令检查：

bash

python -c "import cxas_scrapi"

确保

gcloud

已正确认证：

bash

gcloud auth list

如有需要，通过以下命令登录：

bash

gcloud auth login

2. Get App Name and Output Directory

2. 获取应用名称和输出目录

[!IMPORTANT] You MUST ask the user for the full resource name of the app/agent (e.g.,
projects/.../locations/.../apps/...
) and the base output directory before proceeding with any execution steps.

Ask the user for these values.

[!IMPORTANT] 在执行任何步骤之前，您必须向用户询问应用/Agent的完整资源名称（例如：
projects/.../locations/.../apps/...
）以及基础输出目录。

向用户获取这些值。

3. Fetch Evaluations

3. 获取评估数据

Fetch the list of evaluations using the CES API. Save each evaluation as a JSON file named after its display name under

[output_dir]/golden_evals/

使用CES API获取评估列表。将每个评估以其显示名称命名为JSON文件，保存到

[output_dir]/golden_evals/

目录下。

4. Fetch Tool Schemas

4. 获取工具Schema

Fetch the full schemas for all tools available in the app and save them under

[output_dir]/tools/

获取应用中所有可用工具的完整Schema，并保存到

[output_dir]/tools/

目录下。

5. Fetch Agent Tools Configuration

5. 获取Agent工具配置

Fetch the list of tools and toolsets used by the agent and save the configuration (e.g., to

[output_dir]/agent_tools.json

获取Agent使用的工具和工具集列表，并保存配置（例如：保存到

[output_dir]/agent_tools.json

）。

6. Convert Evaluations

6. 转换评估数据

Run the conversion script (

convert_eval.py

) to process the fetched evaluations and save the converted test cases under

[output_dir]/sim_evals/

运行转换脚本(

convert_eval.py

)处理获取到的评估数据，并将转换后的测试用例保存到

[output_dir]/sim_evals/

目录下。

Automation Scripts

自动化脚本

Three scripts are available to automate the process:

提供了三个脚本用于自动化流程：

1. Fetch Evaluations and Agent Config

1. 获取评估数据和Agent配置

scripts/fetch_app_data.py

Fetches evaluations and the list of tools used by the agent from the CES API.

Usage:

bash

python .agents/skills/cxas-sim-eval/scripts/fetch_app_data.py \
  --app-name "projects/.../locations/.../apps/..." \
  --output-dir /path/to/output_directory

scripts/fetch_app_data.py

从CES API获取评估数据以及Agent使用的工具列表。

使用方法：

bash

python .agents/skills/cxas-sim-eval/scripts/fetch_app_data.py \
  --app-name "projects/.../locations/.../apps/..." \
  --output-dir /path/to/output_directory

2. Fetch Tool Schemas

2. 获取工具Schema

scripts/fetch_tool_schemas.py

Fetches the full schemas for all tools available in the app.

Usage:

bash

python .agents/skills/cxas-sim-eval/scripts/fetch_tool_schemas.py \
  --app-name "projects/.../locations/.../apps/..." \
  --output-dir /path/to/output_directory

scripts/fetch_tool_schemas.py

获取应用中所有可用工具的完整Schema。

使用方法：

bash

python .agents/skills/cxas-sim-eval/scripts/fetch_tool_schemas.py \
  --app-name "projects/.../locations/.../apps/..." \
  --output-dir /path/to/output_directory

3. Convert Evaluations

3. 转换评估数据

scripts/convert_eval.py

Converts the fetched evaluations to simulation test cases, using the fetched tool schemas to infer expectations.

Usage:

bash

python .agents/skills/cxas-sim-eval/scripts/convert_eval.py \
  --output-dir /path/to/output_directory \
  --parallelism 5

scripts/convert_eval.py

将获取到的评估数据转换为模拟测试用例，并使用获取到的工具Schema推断测试预期。

使用方法：

bash

python .agents/skills/cxas-sim-eval/scripts/convert_eval.py \
  --output-dir /path/to/output_directory \
  --parallelism 5

4. Run Evaluations

4. 运行评估

scripts/run_evals.py

Runs the simulation evaluations, logs raw results, and generates a combined HTML report.

Cognitive Diagnostics Analysis: If the agent has the

intercept_and_score_reasoning

tool enabled, this script will automatically extract and analyze the agent's internal monologue for failed evaluations. It detects issues like overthinking, hesitation, and backtracking. Furthermore, it correlates these diagnostics with the agent's instructions to generate actionable suggestions for improvement directly in the HTML report.

Usage:

bash

python .agents/skills/cxas-sim-eval/scripts/run_evals.py \
  --app-name "projects/.../locations/.../apps/..." \
  --output-dir /path/to/output_directory \
  --parallelism 5 \
  --start-index 0 \
  --end-index 10

scripts/run_evals.py

运行模拟评估，记录原始结果并生成合并的HTML报告。

认知诊断分析: 如果Agent启用了

intercept_and_score_reasoning

工具，该脚本会自动提取并分析失败评估中Agent的内部独白，检测过度思考、犹豫和回溯等问题。此外，它会将这些诊断结果与Agent的指令关联，在HTML报告中直接生成可操作的改进建议。

使用方法：

bash

python .agents/skills/cxas-sim-eval/scripts/run_evals.py \
  --app-name "projects/.../locations/.../apps/..." \
  --output-dir /path/to/output_directory \
  --parallelism 5 \
  --start-index 0 \
  --end-index 10

Interpreting Cognitive Diagnostics

认知诊断解读

When running evaluations with the

intercept_and_score_reasoning

tool enabled, the system extracts diagnostics to help you identify issues in agent reasoning.

当启用

intercept_and_score_reasoning

工具运行评估时，系统会提取诊断信息以帮助您识别Agent推理中的问题。

Key Signals

关键信号

Overthinking (Verbosity)
- Symptom: Internal monologue exceeds 350 or 600 characters.
- Meaning: The agent is struggling to process complex or circular instructions.
- Fix: Simplify instructions. Break down complex tasks into smaller, linear steps.
Hedging
- Symptom: Use of words like "might be", "guess", "unsure", "assume".
- Meaning: The agent is uncertain about its next action, often due to missing edge case handling.
- Fix: Add explicit instructions for the scenario the agent is unsure about.
Backtracking
- Symptom: Use of words like "wait", "actually", "on second thought".
- Meaning: The agent is abandoning a plan mid-turn or correcting itself, indicating unclear triggers.
- Fix: Clarify triggers and state transitions in instructions.

过度思考（冗长性）
- 症状：内部独白超过350或600字符。
- 含义：Agent难以处理复杂或循环指令。
- 解决方法：简化指令，将复杂任务分解为更小的线性步骤。
犹豫措辞
- 症状：使用“可能是”、“猜测”、“不确定”、“假设”等词汇。
- 含义：Agent对下一步行动不确定，通常是由于缺少边缘场景处理。
- 解决方法：为Agent不确定的场景添加明确指令。
回溯
- 症状：使用“等等”、“实际上”、“再想想”等词汇。
- 含义：Agent在回合中途放弃计划或自我纠正，表明触发条件不明确。
- 解决方法：明确指令中的触发条件和状态转换。