cxas-sim-eval
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseCXAS Evaluation to Simulation Converter
CXAS评估转模拟用例转换器
This skill helps convert turn-by-turn CXAS golden evaluations into high-level, goal-oriented test cases for the SCRAPI framework. It analyzes the agent's tools to enrich expectations with specific tool calls.
SimulationEvals该技能可帮助将逐轮CXAS黄金评估转换为适用于SCRAPI 框架的高层级、面向目标的测试用例。它会分析Agent的工具,以特定工具调用丰富测试预期。
SimulationEvalsSteps
步骤
1. Check Environment
1. 检查环境
Ensure is installed as a python package. You can check this by running:
cxas_scrapibash
python -c "import cxas_scrapi"Ensure is authenticated properly:
gcloudbash
gcloud auth listIf needed, login with:
bash
gcloud auth login确保已作为Python包安装。可通过运行以下命令检查:
cxas_scrapibash
python -c "import cxas_scrapi"确保已正确认证:
gcloudbash
gcloud auth list如有需要,通过以下命令登录:
bash
gcloud auth login2. Get App Name and Output Directory
2. 获取应用名称和输出目录
[!IMPORTANT] You MUST ask the user for the full resource name of the app/agent (e.g.,) and the base output directory before proceeding with any execution steps.projects/.../locations/.../apps/...
Ask the user for these values.
[!IMPORTANT] 在执行任何步骤之前,您必须向用户询问应用/Agent的完整资源名称(例如:)以及基础输出目录。projects/.../locations/.../apps/...
向用户获取这些值。
3. Fetch Evaluations
3. 获取评估数据
Fetch the list of evaluations using the CES API. Save each evaluation as a JSON file named after its display name under .
[output_dir]/golden_evals/使用CES API获取评估列表。将每个评估以其显示名称命名为JSON文件,保存到目录下。
[output_dir]/golden_evals/4. Fetch Tool Schemas
4. 获取工具Schema
Fetch the full schemas for all tools available in the app and save them under .
[output_dir]/tools/获取应用中所有可用工具的完整Schema,并保存到目录下。
[output_dir]/tools/5. Fetch Agent Tools Configuration
5. 获取Agent工具配置
Fetch the list of tools and toolsets used by the agent and save the configuration (e.g., to ).
[output_dir]/agent_tools.json获取Agent使用的工具和工具集列表,并保存配置(例如:保存到)。
[output_dir]/agent_tools.json6. Convert Evaluations
6. 转换评估数据
Run the conversion script () to process the fetched evaluations and save the converted test cases under .
convert_eval.py[output_dir]/sim_evals/运行转换脚本()处理获取到的评估数据,并将转换后的测试用例保存到目录下。
convert_eval.py[output_dir]/sim_evals/Automation Scripts
自动化脚本
Three scripts are available to automate the process:
提供了三个脚本用于自动化流程:
1. Fetch Evaluations and Agent Config
1. 获取评估数据和Agent配置
scripts/fetch_app_data.pyFetches evaluations and the list of tools used by the agent from the CES API.
Usage:
bash
python .agents/skills/cxas-sim-eval/scripts/fetch_app_data.py \
--app-name "projects/.../locations/.../apps/..." \
--output-dir /path/to/output_directoryscripts/fetch_app_data.py从CES API获取评估数据以及Agent使用的工具列表。
使用方法:
bash
python .agents/skills/cxas-sim-eval/scripts/fetch_app_data.py \
--app-name "projects/.../locations/.../apps/..." \
--output-dir /path/to/output_directory2. Fetch Tool Schemas
2. 获取工具Schema
scripts/fetch_tool_schemas.pyFetches the full schemas for all tools available in the app.
Usage:
bash
python .agents/skills/cxas-sim-eval/scripts/fetch_tool_schemas.py \
--app-name "projects/.../locations/.../apps/..." \
--output-dir /path/to/output_directoryscripts/fetch_tool_schemas.py获取应用中所有可用工具的完整Schema。
使用方法:
bash
python .agents/skills/cxas-sim-eval/scripts/fetch_tool_schemas.py \
--app-name "projects/.../locations/.../apps/..." \
--output-dir /path/to/output_directory3. Convert Evaluations
3. 转换评估数据
scripts/convert_eval.pyConverts the fetched evaluations to simulation test cases, using the fetched tool schemas to infer expectations.
Usage:
bash
python .agents/skills/cxas-sim-eval/scripts/convert_eval.py \
--output-dir /path/to/output_directory \
--parallelism 5scripts/convert_eval.py将获取到的评估数据转换为模拟测试用例,并使用获取到的工具Schema推断测试预期。
使用方法:
bash
python .agents/skills/cxas-sim-eval/scripts/convert_eval.py \
--output-dir /path/to/output_directory \
--parallelism 54. Run Evaluations
4. 运行评估
scripts/run_evals.pyRuns the simulation evaluations, logs raw results, and generates a combined HTML report.
Cognitive Diagnostics Analysis:
If the agent has the tool enabled, this script will automatically extract and analyze the agent's internal monologue for failed evaluations. It detects issues like overthinking, hesitation, and backtracking. Furthermore, it correlates these diagnostics with the agent's instructions to generate actionable suggestions for improvement directly in the HTML report.
intercept_and_score_reasoningUsage:
bash
python .agents/skills/cxas-sim-eval/scripts/run_evals.py \
--app-name "projects/.../locations/.../apps/..." \
--output-dir /path/to/output_directory \
--parallelism 5 \
--start-index 0 \
--end-index 10scripts/run_evals.py运行模拟评估,记录原始结果并生成合并的HTML报告。
认知诊断分析:
如果Agent启用了工具,该脚本会自动提取并分析失败评估中Agent的内部独白,检测过度思考、犹豫和回溯等问题。此外,它会将这些诊断结果与Agent的指令关联,在HTML报告中直接生成可操作的改进建议。
intercept_and_score_reasoning使用方法:
bash
python .agents/skills/cxas-sim-eval/scripts/run_evals.py \
--app-name "projects/.../locations/.../apps/..." \
--output-dir /path/to/output_directory \
--parallelism 5 \
--start-index 0 \
--end-index 10Interpreting Cognitive Diagnostics
认知诊断解读
When running evaluations with the tool enabled, the system extracts diagnostics to help you identify issues in agent reasoning.
intercept_and_score_reasoning当启用工具运行评估时,系统会提取诊断信息以帮助您识别Agent推理中的问题。
intercept_and_score_reasoningKey Signals
关键信号
-
Overthinking (Verbosity)
- Symptom: Internal monologue exceeds 350 or 600 characters.
- Meaning: The agent is struggling to process complex or circular instructions.
- Fix: Simplify instructions. Break down complex tasks into smaller, linear steps.
-
Hedging
- Symptom: Use of words like "might be", "guess", "unsure", "assume".
- Meaning: The agent is uncertain about its next action, often due to missing edge case handling.
- Fix: Add explicit instructions for the scenario the agent is unsure about.
-
Backtracking
- Symptom: Use of words like "wait", "actually", "on second thought".
- Meaning: The agent is abandoning a plan mid-turn or correcting itself, indicating unclear triggers.
- Fix: Clarify triggers and state transitions in instructions.
-
过度思考(冗长性)
- 症状:内部独白超过350或600字符。
- 含义:Agent难以处理复杂或循环指令。
- 解决方法:简化指令,将复杂任务分解为更小的线性步骤。
-
犹豫措辞
- 症状:使用“可能是”、“猜测”、“不确定”、“假设”等词汇。
- 含义:Agent对下一步行动不确定,通常是由于缺少边缘场景处理。
- 解决方法:为Agent不确定的场景添加明确指令。
-
回溯
- 症状:使用“等等”、“实际上”、“再想想”等词汇。
- 含义:Agent在回合中途放弃计划或自我纠正,表明触发条件不明确。
- 解决方法:明确指令中的触发条件和状态转换。