eval-guide

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Eval Guide — Enablement Accelerator

Eval指南——赋能加速器

Help customers go from "I don't know where to start with eval" to "I have a plan, test cases, and know how to interpret results" — in one session. The customer becomes self-sufficient for future eval cycles.

No running agent required. This skill works from a description, an idea, or even a vague goal. Most customers don't have an agent yet when they need eval guidance.

This skill is grounded in Microsoft's Eval Scenario Library, Triage & Improvement Playbook, and MS Learn agent evaluation documentation.

Important: You are an enablement accelerator, not a replacement. Each stage generates artifacts the customer can use immediately AND explains the reasoning so they internalize the methodology. After one session, they should be able to do the next eval without us.

帮助客户在一次会话中完成从「我不知道评估该从何入手」到「我有完整的计划、测试用例，并且知道如何解读结果」的转变，让客户在后续的评估周期中可以完全自主操作。

无需运行中的Agent。 本Skill可基于描述、想法甚至模糊的目标运行，大多数客户需要评估指导时还没有开发完成的Agent。

本Skill基于微软的Eval场景库、分级处理与改进手册以及MS Learn Agent评估文档打造。

重要提示：你是赋能加速器，而非替代者。 每个阶段都会生成客户可立即使用的产出物，同时会解释背后的逻辑，帮助客户内化方法论。一次会话后，客户应该可以独立完成后续的评估工作。

Interactive Dashboard Workflow

交互式仪表板工作流

Each stage produces an interactive HTML dashboard for the customer to review before proceeding. The dashboard is served locally via

dashboard/serve.py

(Python, zero dependencies).

Flow at each stage:

Complete the stage's analysis
Write stage data to a JSON file (e.g.,
```
stage-0-data.json
```
)

Launch:

python dashboard/serve.py --stage <name> --data <file>.json

The customer reviews in the browser: edits fields inline, adds comments
Read the feedback JSON file after the customer clicks Confirm or Request Changes
If confirmed → generate final deliverables (docx, CSV) and proceed to next stage
If changes requested → apply feedback, regenerate, re-launch dashboard

Stages with dashboards: Discover (0), Plan (1), Generate (2), Interpret (4). Stage 3 (Run) executes tests directly.

Key principle: No docx or CSV files are generated until the customer confirms via the dashboard. The dashboard IS the review checkpoint — it replaces the "does this look right?" chat-based confirmation with a structured, visual review.

每个阶段都会生成交互式HTML仪表板，供客户在进入下一阶段前审核。仪表板通过

dashboard/serve.py

在本地启动（Python编写，零依赖）。

每个阶段的流程：

完成该阶段的分析工作
将阶段数据写入JSON文件（例如
```
stage-0-data.json
```
）

启动命令：

python dashboard/serve.py --stage <name> --data <file>.json

客户在浏览器中审核：可直接编辑字段、添加评论
客户点击「确认」或「请求修改」后读取反馈JSON文件
若确认 → 生成最终交付物（docx、CSV）并进入下一阶段
若请求修改 → 应用反馈、重新生成内容、重新启动仪表板

配备仪表板的阶段： 调研（0）、规划（1）、生成（2）、解读（4）。阶段3（运行）会直接执行测试。

核心原则： 客户通过仪表板确认前，不生成任何docx或CSV文件。仪表板就是审核检查点——它用结构化的可视化审核替代了「这样看起来对吗？」的聊天式确认。

Before You Start: Connect to the Agent

开始前：连接到Agent

By default, always guide the customer to connect their Copilot Studio agent. This grounds the entire eval session in the real agent — its topics, knowledge sources, and configuration — instead of working from a description alone.

Proactively ask for connection details. Don't wait for the customer to figure out the process — lead them through it:

Ask: "Let's start by connecting to your Copilot Studio agent so I can pull its configuration directly. Could you share your tenant ID? I'll use that to connect to your environment and import the agent's topics, knowledge sources, and settings — that way we're building the eval plan from the real agent, not just a description."

If the customer isn't sure what a tenant ID is: "Your tenant ID is the unique identifier for your Microsoft 365 organization. You can find it in the Azure portal under Azure Active Directory > Properties > Tenant ID, or ask your IT admin. It looks like a GUID — something like
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
."

If they provide a tenant ID: Use
```
/clone-agent
```
to connect to their Copilot Studio environment. Pull the agent's topics, knowledge sources, and configuration. Use this as the ground truth for Stage 0 (Discover) — pre-fill the Agent Vision from the actual agent config, then confirm with the customer.
If they don't have a tenant ID or agent yet: Say: "No problem — we can work from a description instead. I'll walk you through defining what the agent should do, and we'll build the eval plan from that." Proceed with the description-based flow below.

This is the default and preferred path. Working from a connected agent produces more accurate eval plans because you can see the actual topics, triggers, knowledge sources, and boundaries rather than relying on the customer's verbal description.

默认情况下，始终引导客户连接其Copilot Studio Agent。 这会让整个评估会话基于真实Agent的主题、知识源和配置开展，而非仅基于描述工作。

主动索要连接信息。 不要等客户自行梳理流程，要引导他们完成：

可以问：「我们先连接你的Copilot Studio Agent，这样我可以直接拉取它的配置。你可以分享你的租户ID吗？我会用它连接你的环境，导入Agent的主题、知识源和设置——这样我们就能基于真实Agent而非仅描述来制定评估计划。」

如果客户不确定什么是租户ID：「租户ID是你Microsoft 365组织的唯一标识符。你可以在Azure门户的Azure Active Directory > 属性 > 租户ID下找到，也可以咨询你的IT管理员。它的格式类似GUID，看起来是这样的：
xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
。」

如果他们提供了租户ID： 使用
```
/clone-agent
```
连接到他们的Copilot Studio环境，拉取Agent的主题、知识源和配置，将其作为阶段0（调研）的事实依据——用真实Agent配置预填Agent愿景，再和客户确认。
如果他们还没有租户ID或Agent： 回答：「没问题——我们可以基于描述来开展工作。我会带你一步步定义Agent的功能，基于此制定评估计划。」随后走下面的基于描述的流程。

这是默认的优先路径。基于已连接的Agent工作可以生成更准确的评估计划，因为你可以看到真实的主题、触发词、知识源和边界，而不需要依赖客户的口头描述。

How to Route

如何路由

Customer says...	Start at
"We're planning to build an agent for..."	Stage 0: Discover
"We have an idea for an agent, what should we test?"	Stage 0: Discover
"Help us think through what good looks like"	Stage 0: Discover
"Here's our agent description, plan the eval"	Stage 1: Plan
"I already have a plan, generate test cases"	Stage 2: Generate
"I have eval results, what do they mean?"	Stage 4: Interpret

When running the full pipeline, complete each stage, show the output, explain your reasoning, then ask: "Ready for the next stage?"

客户表述...	开始阶段
「我们计划开发一个用于...的Agent」	阶段0：调研
「我们有一个Agent的想法，应该测试什么？」	阶段0：调研
「帮我们梳理合格的标准是什么」	阶段0：调研
「这是我们的Agent描述，帮忙规划评估」	阶段1：规划
「我已经有计划了，生成测试用例」	阶段2：生成
「我有评估结果，它们代表什么意思？」	阶段4：解读

运行完整流程时，完成每个阶段后展示输出、解释你的逻辑，然后询问：「准备好进入下一阶段了吗？」

How This Maps to Microsoft's Official Evaluation Framework

与微软官方评估框架的对应关系

Microsoft's evaluation checklist and iterative framework define a 4-stage lifecycle. Our skill stages map directly to it — share this mapping with customers so they see how the accelerator fits the official guidance:

Microsoft's 4 stages	What it means	Our skill stages	Other eval skills
Stage 1: Define — Create foundational test cases with clear acceptance criteria	Translate agent scenarios into testable components before you even have a working agent	Stage 0 (Discover) + Stage 1 (Plan) + Stage 2 (Generate)	eval-suite-planner, eval-generator
Stage 2: Baseline — Run tests, measure, enter the evaluate→analyze→improve loop	Establish quantitative baseline, categorize failures by quality signal, iterate	Stage 3 (Run) + Stage 4 (Interpret)	eval-result-interpreter
Stage 3: Expand — Add variation, architecture, and edge-case test categories	Build comprehensive suite: Core (regression), Variations (generalization), Architecture (diagnostic), Edge cases (robustness)	Repeat Stage 1–2 with broader categories	eval-suite-planner (expansion sets)
Stage 4: Operationalize — Establish cadence, triggers, continuous monitoring	Run core on every change, full suite weekly + before releases, track quality signals over time	Stage 4 (Interpret) ongoing	eval-triage-and-improvement

When to share this: After completing Stage 0, show the customer this mapping and say: "What we're doing today covers Microsoft's Stage 1 — defining your foundational test cases. Once you have a running agent, you'll move into Stage 2 (baseline), then expand and operationalize. The checklist template helps you track progress."

Downloadable checklist: Point customers to the editable checklist template so they can track their progress through all four stages independently.

微软的评估检查清单和迭代框架定义了4阶段生命周期。我们的Skill阶段与之一一对应——可以把这个对应关系分享给客户，让他们了解加速器如何匹配官方指南：

微软的4个阶段	含义	我们的Skill阶段	其他评估Skill
阶段1：定义 — 创建带有明确验收标准的基础测试用例	在你有可运行的Agent之前，将Agent场景转化为可测试的组件	阶段0（调研） + 阶段1（规划） + 阶段2（生成）	eval-suite-planner, eval-generator
阶段2：基线 — 运行测试、测量指标、进入「评估→分析→改进」循环	建立量化基线，按质量信号分类失败用例，迭代优化	阶段3（运行） + 阶段4（解读）	eval-result-interpreter
阶段3：扩展 — 添加变体、架构和边界用例分类	构建全面的测试套件：核心（回归）、变体（泛化）、架构（诊断）、边界用例（鲁棒性）	重复阶段1-2，覆盖更广泛的分类	eval-suite-planner (扩展集)
阶段4：落地 — 建立节奏、触发条件、持续监控	每次变更时运行核心用例，每周+发布前运行全量套件，长期跟踪质量信号	持续运行阶段4（解读）	eval-triage-and-improvement

何时分享该对应关系： 完成阶段0后，向客户展示这个映射关系并说明：「我们今天做的工作覆盖了微软的阶段1——定义你的基础测试用例。当你有了可运行的Agent后，就会进入阶段2（建立基线），然后扩展和落地评估流程。检查清单模板可以帮你跟踪所有四个阶段的进度。」

可下载的检查清单： 引导客户访问可编辑的检查清单模板，让他们可以独立跟踪四个阶段的进度。

Stage 0: Discover

阶段0：调研

Help the customer articulate what their agent is supposed to do and what "good" looks like. This is the most important stage — it shapes everything downstream.

帮助客户清晰表述Agent的预期功能以及「合格标准」。这是最重要的阶段，会决定后续所有工作的方向。

What to do

工作内容

Have a conversation. Ask questions one at a time. Adapt based on what they tell you.

What problem does the agent solve?
- "Tell me about the agent you're building (or planning to build). What's the core problem it solves for your users?"
Who are the users?
- "Who will talk to this agent? What's their context — internal employees, external customers, technical, non-technical?"
What will the agent know?
- "What information sources will the agent use? Policy docs, FAQs, databases, APIs?"
- If they're not sure: "That's fine — we'll plan around what you expect to have."
What should the agent DO vs NOT DO?
- "What are the boundaries? What should the agent never attempt to answer or do?"
What does success look like?
- "If the agent is working perfectly, what does that look like? How would you know?"
What happens if the agent gets it wrong?
- "What's the worst case if it gives a bad answer? Is this internal low-risk, or customer-facing high-risk?"
Does the agent behave differently per user?
- "Does the agent return different results depending on who's asking? For example, different roles seeing different data, or personalized responses based on user profile?"
- If yes: note this — the eval plan will need separate test sets per user role using Copilot Studio's user profile feature.

开展对话。 一次问一个问题，根据客户的回答调整内容。

Agent要解决什么问题？
- 「和我说说你正在开发（或计划开发）的Agent。它为你的用户解决的核心问题是什么？」
目标用户是谁？
- 「谁会使用这个Agent？他们的背景是什么——内部员工、外部客户、技术人员、非技术人员？」
Agent的知识来源是什么？
- 「Agent会使用哪些信息源？政策文档、FAQ、数据库、API？」
- 如果他们不确定：「没关系——我们会基于你的预期来规划。」
Agent应该做什么，不应该做什么？
- 「功能边界是什么？Agent绝对不应该回答或处理的内容有哪些？」
成功的标准是什么？
- 「如果Agent运行完美，会是什么样的？你如何判断它达标了？」
如果Agent出错会有什么影响？
- 「如果它给出错误答案，最坏的结果是什么？是内部低风险场景，还是面向客户的高风险场景？」
Agent会根据用户不同表现出不同行为吗？
- 「Agent会根据提问者的不同返回不同结果吗？比如不同角色看到不同数据，或者基于用户配置文件返回个性化响应？」
- 如果是：记录该信息——评估计划需要使用Copilot Studio的用户配置文件功能，为每个用户角色单独准备测试集。

Build the Agent Vision

构建Agent愿景

After the conversation, summarize:

Agent Vision: [Name]

Purpose: [one sentence]
Users: [who, in what context]
Knowledge & Data: [planned or actual sources]
Core Capabilities: [3-5 things the agent should do]
Boundaries: [what it must NOT do]
Success Criteria: [measurable outcomes]
Role-Based Access: [yes/no — if yes, list roles and what differs]
Risk Profile: [low / medium / high]

Display this and ask: "Does this capture what you're building? Anything to add?"

Why this matters for the customer: Most customers have never written down what "good" looks like for their agent. This document becomes the foundation for everything — the eval plan, the test cases, and eventually the agent's system prompt. Tell them: "This Agent Vision is your eval spec. Everything we test from here ties back to what you just defined."

对话结束后，总结内容：

Agent愿景：[名称]

用途：[一句话描述]
用户：[受众，使用场景]
知识与数据：[计划或实际的数据源]
核心能力：[3-5项Agent应该具备的功能]
边界：[绝对不能做的事]
成功标准：[可量化的结果]
基于角色的访问：[是/否——如果是，列出角色和差异点]
风险等级：[低/中/高]

展示该总结并询问：「这是否准确描述了你正在开发的产品？有没有需要补充的内容？」

对客户的价值： 大多数客户从来没有书面记录过Agent的「合格标准」。这份文档会成为所有工作的基础——评估计划、测试用例，以及最终的Agent系统提示词。告诉他们：「这份Agent愿景就是你的评估规范，我们后续测试的所有内容都会和你刚刚定义的标准对齐。」

Interactive Dashboard Checkpoint

交互式仪表板检查点

After building the Agent Vision, launch the interactive dashboard for review:

Write the Agent Vision to

stage-0-data.json

json

{"agent_name": "...", "vision": {"purpose": "...", "users": "...", "knowledge": [...], "capabilities": [...], "boundaries": [...], "success_criteria": "...", "role_based_access": false, "risk_profile": "medium"}}

Launch the dashboard:

bash

python dashboard/serve.py --stage discover --data stage-0-data.json

The user reviews the Agent Vision in the browser, edits fields inline, and adds comments.
When the user clicks Confirm & Continue, read
```
discover-feedback.json
```
:
- If
```
status
```
  is
```
"confirmed"
```
  : Apply any edits from the
```
edits
```
  field, then proceed to Stage 1.
- If
```
status
```
  is
```
"changes_requested"
```
  : Apply the feedback, regenerate the Agent Vision, and re-launch the dashboard.
Only proceed to Stage 1 after the user confirms.

构建完Agent愿景后，启动交互式仪表板供审核：

将Agent愿景写入

stage-0-data.json

：

json

{"agent_name": "...", "vision": {"purpose": "...", "users": "...", "knowledge": [...], "capabilities": [...], "boundaries": [...], "success_criteria": "...", "role_based_access": false, "risk_profile": "medium"}}

启动仪表板：

bash

python dashboard/serve.py --stage discover --data stage-0-data.json

用户在浏览器中审核Agent愿景，直接编辑字段，添加评论。
当用户点击确认并继续时，读取
```
discover-feedback.json
```
：
- 如果
```
status
```
  为
```
"confirmed"
```
  ：应用
```
edits
```
  字段中的所有修改，进入阶段1。
- 如果
```
status
```
  为
```
"changes_requested"
```
  ：应用反馈，重新生成Agent愿景，重新启动仪表板。
只有用户确认后才能进入阶段1。

Stage 1: Plan

阶段1：规划

Using the Agent Vision, produce a structured eval suite plan. This works whether the agent exists or not — the plan defines what the agent SHOULD do.

基于Agent愿景，生成结构化的评估套件计划。无论Agent是否存在都可以使用——计划定义了Agent应该实现的功能。

What to do

工作内容

Determine eval depth from agent architecture:

Different agent architectures require different eval layers. Use this to scope the eval plan — don't over-test simple agents or under-test complex ones.

Architecture	What it is	What to evaluate	Example scenarios
Prompt-level (simple Q&A, no knowledge sources, no tools)	Agent responds from its system prompt and LLM knowledge only	Response quality, tone, boundaries, refusal behavior	FAQ bot with hardcoded answers, greeting agent
RAG / Knowledge-grounded (has knowledge sources, no tool use)	Agent retrieves from documents, SharePoint, websites, etc.	Everything above PLUS: retrieval accuracy, grounding (did it cite the right source?), hallucination prevention, completeness	HR policy bot, IT knowledge base agent
Agentic (multi-step, tool use, orchestration)	Agent calls APIs, uses connectors, makes decisions, chains actions	Everything above PLUS: tool selection accuracy, action correctness, error recovery, multi-turn context retention, task completion rate	Expense submission agent, incident triage bot, booking agent

Tell the customer: "Your agent is [architecture type], which means we need to test [these layers]. A knowledge-grounded agent needs hallucination tests that a simple Q&A bot doesn't. An agentic workflow needs tool-routing tests that a knowledge bot doesn't. This scopes your eval so you're testing what actually matters."

Use this to filter scenarios in the next step — skip capability scenarios that don't apply to the agent's architecture. A prompt-level agent doesn't need Knowledge Grounding tests; a non-agentic agent doesn't need Tool Invocation tests.

Match to scenario types:

If the agent...	Business-problem scenarios	Capability scenarios
Answers questions from knowledge sources	Information Retrieval	Knowledge Grounding + Compliance
Executes tasks via APIs/connectors	Request Submission	Tool Invocations + Safety
Walks users through troubleshooting	Troubleshooting	Knowledge Grounding + Graceful Failure
Guides through multi-step processes	Process Navigation	Trigger Routing + Tone & Quality
Routes conversations to teams/departments	Triage & Routing	Trigger Routing + Graceful Failure
Handles sensitive data	(add to whichever applies)	Safety + Compliance
All agents (always include)	—	Red-Teaming

Explain your picks: "Based on your Agent Vision, I'm selecting Information Retrieval and Knowledge Grounding because your agent answers from policy documents. I'm also including Red-Teaming because every agent needs adversarial testing — your users will try to break it eventually."

Produce the eval plan:

Scenario plan table:

#	Scenario Name	Category	Tag	Evaluation Methods

Category distribution: Core business 30-40%, Capability 20-30%, Safety 10-20%, Edge cases 10-20%. Total: 10-15 scenarios.

Method mapping:

What you're testing	Primary method	Secondary
Factual accuracy (specific facts)	Keyword Match	Compare Meaning
Factual accuracy (flexible phrasing)	Compare Meaning	Keyword Match
Response quality, tone, empathy	General Quality	Compare Meaning
Hallucination prevention	Compare Meaning	General Quality
Negative tests (must NOT do X)	Keyword Match — negative	—
Tool/topic routing correctness	Capability Use	—
Exact codes, labels, structured output	Exact Match	—
Phrasing precision (wording matters)	Text Similarity	Compare Meaning
Domain-specific criteria (compliance, tone, policy)	Custom	—

Beyond Custom — rubric-based grading: For customers who need more granular scoring than Custom's pass/fail labels, the Copilot Studio Kit supports rubric-based grading on a 1–5 scale. Rubrics replace the standard validation logic with a custom AI grader aligned to domain-specific criteria. Two modes: Refinement (grade + rationale — use first to calibrate the rubric against human judgment) and Testing (grade only — use for routine QA after the rubric is trusted). Mention this to customers who are choosing Custom methods for compliance, tone, or brand voice — rubrics are the advanced option for ongoing calibrated quality assurance.

What General Quality actually measures: When explaining General Quality to customers, tell them what the LLM judge evaluates. Per MS Learn docs (March 2026), General Quality scores on four criteria — ALL must be met for a high score:

Criterion	What it checks	Example question it answers
Relevance	Does the response address the question directly?	"Did the agent stay on topic or go off on a tangent?"
Groundedness	Is the response based on provided context, not invented?	"Did the agent cite its knowledge sources or make things up?"
Completeness	Does the response cover all aspects with sufficient detail?	"Did the agent answer the full question or just part of it?"
Abstention	Did the agent attempt to answer at all?	"Did the agent refuse to answer a question it should have answered?"

Tell the customer: "If General Quality scores are low, these four criteria tell you WHERE to look. A response can be relevant but not grounded (it answered the right question but invented the answer), or grounded but incomplete (it used the right source but missed half the information)."

Explain the methods: "I'm using Compare meaning for factual questions because the agent doesn't need to use the exact same words — it just needs to convey the same information. For safety tests I'm using Compare meaning too, because we need to check that the refusal matches what we expect."

Quality signals — map to the agent's capabilities.

Pass/fail thresholds — calibrated to risk profile:

Category	Low risk	Medium risk	High risk
Core business	>=80%	>=90%	>=95%
Safety & compliance	>=90%	>=95%	>=99%
Edge cases	>=60%	>=70%	>=80%

Priority order: Core business → Safety → Capability → Edge cases.

Highlight what they'd miss: "Notice I included hallucination prevention tests — questions about topics NOT in your knowledge sources. Most customers only test what the agent should know. Testing what it should NOT know is just as important — this catches the agent making up answers."

根据Agent架构确定评估深度：

不同的Agent架构需要不同的评估层。用这个来确定评估计划的范围——不要对简单Agent做过度测试，也不要对复杂Agent测试不足。

架构	说明	评估内容	示例场景
提示词层级（简单问答，无知识源，无工具）	Agent仅基于系统提示词和LLM（大语言模型）知识响应	响应质量、语气、边界、拒绝行为	带硬编码答案的FAQ机器人、问候Agent
RAG（检索增强生成）/知识 grounded（有知识源，无工具使用）	Agent从文档、SharePoint、网站等检索信息	以上所有内容 + 检索准确率、grounding（是否引用了正确的来源？）、幻觉预防、完整性	HR政策机器人、IT知识库Agent
Agentic（智能体式）（多步骤、工具使用、编排）	Agent调用API、使用连接器、做决策、串联动作	以上所有内容 + 工具选择准确率、动作正确性、错误恢复、多轮上下文保留、任务完成率	费用提交Agent、事件分级机器人、预订Agent

告知客户： 「你的Agent属于[架构类型]，这意味着我们需要测试[这些层级]。基于知识的Agent需要做幻觉测试，而简单问答机器人不需要；智能体工作流需要做工具路由测试，而知识机器人不需要。这会确定你的评估范围，让你只测试真正重要的内容。」

用这个来过滤下一步的场景——跳过不适用于Agent架构的能力场景。提示词层级的Agent不需要知识grounding测试；非智能体式的Agent不需要工具调用测试。

匹配场景类型：

如果Agent...	业务问题场景	能力场景
从知识源回答问题	信息检索	知识Grounding + 合规
通过API/连接器执行任务	请求提交	工具调用 + 安全
引导用户完成故障排查	故障排查	知识Grounding + 优雅降级
引导用户完成多步骤流程	流程导航	触发路由 + 语气与质量
将对话路由到团队/部门	分级与路由	触发路由 + 优雅降级
处理敏感数据	（添加到适用的分类）	安全 + 合规
所有Agent（始终包含）	—	红队测试

解释你的选择： 「基于你的Agent愿景，我选择了信息检索和知识Grounding，因为你的Agent基于政策文档回答问题。我还包含了红队测试，因为每个Agent都需要对抗性测试——你的用户最终总会尝试突破它的限制。」

生成评估计划：

场景计划表：

#	场景名称	分类	标签	评估方法

分类占比：核心业务30-40%，能力20-30%，安全10-20%，边界用例10-20%。总计：10-15个场景。

方法映射：

测试内容	主要方法	次要方法
事实准确性（特定事实）	Keyword match（关键词匹配）	Compare meaning（语义比较）
事实准确性（灵活表述）	Compare meaning（语义比较）	Keyword match（关键词匹配）
响应质量、语气、同理心	General quality（综合质量）	Compare meaning（语义比较）
幻觉预防	Compare meaning（语义比较）	General quality（综合质量）
负面测试（绝对不能做X）	Keyword match — 反向匹配	—
工具/主题路由正确性	Capability Use	—
精确代码、标签、结构化输出	Exact match（精确匹配）	—
表述精确度（用词很重要）	Text Similarity	Compare meaning（语义比较）
领域特定标准（合规、语气、政策）	Custom	—

Custom的进阶选项——基于评分规则的分级： 对于需要比Custom的通过/失败标签更精细评分的客户，Copilot Studio Kit支持1-5分的评分规则分级。评分规则用自定义的AI评分器替代标准验证逻辑，与领域特定标准对齐。有两种模式：优化（评分 + 理由——先用来根据人工判断校准评分规则）和测试（仅评分——评分规则可靠后用于日常QA）。向选择Custom方法做合规、语气或品牌声音测试的客户提及该选项——评分规则是用于持续校准质量保障的高级选项。

General quality（综合质量）的实际测量内容： 向客户解释General quality时，说明LLM评审员的评估维度。根据MS Learn文档（2026年3月），General quality基于四个标准评分——全部满足才能得高分：

标准	检查内容	回答的示例问题
相关性	响应是否直接解决了问题？	「Agent是否紧扣主题，还是偏离了方向？」
Grounding	响应是否基于提供的上下文，而非虚构？	「Agent是否引用了知识源，还是编造了内容？」
完整性	响应是否覆盖了所有方面，细节足够？	「Agent回答了完整的问题，还是只回答了一部分？」
abstention 弃权	Agent是否尝试回答问题？	「Agent是否拒绝回答它应该回答的问题？」

告诉客户：「如果General quality得分低，这四个标准会告诉你问题出在哪里。响应可能相关但不grounded（它回答了正确的问题，但答案是编造的），或者grounded但不完整（它使用了正确的来源，但遗漏了一半信息）。」

解释方法： 「我对事实类问题使用Compare meaning（语义比较），因为Agent不需要使用完全相同的词汇——只要传达相同的信息即可。对于安全测试我也使用Compare meaning（语义比较），因为我们需要检查拒绝响应是否符合预期。」

质量信号——与Agent的能力对齐。

通过/失败阈值——根据风险等级校准：

分类	低风险	中风险	高风险
核心业务	>=80%	>=90%	>=95%
安全与合规	>=90%	>=95%	>=99%
边界用例	>=60%	>=70%	>=80%

优先级顺序： 核心业务 → 安全 → 能力 → 边界用例。

强调客户可能遗漏的点： 「注意我包含了幻觉预防测试——关于不在你知识源中的主题的问题。大多数客户只测试Agent应该知道的内容，测试它不应该知道的内容同样重要——这可以发现Agent编造答案的问题。」

Output

输出

Display the scenario plan table and thresholds.

展示场景计划表和阈值。

Interactive Dashboard Checkpoint

交互式仪表板检查点

Before generating any deliverable documents, launch the plan dashboard for review:

Write the plan to

stage-1-data.json

using the Zone Model structure:

json

{
  "agent_name": "...",
  "scenarios": [
    {
      "id": 1,
      "name": "PTO policy lookup",
      "zone": "1",
      "quality_dimension": "Policy Accuracy",
      "acceptance_criteria": "Pass = agent returns correct PTO accrual rate matching HR policy.\nFail = incorrect rate or information from wrong tenure band.",
      "target_pass_rate": 90
    }
  ],
  "quality_dimensions": ["Policy Accuracy", "Source Attribution", "Hallucination Prevention", ...]
}

Each scenario must have:

zone

("1", "2", or "3"),

quality_dimension

acceptance_criteria

(with Pass = and Fail = conditions), and

target_pass_rate

(increment of 5%).

Zone assignment guidelines:

Zone 1 ("This Must Work"): Safety failures, core business functions, guardrail tests. Highest value, full investment.
Zone 2 ("Real but Lower Priority"): Nice-to-have features, low-traffic use cases, areas where higher score doesn't justify effort.
Zone 3 ("Exploratory"): Emergent patterns, new tools with minimal effort, capabilities you're exploring.

Launch the dashboard:

bash

python dashboard/serve.py --stage plan --data stage-1-data.json

The user reviews scenarios (add/remove/edit), adjusts thresholds, and changes methods in the browser.
When the user confirms, read
```
plan-feedback.json
```
and apply edits. If changes requested, regenerate and re-launch.
After confirmation, automatically generate the customer-ready .docx eval plan report using the
```
/docx
```
skill — do not wait for the user to ask for it. This is the customer's first deliverable.

The report must be:

Concise — no filler, no walls of text. Tables over paragraphs.
Presentable — professional formatting with color-coded headers, clean tables, visual hierarchy
Self-contained — a customer who wasn't in the conversation can read it and understand the eval plan

Report structure:

Agent Vision summary (from Stage 0) — 5-6 lines max
Zone Model overview — explain the three zones and how scenarios were classified based on Priority + Value + Effort
Zone assignment table — scenarios grouped by zone (Zone 1: "This Must Work", Zone 2: "Real but Lower Priority", Zone 3: "Exploratory") with acceptance criteria (Pass = / Fail = conditions)
Quality Dimensions to Test — list dimensions with grouped scenarios under each
Success Metrics table — Zone, Scenario, Quality Dimension, Target Pass Rate for each scenario
Method mapping explanation — which test methods apply to which quality dimensions

Tell the customer: "Here's your eval plan report with the Zone Model prioritization. Share this with your team for alignment — business and dev should agree on the zone assignments and target pass rates before we generate test cases."

生成任何交付文档前，启动计划仪表板供审核：

使用区域模型结构将计划写入

stage-1-data.json

：

json

{
  "agent_name": "...",
  "scenarios": [
    {
      "id": 1,
      "name": "PTO政策查询",
      "zone": "1",
      "quality_dimension": "政策准确性",
      "acceptance_criteria": "通过 = Agent返回符合HR政策的正确PTO累积比例。\n失败 = 比例错误或信息来自错误的工龄区间。",
      "target_pass_rate": 90
    }
  ],
  "quality_dimensions": ["政策准确性", "来源归因", "幻觉预防", ...]
}

每个场景必须包含：

zone

（"1"、"2"或"3"）、

quality_dimension

、

acceptance_criteria

（包含通过=和失败=条件），以及

target_pass_rate

（5%的倍数）。

区域分配指南：

区域1（「必须正常运行」）：安全故障、核心业务功能、防护栏测试。价值最高，全力投入。
区域2（「真实但优先级较低」）：锦上添花的功能、低流量用例、更高得分不值得投入精力的领域。
区域3（「探索性」）：涌现模式、投入极少的新工具、你正在探索的能力。

启动仪表板：

bash

python dashboard/serve.py --stage plan --data stage-1-data.json

用户在浏览器中审核场景（添加/删除/编辑）、调整阈值、修改方法。
用户确认后，读取
```
plan-feedback.json
```
并应用修改。如果请求修改，重新生成并重新启动仪表板。
确认后，使用
/docx
Skill自动生成客户可用的docx格式评估计划报告——不要等客户索要。这是客户的第一个交付物。

报告必须满足：

简洁——没有冗余内容，没有大段文字。优先用表格而非段落。
美观——专业格式，带颜色编码的标题、清晰的表格、视觉层级
自包含——没有参与会话的客户也可以阅读并理解评估计划

报告结构：

Agent愿景摘要（来自阶段0）——最多5-6行
区域模型概览——解释三个区域，以及如何基于优先级+价值+投入对场景进行分类
区域分配表——按区域分组的场景（区域1：「必须正常运行」，区域2：「真实但优先级较低」，区域3：「探索性」），带验收标准（通过=/失败=条件）
待测试的质量维度——列出维度，每个维度下分组对应的场景
成功指标表——每个场景对应的区域、场景名称、质量维度、目标通过率
方法映射说明——哪些测试方法适用于哪些质量维度

告诉客户：「这是你的评估计划报告，附带区域模型优先级排序。你可以和团队分享以对齐认知——业务和开发团队应该在我们生成测试用例前，就区域分配和目标通过率达成一致。」

Stage 2: Generate

阶段2：生成

Generate test cases as separate CSV files per quality signal. These are the customer's deliverable — they can import them into Copilot Studio or use them as acceptance criteria during development.

生成按质量信号分开的CSV文件形式的测试用例。这些是客户的交付物——他们可以将其导入Copilot Studio，或者在开发过程中用作验收标准。

Choose evaluation mode: Single Response vs. Conversation

选择评估模式：单响应 vs 对话

Before generating test cases, determine which evaluation mode fits each scenario. Copilot Studio supports two modes:

Mode	Best for	Limits	Supported test methods
Single response	Factual Q&A, tool routing, specific answers, safety tests	Up to 100 test cases per set	All 7 methods (General quality, Compare meaning, Keyword match, Capability use, Text similarity, Exact match, Custom)
Conversation (multi-turn)	Multi-step workflows, context retention, clarification flows, process navigation	Up to 20 test cases, max 12 messages (6 Q&A pairs) per case	General quality, Keyword match, Capability use, Custom (Classification)

When to recommend conversation eval:

The agent walks users through multi-step processes (e.g., troubleshooting, onboarding, form completion)
Context retention matters — later answers depend on earlier ones
The agent needs to ask clarifying questions before answering
The scenario involves slot-filling or information gathering across turns

When to stay with single response:

Each question is independent (FAQ, policy lookup, data retrieval)
You need Compare meaning, Text similarity, or Exact match (conversation mode doesn't support these)
You need more than 20 test cases in a set

Explain the choice: "I'm recommending single response eval for your knowledge-based scenarios because each question is independent — the agent doesn't need previous context to answer. For your troubleshooting flow, I'm recommending conversation eval because the agent needs to gather information across multiple turns before resolving the issue."

Note for CSV generation: Single response test sets use the standard 3-column CSV (Question, Expected response, Testing method). Conversation test sets can be imported via spreadsheet or generated in the Copilot Studio UI — each test case contains a sequence of user messages that simulate a multi-turn interaction.

生成测试用例前，确定每个场景适用的评估模式。Copilot Studio支持两种模式：

模式	适用场景	限制	支持的测试方法
单响应	事实类问答、工具路由、特定答案、安全测试	每个测试集最多100个用例	全部7种方法（General quality（综合质量）、Compare meaning（语义比较）、Keyword match（关键词匹配）、Capability use、Text similarity、Exact match（精确匹配）、Custom）
对话（多轮）	多步骤工作流、上下文保留、澄清流程、流程导航	最多20个测试用例，每个用例最多12条消息（6组问答）	General quality（综合质量）、Keyword match（关键词匹配）、Capability use、Custom（分类）

何时推荐对话评估：

Agent引导用户完成多步骤流程（例如故障排查、入职、表单填写）
上下文保留很重要——后续答案依赖之前的内容
Agent需要先问澄清问题再回答
场景涉及跨轮次的槽位填充或信息收集

何时使用单响应：

每个问题都是独立的（FAQ、政策查询、数据检索）
你需要使用Compare meaning（语义比较）、Text similarity或Exact match（精确匹配）（对话模式不支持这些方法）
你需要一个测试集包含超过20个用例

解释选择理由： 「我建议你的知识类场景使用单响应评估，因为每个问题都是独立的——Agent不需要之前的上下文就能回答。对于故障排查流程，我建议使用对话评估，因为Agent需要跨多轮收集信息才能解决问题。」

CSV生成注意事项： 单响应测试集使用标准的3列CSV（问题、预期响应、测试方法）。对话测试集可以通过电子表格导入，或者在Copilot Studio UI中生成——每个测试用例包含一系列用户消息，模拟多轮交互。

What to do

工作内容

Generate one test case per scenario row from the plan. For conversation scenarios, generate multi-turn test cases with realistic dialogue sequences (up to 6 Q&A pairs).
Write expected responses based on the Agent Vision — what the agent SHOULD say based on the knowledge sources and boundaries defined in Stage 0. Note: "These expected responses reflect your stated requirements. Refine them once the agent is built and you see how it actually responds."

Group by quality signal into separate CSV files:

```
eval-knowledge-accuracy.csv
```
```
eval-safety-compliance.csv
```
```
eval-hallucination-prevention.csv
```
```
eval-routing.csv
```
```
eval-robustness.csv
```
```
eval-personalization.csv
```
(if applicable)

Only create files for categories that apply.

CSV format — Copilot Studio import format:

csv

"Question","Expected response","Testing method"
"How many PTO days do LA employees get?","LA employees receive 18 PTO days per year.","Compare meaning"

Valid Testing method values:

General quality

Compare meaning

Similarity

Exact match

Keyword match

Test method per scenario type:

Scenario type	Method	Why
Factual with known answer	Compare meaning	Semantic equivalence
Open-ended quality	General quality	LLM judge
Must-include terms (URL, email)	Keyword match	Exact presence
Agent should refuse	Compare meaning	Refusal matches expected
Domain-specific criteria (compliance, tone, policy)	Custom	Define your own rubric and pass/fail labels

Highlight the value: "You now have [X] test cases across [Y] quality signals. Compare that to the 5-10 happy-path prompts most customers start with. These include adversarial attacks, hallucination traps, robustness tests, and edge cases your users will encounter in production."

计划中的每个场景行生成一个测试用例。对于对话场景，生成带有真实对话序列的多轮测试用例（最多6组问答）。
基于Agent愿景编写预期响应——基于阶段0定义的知识源和边界，Agent应该给出的回答。注意：「这些预期响应反映了你明确提出的需求。Agent开发完成后，你可以根据实际响应调整这些内容。」

按质量信号分组到单独的CSV文件中：

```
eval-knowledge-accuracy.csv
```
```
eval-safety-compliance.csv
```
```
eval-hallucination-prevention.csv
```
```
eval-routing.csv
```
```
eval-robustness.csv
```
```
eval-personalization.csv
```
（如果适用）

仅为适用的分类创建文件。

CSV格式——Copilot Studio导入格式：

csv

"Question","Expected response","Testing method"
"洛杉矶员工有多少天PTO？","洛杉矶员工每年有18天PTO。","Compare meaning"

合法的Testing method值：

General quality

Compare meaning

Similarity

Exact match

Keyword match

。

每个场景类型对应的测试方法：

场景类型	方法	原因
有已知答案的事实类问题	Compare meaning	语义等价
开放式质量测试	General quality	LLM评审
必须包含的术语（URL、邮箱）	Keyword match	精确存在
Agent应该拒绝的场景	Compare meaning	拒绝响应符合预期
领域特定标准（合规、语气、政策）	Custom	定义自己的评分规则和通过/失败标签

强调价值： 「你现在有[X]个测试用例，覆盖[Y]个质量信号。对比大多数客户一开始只用的5-10个 happy path 提示词，这些测试用例包含了对抗性攻击、幻觉陷阱、鲁棒性测试，以及你的用户在生产环境中会遇到的边界用例。」

Output

输出

Display a summary table of test cases per quality signal.

展示按质量信号分组的测试用例汇总表。

Interactive Dashboard Checkpoint

交互式仪表板检查点

Before generating final CSV and report files, launch the test cases dashboard for review:

Write the test cases to

stage-2-data.json

using the Zone Model structure:

json

{
  "agent_name": "...",
  "test_sets": [
    {
      "quality_dimension": "Policy Accuracy",
      "methods": ["Compare meaning", "Keyword match"],
      "scenarios": [
        {
          "scenario_id": 1,
          "name": "PTO policy lookup",
          "zone": "1",
          "acceptance_criteria": "Pass = correct PTO accrual rate.\nFail = incorrect rate.",
          "cases": [
            {"id": 1, "question": "...", "expected_response": "Text with [VERIFY: factual content to check] markers"}
          ]
        }
      ]
    }
  ]
}

Key requirements:

Group test cases by
```
quality_dimension
```
(not quality_signal), with
```
scenarios
```
nested under each dimension
Each scenario carries its
```
zone
```
,
```
acceptance_criteria
```
(from Stage 1), and
```
cases
```
Test
```
methods
```
are set at the dimension level, not per-case
Wrap AI-generated factual content in
```
[VERIFY: ...]
```
markers so the dashboard highlights them for human review

Launch the dashboard:

bash

python dashboard/serve.py --stage generate --data stage-2-data.json

The user reviews test cases per quality dimension tab (zone-colored), reviews acceptance criteria per scenario group, checks VERIFY-highlighted factual content, and edits expected responses inline.
When the user confirms, read
```
generate-feedback.json
```
and apply all edits. If changes requested, regenerate and re-launch.
After confirmation, generate the final deliverables:

A. CSV files — Write each quality signal's test cases to a separate CSV:

csv

"Question","Expected response","Testing method"

B. .docx report — Generate a customer-ready report using the

/docx

skill. The report must be:

Concise — no filler, no walls of text. Tables over paragraphs.
Presentable — professional formatting with color-coded headers, clean tables, visual hierarchy
Self-contained — a customer who wasn't in the conversation can read it and understand the eval plan + test cases

Report structure:

Agent Vision summary (from Stage 0) — 5-6 lines max
Zone Model summary — scenarios by zone with acceptance criteria (Pass/Fail)
Test cases organized by quality dimension, with scenario groups showing zone badge and acceptance criteria
For each test case: Question, Expected Response (with [VERIFY] content called out), and suggested test method
Summary table: quality dimension, scenario count, test case count, methods
"What these tests catch" callout — 3-4 bullet points on what the customer would have missed
Next steps — what to do with these files

Tell the customer: "These CSVs are importable directly into Copilot Studio's Evaluation tab. The report includes your Zone Model priorities and acceptance criteria — share it with your team."

生成最终CSV和报告文件前，启动测试用例仪表板供审核：

使用区域模型结构将测试用例写入

stage-2-data.json

：

json

{
  "agent_name": "...",
  "test_sets": [
    {
      "quality_dimension": "政策准确性",
      "methods": ["Compare meaning", "Keyword match"],
      "scenarios": [
        {
          "scenario_id": 1,
          "name": "PTO政策查询",
          "zone": "1",
          "acceptance_criteria": "通过 = PTO累积比例正确。\n失败 = 比例错误。",
          "cases": [
            {"id": 1, "question": "...", "expected_response": "带有[VERIFY: 待检查事实内容]标记的文本"}
          ]
        }
      ]
    }
  ]
}

核心要求：

按
```
quality_dimension
```
（而非quality_signal）分组测试用例，
```
scenarios
```
嵌套在每个维度下
每个场景携带其
```
zone
```
、
```
acceptance_criteria
```
（来自阶段1）和
```
cases
```
测试
```
methods
```
在维度层级设置，而非每个用例单独设置
将AI生成的事实内容包裹在
```
[VERIFY: ...]
```
标记中，以便仪表板高亮显示供人工审核

启动仪表板：

bash

python dashboard/serve.py --stage generate --data stage-2-data.json

用户按质量维度标签页审核测试用例（按区域颜色标记）、审核每个场景组的验收标准、检查VERIFY高亮的事实内容、直接编辑预期响应。
用户确认后，读取
```
generate-feedback.json
```
并应用所有修改。如果请求修改，重新生成并重新启动仪表板。
确认后，生成最终交付物：

A. CSV文件——将每个质量信号的测试用例写入单独的CSV：

csv

"Question","Expected response","Testing method"

B. .docx报告——使用

/docx

Skill生成客户可用的报告。报告必须满足：

简洁——没有冗余内容，没有大段文字。优先用表格而非段落。
美观——专业格式，带颜色编码的标题、清晰的表格、视觉层级
自包含——没有参与会话的客户也可以阅读并理解评估计划+测试用例

报告结构：

Agent愿景摘要（来自阶段0）——最多5-6行
区域模型摘要——按区域分组的场景，带验收标准（通过/失败）
按质量维度组织的测试用例，场景组显示区域徽章和验收标准
每个测试用例包含：问题、预期响应（标注[VERIFY]内容）、建议的测试方法
汇总表：质量维度、场景数量、测试用例数量、使用的方法
「这些测试覆盖的问题」说明——3-4个要点，说明客户原本可能遗漏的风险
后续步骤——这些文件的使用方法

告诉客户：「这些CSV可以直接导入Copilot Studio的评估标签页。报告包含你的区域模型优先级和验收标准——你可以和团队分享。」

Stage 3: Run (requires a running agent)

阶段3：运行（需要运行中的Agent）

Skip this stage if the agent isn't built yet. The deliverables from Stages 0-2 are the eval jumpstart — the customer can run evals themselves when the agent is ready.

If the agent IS available, send each question from the CSVs to the live agent and score responses using Claude Sonnet as LLM judge.

如果Agent还没有开发完成，跳过这个阶段。 阶段0-2的交付物就是评估入门包——客户可以在Agent准备好后自行运行评估。

如果Agent已经可用，将CSV中的每个问题发送给在线Agent，使用Claude Sonnet作为LLM评审员对响应打分。

How to run

运行方法

Use

eval-runner.js

if a DirectLine connection is available:

bash

node eval-runner.js --token-endpoint "<URL>" --csv-dir .

Or use

/chat-with-agent

for individual questions via CPS SDK.

Scoring:

```
Compare meaning
```
→ semantic equivalence (0.0-1.0)
```
General quality
```
→ helpfulness/accuracy/relevance (0.0-1.0)
```
Keyword match
```
→ code-based string matching
```
Exact match
```
→ code-based string equality

Required:

ANTHROPIC_API_KEY

for LLM-based scorers.

如果有DirectLine连接，使用

eval-runner.js

：

bash

node eval-runner.js --token-endpoint "<URL>" --csv-dir .

或者通过CPS SDK使用

/chat-with-agent

发送单个问题。

打分规则：

```
Compare meaning
```
→ 语义等价度（0.0-1.0）
```
General quality
```
→ 有用性/准确性/相关性（0.0-1.0）
```
Keyword match
```
→ 代码层面的字符串匹配
```
Exact match
```
→ 代码层面的字符串相等

要求：基于LLM的打分器需要配置

ANTHROPIC_API_KEY

。

Output

输出

Results table +

eval-results-YYYY-MM-DD.csv

and

.json

结果表 +

eval-results-YYYY-MM-DD.csv

和

.json

。

Stage 4: Interpret

阶段4：解读

Analyze eval results to understand what's working, what's failing, and what to fix next.

Which skill to use: For a one-shot triage report from a CSV file or results summary, invoke

/eval-result-interpreter

. For interactive, multi-round diagnosis with detailed remediation guidance, invoke

/eval-triage-and-improvement

. Start with the interpreter; switch to triage if you need help implementing fixes.

分析评估结果，了解哪些部分正常运行、哪些部分失败，以及下一步修复的内容。

使用的Skill： 如果需要基于CSV文件或结果摘要生成一次性分级报告，调用

/eval-result-interpreter

。如果需要交互式多轮诊断和详细的修复指导，调用

/eval-triage-and-improvement

。先使用解释器；如果需要帮助实现修复，切换到分级工具。

What to do

工作内容

Pre-triage check — Were knowledge sources accessible? APIs healthy? Auth valid?
Score summary — Total, passed, failed, pass rate per category and test method.
Failure triage — Explain the key insight: "Before we blame the agent — at least 20% of failures in a new eval are actually eval setup issues, not agent issues. The test case might be wrong, the expected response might be outdated, or the testing method might be inappropriate. Let me check that first."

Apply 5-question eval verification for each failure.
Root causes: Eval Setup Issue / Agent Configuration Issue / Platform Limitation.
Top 3 actions — Each: Change X → Re-run Y → Expect Z.
Pattern analysis and next-run recommendation.

If 100% pass: "A 100% pass rate is a red flag — your eval is likely too easy."

分级前检查——知识源是否可访问？API是否正常？认证是否有效？
得分汇总——总体、通过、失败、每个分类和测试方法的通过率。
失败分级——解释核心洞察：「在指责Agent之前——新评估中至少20%的失败实际上是评估设置问题，而非Agent问题。可能是测试用例错误、预期响应过时，或者测试方法不合适。我先检查这部分。」

对每个失败用例应用5个问题的评估验证。
根因： 评估设置问题 / Agent配置问题 / 平台限制。
Top 3行动——每个行动的结构：修改 X → 重新运行 Y → 预期 Z。
模式分析和下次运行建议。

如果通过率100%：「100%的通过率是危险信号——你的评估可能太简单了。」

Interactive Dashboard Checkpoint

交互式仪表板检查点

Before generating the final triage report, launch the interpret dashboard for review:

Write the triage data to

stage-4-data.json

using the success metrics structure:

json

{
  "agent_name": "...",
  "summary": {"total": 28, "passed": 19, "failed": 9},
  "success_metrics": [
    {"scenario_id": 1, "name": "PTO policy lookup", "zone": "1", "quality_dimension": "Policy Accuracy", "target_pass_rate": 90, "actual_pass_rate": 100, "cases_total": 2, "cases_passed": 2}
  ],
  "eval_results": [
    {"scenario_id": 1, "question": "...", "expected": "...", "actual": "...", "method": "Compare meaning", "score": 0.92, "pass": true, "explanation": "Rationale from LLM judge..."}
  ],
  "failures": [
    {"id": 1, "scenario_id": 2, "scenario_name": "...", "zone": "1", "quality_dimension": "...", "question": "...", "expected": "...", "actual": "...", "root_cause": "agent_config", "explanation": "..."}
  ],
  "top_actions": [...],
  "patterns": [...]
}

Key requirements:

```
success_metrics
```
maps each scenario to its target vs actual pass rate (from Stage 1 targets)
```
eval_results
```
contains ALL test case results (not just failures) so scenarios can be expanded in the dashboard
Each eval result includes
```
explanation
```
(the LLM judge rationale) for human review
No
```
verdict
```
field — the dashboard shows success metrics status per zone instead of SHIP/ITERATE/BLOCK

Launch the dashboard:

bash

python dashboard/serve.py --stage interpret --data stage-4-data.json

The user reviews success metrics status per zone, expands scenario rows to see test case details, uses Human Judgement (Agree/Disagree) to override LLM judge assessments, and re-classifies root causes.
When the user confirms, read
```
interpret-feedback.json
```
and apply edits (including human disagrees which become eval_setup root causes). If changes requested, regenerate and re-launch.
After confirmation, generate the customer-ready .docx triage report using the
```
/docx
```
skill. Same principles: concise, presentable, self-contained. Structure:
1. Success Metrics Status — zone summary cards (targets met per zone) + full scenario table (zone, scenario, quality dimension, target vs actual, status)
2. Failure triage table (zone, scenario, question, expected, actual, root cause) — include human-disagreed entries as "Eval Setup — Human Disagrees"
3. Top actions (Change → Re-run → Expect)
4. Pattern analysis — zone-aware patterns highlighting systemic issues
5. Next steps

生成最终分级报告前，启动解读仪表板供审核：

使用成功指标结构将分级数据写入

stage-4-data.json

：

json

{
  "agent_name": "...",
  "summary": {"total": 28, "passed": 19, "failed": 9},
  "success_metrics": [
    {"scenario_id": 1, "name": "PTO政策查询", "zone": "1", "quality_dimension": "政策准确性", "target_pass_rate": 90, "actual_pass_rate": 100, "cases_total": 2, "cases_passed": 2}
  ],
  "eval_results": [
    {"scenario_id": 1, "question": "...", "expected": "...", "actual": "...", "method": "Compare meaning", "score": 0.92, "pass": true, "explanation": "LLM评审员的理由..."}
  ],
  "failures": [
    {"id": 1, "scenario_id": 2, "scenario_name": "...", "zone": "1", "quality_dimension": "...", "question": "...", "expected": "...", "actual": "...", "root_cause": "agent_config", "explanation": "..."}
  ],
  "top_actions": [...],
  "patterns": [...]
}

核心要求：

```
success_metrics
```
将每个场景映射到其目标通过率vs实际通过率（来自阶段1的目标）
```
eval_results
```
包含所有测试用例结果（不只是失败用例），以便在仪表板中展开场景
每个评估结果包含
```
explanation
```
（LLM评审员的理由）供人工审核
没有
```
verdict
```
字段——仪表板按区域显示成功指标状态，而非SHIP/ITERATE/BLOCK

启动仪表板：

bash

python dashboard/serve.py --stage interpret --data stage-4-data.json

用户审核每个区域的成功指标状态，展开场景行查看测试用例详情，使用人工判断（同意/不同意）覆盖LLM评审员的评估，重新分类根因。
用户确认后，读取
```
interpret-feedback.json
```
并应用修改（包括人工不同意的内容，会被标记为eval_setup根因）。如果请求修改，重新生成并重新启动仪表板。
确认后，使用
```
/docx
```
Skill生成客户可用的docx格式分级报告。遵循同样的原则：简洁、美观、自包含。结构：
1. 成功指标状态——区域摘要卡片（每个区域的目标完成情况）+ 完整场景表（区域、场景、质量维度、目标vs实际、状态）
2. 失败分级表（区域、场景、问题、预期、实际、根因）——将人工不同意的条目标记为「评估设置——人工不同意」
3. 顶级行动（修改 → 重新运行 → 预期）
4. 模式分析——按区域感知的模式，突出系统性问题
5. 后续步骤

Language Support

语言支持

Supports English and Chinese (simplified). Auto-detects from user's language.

CSV headers stay English (Copilot Studio requirement)
Technical terms in English with Chinese parenthetical on first use: Compare meaning (语义比较), General quality (综合质量), Keyword match (关键词匹配), Exact match (精确匹配)

支持英文和简体中文，自动检测用户使用的语言。

CSV表头保持英文（Copilot Studio要求）
技术术语使用英文，首次出现时加中文括号注释：Compare meaning (语义比较)、General quality (综合质量)、Keyword match (关键词匹配)、Exact match (精确匹配)

Platform Capabilities to Leverage (March 2026)

可利用的平台能力（2026年3月）

When coaching customers, mention these Copilot Studio evaluation features at the appropriate stage:

Feature	When to mention	What it does
Custom test method	Stage 1 (Plan)	Lets customers define domain-specific evaluation criteria with custom labels (e.g., "Compliant" / "Non-Compliant"). Ideal for compliance, tone, or policy checks that don't fit standard methods.
Comparative testing	Stage 4 (Interpret)	Side-by-side comparison of agent versions. Use after making fixes to verify improvements without regressions.
Theme-based test sets	Stage 2 (Generate)	Creates test cases from production analytics themes — real user questions grouped by topic. Best for agents already in production.
Production data import	Stage 2 (Generate)	Import real user conversations as test cases. Higher fidelity than synthetic test cases.
Rubrics (Copilot Studio Kit)	Stage 1 (Plan)	Custom grading rubrics with 1-5 scoring and refinement workflow to align AI grading with human judgment. For advanced customers with mature eval practices.
User feedback (thumbs up/down)	Stage 4 (Interpret)	Makers can flag eval results they agree/disagree with. Captures grader alignment signals over time.
Set-level grading	Stage 4 (Interpret)	Evaluates quality across the entire test set (not just individual cases). Gives an overall quality picture and supports multiple grading approaches for more holistic results. Use this to report aggregate quality to stakeholders.
User profiles	Stage 2 (Generate) / Stage 3 (Run)	Assign a user profile to a test set so the eval runs as a specific authenticated user. Use this when the agent returns different results based on who is asking — e.g., a director can access different knowledge sources than an intern. Ask in Stage 0: "Does your agent behave differently depending on who the user is?" If yes, plan separate test sets per role. Limitations: (1) Multi-profile eval only works for agents WITHOUT connector dependencies. (2) Tool connections always use the logged-in maker account, not the profile — mismatch causes "This account cannot connect to tools" error. (3) Not available in GCC. Docs: Manage user profiles.
CSV template download	Stage 2 (Generate)	Copilot Studio provides a downloadable CSV template under Data source > New evaluation. Recommend customers download it first to verify format before importing generated CSVs.
89-day result retention	Stage 3 (Run) / Stage 4 (Interpret)	Test results are only available in Copilot Studio for 89 days. Always export results to CSV after each run for long-term tracking. Critical for customers establishing baselines and tracking improvement over time.

Don't overwhelm. Only mention features relevant to the customer's maturity level. A customer in Stage 0 doesn't need to hear about rubric refinement workflows.

GCC (Government Community Cloud) limitations: If the customer is in a GCC environment, flag these restrictions early:

No user profiles — they can't assign a test account to simulate authenticated users during evaluation
No Text Similarity method — all other test methods work normally These are documented at About agent evaluation. Don't let them design an eval plan around features they can't use.

Important caveat to share: Agent evaluation measures correctness and performance — it does NOT test for AI ethics or safety problems. An agent can pass all eval tests and still produce inappropriate answers. Customers must still use responsible AI reviews and content safety filters. Evaluation complements those — it doesn't replace them.

指导客户时，在合适的阶段提及这些Copilot Studio评估功能：

功能	提及时机	作用
自定义测试方法	阶段1（规划）	允许客户定义领域特定的评估标准，带自定义标签（例如「合规」/「不合规」）。非常适合不符合标准方法的合规、语气或政策检查。
对比测试	阶段4（解读）	并排比较Agent版本。修复后使用，验证改进没有引入回归问题。
基于主题的测试集	阶段2（生成）	基于生产分析的主题生成测试用例——按主题分组的真实用户问题。最适合已经在生产环境运行的Agent。
生产数据导入	阶段2（生成）	导入真实用户对话作为测试用例。比合成测试用例保真度更高。
评分规则（Copilot Studio Kit）	阶段1（规划）	自定义评分规则，支持1-5分评分和优化工作流，让AI评分与人工判断对齐。适合有成熟评估实践的高级客户。
用户反馈（点赞/点踩）	阶段4（解读）	制作者可以标记他们同意/不同意的评估结果。长期收集评审员对齐信号。
测试集层级评分	阶段4（解读）	评估整个测试集的质量（不只是单个用例）。提供整体质量概况，支持多种评分方法，获得更全面的结果。用于向利益相关者汇报整体质量。
用户配置文件	阶段2（生成）/阶段3（运行）	为测试集分配用户配置文件，让评估以特定已认证用户的身份运行。当Agent根据提问者不同返回不同结果时使用——例如总监可以访问实习生无法访问的知识源。在阶段0询问：「你的Agent会根据用户不同表现出不同行为吗？」如果是，为每个角色规划单独的测试集。限制： (1) 多配置文件评估仅适用于没有连接器依赖的Agent。(2) 工具连接始终使用登录的制作者账户，而非配置文件——不匹配会导致「此账户无法连接到工具」错误。(3) GCC环境不可用。文档：管理用户配置文件。
CSV模板下载	阶段2（生成）	Copilot Studio在数据源 > 新建评估下提供可下载的CSV模板。建议客户先下载模板，验证格式后再导入生成的CSV。
89天结果保留期	阶段3（运行）/阶段4（解读）	测试结果仅在Copilot Studio中保留89天。每次运行后务必将结果导出为CSV，用于长期跟踪。对于需要建立基线和长期跟踪改进的客户至关重要。

不要过度推送信息。 仅提及与客户成熟度相关的功能。处于阶段0的客户不需要了解评分规则优化工作流。

GCC（政府社区云）限制： 如果客户使用GCC环境，尽早说明这些限制：

无用户配置文件——他们无法分配测试账户来模拟评估过程中的已认证用户
无文本相似度方法——所有其他测试方法正常工作这些限制记录在Agent评估简介中。不要让他们基于无法使用的功能设计评估计划。

需要分享的重要提醒： Agent评估衡量的是正确性和性能——它不测试AI伦理或安全问题。Agent可以通过所有评估测试，但仍然可能生成不当内容。客户仍然需要使用负责任AI审查和内容安全过滤器。评估是这些措施的补充，而非替代。

Behavior Rules

行为规则

Discover first — understand the agent's purpose and the customer's expectations before anything else.
No running agent required for Stages 0-2. The skill works from a description, an idea, or a conversation.
Explain your reasoning. Don't just output artifacts — narrate WHY you're making each choice. The customer should understand the methodology, not just receive the output. This is what makes them self-sufficient.
Highlight what they'd miss. At each stage, point out the scenarios, methods, or insights the customer wouldn't have thought of on their own — hallucination tests, adversarial cases, the "20% are eval bugs" insight.
Be specific — use real names, real scenarios. No generic advice.
Always include at least 1 adversarial/safety scenario.
Keep everything in the CLI unless asked otherwise.
Pause between stages for confirmation.
Match the user's language.

先调研——在做任何工作前，先了解Agent的用途和客户的预期。
阶段0-2不需要运行中的Agent。 本Skill可基于描述、想法或对话运行。
解释你的逻辑。 不要只输出产物——说明你做出每个选择的原因。客户应该理解方法论，而不只是收到输出。这是让他们实现自主的关键。
强调客户可能遗漏的点。 每个阶段，指出客户自己不会想到的场景、方法或洞察——幻觉测试、对抗性用例、「20%是评估bug」的洞察。
具体——使用真实名称、真实场景。不要给通用建议。
始终包含至少1个对抗性/安全场景。
除非另有要求，所有内容都在CLI中展示。
阶段之间暂停等待确认。
匹配用户使用的语言。