Translation Comparison - evaluate

Invoke {{command_prefix}}agent-workflow — it contains workflow principles, anti-patterns, and the Context Gathering Protocol. Follow the protocol before proceeding — if no workflow context exists yet, you MUST run {{command_prefix}}teach-maestro first. Consult the feedback-loops reference in the agent-workflow skill for evaluation patterns, golden test sets, and regression detection.

Evaluate the workflow's actual interaction quality by testing it against scenarios that represent real usage.

调用 {{command_prefix}}agent-workflow —— 它包含工作流原则、反模式，以及上下文收集协议。在继续操作前请遵循该协议——如果还不存在工作流上下文，你必须先运行 {{command_prefix}}teach-maestro。参考agent-workflow技能中的反馈循环参考资料，了解评估模式、黄金测试集和回归检测方法。

通过使用代表真实使用场景的案例进行测试，评估工作流的实际交互质量。

Evaluation Dimensions

评估维度

1. Task Completion

Does the workflow actually accomplish what it's supposed to?
Does it handle the complete task or only the happy path?
Are edge cases addressed or silently dropped?

2. Output Quality

Is the output accurate, complete, and well-formatted?
Does it match the defined output schema (if any)?
Would a domain expert approve the output?

3. Error Behavior

What happens when input is malformed?
What happens when a tool fails?
What happens when the model is uncertain?
Is the error message useful or generic?

4. User Experience

Is the interaction natural and intuitive?
Are confirmations requested for destructive operations?
Is the response time acceptable?
Does the workflow communicate its limitations?

5. Consistency

Does the same input produce consistent output quality?
Are there random failures that aren't reproducible?
Does quality degrade over long conversations?

1. 任务完成度

工作流是否真的实现了预期目标？
它是覆盖了完整任务流程，还是仅处理了happy path？
边界用例是被处理了，还是被静默丢弃了？

2. 输出质量

输出是否准确、完整且格式规范？
它是否符合已定义的输出Schema（如果有的话）？
领域专家是否会认可该输出？

3. 错误表现

输入格式错误时会发生什么？
工具调用失败时会发生什么？
模型无法确定结果时会发生什么？
错误提示是有用的，还是通用无意义的？

4. 用户体验

交互是否自然直观？
执行破坏性操作前是否请求用户确认？
响应时间是否在可接受范围内？
工作流是否会告知用户它的能力限制？

5. 一致性

相同输入是否能产出质量一致的输出？
是否存在无法复现的随机失败？
长对话过程中输出质量是否会下降？

Scenario Testing

场景测试

Create and run test scenarios:

Scenario	Input	Expected	Actual	Grade
Happy path	Normal input	Correct output	?	A-F
Edge case	Unusual input	Graceful handling	?	A-F
Error case	Bad input	Helpful error	?	A-F
Stress case	Large/complex input	Reasonable handling	?	A-F
Adversarial	Tricky/malicious input	Safe response	?	A-F

创建并运行测试场景：

场景类型	输入	预期结果	实际结果	评级
正常路径	常规输入	正确输出	?	A-F
边界场景	非常规输入	优雅处理	?	A-F
错误场景	非法输入	有用的错误提示	?	A-F
压力场景	大量/复杂输入	合理处理	?	A-F
对抗场景	刁钻/恶意输入	安全响应	?	A-F

Evaluation Report

评估报告

Produce a structured report with:

Overall quality grade (A-F)
Per-dimension scores with evidence
Specific scenario results
Priority improvements with recommended Maestro commands

生成结构化报告，包含以下内容：

整体质量评级（A-F）
各维度得分及佐证依据
具体场景测试结果
优先级改进项及推荐的Maestro命令

Evaluation Checklist

评估检查清单

All 5 dimensions tested with concrete scenarios
At least one edge case and one adversarial case tested
Results documented in the scenario table
Overall grade assigned with justification
Improvement actions reference specific Maestro commands

已使用具体场景测试全部5个维度
已测试至少1个边界场景和1个对抗场景
测试结果已记录在场景表格中
已给出整体评级及理由
改进行动关联了具体的Maestro命令

evaluate

Original

Translation

MANDATORY PREPARATION

强制准备