evaluate
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMANDATORY PREPARATION
强制准备
Invoke {{command_prefix}}agent-workflow — it contains workflow principles, anti-patterns, and the Context Gathering Protocol. Follow the protocol before proceeding — if no workflow context exists yet, you MUST run {{command_prefix}}teach-maestro first.
Consult the feedback-loops reference in the agent-workflow skill for evaluation patterns, golden test sets, and regression detection.
Evaluate the workflow's actual interaction quality by testing it against scenarios that represent real usage.
调用 {{command_prefix}}agent-workflow —— 它包含工作流原则、反模式,以及上下文收集协议。在继续操作前请遵循该协议——如果还不存在工作流上下文,你必须先运行 {{command_prefix}}teach-maestro。
参考agent-workflow技能中的反馈循环参考资料,了解评估模式、黄金测试集和回归检测方法。
通过使用代表真实使用场景的案例进行测试,评估工作流的实际交互质量。
Evaluation Dimensions
评估维度
1. Task Completion
- Does the workflow actually accomplish what it's supposed to?
- Does it handle the complete task or only the happy path?
- Are edge cases addressed or silently dropped?
2. Output Quality
- Is the output accurate, complete, and well-formatted?
- Does it match the defined output schema (if any)?
- Would a domain expert approve the output?
3. Error Behavior
- What happens when input is malformed?
- What happens when a tool fails?
- What happens when the model is uncertain?
- Is the error message useful or generic?
4. User Experience
- Is the interaction natural and intuitive?
- Are confirmations requested for destructive operations?
- Is the response time acceptable?
- Does the workflow communicate its limitations?
5. Consistency
- Does the same input produce consistent output quality?
- Are there random failures that aren't reproducible?
- Does quality degrade over long conversations?
1. 任务完成度
- 工作流是否真的实现了预期目标?
- 它是覆盖了完整任务流程,还是仅处理了happy path?
- 边界用例是被处理了,还是被静默丢弃了?
2. 输出质量
- 输出是否准确、完整且格式规范?
- 它是否符合已定义的输出Schema(如果有的话)?
- 领域专家是否会认可该输出?
3. 错误表现
- 输入格式错误时会发生什么?
- 工具调用失败时会发生什么?
- 模型无法确定结果时会发生什么?
- 错误提示是有用的,还是通用无意义的?
4. 用户体验
- 交互是否自然直观?
- 执行破坏性操作前是否请求用户确认?
- 响应时间是否在可接受范围内?
- 工作流是否会告知用户它的能力限制?
5. 一致性
- 相同输入是否能产出质量一致的输出?
- 是否存在无法复现的随机失败?
- 长对话过程中输出质量是否会下降?
Scenario Testing
场景测试
Create and run test scenarios:
| Scenario | Input | Expected | Actual | Grade |
|---|---|---|---|---|
| Happy path | Normal input | Correct output | ? | A-F |
| Edge case | Unusual input | Graceful handling | ? | A-F |
| Error case | Bad input | Helpful error | ? | A-F |
| Stress case | Large/complex input | Reasonable handling | ? | A-F |
| Adversarial | Tricky/malicious input | Safe response | ? | A-F |
创建并运行测试场景:
| 场景类型 | 输入 | 预期结果 | 实际结果 | 评级 |
|---|---|---|---|---|
| 正常路径 | 常规输入 | 正确输出 | ? | A-F |
| 边界场景 | 非常规输入 | 优雅处理 | ? | A-F |
| 错误场景 | 非法输入 | 有用的错误提示 | ? | A-F |
| 压力场景 | 大量/复杂输入 | 合理处理 | ? | A-F |
| 对抗场景 | 刁钻/恶意输入 | 安全响应 | ? | A-F |
Evaluation Report
评估报告
Produce a structured report with:
- Overall quality grade (A-F)
- Per-dimension scores with evidence
- Specific scenario results
- Priority improvements with recommended Maestro commands
生成结构化报告,包含以下内容:
- 整体质量评级(A-F)
- 各维度得分及佐证依据
- 具体场景测试结果
- 优先级改进项及推荐的Maestro命令
Evaluation Checklist
评估检查清单
- All 5 dimensions tested with concrete scenarios
- At least one edge case and one adversarial case tested
- Results documented in the scenario table
- Overall grade assigned with justification
- Improvement actions reference specific Maestro commands
- 已使用具体场景测试全部5个维度
- 已测试至少1个边界场景和1个对抗场景
- 测试结果已记录在场景表格中
- 已给出整体评级及理由
- 改进行动关联了具体的Maestro命令
Recommended Next Step
推荐下一步
After evaluation, run to address error behavior gaps, for output quality improvements, or to set up continuous quality monitoring.
{{command_prefix}}fortify{{command_prefix}}refine{{command_prefix}}iterateNEVER:
- Evaluate theoretically — run actual scenarios
- Give an A grade unless the workflow handles all scenario types well
- Skip adversarial testing for user-facing workflows
- Evaluate only the happy path
评估完成后,运行 填补错误表现缺口,运行 优化输出质量,或运行 搭建持续质量监控机制。
{{command_prefix}}fortify{{command_prefix}}refine{{command_prefix}}iterate严禁:
- 仅做理论评估——必须运行真实测试场景
- 除非工作流能很好地处理所有类型的场景,否则不要给出A级评级
- 面向用户的工作流不得跳过对抗测试
- 仅测试happy path