evaluate

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

MANDATORY PREPARATION

强制准备

Invoke {{command_prefix}}agent-workflow — it contains workflow principles, anti-patterns, and the Context Gathering Protocol. Follow the protocol before proceeding — if no workflow context exists yet, you MUST run {{command_prefix}}teach-maestro first. Consult the feedback-loops reference in the agent-workflow skill for evaluation patterns, golden test sets, and regression detection.

Evaluate the workflow's actual interaction quality by testing it against scenarios that represent real usage.
调用 {{command_prefix}}agent-workflow —— 它包含工作流原则、反模式,以及上下文收集协议。在继续操作前请遵循该协议——如果还不存在工作流上下文,你必须先运行 {{command_prefix}}teach-maestro。 参考agent-workflow技能中的反馈循环参考资料,了解评估模式、黄金测试集和回归检测方法。

通过使用代表真实使用场景的案例进行测试,评估工作流的实际交互质量。

Evaluation Dimensions

评估维度

1. Task Completion
  • Does the workflow actually accomplish what it's supposed to?
  • Does it handle the complete task or only the happy path?
  • Are edge cases addressed or silently dropped?
2. Output Quality
  • Is the output accurate, complete, and well-formatted?
  • Does it match the defined output schema (if any)?
  • Would a domain expert approve the output?
3. Error Behavior
  • What happens when input is malformed?
  • What happens when a tool fails?
  • What happens when the model is uncertain?
  • Is the error message useful or generic?
4. User Experience
  • Is the interaction natural and intuitive?
  • Are confirmations requested for destructive operations?
  • Is the response time acceptable?
  • Does the workflow communicate its limitations?
5. Consistency
  • Does the same input produce consistent output quality?
  • Are there random failures that aren't reproducible?
  • Does quality degrade over long conversations?
1. 任务完成度
  • 工作流是否真的实现了预期目标?
  • 它是覆盖了完整任务流程,还是仅处理了happy path?
  • 边界用例是被处理了,还是被静默丢弃了?
2. 输出质量
  • 输出是否准确、完整且格式规范?
  • 它是否符合已定义的输出Schema(如果有的话)?
  • 领域专家是否会认可该输出?
3. 错误表现
  • 输入格式错误时会发生什么?
  • 工具调用失败时会发生什么?
  • 模型无法确定结果时会发生什么?
  • 错误提示是有用的,还是通用无意义的?
4. 用户体验
  • 交互是否自然直观?
  • 执行破坏性操作前是否请求用户确认?
  • 响应时间是否在可接受范围内?
  • 工作流是否会告知用户它的能力限制?
5. 一致性
  • 相同输入是否能产出质量一致的输出?
  • 是否存在无法复现的随机失败?
  • 长对话过程中输出质量是否会下降?

Scenario Testing

场景测试

Create and run test scenarios:
ScenarioInputExpectedActualGrade
Happy pathNormal inputCorrect output?A-F
Edge caseUnusual inputGraceful handling?A-F
Error caseBad inputHelpful error?A-F
Stress caseLarge/complex inputReasonable handling?A-F
AdversarialTricky/malicious inputSafe response?A-F
创建并运行测试场景:
场景类型输入预期结果实际结果评级
正常路径常规输入正确输出?A-F
边界场景非常规输入优雅处理?A-F
错误场景非法输入有用的错误提示?A-F
压力场景大量/复杂输入合理处理?A-F
对抗场景刁钻/恶意输入安全响应?A-F

Evaluation Report

评估报告

Produce a structured report with:
  1. Overall quality grade (A-F)
  2. Per-dimension scores with evidence
  3. Specific scenario results
  4. Priority improvements with recommended Maestro commands
生成结构化报告,包含以下内容:
  1. 整体质量评级(A-F)
  2. 各维度得分及佐证依据
  3. 具体场景测试结果
  4. 优先级改进项及推荐的Maestro命令

Evaluation Checklist

评估检查清单

  • All 5 dimensions tested with concrete scenarios
  • At least one edge case and one adversarial case tested
  • Results documented in the scenario table
  • Overall grade assigned with justification
  • Improvement actions reference specific Maestro commands
  • 已使用具体场景测试全部5个维度
  • 已测试至少1个边界场景和1个对抗场景
  • 测试结果已记录在场景表格中
  • 已给出整体评级及理由
  • 改进行动关联了具体的Maestro命令

Recommended Next Step

推荐下一步

After evaluation, run
{{command_prefix}}fortify
to address error behavior gaps,
{{command_prefix}}refine
for output quality improvements, or
{{command_prefix}}iterate
to set up continuous quality monitoring.
NEVER:
  • Evaluate theoretically — run actual scenarios
  • Give an A grade unless the workflow handles all scenario types well
  • Skip adversarial testing for user-facing workflows
  • Evaluate only the happy path
评估完成后,运行
{{command_prefix}}fortify
填补错误表现缺口,运行
{{command_prefix}}refine
优化输出质量,或运行
{{command_prefix}}iterate
搭建持续质量监控机制。
严禁
  • 仅做理论评估——必须运行真实测试场景
  • 除非工作流能很好地处理所有类型的场景,否则不要给出A级评级
  • 面向用户的工作流不得跳过对抗测试
  • 仅测试happy path