iterate
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseMANDATORY PREPARATION
必备准备工作
Invoke {{command_prefix}}agent-workflow — it contains workflow principles, anti-patterns, and the Context Gathering Protocol. Follow the protocol before proceeding — if no workflow context exists yet, you MUST run {{command_prefix}}teach-maestro first.
Consult the feedback-loops reference in the agent-workflow skill for evaluation patterns and self-correction strategies.
Set up feedback loops that make workflows self-correcting and continuously improving. Iteration transforms one-shot gambles into convergent, reliable systems.
调用 {{command_prefix}}agent-workflow —— 它包含了工作流原则、反模式,以及上下文收集协议。在继续操作前请遵循该协议:如果还没有任何工作流上下文,你必须先运行 {{command_prefix}}teach-maestro。
请查阅agent-workflow技能中的反馈回路参考文档,获取评估模式与自我修正策略。
搭建可让工作流实现自我修正、持续优化的反馈回路。迭代能够将一次性的赌博式操作转化为收敛、可靠的系统。
Feedback Loop Design
反馈回路设计
Step 1: Define Quality Criteria
步骤1:定义质量标准
What does "good output" look like? Score dimensions:
| Dimension | Weight | Threshold | Measurement |
|---|---|---|---|
| Accuracy | 0.4 | ≥ 0.8 | Factual correctness check |
| Completeness | 0.3 | ≥ 0.7 | Required fields present |
| Format | 0.2 | ≥ 0.9 | Schema compliance |
| Tone | 0.1 | ≥ 0.6 | Appropriate for audience |
「合格产出」应该是什么样的?评分维度如下:
| 维度 | 权重 | 合格阈值 | 衡量方式 |
|---|---|---|---|
| 准确率 | 0.4 | ≥ 0.8 | 事实正确性校验 |
| 完整度 | 0.3 | ≥ 0.7 | 所有必填字段均存在 |
| 格式合规性 | 0.2 | ≥ 0.9 | 符合Schema要求 |
| 语气适配度 | 0.1 | ≥ 0.6 | 符合受众场景 |
Step 2: Choose Evaluator Type
步骤2:选择评估器类型
Match evaluator to requirements:
- Rule-based: Schema validation, field presence, value ranges (fast, free)
- Self-check: Same model evaluates own output (fast, cheap, less reliable)
- Cross-model: Different model evaluates (slower, more reliable)
- Human-in-the-loop: Human review (slowest, most reliable, doesn't scale)
- Hybrid: Rules first, then model check for what rules can't catch
根据需求匹配对应的评估器:
- 规则驱动:Schema校验、字段存在性校验、值范围校验(速度快,无额外成本)
- 自校验:生成输出的同一模型评估自身产出(速度快,成本低,可靠性较低)
- 跨模型校验:使用不同的模型进行评估(速度稍慢,可靠性更高)
- 人在回路:人工审核(速度最慢,可靠性最高,无法规模化)
- 混合模式:先运行规则校验,再用模型校验规则无法覆盖的场景
Step 3: Design the Correction Loop
步骤3:设计修正回路
text
generate(input) → evaluate(output) → score
if score ≥ threshold → return output
if score < threshold AND attempts < max →
enrich input with evaluator feedback
generate again (with feedback)
if attempts ≥ max → fallback or escalateCritical: The retry input MUST be different from the original. Include:
- The evaluator's specific feedback
- What was wrong and why
- A suggestion for how to fix it
text
generate(input) → evaluate(output) → score
if score ≥ threshold → return output
if score < threshold AND attempts < max →
enrich input with evaluator feedback
generate again (with feedback)
if attempts ≥ max → fallback or escalate关键注意点:重试时的输入必须与原始输入不同,需要包含:
- 评估器给出的具体反馈
- 存在的问题及问题原因
- 对应的修复建议
Step 4: Set Up Regression Detection
步骤4:设置回归检测
When changing prompts, models, or tools:
- Run golden test set with OLD config → baseline scores
- Run golden test set with NEW config → new scores
- Compare: improvement ≥ 5% → accept; regression ≥ 5% → reject
当修改Prompt、模型或工具时:
- 使用旧配置运行黄金测试集 → 得到基准分数
- 使用新配置运行黄金测试集 → 得到新分数
- 对比:提升幅度≥5% → 接受变更;回退幅度≥5% → 拒绝变更
Step 5: Continuous Monitoring
步骤5:持续监控
For production workflows:
- Sample 1-5% of outputs for automated evaluation
- Track quality scores over time
- Alert on downward trends
- A/B test changes before full rollout
面向生产环境的工作流:
- 抽样1-5%的产出进行自动化评估
- 长期跟踪质量分数变化
- 质量分数出现下降趋势时触发告警
- 全量上线变更前先进行A/B测试
Iteration Checklist
迭代检查清单
- Quality criteria defined with weights and thresholds
- Evaluator selected and configured
- Correction loop has max attempts limit
- Feedback is injected into retries (not identical retry)
- Golden test set exists with ≥ 10 cases
- Regression detection configured for changes
- Production monitoring in place
- 已定义质量标准,包含权重和合格阈值
- 已选择并配置好评估器
- 修正回路已设置最大重试次数限制
- 重试时已注入反馈信息(不会执行完全相同的重试)
- 已搭建包含至少10条用例的黄金测试集
- 已为变更配置回归检测逻辑
- 生产环境监控已部署到位
Recommended Next Step
推荐后续步骤
After setting up feedback loops, run to validate the loop with real scenarios, then for final polish.
{{command_prefix}}evaluate{{command_prefix}}refineNEVER:
- Retry with the exact same input (definition of insanity)
- Use the same weak model to both generate and evaluate
- Skip the max attempts limit (infinite loops are real)
- Deploy changes without regression testing against golden set
- Monitor only errors — track quality scores over time
搭建完反馈回路后,运行 用真实场景验证回路有效性,随后运行 做最终优化。
{{command_prefix}}evaluate{{command_prefix}}refine绝对禁止:
- 使用完全相同的输入重试(这是无意义的重复操作)
- 使用同一个能力较弱的模型同时负责生成和评估
- 省略最大重试次数限制(真的会出现无限循环)
- 未针对黄金测试集做回归测试就上线变更
- 只监控错误信息——要长期跟踪质量分数变化