iterate

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

MANDATORY PREPARATION

必备准备工作

Invoke {{command_prefix}}agent-workflow — it contains workflow principles, anti-patterns, and the Context Gathering Protocol. Follow the protocol before proceeding — if no workflow context exists yet, you MUST run {{command_prefix}}teach-maestro first.
Consult the feedback-loops reference in the agent-workflow skill for evaluation patterns and self-correction strategies.

Set up feedback loops that make workflows self-correcting and continuously improving. Iteration transforms one-shot gambles into convergent, reliable systems.
调用 {{command_prefix}}agent-workflow —— 它包含了工作流原则、反模式,以及上下文收集协议。在继续操作前请遵循该协议:如果还没有任何工作流上下文,你必须先运行 {{command_prefix}}teach-maestro。
请查阅agent-workflow技能中的反馈回路参考文档,获取评估模式与自我修正策略。

搭建可让工作流实现自我修正、持续优化的反馈回路。迭代能够将一次性的赌博式操作转化为收敛、可靠的系统。

Feedback Loop Design

反馈回路设计

Step 1: Define Quality Criteria

步骤1:定义质量标准

What does "good output" look like? Score dimensions:
DimensionWeightThresholdMeasurement
Accuracy0.4≥ 0.8Factual correctness check
Completeness0.3≥ 0.7Required fields present
Format0.2≥ 0.9Schema compliance
Tone0.1≥ 0.6Appropriate for audience
「合格产出」应该是什么样的?评分维度如下:
维度权重合格阈值衡量方式
准确率0.4≥ 0.8事实正确性校验
完整度0.3≥ 0.7所有必填字段均存在
格式合规性0.2≥ 0.9符合Schema要求
语气适配度0.1≥ 0.6符合受众场景

Step 2: Choose Evaluator Type

步骤2:选择评估器类型

Match evaluator to requirements:
  • Rule-based: Schema validation, field presence, value ranges (fast, free)
  • Self-check: Same model evaluates own output (fast, cheap, less reliable)
  • Cross-model: Different model evaluates (slower, more reliable)
  • Human-in-the-loop: Human review (slowest, most reliable, doesn't scale)
  • Hybrid: Rules first, then model check for what rules can't catch
根据需求匹配对应的评估器:
  • 规则驱动:Schema校验、字段存在性校验、值范围校验(速度快,无额外成本)
  • 自校验:生成输出的同一模型评估自身产出(速度快,成本低,可靠性较低)
  • 跨模型校验:使用不同的模型进行评估(速度稍慢,可靠性更高)
  • 人在回路:人工审核(速度最慢,可靠性最高,无法规模化)
  • 混合模式:先运行规则校验,再用模型校验规则无法覆盖的场景

Step 3: Design the Correction Loop

步骤3:设计修正回路

text
generate(input) → evaluate(output) → score
  if score ≥ threshold → return output
  if score < threshold AND attempts < max →
    enrich input with evaluator feedback
    generate again (with feedback)
  if attempts ≥ max → fallback or escalate
Critical: The retry input MUST be different from the original. Include:
  • The evaluator's specific feedback
  • What was wrong and why
  • A suggestion for how to fix it
text
generate(input) → evaluate(output) → score
  if score ≥ threshold → return output
  if score < threshold AND attempts < max →
    enrich input with evaluator feedback
    generate again (with feedback)
  if attempts ≥ max → fallback or escalate
关键注意点:重试时的输入必须与原始输入不同,需要包含:
  • 评估器给出的具体反馈
  • 存在的问题及问题原因
  • 对应的修复建议

Step 4: Set Up Regression Detection

步骤4:设置回归检测

When changing prompts, models, or tools:
  1. Run golden test set with OLD config → baseline scores
  2. Run golden test set with NEW config → new scores
  3. Compare: improvement ≥ 5% → accept; regression ≥ 5% → reject
当修改Prompt、模型或工具时:
  1. 使用旧配置运行黄金测试集 → 得到基准分数
  2. 使用新配置运行黄金测试集 → 得到新分数
  3. 对比:提升幅度≥5% → 接受变更;回退幅度≥5% → 拒绝变更

Step 5: Continuous Monitoring

步骤5:持续监控

For production workflows:
  • Sample 1-5% of outputs for automated evaluation
  • Track quality scores over time
  • Alert on downward trends
  • A/B test changes before full rollout
面向生产环境的工作流:
  • 抽样1-5%的产出进行自动化评估
  • 长期跟踪质量分数变化
  • 质量分数出现下降趋势时触发告警
  • 全量上线变更前先进行A/B测试

Iteration Checklist

迭代检查清单

  • Quality criteria defined with weights and thresholds
  • Evaluator selected and configured
  • Correction loop has max attempts limit
  • Feedback is injected into retries (not identical retry)
  • Golden test set exists with ≥ 10 cases
  • Regression detection configured for changes
  • Production monitoring in place
  • 已定义质量标准,包含权重和合格阈值
  • 已选择并配置好评估器
  • 修正回路已设置最大重试次数限制
  • 重试时已注入反馈信息(不会执行完全相同的重试)
  • 已搭建包含至少10条用例的黄金测试集
  • 已为变更配置回归检测逻辑
  • 生产环境监控已部署到位

Recommended Next Step

推荐后续步骤

After setting up feedback loops, run
{{command_prefix}}evaluate
to validate the loop with real scenarios, then
{{command_prefix}}refine
for final polish.
NEVER:
  • Retry with the exact same input (definition of insanity)
  • Use the same weak model to both generate and evaluate
  • Skip the max attempts limit (infinite loops are real)
  • Deploy changes without regression testing against golden set
  • Monitor only errors — track quality scores over time
搭建完反馈回路后,运行
{{command_prefix}}evaluate
用真实场景验证回路有效性,随后运行
{{command_prefix}}refine
做最终优化。
绝对禁止
  • 使用完全相同的输入重试(这是无意义的重复操作)
  • 使用同一个能力较弱的模型同时负责生成和评估
  • 省略最大重试次数限制(真的会出现无限循环)
  • 未针对黄金测试集做回归测试就上线变更
  • 只监控错误信息——要长期跟踪质量分数变化