iterate

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

MANDATORY PREPARATION

必备准备工作

Invoke {{command_prefix}}agent-workflow — it contains workflow principles, anti-patterns, and the Context Gathering Protocol. Follow the protocol before proceeding — if no workflow context exists yet, you MUST run {{command_prefix}}teach-maestro first.

Consult the feedback-loops reference in the agent-workflow skill for evaluation patterns and self-correction strategies.

Set up feedback loops that make workflows self-correcting and continuously improving. Iteration transforms one-shot gambles into convergent, reliable systems.

调用 {{command_prefix}}agent-workflow —— 它包含了工作流原则、反模式，以及上下文收集协议。在继续操作前请遵循该协议：如果还没有任何工作流上下文，你必须先运行 {{command_prefix}}teach-maestro。

请查阅agent-workflow技能中的反馈回路参考文档，获取评估模式与自我修正策略。

搭建可让工作流实现自我修正、持续优化的反馈回路。迭代能够将一次性的赌博式操作转化为收敛、可靠的系统。

Feedback Loop Design

反馈回路设计

Step 1: Define Quality Criteria

步骤1：定义质量标准

What does "good output" look like? Score dimensions:

Dimension	Weight	Threshold	Measurement
Accuracy	0.4	≥ 0.8	Factual correctness check
Completeness	0.3	≥ 0.7	Required fields present
Format	0.2	≥ 0.9	Schema compliance
Tone	0.1	≥ 0.6	Appropriate for audience

「合格产出」应该是什么样的？评分维度如下：

维度	权重	合格阈值	衡量方式
准确率	0.4	≥ 0.8	事实正确性校验
完整度	0.3	≥ 0.7	所有必填字段均存在
格式合规性	0.2	≥ 0.9	符合Schema要求
语气适配度	0.1	≥ 0.6	符合受众场景

Step 2: Choose Evaluator Type

步骤2：选择评估器类型

Match evaluator to requirements:

Rule-based: Schema validation, field presence, value ranges (fast, free)
Self-check: Same model evaluates own output (fast, cheap, less reliable)
Cross-model: Different model evaluates (slower, more reliable)
Human-in-the-loop: Human review (slowest, most reliable, doesn't scale)
Hybrid: Rules first, then model check for what rules can't catch

根据需求匹配对应的评估器：

规则驱动：Schema校验、字段存在性校验、值范围校验（速度快，无额外成本）
自校验：生成输出的同一模型评估自身产出（速度快，成本低，可靠性较低）
跨模型校验：使用不同的模型进行评估（速度稍慢，可靠性更高）
人在回路：人工审核（速度最慢，可靠性最高，无法规模化）
混合模式：先运行规则校验，再用模型校验规则无法覆盖的场景

Step 3: Design the Correction Loop

步骤3：设计修正回路

text

generate(input) → evaluate(output) → score
  if score ≥ threshold → return output
  if score < threshold AND attempts < max →
    enrich input with evaluator feedback
    generate again (with feedback)
  if attempts ≥ max → fallback or escalate

Critical: The retry input MUST be different from the original. Include:

The evaluator's specific feedback
What was wrong and why
A suggestion for how to fix it

text

generate(input) → evaluate(output) → score
  if score ≥ threshold → return output
  if score < threshold AND attempts < max →
    enrich input with evaluator feedback
    generate again (with feedback)
  if attempts ≥ max → fallback or escalate

关键注意点：重试时的输入必须与原始输入不同，需要包含：

评估器给出的具体反馈
存在的问题及问题原因
对应的修复建议

Step 4: Set Up Regression Detection

步骤4：设置回归检测

When changing prompts, models, or tools:

Run golden test set with OLD config → baseline scores
Run golden test set with NEW config → new scores
Compare: improvement ≥ 5% → accept; regression ≥ 5% → reject

当修改Prompt、模型或工具时：

使用旧配置运行黄金测试集 → 得到基准分数
使用新配置运行黄金测试集 → 得到新分数
对比：提升幅度≥5% → 接受变更；回退幅度≥5% → 拒绝变更

Step 5: Continuous Monitoring

步骤5：持续监控

For production workflows:

Sample 1-5% of outputs for automated evaluation
Track quality scores over time
Alert on downward trends
A/B test changes before full rollout

面向生产环境的工作流：

抽样1-5%的产出进行自动化评估
长期跟踪质量分数变化
质量分数出现下降趋势时触发告警
全量上线变更前先进行A/B测试