self-improving-agent-builder

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Self-Improving Agent Builder

自我改进型Agent构建工具

Purpose

用途

Run a closed-loop improvement cycle on any goal-seeking agent implementation:

EVAL -> ANALYZE -> RESEARCH -> IMPROVE -> RE-EVAL -> DECIDE -> (repeat)

Each iteration measures L1-L12 progressive test scores, identifies failures with

error_analyzer.py

, runs a research step with hypothesis/evidence/ counter-arguments, applies targeted fixes, and gates promotion through regression checks.

为任意目标导向型Agent实现运行闭环改进循环：

EVAL -> ANALYZE -> RESEARCH -> IMPROVE -> RE-EVAL -> DECIDE -> (重复)

每次迭代都会测量L1-L12的渐进式测试分数，通过

error_analyzer.py

识别故障，执行包含假设/证据/反论点的研究步骤，应用针对性修复，并通过退化检查来决定是否推广改进。

When I Activate

激活触发条件

"improve agent" or "self-improving loop"
"agent eval loop" or "run improvement cycle"
"benchmark agents" or "compare SDK implementations"
"iterate on agent scores" or "fix agent regressions"

"improve agent"或"self-improving loop"
"agent eval loop"或"run improvement cycle"
"benchmark agents"或"compare SDK implementations"
"iterate on agent scores"或"fix agent regressions"

Quick Start

快速开始

User: "Run the self-improving loop on the mini-framework agent for 3 iterations"

Skill: Executes 3 iterations of EVAL->ANALYZE->RESEARCH->IMPROVE->RE-EVAL->DECIDE
       Reports per-iteration scores, net improvement, and commits/reverts.

用户: "在迷你框架Agent上运行自我改进循环，执行3次迭代"

Skill: 执行EVAL->ANALYZE->RESEARCH->IMPROVE->RE-EVAL->DECIDE的3次迭代
       报告每次迭代的分数、净提升情况，以及提交/回滚操作。

Runner Script

运行脚本

The self-improvement loop is implemented as a Python CLI:

bash

undefined

自我改进循环以Python CLI工具的形式实现：

bash

undefined

Basic usage

基础用法

python -m amplihack.eval.self_improve.runner --sdk mini --iterations 3

Full options

完整选项

python -m amplihack.eval.self_improve.runner
--sdk mini
--iterations 5
--improvement-threshold 2.0
--regression-tolerance 5.0
--levels L1 L2 L3 L4 L5 L6
--output-dir ./eval_results/self_improve
--dry-run # evaluate only, don't apply changes


**Source:** `src/amplihack/eval/self_improve/runner.py`


**源码路径：** `src/amplihack/eval/self_improve/runner.py`

The Loop (6 Phases per Iteration)

循环流程（每次迭代包含6个阶段）

Phase 1: EVAL

阶段1：EVAL（评估）

Run the L1-L12 progressive test suite on the current agent implementation.

Execution:

bash

python -m amplihack.eval.progressive_test_suite \
  --agent-name <agent_name> \
  --output-dir <output_dir>/iteration_N/eval \
  --levels L1 L2 L3 L4 L5 L6

Output: Per-level scores and overall baseline.

在当前Agent实现上运行L1-L12渐进式测试套件。

执行命令：

bash

python -m amplihack.eval.progressive_test_suite \
  --agent-name <agent_name> \
  --output-dir <output_dir>/iteration_N/eval \
  --levels L1 L2 L3 L4 L5 L6

输出结果： 各层级分数及整体基准值。

Phase 2: ANALYZE

阶段2：ANALYZE（分析）

Classify failures using

error_analyzer.py

. Maps each failed question to a failure taxonomy (retrieval_insufficient, temporal_ordering_wrong, etc.) and the specific code component responsible.

python

from amplihack.eval.self_improve import analyze_eval_results

analyses = analyze_eval_results(level_results, score_threshold=0.6)

使用

error_analyzer.py

对故障进行分类。将每个失败的测试问题映射到故障分类体系（如retrieval_insufficient、temporal_ordering_wrong等），并定位到对应的具体代码组件。

python

from amplihack.eval.self_improve import analyze_eval_results

analyses = analyze_eval_results(level_results, score_threshold=0.6)

Each ErrorAnalysis maps to:

每个ErrorAnalysis对象包含：

failure_mode -> affected_component -> prompt_template

undefined

undefined

Phase 3: RESEARCH (New)

阶段3：RESEARCH（研究）【新增】

The critical thinking step that prevents blind changes. For each proposed improvement:

State hypothesis: What specific change will fix the failure?
Gather evidence: From eval results, failure patterns, baseline scores
Consider counter-arguments: What could go wrong? Risk of regression?
Make decision: Apply, skip, or defer with full reasoning

Decisions are logged in

research_decisions.json

for auditability.

Decision criteria:

Apply: Clear failure pattern + prompt template available + low score
Skip: Score above 50% (likely stochastic variation)
Defer: Ambiguous evidence, needs more data

这是避免盲目更改的关键思考步骤。针对每个拟议的改进：

提出假设：具体的哪些更改可以修复故障？
收集证据：来自评估结果、故障模式、基准分数
考虑反论点：可能出现什么问题？是否存在性能退化风险？
做出决策：应用改进、跳过或延迟，并记录完整的推理过程

决策记录将保存到

research_decisions.json

中，便于审计。

决策标准：

应用：存在明确的故障模式 + 有可用的提示模板 + 分数较低
跳过：分数高于50%（可能是随机波动导致）
延迟：证据不明确，需要更多数据

Phase 4: IMPROVE

阶段4：IMPROVE（改进）

Apply the improvements approved by the research step. Priority order:

Prompt template improvements (safest, highest impact)
Retrieval strategy adjustments
Code logic fixes (most risky, needs careful review)

应用经研究步骤批准的改进措施。优先级顺序：

提示模板改进（最安全，影响最大）
检索策略调整
代码逻辑修复（风险最高，需仔细审查）

Phase 5: RE-EVAL

阶段5：RE-EVAL（再评估）

Re-run the same eval suite after applying fixes to measure impact.

应用修复后重新运行相同的评估套件，以测量改进效果。

Phase 6: DECIDE

阶段6：DECIDE（决策）

Promotion gate:

Net improvement >= +2% overall score: COMMIT the changes
Any single level regression > 5%: REVERT all changes
Otherwise: COMMIT with marginal improvement note

推广门槛：

整体分数净提升≥2%：COMMIT（提交）更改
任意单一层级性能退化>5%：REVERT（回滚）所有更改
其他情况：COMMIT（提交）并添加边际改进说明

Configuration

配置参数

Parameter	Default	Description
`sdk_type`	`mini`	Which SDK: mini/claude/copilot/microsoft
`max_iterations`	`5`	Maximum improvement iterations
`improvement_threshold`	`2.0`	Minimum % improvement to commit
`regression_tolerance`	`5.0`	Maximum % regression on any level
`levels`	`L1-L6`	Which levels to evaluate
`output_dir`	`./eval_results/self_improve`	Results directory
`dry_run`	`false`	Evaluate only, don't apply changes

参数名称	默认值	描述说明
`sdk_type`	`mini`	使用的SDK类型：mini/claude/copilot/microsoft
`max_iterations`	`5`	最大改进迭代次数
`improvement_threshold`	`2.0`	提交改进所需的最小百分比提升
`regression_tolerance`	`5.0`	任意层级允许的最大性能退化百分比
`levels`	`L1-L6`	需要评估的层级
`output_dir`	`./eval_results/self_improve`	结果输出目录
`dry_run`	`false`	仅评估，不应用更改

Programmatic Usage

程序化调用

python

from amplihack.eval.self_improve import run_self_improvement, RunnerConfig

config = RunnerConfig(
    sdk_type="mini",
    max_iterations=3,
    improvement_threshold=2.0,
    regression_tolerance=5.0,
    levels=["L1", "L2", "L3", "L4", "L5", "L6"],
    output_dir="./eval_results/self_improve",
    dry_run=False,
)

result = run_self_improvement(config)
print(f"Total improvement: {result.total_improvement:+.1f}%")
print(f"Final scores: {result.final_scores}")

python

from amplihack.eval.self_improve import run_self_improvement, RunnerConfig

config = RunnerConfig(
    sdk_type="mini",
    max_iterations=3,
    improvement_threshold=2.0,
    regression_tolerance=5.0,
    levels=["L1", "L2", "L3", "L4", "L5", "L6"],
    output_dir="./eval_results/self_improve",
    dry_run=False,
)

result = run_self_improvement(config)
print(f"总提升幅度: {result.total_improvement:+.1f}%")
print(f"最终分数: {result.final_scores}")

4-Way Benchmark Mode

四向基准测试模式

Compare all SDK implementations side by side:

User: "Run a 4-way benchmark comparing all SDK implementations"

Skill: Runs eval suite on mini, claude, copilot, microsoft
       Generates comparison table with scores, LOC, and coverage.

同时对比所有SDK实现的性能：

User: "运行四向基准测试，对比所有SDK implementations"

Skill: 在mini、claude、copilot、microsoft上运行评估套件
       生成包含分数、LOC和覆盖率的对比表格。

Integration Points

集成点

src/amplihack/eval/self_improve/runner.py

: Self-improvement loop runner

src/amplihack/eval/self_improve/error_analyzer.py

: Failure classification

src/amplihack/eval/progressive_test_suite.py

: L1-L12 eval runner

src/amplihack/agents/goal_seeking/sdk_adapters/

: All 4 SDK implementations

src/amplihack/eval/metacognition_grader.py

: Advanced eval dimensions

```
src/amplihack/eval/teaching_session.py
```
: L7 teaching quality eval

src/amplihack/eval/self_improve/runner.py

: 自我改进循环运行器

src/amplihack/eval/self_improve/error_analyzer.py

: 故障分类工具

src/amplihack/eval/progressive_test_suite.py

: L1-L12评估运行器

src/amplihack/agents/goal_seeking/sdk_adapters/

: 全部4种SDK实现

src/amplihack/eval/metacognition_grader.py

: 高级评估维度工具

```
src/amplihack/eval/teaching_session.py
```
: L7教学质量评估工具