skill-refiner

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Skill Refiner: Iterative Self-Improvement Loop

Skill Refiner：迭代式自我优化循环

Adaptive evaluation loop for AI skill collections, inspired by Karpathy's AutoResearch. Orchestrates repeated score-improve-verify cycles using skill-creator as the engine and mandatory peer review as an adversarial check (cross-model when a secondary harness is available, fresh-context self-review as the minimum fallback).

受Karpathy's AutoResearch启发，为AI Skill集合打造的自适应评估循环。以skill-creator为核心引擎，编排重复的评分-优化-验证循环，并将强制同行评审作为对抗性检查（当存在secondary harness时采用跨模型评审，否则以全新上下文的自我评审作为最低 fallback 方案）。

When to use

适用场景

Batch-improving the entire skill collection after a period of manual edits
Running quality sweeps before a release or publish
Triggering a self-improvement cycle where skills bootstrap each other
After adding several new skills that need polish and consistency alignment
When cross-model perspective would catch single-model blind spots
Periodic maintenance: scheduled improvement runs to keep skills current

手动编辑一段时间后，批量优化整个Skill集合
发布前运行质量扫描
触发Skill间互相引导的自我优化循环
添加多个新Skill后，需要进行打磨和一致性对齐
跨模型视角可发现单模型盲区时
定期维护：通过定时优化运行保持Skill的时效性

When NOT to use

不适用场景

Single skill review or improvement - use skill-creator (Mode 2)
Creating a new skill from scratch - use skill-creator (Mode 1)
One-off collection audit without iteration - use skill-creator (Mode 3)
Full codebase review (code, not skills) - use full-review
Style/slop audit on application code - use anti-slop

单个Skill的评审或优化 - 使用skill-creator（模式2）
从零创建新Skill - 使用skill-creator（模式1）
无迭代的一次性集合审计 - 使用skill-creator（模式3）
完整代码库评审（代码而非Skill）- 使用full-review
应用代码的风格/冗余审计 - 使用anti-slop

Configuration

配置

skill-refiner [--iterations N] [--mode MODE] [--secondary HARNESS] [--threshold N] [--plateau N]

Flag	Default	Description
`--iterations`	10	Maximum iterations for phase 1
`--mode`	circuit-breaker	`auto` , `circuit-breaker` , or `step`
`--secondary`	auto-detect	Secondary harness for cross-model review, or `none`
`--threshold`	85	Focus threshold - skip skills scoring above this (user can override max)
`--plateau`	2	Minimum score delta to keep iterating

Environment override:

SKILL_REFINER_SECONDARY=<harness>

(CLI flag takes precedence)

skill-refiner [--iterations N] [--mode MODE] [--secondary HARNESS] [--threshold N] [--plateau N]

标志	默认值	描述
`--iterations`	10	第一阶段的最大迭代次数
`--mode`	circuit-breaker	可选值： `auto` 、 `circuit-breaker` 或 `step`
`--secondary`	auto-detect	用于跨模型评审的secondary harness，或设为 `none`
`--threshold`	85	聚焦阈值 - 跳过评分高于此值的Skill（用户可覆盖最大值）
`--plateau`	2	持续迭代所需的最小评分增量

环境变量覆盖：

SKILL_REFINER_SECONDARY=<harness>

（CLI标志优先级更高）

Checkpoint Modes

检查点模式

circuit-breaker (default): runs autonomously, auto-pauses on score regression, contested major flags, or plateau. Always pauses before phase 2.

auto: fully autonomous through phase 1. Still pauses before phase 2 and on contested major flags (non-configurable).

step: pauses after every iteration for manual review. Best for first run or learning.

circuit-breaker（默认）：自主运行，当出现评分倒退、存在争议的重大标记或进入平台期时自动暂停。在进入第二阶段前始终会暂停。

auto：第一阶段完全自主运行。在进入第二阶段前以及出现争议的重大标记时仍会暂停（不可配置）。

step：每次迭代后暂停以等待手动评审。首次运行或学习阶段使用最佳。

Workflow

工作流程

Phase 0: Setup

阶段0：准备

Create feature branch:
```
skill-refiner/YYYY-MM-DD-HHMMSS
```
from current HEAD
Load run history: read
```
.refiner-runs.json
```
from the collection root (if it exists). Use previous run data for: baseline score comparison (detect regressions from external changes), model/harness change detection (flag if the primary or secondary model changed since last run - new model = new baseline, not a comparable delta), and skip analysis (don't re-attempt improvements that were already tried and reverted in a recent run).
Build skill inventory: list all skills, exclude phase-2 targets (skill-creator, skill-refiner) from the improvement pool
Detect primary harness: check environment to identify which AI CLI is running this session
Probe for secondary harness: run three-step validation (PATH check, config check, smoke test) per
```
references/harness-detection.md
```
. Announce result.
If no secondary found: always fall back to self-review. Spawn a fresh agent on the current harness with the review prompt template from
```
references/harness-detection.md
```
. Label as "same-model fresh-context review" in scoring, weight at 3% instead of 5% (composite becomes gate/40/55/3, renormalize the missing 2% proportionally to AI Self-Check and Behavioral). This catches confirmation bias but shares the primary model's blind spots. Skipping review entirely is not an option - a fresh-context self-review is the minimum bar. If the harness doesn't support subagents, run the review prompt as a separate CLI invocation (
```
claude -p
```
,
```
codex exec
```
, etc.).

创建特性分支：基于当前HEAD创建
```
skill-refiner/YYYY-MM-DD-HHMMSS
```
分支
加载运行历史：从集合根目录读取
```
.refiner-runs.json
```
（如果存在）。将之前的运行数据用于：基准评分对比（检测外部变更导致的倒退）、模型/harness变更检测（标记自上次运行以来主模型或副模型是否变更 - 新模型=新基准，无法进行增量对比）、跳过分析（不再尝试最近运行中已尝试并回退的优化）。
构建Skill清单：列出所有Skill，将第二阶段目标（skill-creator、skill-refiner）从优化池中排除
检测主harness：检查环境以确定当前会话运行的是哪个AI CLI
探测副harness：按照
```
references/harness-detection.md
```
执行三步验证（路径检查、配置检查、冒烟测试）。公布结果。
如果未找到副harness：始终回退到自我评审。在当前harness上启动一个全新Agent，使用
```
references/harness-detection.md
```
中的评审提示模板。在评分中标记为“同模型全新上下文评审”，权重设为3%而非5%（综合评分比例变为gate/40/55/3，将缺失的2%按比例重新分配给AI自我检查和行为测试）。这可以避免确认偏差，但仍会存在主模型的盲区。完全跳过评审并非可选方案 - 全新上下文的自我评审是最低标准。如果harness不支持子Agent，则通过单独的CLI调用运行评审提示（如
```
claude -p
```
、
```
codex exec
```
等）。

Phase 1: Regular Iterations

阶段1：常规迭代

Iteration 1 - full sweep: score every skill in the pool using the four-component model from
```
references/evaluation-criteria.md
```
- Structural: run lint-skills.sh + validate-spec.sh
- AI Self-Check: invoke skill-creator review mode on each skill
- Behavioral: run test prompts from
```
references/test-cases.md
```
  . For skills without pre-written test cases, auto-generate 2-3 test prompts from the skill's "When to use" section and quality signals from its AI Self-Check. Log a warning that generated tests are lower quality than hand-written ones. Optionally save generated tests to a test-cases-local.md file alongside test-cases.md so they accumulate across runs.
- Cross-model: skip on first iteration (no diff to review yet)
Log baseline scores: record per-skill and aggregate scores
Iteration 2+: enter adaptive focus mode
Select targets: identify skills scoring below the focus threshold
For each targeted skill, run the improvement cycle: a. Read current SKILL.md and all reference files b. Invoke skill-creator review mode - collect findings c. Run behavioral test - score current output quality d. Propose targeted improvements based on findings (not random changes) e. Apply changes to SKILL.md (and references if needed) f. Re-score: run lint + AI Self-Check + behavioral test g. Karpathy gate: if score improved, keep. If not, revert. No exceptions. h. If cross-model review available, send the diff to secondary harness i. Process flags per
```
references/harness-detection.md
```
verification protocol j. If secondary flags major issue and primary agrees: revert k. If secondary flags major issue and primary disagrees: escalate to circuit breaker
Commit iteration: one commit with all improvements from this iteration Format:
```
refactor(skill-refiner): iteration N - skill1(+X), skill2(+Y)
```

Log iteration summary:

--- iteration N / max -------------------------------------------
improved:  skill1 (72 > 80 | G:pass A:76 B:78 X:90), skill2 (68 > 73 | G:pass A:70 B:72 X:100)
gated:     skillZ (lint/spec failed - excluded from scoring)
skipped:   M skills above threshold
reverted:  skill3 (proposed change scored -2, rolled back | G:pass A:74 B:69 X:100)
contested: skill4 (secondary flagged major, primary disagreed)
plateau:   yes/no (max delta: +X)
-----------------------------------------------------------------

Check termination conditions (phase 1 always flows into phase 2 on termination, except on circuit-breaker pauses which wait for user input first):
- Plateau detected (max delta < plateau threshold)? Terminate phase 1.
- All skills above focus threshold? Bump threshold by 5 and continue. If threshold is already at max (95) and all skills still clear it, terminate phase 1.
- Iteration cap reached? Terminate phase 1.
- Circuit breaker triggered? Pause for user input.
Repeat from step 9 until terminated

迭代1 - 全面扫描：使用
```
references/evaluation-criteria.md
```
中的四组件模型为池中的每个Skill评分
- 结构检查：运行lint-skills.sh + validate-spec.sh
- AI自我检查：对每个Skill调用skill-creator的评审模式
- 行为测试：运行
```
references/test-cases.md
```
  中的测试提示。对于没有预编写测试用例的Skill，从Skill的“适用场景”部分和AI自我检查的质量信号自动生成2-3个测试提示。记录警告，说明生成的测试质量低于手写测试。可选择将生成的测试保存到test-cases.md旁的test-cases-local.md文件中，以便在多次运行中积累。
- 跨模型评审：首次迭代跳过（尚无差异可评审）
记录基准评分：记录每个Skill的评分及综合评分
迭代2及以后：进入自适应聚焦模式
选择目标：识别评分低于聚焦阈值的Skill
针对每个目标Skill，运行优化循环： a. 读取当前SKILL.md及所有参考文件 b. 调用skill-creator评审模式 - 收集发现的问题 c. 运行行为测试 - 评分当前输出质量 d. 根据发现的问题提出针对性优化方案（而非随机变更） e. 将变更应用到SKILL.md（必要时修改参考文件） f. 重新评分：运行Lint检查 + AI自我检查 + 行为测试 g. Karpathy gate：如果评分提升，则保留变更；否则回退。无例外。 h. 如果有跨模型评审可用，将差异发送给副harness i. 按照
```
references/harness-detection.md
```
的验证协议处理标记 j. 如果副harness标记重大问题且主harness同意：回退变更 k. 如果副harness标记重大问题但主harness不同意：触发熔断机制
提交迭代：一次提交包含本次迭代的所有优化格式：
```
refactor(skill-refiner): iteration N - skill1(+X), skill2(+Y)
```

记录迭代摘要：

--- iteration N / max -------------------------------------------
improved:  skill1 (72 > 80 | G:pass A:76 B:78 X:90), skill2 (68 > 73 | G:pass A:70 B:72 X:100)
gated:     skillZ (lint/spec failed - excluded from scoring)
skipped:   M skills above threshold
reverted:  skill3 (proposed change scored -2, rolled back | G:pass A:74 B:69 X:100)
contested: skill4 (secondary flagged major, primary disagreed)
plateau:   yes/no (max delta: +X)
-----------------------------------------------------------------

检查终止条件（除熔断暂停需等待用户输入外，阶段1终止后始终进入阶段2）：
- 检测到平台期（最大增量 < 平台期阈值）？终止阶段1。
- 所有Skill评分均高于聚焦阈值？将阈值提高5并继续。如果阈值已达最大值（95）且所有Skill仍达标，终止阶段1。
- 达到迭代上限？终止阶段1。
- 触发熔断机制？暂停等待用户输入。
重复步骤9直至终止

Phase 2: Meta-Improvement

阶段2：元优化

Announce: "Entering phase 2 - meta-improvement. This always requires human review."
Snapshot evaluation criteria:
- Copy skill-creator's AI Self-Check section to a temp location
- Copy
```
references/evaluation-criteria.md
```
  to a temp location
- Copy skill-creator's
```
conventions.md
```
  reference to a temp location These snapshots are the evaluation baseline for phase 2.
Improve skill-creator: run the improvement cycle (steps 10a-10k) using the snapshot as the evaluation criteria, not skill-creator's live version
Improve skill-refiner: same process, using the snapshot
Improve lint scripts (lint-skills.sh, validate-spec.sh):
- Capture baseline: run both scripts, save full output
- Propose improvements
- Apply changes
- Run regression: compare output to baseline
- If false positives or false negatives introduced: revert
- If clean: keep

Commit phase 2: one commit per target Format:

refactor(skill-refiner): meta - improve <target> (+N)

Pause for human review: display phase 2 changes, wait for approval. This checkpoint is non-configurable - it fires even in
```
--mode auto
```
. A direct user approval such as "continue" or "proceed" counts as approval to resume.

通知：“进入阶段2 - 元优化。此阶段始终需要人工评审。”
快照评估标准：
- 将skill-creator的AI自我检查部分复制到临时位置
- 将
```
references/evaluation-criteria.md
```
  复制到临时位置
- 将skill-creator的
```
conventions.md
```
  参考文件复制到临时位置这些快照是阶段2的评估基准。
优化skill-creator：使用快照作为评估标准（而非skill-creator的实时版本）运行优化循环（步骤10a-10k）
优化skill-refiner：采用相同流程，使用快照作为评估标准
优化Lint脚本（lint-skills.sh、validate-spec.sh）：
- 捕获基准：运行两个脚本，保存完整输出
- 提出优化方案
- 应用变更
- 运行回归测试：将输出与基准对比
- 如果引入误报或漏报：回退变更
- 如果无问题：保留变更
提交阶段2变更：每个目标对应一次提交格式：
```
refactor(skill-refiner): meta - improve <target> (+N)
```
暂停等待人工评审：展示阶段2的变更，等待批准。此检查点不可配置 - 即使在
```
--mode auto
```
模式下也会触发。用户输入“continue”或“proceed”等指令即视为批准继续。

Phase 3: Summary

阶段3：总结

Final report:

=== skill-refiner run complete ===================================
Branch:     skill-refiner/YYYY-MM-DD-HHMMSS
Primary:    <harness> <version> (<model>, effort: <level>)
Secondary:  <harness> <version> (<model>, effort: <level>) | none
Pool:       N skills (skill-creator, skill-refiner excluded)
Config:     iterations=M, threshold=T, mode=MODE, plateau=P

Iterations: N (of max M)
Terminated: plateau / threshold / cap / user

Score changes:
  skill1:  62 > 88 (+26)  [G:pass A:84 B:86 X:90]
  skill2:  71 > 85 (+14)  [G:pass A:82 B:79 X:100]
  ...
  skill-creator: 80 > 84 (+4)  [G:pass A:82 B:81 X:100] [meta]
  skill-refiner: 78 > 83 (+5)  [G:pass A:80 B:79 X:100] [meta]

Aggregate:  avg X.X | min X.X | max X.X
Reverted:   X changes across Y iterations
Contested:  Z flags escalated to human
=================================================================

Write run history: append this run's metadata to
```
.refiner-runs.json
```
in the collection root. Include: run_id, branch, date, primary/secondary harness+model+effort, config, pool size, termination reason, cross-model flag counts, before/after per-skill scores (component breakdown + composite, or clearly labeled estimates if the run used a targeted manual rubric instead of the full automated sweep), and a changes summary. Commit with the phase 3 summary.
Announce branch: remind user to review and merge when ready

最终报告：

=== skill-refiner run complete ===================================
Branch:     skill-refiner/YYYY-MM-DD-HHMMSS
Primary:    <harness> <version> (<model>, effort: <level>)
Secondary:  <harness> <version> (<model>, effort: <level>) | none
Pool:       N skills (skill-creator, skill-refiner excluded)
Config:     iterations=M, threshold=T, mode=MODE, plateau=P

Iterations: N (of max M)
Terminated: plateau / threshold / cap / user

Score changes:
  skill1:  62 > 88 (+26)  [G:pass A:84 B:86 X:90]
  skill2:  71 > 85 (+14)  [G:pass A:82 B:79 X:100]
  ...
  skill-creator: 80 > 84 (+4)  [G:pass A:82 B:81 X:100] [meta]
  skill-refiner: 78 > 83 (+5)  [G:pass A:80 B:79 X:100] [meta]

Aggregate:  avg X.X | min X.X | max X.X
Reverted:   X changes across Y iterations
Contested:  Z flags escalated to human
=================================================================

写入运行历史：将本次运行的元数据追加到集合根目录的
```
.refiner-runs.json
```
中。包含：run_id、分支、日期、主/副harness+模型+工作量、配置、池大小、终止原因、跨模型标记数量、每个Skill的前后评分（组件细分+综合评分，或如果运行使用针对性手动评分标准而非完整自动扫描，则明确标记为估算值），以及变更摘要。与阶段3总结一起提交。
通知分支信息：提醒用户在准备就绪后评审并合并分支

AI Self-Check

AI自我检查

Rules

规则

Immutability in phase 1: never modify
```
references/evaluation-criteria.md
```
,
```
references/test-cases.md
```
, lint-skills.sh, validate-spec.sh, skill-creator, or skill-refiner during phase 1. Violation = abort the run.
Karpathy gate: only directional improvements survive. If a change does not improve the composite score, revert it. No exceptions, no "it looks better."
Verify flags: never take cross-model flags at face value. Primary reviews every flag independently. Disagreements on major flags go to human.
Snapshot before meta: always snapshot evaluation criteria before phase 2. Evaluate against the snapshot, never the live version being modified.
Phase 2 always pauses: even in
```
--mode auto
```
. Non-configurable.
Contested major flags always pause: even in
```
--mode auto
```
. Non-configurable.
Simplicity criterion: all else being equal, simpler is better. Deletions that maintain score are preferred over additions that marginally improve it.
One commit per iteration: bundle improvements, include score deltas in message.
Branch isolation: all work on a feature branch. Never modify main directly.
Read before edit: always read the full skill before proposing changes. Never edit from memory or assumption.

阶段1的不可变性：阶段1中绝不能修改
```
references/evaluation-criteria.md
```
、
```
references/test-cases.md
```
、lint-skills.sh、validate-spec.sh、skill-creator或skill-refiner。违反此规则则终止运行。
Karpathy gate：只有方向性优化才能保留。如果变更未提升综合评分，则回退。无例外，不接受“看起来更好”的理由。
验证标记：绝不能直接采信跨模型标记。主harness需独立评审每个标记。重大标记存在分歧时提交人工处理。
元优化前先快照：阶段2前始终快照评估标准。基于快照进行评估，而非正在修改的实时版本。
阶段2始终暂停：即使在
```
--mode auto
```
模式下也会暂停。不可配置。
存在争议的重大标记始终暂停：即使在
```
--mode auto
```
模式下也会暂停。不可配置。
简洁性准则：在其他条件相同的情况下，越简洁越好。保持评分不变的删除操作优于仅小幅提升评分的添加操作。
每次迭代一次提交：打包优化内容，提交信息中包含评分增量。
分支隔离：所有工作在特性分支上进行。绝不能直接修改主分支。
先阅读再编辑：提出变更前始终完整阅读Skill内容。绝不能凭记忆或假设进行编辑。