skill-refiner
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSkill Refiner: Iterative Self-Improvement Loop
Skill Refiner:迭代式自我优化循环
Adaptive evaluation loop for AI skill collections, inspired by Karpathy's AutoResearch.
Orchestrates repeated score-improve-verify cycles using skill-creator as the engine
and mandatory peer review as an adversarial check (cross-model when a secondary harness
is available, fresh-context self-review as the minimum fallback).
受Karpathy's AutoResearch启发,为AI Skill集合打造的自适应评估循环。以skill-creator为核心引擎,编排重复的评分-优化-验证循环,并将强制同行评审作为对抗性检查(当存在secondary harness时采用跨模型评审,否则以全新上下文的自我评审作为最低 fallback 方案)。
When to use
适用场景
- Batch-improving the entire skill collection after a period of manual edits
- Running quality sweeps before a release or publish
- Triggering a self-improvement cycle where skills bootstrap each other
- After adding several new skills that need polish and consistency alignment
- When cross-model perspective would catch single-model blind spots
- Periodic maintenance: scheduled improvement runs to keep skills current
- 手动编辑一段时间后,批量优化整个Skill集合
- 发布前运行质量扫描
- 触发Skill间互相引导的自我优化循环
- 添加多个新Skill后,需要进行打磨和一致性对齐
- 跨模型视角可发现单模型盲区时
- 定期维护:通过定时优化运行保持Skill的时效性
When NOT to use
不适用场景
- Single skill review or improvement - use skill-creator (Mode 2)
- Creating a new skill from scratch - use skill-creator (Mode 1)
- One-off collection audit without iteration - use skill-creator (Mode 3)
- Full codebase review (code, not skills) - use full-review
- Style/slop audit on application code - use anti-slop
- 单个Skill的评审或优化 - 使用skill-creator(模式2)
- 从零创建新Skill - 使用skill-creator(模式1)
- 无迭代的一次性集合审计 - 使用skill-creator(模式3)
- 完整代码库评审(代码而非Skill)- 使用full-review
- 应用代码的风格/冗余审计 - 使用anti-slop
Configuration
配置
skill-refiner [--iterations N] [--mode MODE] [--secondary HARNESS] [--threshold N] [--plateau N]| Flag | Default | Description |
|---|---|---|
| 10 | Maximum iterations for phase 1 |
| circuit-breaker | |
| auto-detect | Secondary harness for cross-model review, or |
| 85 | Focus threshold - skip skills scoring above this (user can override max) |
| 2 | Minimum score delta to keep iterating |
Environment override: (CLI flag takes precedence)
SKILL_REFINER_SECONDARY=<harness>skill-refiner [--iterations N] [--mode MODE] [--secondary HARNESS] [--threshold N] [--plateau N]| 标志 | 默认值 | 描述 |
|---|---|---|
| 10 | 第一阶段的最大迭代次数 |
| circuit-breaker | 可选值: |
| auto-detect | 用于跨模型评审的secondary harness,或设为 |
| 85 | 聚焦阈值 - 跳过评分高于此值的Skill(用户可覆盖最大值) |
| 2 | 持续迭代所需的最小评分增量 |
环境变量覆盖: (CLI标志优先级更高)
SKILL_REFINER_SECONDARY=<harness>Checkpoint Modes
检查点模式
circuit-breaker (default): runs autonomously, auto-pauses on score regression,
contested major flags, or plateau. Always pauses before phase 2.
auto: fully autonomous through phase 1. Still pauses before phase 2 and on
contested major flags (non-configurable).
step: pauses after every iteration for manual review. Best for first run or learning.
circuit-breaker(默认):自主运行,当出现评分倒退、存在争议的重大标记或进入平台期时自动暂停。在进入第二阶段前始终会暂停。
auto:第一阶段完全自主运行。在进入第二阶段前以及出现争议的重大标记时仍会暂停(不可配置)。
step:每次迭代后暂停以等待手动评审。首次运行或学习阶段使用最佳。
Workflow
工作流程
Phase 0: Setup
阶段0:准备
- Create feature branch: from current HEAD
skill-refiner/YYYY-MM-DD-HHMMSS - Load run history: read from the collection root (if it exists). Use previous run data for: baseline score comparison (detect regressions from external changes), model/harness change detection (flag if the primary or secondary model changed since last run - new model = new baseline, not a comparable delta), and skip analysis (don't re-attempt improvements that were already tried and reverted in a recent run).
.refiner-runs.json - Build skill inventory: list all skills, exclude phase-2 targets (skill-creator, skill-refiner) from the improvement pool
- Detect primary harness: check environment to identify which AI CLI is running this session
- Probe for secondary harness: run three-step validation (PATH check, config check,
smoke test) per . Announce result.
references/harness-detection.md - If no secondary found: always fall back to self-review. Spawn a fresh agent on
the current harness with the review prompt template from . Label as "same-model fresh-context review" in scoring, weight at 3% instead of 5% (composite becomes gate/40/55/3, renormalize the missing 2% proportionally to AI Self-Check and Behavioral). This catches confirmation bias but shares the primary model's blind spots. Skipping review entirely is not an option - a fresh-context self-review is the minimum bar. If the harness doesn't support subagents, run the review prompt as a separate CLI invocation (
references/harness-detection.md,claude -p, etc.).codex exec
- 创建特性分支:基于当前HEAD创建分支
skill-refiner/YYYY-MM-DD-HHMMSS - 加载运行历史:从集合根目录读取(如果存在)。将之前的运行数据用于:基准评分对比(检测外部变更导致的倒退)、模型/harness变更检测(标记自上次运行以来主模型或副模型是否变更 - 新模型=新基准,无法进行增量对比)、跳过分析(不再尝试最近运行中已尝试并回退的优化)。
.refiner-runs.json - 构建Skill清单:列出所有Skill,将第二阶段目标(skill-creator、skill-refiner)从优化池中排除
- 检测主harness:检查环境以确定当前会话运行的是哪个AI CLI
- 探测副harness:按照执行三步验证(路径检查、配置检查、冒烟测试)。公布结果。
references/harness-detection.md - 如果未找到副harness:始终回退到自我评审。在当前harness上启动一个全新Agent,使用中的评审提示模板。在评分中标记为“同模型全新上下文评审”,权重设为3%而非5%(综合评分比例变为gate/40/55/3,将缺失的2%按比例重新分配给AI自我检查和行为测试)。这可以避免确认偏差,但仍会存在主模型的盲区。完全跳过评审并非可选方案 - 全新上下文的自我评审是最低标准。如果harness不支持子Agent,则通过单独的CLI调用运行评审提示(如
references/harness-detection.md、claude -p等)。codex exec
Phase 1: Regular Iterations
阶段1:常规迭代
- Iteration 1 - full sweep: score every skill in the pool using the four-component
model from
references/evaluation-criteria.md- Structural: run lint-skills.sh + validate-spec.sh
- AI Self-Check: invoke skill-creator review mode on each skill
- Behavioral: run test prompts from . For skills without pre-written test cases, auto-generate 2-3 test prompts from the skill's "When to use" section and quality signals from its AI Self-Check. Log a warning that generated tests are lower quality than hand-written ones. Optionally save generated tests to a test-cases-local.md file alongside test-cases.md so they accumulate across runs.
references/test-cases.md - Cross-model: skip on first iteration (no diff to review yet)
- Log baseline scores: record per-skill and aggregate scores
- Iteration 2+: enter adaptive focus mode
- Select targets: identify skills scoring below the focus threshold
- For each targeted skill, run the improvement cycle:
a. Read current SKILL.md and all reference files
b. Invoke skill-creator review mode - collect findings
c. Run behavioral test - score current output quality
d. Propose targeted improvements based on findings (not random changes)
e. Apply changes to SKILL.md (and references if needed)
f. Re-score: run lint + AI Self-Check + behavioral test
g. Karpathy gate: if score improved, keep. If not, revert. No exceptions.
h. If cross-model review available, send the diff to secondary harness
i. Process flags per verification protocol j. If secondary flags major issue and primary agrees: revert k. If secondary flags major issue and primary disagrees: escalate to circuit breaker
references/harness-detection.md - Commit iteration: one commit with all improvements from this iteration
Format:
refactor(skill-refiner): iteration N - skill1(+X), skill2(+Y) - Log iteration summary:
--- iteration N / max ------------------------------------------- improved: skill1 (72 > 80 | G:pass A:76 B:78 X:90), skill2 (68 > 73 | G:pass A:70 B:72 X:100) gated: skillZ (lint/spec failed - excluded from scoring) skipped: M skills above threshold reverted: skill3 (proposed change scored -2, rolled back | G:pass A:74 B:69 X:100) contested: skill4 (secondary flagged major, primary disagreed) plateau: yes/no (max delta: +X) ----------------------------------------------------------------- - Check termination conditions (phase 1 always flows into phase 2 on termination,
except on circuit-breaker pauses which wait for user input first):
- Plateau detected (max delta < plateau threshold)? Terminate phase 1.
- All skills above focus threshold? Bump threshold by 5 and continue. If threshold is already at max (95) and all skills still clear it, terminate phase 1.
- Iteration cap reached? Terminate phase 1.
- Circuit breaker triggered? Pause for user input.
- Repeat from step 9 until terminated
- 迭代1 - 全面扫描:使用中的四组件模型为池中的每个Skill评分
references/evaluation-criteria.md- 结构检查:运行lint-skills.sh + validate-spec.sh
- AI自我检查:对每个Skill调用skill-creator的评审模式
- 行为测试:运行中的测试提示。对于没有预编写测试用例的Skill,从Skill的“适用场景”部分和AI自我检查的质量信号自动生成2-3个测试提示。记录警告,说明生成的测试质量低于手写测试。可选择将生成的测试保存到test-cases.md旁的test-cases-local.md文件中,以便在多次运行中积累。
references/test-cases.md - 跨模型评审:首次迭代跳过(尚无差异可评审)
- 记录基准评分:记录每个Skill的评分及综合评分
- 迭代2及以后:进入自适应聚焦模式
- 选择目标:识别评分低于聚焦阈值的Skill
- 针对每个目标Skill,运行优化循环:
a. 读取当前SKILL.md及所有参考文件
b. 调用skill-creator评审模式 - 收集发现的问题
c. 运行行为测试 - 评分当前输出质量
d. 根据发现的问题提出针对性优化方案(而非随机变更)
e. 将变更应用到SKILL.md(必要时修改参考文件)
f. 重新评分:运行Lint检查 + AI自我检查 + 行为测试
g. Karpathy gate:如果评分提升,则保留变更;否则回退。无例外。
h. 如果有跨模型评审可用,将差异发送给副harness
i. 按照的验证协议处理标记 j. 如果副harness标记重大问题且主harness同意:回退变更 k. 如果副harness标记重大问题但主harness不同意:触发熔断机制
references/harness-detection.md - 提交迭代:一次提交包含本次迭代的所有优化
格式:
refactor(skill-refiner): iteration N - skill1(+X), skill2(+Y) - 记录迭代摘要:
--- iteration N / max ------------------------------------------- improved: skill1 (72 > 80 | G:pass A:76 B:78 X:90), skill2 (68 > 73 | G:pass A:70 B:72 X:100) gated: skillZ (lint/spec failed - excluded from scoring) skipped: M skills above threshold reverted: skill3 (proposed change scored -2, rolled back | G:pass A:74 B:69 X:100) contested: skill4 (secondary flagged major, primary disagreed) plateau: yes/no (max delta: +X) ----------------------------------------------------------------- - 检查终止条件(除熔断暂停需等待用户输入外,阶段1终止后始终进入阶段2):
- 检测到平台期(最大增量 < 平台期阈值)?终止阶段1。
- 所有Skill评分均高于聚焦阈值?将阈值提高5并继续。如果阈值已达最大值(95)且所有Skill仍达标,终止阶段1。
- 达到迭代上限?终止阶段1。
- 触发熔断机制?暂停等待用户输入。
- 重复步骤9直至终止
Phase 2: Meta-Improvement
阶段2:元优化
- Announce: "Entering phase 2 - meta-improvement. This always requires human review."
- Snapshot evaluation criteria:
- Copy skill-creator's AI Self-Check section to a temp location
- Copy to a temp location
references/evaluation-criteria.md - Copy skill-creator's reference to a temp location These snapshots are the evaluation baseline for phase 2.
conventions.md
- Improve skill-creator: run the improvement cycle (steps 10a-10k) using the snapshot as the evaluation criteria, not skill-creator's live version
- Improve skill-refiner: same process, using the snapshot
- Improve lint scripts (lint-skills.sh, validate-spec.sh):
- Capture baseline: run both scripts, save full output
- Propose improvements
- Apply changes
- Run regression: compare output to baseline
- If false positives or false negatives introduced: revert
- If clean: keep
- Commit phase 2: one commit per target
Format:
refactor(skill-refiner): meta - improve <target> (+N) - Pause for human review: display phase 2 changes, wait for approval.
This checkpoint is non-configurable - it fires even in . A direct user approval such as "continue" or "proceed" counts as approval to resume.
--mode auto
- 通知:“进入阶段2 - 元优化。此阶段始终需要人工评审。”
- 快照评估标准:
- 将skill-creator的AI自我检查部分复制到临时位置
- 将复制到临时位置
references/evaluation-criteria.md - 将skill-creator的参考文件复制到临时位置 这些快照是阶段2的评估基准。
conventions.md
- 优化skill-creator:使用快照作为评估标准(而非skill-creator的实时版本)运行优化循环(步骤10a-10k)
- 优化skill-refiner:采用相同流程,使用快照作为评估标准
- 优化Lint脚本(lint-skills.sh、validate-spec.sh):
- 捕获基准:运行两个脚本,保存完整输出
- 提出优化方案
- 应用变更
- 运行回归测试:将输出与基准对比
- 如果引入误报或漏报:回退变更
- 如果无问题:保留变更
- 提交阶段2变更:每个目标对应一次提交
格式:
refactor(skill-refiner): meta - improve <target> (+N) - 暂停等待人工评审:展示阶段2的变更,等待批准。此检查点不可配置 - 即使在模式下也会触发。用户输入“continue”或“proceed”等指令即视为批准继续。
--mode auto
Phase 3: Summary
阶段3:总结
- Final report:
=== skill-refiner run complete =================================== Branch: skill-refiner/YYYY-MM-DD-HHMMSS Primary: <harness> <version> (<model>, effort: <level>) Secondary: <harness> <version> (<model>, effort: <level>) | none Pool: N skills (skill-creator, skill-refiner excluded) Config: iterations=M, threshold=T, mode=MODE, plateau=P Iterations: N (of max M) Terminated: plateau / threshold / cap / user Score changes: skill1: 62 > 88 (+26) [G:pass A:84 B:86 X:90] skill2: 71 > 85 (+14) [G:pass A:82 B:79 X:100] ... skill-creator: 80 > 84 (+4) [G:pass A:82 B:81 X:100] [meta] skill-refiner: 78 > 83 (+5) [G:pass A:80 B:79 X:100] [meta] Aggregate: avg X.X | min X.X | max X.X Reverted: X changes across Y iterations Contested: Z flags escalated to human ================================================================= - Write run history: append this run's metadata to in the collection root. Include: run_id, branch, date, primary/secondary harness+model+effort, config, pool size, termination reason, cross-model flag counts, before/after per-skill scores (component breakdown + composite, or clearly labeled estimates if the run used a targeted manual rubric instead of the full automated sweep), and a changes summary. Commit with the phase 3 summary.
.refiner-runs.json - Announce branch: remind user to review and merge when ready
- 最终报告:
=== skill-refiner run complete =================================== Branch: skill-refiner/YYYY-MM-DD-HHMMSS Primary: <harness> <version> (<model>, effort: <level>) Secondary: <harness> <version> (<model>, effort: <level>) | none Pool: N skills (skill-creator, skill-refiner excluded) Config: iterations=M, threshold=T, mode=MODE, plateau=P Iterations: N (of max M) Terminated: plateau / threshold / cap / user Score changes: skill1: 62 > 88 (+26) [G:pass A:84 B:86 X:90] skill2: 71 > 85 (+14) [G:pass A:82 B:79 X:100] ... skill-creator: 80 > 84 (+4) [G:pass A:82 B:81 X:100] [meta] skill-refiner: 78 > 83 (+5) [G:pass A:80 B:79 X:100] [meta] Aggregate: avg X.X | min X.X | max X.X Reverted: X changes across Y iterations Contested: Z flags escalated to human ================================================================= - 写入运行历史:将本次运行的元数据追加到集合根目录的中。包含:run_id、分支、日期、主/副harness+模型+工作量、配置、池大小、终止原因、跨模型标记数量、每个Skill的前后评分(组件细分+综合评分,或如果运行使用针对性手动评分标准而非完整自动扫描,则明确标记为估算值),以及变更摘要。与阶段3总结一起提交。
.refiner-runs.json - 通知分支信息:提醒用户在准备就绪后评审并合并分支
AI Self-Check
AI自我检查
Before committing any skill modification, verify:
- Lint passes: lint-skills.sh exits 0 for the modified skill
- Spec valid: validate-spec.sh exits 0 for the modified skill
- Score improved: composite score is strictly higher than before the change
- No content regression: change does not remove critical sections, warnings, or cross-references without replacement
- Simplicity maintained: change does not add unnecessary complexity for marginal gains
- Cross-references intact: all skill names in bold still resolve to existing skills
- Target ~500 lines: modified SKILL.md stays near 500 lines. Hard max 600
- ASCII only: no non-ASCII characters introduced (except allowed emoji indicators)
- Immutability respected: no phase-1 modification to evaluation criteria, test cases, lint scripts, skill-creator, or skill-refiner
提交任何Skill修改前,需验证:
- Lint检查通过:修改后的Skill对应的lint-skills.sh执行结果为0
- 规范有效:修改后的Skill对应的validate-spec.sh执行结果为0
- 评分提升:综合评分严格高于修改前
- 无内容倒退:变更未移除关键章节、警告或交叉引用(除非有替代内容)
- 保持简洁:变更未为微小增益添加不必要的复杂度
- 交叉引用完整:所有加粗的Skill名称仍指向现有Skill
- 目标约500行:修改后的SKILL.md保持在500行左右。硬上限为600行
- 仅ASCII字符:未引入非ASCII字符(允许的表情符号指示器除外)
- 遵守不可变性:阶段1未修改评估标准、测试用例、Lint脚本、skill-creator或skill-refiner
Rules
规则
- Immutability in phase 1: never modify ,
references/evaluation-criteria.md, lint-skills.sh, validate-spec.sh, skill-creator, or skill-refiner during phase 1. Violation = abort the run.references/test-cases.md - Karpathy gate: only directional improvements survive. If a change does not improve the composite score, revert it. No exceptions, no "it looks better."
- Verify flags: never take cross-model flags at face value. Primary reviews every flag independently. Disagreements on major flags go to human.
- Snapshot before meta: always snapshot evaluation criteria before phase 2. Evaluate against the snapshot, never the live version being modified.
- Phase 2 always pauses: even in . Non-configurable.
--mode auto - Contested major flags always pause: even in . Non-configurable.
--mode auto - Simplicity criterion: all else being equal, simpler is better. Deletions that maintain score are preferred over additions that marginally improve it.
- One commit per iteration: bundle improvements, include score deltas in message.
- Branch isolation: all work on a feature branch. Never modify main directly.
- Read before edit: always read the full skill before proposing changes. Never edit from memory or assumption.
- 阶段1的不可变性:阶段1中绝不能修改、
references/evaluation-criteria.md、lint-skills.sh、validate-spec.sh、skill-creator或skill-refiner。违反此规则则终止运行。references/test-cases.md - Karpathy gate:只有方向性优化才能保留。如果变更未提升综合评分,则回退。无例外,不接受“看起来更好”的理由。
- 验证标记:绝不能直接采信跨模型标记。主harness需独立评审每个标记。重大标记存在分歧时提交人工处理。
- 元优化前先快照:阶段2前始终快照评估标准。基于快照进行评估,而非正在修改的实时版本。
- 阶段2始终暂停:即使在模式下也会暂停。不可配置。
--mode auto - 存在争议的重大标记始终暂停:即使在模式下也会暂停。不可配置。
--mode auto - 简洁性准则:在其他条件相同的情况下,越简洁越好。保持评分不变的删除操作优于仅小幅提升评分的添加操作。
- 每次迭代一次提交:打包优化内容,提交信息中包含评分增量。
- 分支隔离:所有工作在特性分支上进行。绝不能直接修改主分支。
- 先阅读再编辑:提出变更前始终完整阅读Skill内容。绝不能凭记忆或假设进行编辑。
Related Skills
相关Skill
- skill-creator - the evaluation and improvement engine. skill-refiner invokes skill-creator's review mode (Mode 2) for scoring and its improve mode for generating changes. skill-creator handles individual skill quality; skill-refiner handles iteration, prioritization, and orchestration. Primary dependency.
- full-review - one-off collection audit across code-review, anti-slop, security-audit, and update-docs. Use full-review for a single pass over application code; use skill-refiner for iterative improvement of skill files.
- anti-slop - code quality patterns. skill-refiner may invoke anti-slop principles through skill-creator during improvement, but does not call anti-slop directly. Different domain: anti-slop audits application code, skill-refiner audits skill files.
- skill-creator - 评估与优化引擎。skill-refiner调用skill-creator的评审模式(模式2)进行评分,并调用其优化模式生成变更。skill-creator负责单个Skill的质量;skill-refiner负责迭代、优先级排序和编排。核心依赖。
- full-review - 涵盖代码评审、anti-slop、安全审计和文档更新的一次性集合审计。对应用代码进行单次检查使用full-review;对Skill文件进行迭代优化使用skill-refiner。
- anti-slop - 代码质量模式。skill-refiner可能在优化过程中通过skill-creator应用anti-slop原则,但不会直接调用anti-slop。领域不同:anti-slop审计应用代码,skill-refiner审计Skill文件。