skill-refiner

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Skill Refiner: Iterative Self-Improvement Loop

Skill Refiner:迭代式自我优化循环

Adaptive evaluation loop for AI skill collections, inspired by Karpathy's AutoResearch. Orchestrates repeated score-improve-verify cycles using skill-creator as the engine and mandatory peer review as an adversarial check (cross-model when a secondary harness is available, fresh-context self-review as the minimum fallback).
受Karpathy's AutoResearch启发,为AI Skill集合打造的自适应评估循环。以skill-creator为核心引擎,编排重复的评分-优化-验证循环,并将强制同行评审作为对抗性检查(当存在secondary harness时采用跨模型评审,否则以全新上下文的自我评审作为最低 fallback 方案)。

When to use

适用场景

  • Batch-improving the entire skill collection after a period of manual edits
  • Running quality sweeps before a release or publish
  • Triggering a self-improvement cycle where skills bootstrap each other
  • After adding several new skills that need polish and consistency alignment
  • When cross-model perspective would catch single-model blind spots
  • Periodic maintenance: scheduled improvement runs to keep skills current
  • 手动编辑一段时间后,批量优化整个Skill集合
  • 发布前运行质量扫描
  • 触发Skill间互相引导的自我优化循环
  • 添加多个新Skill后,需要进行打磨和一致性对齐
  • 跨模型视角可发现单模型盲区时
  • 定期维护:通过定时优化运行保持Skill的时效性

When NOT to use

不适用场景

  • Single skill review or improvement - use skill-creator (Mode 2)
  • Creating a new skill from scratch - use skill-creator (Mode 1)
  • One-off collection audit without iteration - use skill-creator (Mode 3)
  • Full codebase review (code, not skills) - use full-review
  • Style/slop audit on application code - use anti-slop
  • 单个Skill的评审或优化 - 使用skill-creator(模式2)
  • 从零创建新Skill - 使用skill-creator(模式1)
  • 无迭代的一次性集合审计 - 使用skill-creator(模式3)
  • 完整代码库评审(代码而非Skill)- 使用full-review
  • 应用代码的风格/冗余审计 - 使用anti-slop

Configuration

配置

skill-refiner [--iterations N] [--mode MODE] [--secondary HARNESS] [--threshold N] [--plateau N]
FlagDefaultDescription
--iterations
10Maximum iterations for phase 1
--mode
circuit-breaker
auto
,
circuit-breaker
, or
step
--secondary
auto-detectSecondary harness for cross-model review, or
none
--threshold
85Focus threshold - skip skills scoring above this (user can override max)
--plateau
2Minimum score delta to keep iterating
Environment override:
SKILL_REFINER_SECONDARY=<harness>
(CLI flag takes precedence)
skill-refiner [--iterations N] [--mode MODE] [--secondary HARNESS] [--threshold N] [--plateau N]
标志默认值描述
--iterations
10第一阶段的最大迭代次数
--mode
circuit-breaker可选值:
auto
circuit-breaker
step
--secondary
auto-detect用于跨模型评审的secondary harness,或设为
none
--threshold
85聚焦阈值 - 跳过评分高于此值的Skill(用户可覆盖最大值)
--plateau
2持续迭代所需的最小评分增量
环境变量覆盖:
SKILL_REFINER_SECONDARY=<harness>
(CLI标志优先级更高)

Checkpoint Modes

检查点模式

circuit-breaker (default): runs autonomously, auto-pauses on score regression, contested major flags, or plateau. Always pauses before phase 2.
auto: fully autonomous through phase 1. Still pauses before phase 2 and on contested major flags (non-configurable).
step: pauses after every iteration for manual review. Best for first run or learning.
circuit-breaker(默认):自主运行,当出现评分倒退、存在争议的重大标记或进入平台期时自动暂停。在进入第二阶段前始终会暂停。
auto:第一阶段完全自主运行。在进入第二阶段前以及出现争议的重大标记时仍会暂停(不可配置)。
step:每次迭代后暂停以等待手动评审。首次运行或学习阶段使用最佳。

Workflow

工作流程

Phase 0: Setup

阶段0:准备

  1. Create feature branch:
    skill-refiner/YYYY-MM-DD-HHMMSS
    from current HEAD
  2. Load run history: read
    .refiner-runs.json
    from the collection root (if it exists). Use previous run data for: baseline score comparison (detect regressions from external changes), model/harness change detection (flag if the primary or secondary model changed since last run - new model = new baseline, not a comparable delta), and skip analysis (don't re-attempt improvements that were already tried and reverted in a recent run).
  3. Build skill inventory: list all skills, exclude phase-2 targets (skill-creator, skill-refiner) from the improvement pool
  4. Detect primary harness: check environment to identify which AI CLI is running this session
  5. Probe for secondary harness: run three-step validation (PATH check, config check, smoke test) per
    references/harness-detection.md
    . Announce result.
  6. If no secondary found: always fall back to self-review. Spawn a fresh agent on the current harness with the review prompt template from
    references/harness-detection.md
    . Label as "same-model fresh-context review" in scoring, weight at 3% instead of 5% (composite becomes gate/40/55/3, renormalize the missing 2% proportionally to AI Self-Check and Behavioral). This catches confirmation bias but shares the primary model's blind spots. Skipping review entirely is not an option - a fresh-context self-review is the minimum bar. If the harness doesn't support subagents, run the review prompt as a separate CLI invocation (
    claude -p
    ,
    codex exec
    , etc.).
  1. 创建特性分支:基于当前HEAD创建
    skill-refiner/YYYY-MM-DD-HHMMSS
    分支
  2. 加载运行历史:从集合根目录读取
    .refiner-runs.json
    (如果存在)。将之前的运行数据用于:基准评分对比(检测外部变更导致的倒退)、模型/harness变更检测(标记自上次运行以来主模型或副模型是否变更 - 新模型=新基准,无法进行增量对比)、跳过分析(不再尝试最近运行中已尝试并回退的优化)。
  3. 构建Skill清单:列出所有Skill,将第二阶段目标(skill-creator、skill-refiner)从优化池中排除
  4. 检测主harness:检查环境以确定当前会话运行的是哪个AI CLI
  5. 探测副harness:按照
    references/harness-detection.md
    执行三步验证(路径检查、配置检查、冒烟测试)。公布结果。
  6. 如果未找到副harness始终回退到自我评审。在当前harness上启动一个全新Agent,使用
    references/harness-detection.md
    中的评审提示模板。在评分中标记为“同模型全新上下文评审”,权重设为3%而非5%(综合评分比例变为gate/40/55/3,将缺失的2%按比例重新分配给AI自我检查和行为测试)。这可以避免确认偏差,但仍会存在主模型的盲区。完全跳过评审并非可选方案 - 全新上下文的自我评审是最低标准。如果harness不支持子Agent,则通过单独的CLI调用运行评审提示(如
    claude -p
    codex exec
    等)。

Phase 1: Regular Iterations

阶段1:常规迭代

  1. Iteration 1 - full sweep: score every skill in the pool using the four-component model from
    references/evaluation-criteria.md
    • Structural: run lint-skills.sh + validate-spec.sh
    • AI Self-Check: invoke skill-creator review mode on each skill
    • Behavioral: run test prompts from
      references/test-cases.md
      . For skills without pre-written test cases, auto-generate 2-3 test prompts from the skill's "When to use" section and quality signals from its AI Self-Check. Log a warning that generated tests are lower quality than hand-written ones. Optionally save generated tests to a test-cases-local.md file alongside test-cases.md so they accumulate across runs.
    • Cross-model: skip on first iteration (no diff to review yet)
  2. Log baseline scores: record per-skill and aggregate scores
  3. Iteration 2+: enter adaptive focus mode
  4. Select targets: identify skills scoring below the focus threshold
  5. For each targeted skill, run the improvement cycle: a. Read current SKILL.md and all reference files b. Invoke skill-creator review mode - collect findings c. Run behavioral test - score current output quality d. Propose targeted improvements based on findings (not random changes) e. Apply changes to SKILL.md (and references if needed) f. Re-score: run lint + AI Self-Check + behavioral test g. Karpathy gate: if score improved, keep. If not, revert. No exceptions. h. If cross-model review available, send the diff to secondary harness i. Process flags per
    references/harness-detection.md
    verification protocol j. If secondary flags major issue and primary agrees: revert k. If secondary flags major issue and primary disagrees: escalate to circuit breaker
  6. Commit iteration: one commit with all improvements from this iteration Format:
    refactor(skill-refiner): iteration N - skill1(+X), skill2(+Y)
  7. Log iteration summary:
    --- iteration N / max -------------------------------------------
    improved:  skill1 (72 > 80 | G:pass A:76 B:78 X:90), skill2 (68 > 73 | G:pass A:70 B:72 X:100)
    gated:     skillZ (lint/spec failed - excluded from scoring)
    skipped:   M skills above threshold
    reverted:  skill3 (proposed change scored -2, rolled back | G:pass A:74 B:69 X:100)
    contested: skill4 (secondary flagged major, primary disagreed)
    plateau:   yes/no (max delta: +X)
    -----------------------------------------------------------------
  8. Check termination conditions (phase 1 always flows into phase 2 on termination, except on circuit-breaker pauses which wait for user input first):
    • Plateau detected (max delta < plateau threshold)? Terminate phase 1.
    • All skills above focus threshold? Bump threshold by 5 and continue. If threshold is already at max (95) and all skills still clear it, terminate phase 1.
    • Iteration cap reached? Terminate phase 1.
    • Circuit breaker triggered? Pause for user input.
  9. Repeat from step 9 until terminated
  1. 迭代1 - 全面扫描:使用
    references/evaluation-criteria.md
    中的四组件模型为池中的每个Skill评分
    • 结构检查:运行lint-skills.sh + validate-spec.sh
    • AI自我检查:对每个Skill调用skill-creator的评审模式
    • 行为测试:运行
      references/test-cases.md
      中的测试提示。对于没有预编写测试用例的Skill,从Skill的“适用场景”部分和AI自我检查的质量信号自动生成2-3个测试提示。记录警告,说明生成的测试质量低于手写测试。可选择将生成的测试保存到test-cases.md旁的test-cases-local.md文件中,以便在多次运行中积累。
    • 跨模型评审:首次迭代跳过(尚无差异可评审)
  2. 记录基准评分:记录每个Skill的评分及综合评分
  3. 迭代2及以后:进入自适应聚焦模式
  4. 选择目标:识别评分低于聚焦阈值的Skill
  5. 针对每个目标Skill,运行优化循环: a. 读取当前SKILL.md及所有参考文件 b. 调用skill-creator评审模式 - 收集发现的问题 c. 运行行为测试 - 评分当前输出质量 d. 根据发现的问题提出针对性优化方案(而非随机变更) e. 将变更应用到SKILL.md(必要时修改参考文件) f. 重新评分:运行Lint检查 + AI自我检查 + 行为测试 g. Karpathy gate:如果评分提升,则保留变更;否则回退。无例外。 h. 如果有跨模型评审可用,将差异发送给副harness i. 按照
    references/harness-detection.md
    的验证协议处理标记 j. 如果副harness标记重大问题且主harness同意:回退变更 k. 如果副harness标记重大问题但主harness不同意:触发熔断机制
  6. 提交迭代:一次提交包含本次迭代的所有优化 格式:
    refactor(skill-refiner): iteration N - skill1(+X), skill2(+Y)
  7. 记录迭代摘要
    --- iteration N / max -------------------------------------------
    improved:  skill1 (72 > 80 | G:pass A:76 B:78 X:90), skill2 (68 > 73 | G:pass A:70 B:72 X:100)
    gated:     skillZ (lint/spec failed - excluded from scoring)
    skipped:   M skills above threshold
    reverted:  skill3 (proposed change scored -2, rolled back | G:pass A:74 B:69 X:100)
    contested: skill4 (secondary flagged major, primary disagreed)
    plateau:   yes/no (max delta: +X)
    -----------------------------------------------------------------
  8. 检查终止条件(除熔断暂停需等待用户输入外,阶段1终止后始终进入阶段2):
    • 检测到平台期(最大增量 < 平台期阈值)?终止阶段1。
    • 所有Skill评分均高于聚焦阈值?将阈值提高5并继续。如果阈值已达最大值(95)且所有Skill仍达标,终止阶段1。
    • 达到迭代上限?终止阶段1。
    • 触发熔断机制?暂停等待用户输入。
  9. 重复步骤9直至终止

Phase 2: Meta-Improvement

阶段2:元优化

  1. Announce: "Entering phase 2 - meta-improvement. This always requires human review."
  2. Snapshot evaluation criteria:
    • Copy skill-creator's AI Self-Check section to a temp location
    • Copy
      references/evaluation-criteria.md
      to a temp location
    • Copy skill-creator's
      conventions.md
      reference to a temp location These snapshots are the evaluation baseline for phase 2.
  3. Improve skill-creator: run the improvement cycle (steps 10a-10k) using the snapshot as the evaluation criteria, not skill-creator's live version
  4. Improve skill-refiner: same process, using the snapshot
  5. Improve lint scripts (lint-skills.sh, validate-spec.sh):
    • Capture baseline: run both scripts, save full output
    • Propose improvements
    • Apply changes
    • Run regression: compare output to baseline
    • If false positives or false negatives introduced: revert
    • If clean: keep
  6. Commit phase 2: one commit per target Format:
    refactor(skill-refiner): meta - improve <target> (+N)
  7. Pause for human review: display phase 2 changes, wait for approval. This checkpoint is non-configurable - it fires even in
    --mode auto
    . A direct user approval such as "continue" or "proceed" counts as approval to resume.
  1. 通知:“进入阶段2 - 元优化。此阶段始终需要人工评审。”
  2. 快照评估标准
    • skill-creator的AI自我检查部分复制到临时位置
    • references/evaluation-criteria.md
      复制到临时位置
    • skill-creator
      conventions.md
      参考文件复制到临时位置 这些快照是阶段2的评估基准。
  3. 优化skill-creator:使用快照作为评估标准(而非skill-creator的实时版本)运行优化循环(步骤10a-10k)
  4. 优化skill-refiner:采用相同流程,使用快照作为评估标准
  5. 优化Lint脚本(lint-skills.sh、validate-spec.sh):
    • 捕获基准:运行两个脚本,保存完整输出
    • 提出优化方案
    • 应用变更
    • 运行回归测试:将输出与基准对比
    • 如果引入误报或漏报:回退变更
    • 如果无问题:保留变更
  6. 提交阶段2变更:每个目标对应一次提交 格式:
    refactor(skill-refiner): meta - improve <target> (+N)
  7. 暂停等待人工评审:展示阶段2的变更,等待批准。此检查点不可配置 - 即使在
    --mode auto
    模式下也会触发。用户输入“continue”或“proceed”等指令即视为批准继续。

Phase 3: Summary

阶段3:总结

  1. Final report:
    === skill-refiner run complete ===================================
    Branch:     skill-refiner/YYYY-MM-DD-HHMMSS
    Primary:    <harness> <version> (<model>, effort: <level>)
    Secondary:  <harness> <version> (<model>, effort: <level>) | none
    Pool:       N skills (skill-creator, skill-refiner excluded)
    Config:     iterations=M, threshold=T, mode=MODE, plateau=P
    
    Iterations: N (of max M)
    Terminated: plateau / threshold / cap / user
    
    Score changes:
      skill1:  62 > 88 (+26)  [G:pass A:84 B:86 X:90]
      skill2:  71 > 85 (+14)  [G:pass A:82 B:79 X:100]
      ...
      skill-creator: 80 > 84 (+4)  [G:pass A:82 B:81 X:100] [meta]
      skill-refiner: 78 > 83 (+5)  [G:pass A:80 B:79 X:100] [meta]
    
    Aggregate:  avg X.X | min X.X | max X.X
    Reverted:   X changes across Y iterations
    Contested:  Z flags escalated to human
    =================================================================
  2. Write run history: append this run's metadata to
    .refiner-runs.json
    in the collection root. Include: run_id, branch, date, primary/secondary harness+model+effort, config, pool size, termination reason, cross-model flag counts, before/after per-skill scores (component breakdown + composite, or clearly labeled estimates if the run used a targeted manual rubric instead of the full automated sweep), and a changes summary. Commit with the phase 3 summary.
  3. Announce branch: remind user to review and merge when ready
  1. 最终报告
    === skill-refiner run complete ===================================
    Branch:     skill-refiner/YYYY-MM-DD-HHMMSS
    Primary:    <harness> <version> (<model>, effort: <level>)
    Secondary:  <harness> <version> (<model>, effort: <level>) | none
    Pool:       N skills (skill-creator, skill-refiner excluded)
    Config:     iterations=M, threshold=T, mode=MODE, plateau=P
    
    Iterations: N (of max M)
    Terminated: plateau / threshold / cap / user
    
    Score changes:
      skill1:  62 > 88 (+26)  [G:pass A:84 B:86 X:90]
      skill2:  71 > 85 (+14)  [G:pass A:82 B:79 X:100]
      ...
      skill-creator: 80 > 84 (+4)  [G:pass A:82 B:81 X:100] [meta]
      skill-refiner: 78 > 83 (+5)  [G:pass A:80 B:79 X:100] [meta]
    
    Aggregate:  avg X.X | min X.X | max X.X
    Reverted:   X changes across Y iterations
    Contested:  Z flags escalated to human
    =================================================================
  2. 写入运行历史:将本次运行的元数据追加到集合根目录的
    .refiner-runs.json
    中。包含:run_id、分支、日期、主/副harness+模型+工作量、配置、池大小、终止原因、跨模型标记数量、每个Skill的前后评分(组件细分+综合评分,或如果运行使用针对性手动评分标准而非完整自动扫描,则明确标记为估算值),以及变更摘要。与阶段3总结一起提交。
  3. 通知分支信息:提醒用户在准备就绪后评审并合并分支

AI Self-Check

AI自我检查

Before committing any skill modification, verify:
  • Lint passes: lint-skills.sh exits 0 for the modified skill
  • Spec valid: validate-spec.sh exits 0 for the modified skill
  • Score improved: composite score is strictly higher than before the change
  • No content regression: change does not remove critical sections, warnings, or cross-references without replacement
  • Simplicity maintained: change does not add unnecessary complexity for marginal gains
  • Cross-references intact: all skill names in bold still resolve to existing skills
  • Target ~500 lines: modified SKILL.md stays near 500 lines. Hard max 600
  • ASCII only: no non-ASCII characters introduced (except allowed emoji indicators)
  • Immutability respected: no phase-1 modification to evaluation criteria, test cases, lint scripts, skill-creator, or skill-refiner
提交任何Skill修改前,需验证:
  • Lint检查通过:修改后的Skill对应的lint-skills.sh执行结果为0
  • 规范有效:修改后的Skill对应的validate-spec.sh执行结果为0
  • 评分提升:综合评分严格高于修改前
  • 无内容倒退:变更未移除关键章节、警告或交叉引用(除非有替代内容)
  • 保持简洁:变更未为微小增益添加不必要的复杂度
  • 交叉引用完整:所有加粗的Skill名称仍指向现有Skill
  • 目标约500行:修改后的SKILL.md保持在500行左右。硬上限为600行
  • 仅ASCII字符:未引入非ASCII字符(允许的表情符号指示器除外)
  • 遵守不可变性:阶段1未修改评估标准、测试用例、Lint脚本、skill-creator或skill-refiner

Rules

规则

  1. Immutability in phase 1: never modify
    references/evaluation-criteria.md
    ,
    references/test-cases.md
    , lint-skills.sh, validate-spec.sh, skill-creator, or skill-refiner during phase 1. Violation = abort the run.
  2. Karpathy gate: only directional improvements survive. If a change does not improve the composite score, revert it. No exceptions, no "it looks better."
  3. Verify flags: never take cross-model flags at face value. Primary reviews every flag independently. Disagreements on major flags go to human.
  4. Snapshot before meta: always snapshot evaluation criteria before phase 2. Evaluate against the snapshot, never the live version being modified.
  5. Phase 2 always pauses: even in
    --mode auto
    . Non-configurable.
  6. Contested major flags always pause: even in
    --mode auto
    . Non-configurable.
  7. Simplicity criterion: all else being equal, simpler is better. Deletions that maintain score are preferred over additions that marginally improve it.
  8. One commit per iteration: bundle improvements, include score deltas in message.
  9. Branch isolation: all work on a feature branch. Never modify main directly.
  10. Read before edit: always read the full skill before proposing changes. Never edit from memory or assumption.
  1. 阶段1的不可变性:阶段1中绝不能修改
    references/evaluation-criteria.md
    references/test-cases.md
    、lint-skills.sh、validate-spec.sh、skill-creatorskill-refiner。违反此规则则终止运行。
  2. Karpathy gate:只有方向性优化才能保留。如果变更未提升综合评分,则回退。无例外,不接受“看起来更好”的理由。
  3. 验证标记:绝不能直接采信跨模型标记。主harness需独立评审每个标记。重大标记存在分歧时提交人工处理。
  4. 元优化前先快照:阶段2前始终快照评估标准。基于快照进行评估,而非正在修改的实时版本。
  5. 阶段2始终暂停:即使在
    --mode auto
    模式下也会暂停。不可配置。
  6. 存在争议的重大标记始终暂停:即使在
    --mode auto
    模式下也会暂停。不可配置。
  7. 简洁性准则:在其他条件相同的情况下,越简洁越好。保持评分不变的删除操作优于仅小幅提升评分的添加操作。
  8. 每次迭代一次提交:打包优化内容,提交信息中包含评分增量。
  9. 分支隔离:所有工作在特性分支上进行。绝不能直接修改主分支。
  10. 先阅读再编辑:提出变更前始终完整阅读Skill内容。绝不能凭记忆或假设进行编辑。

Related Skills

相关Skill

  • skill-creator - the evaluation and improvement engine. skill-refiner invokes skill-creator's review mode (Mode 2) for scoring and its improve mode for generating changes. skill-creator handles individual skill quality; skill-refiner handles iteration, prioritization, and orchestration. Primary dependency.
  • full-review - one-off collection audit across code-review, anti-slop, security-audit, and update-docs. Use full-review for a single pass over application code; use skill-refiner for iterative improvement of skill files.
  • anti-slop - code quality patterns. skill-refiner may invoke anti-slop principles through skill-creator during improvement, but does not call anti-slop directly. Different domain: anti-slop audits application code, skill-refiner audits skill files.
  • skill-creator - 评估与优化引擎。skill-refiner调用skill-creator的评审模式(模式2)进行评分,并调用其优化模式生成变更。skill-creator负责单个Skill的质量;skill-refiner负责迭代、优先级排序和编排。核心依赖。
  • full-review - 涵盖代码评审、anti-slop、安全审计和文档更新的一次性集合审计。对应用代码进行单次检查使用full-review;对Skill文件进行迭代优化使用skill-refiner。
  • anti-slop - 代码质量模式。skill-refiner可能在优化过程中通过skill-creator应用anti-slop原则,但不会直接调用anti-slop。领域不同:anti-slop审计应用代码,skill-refiner审计Skill文件。