autoresearch

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Autoresearch for Skills

基于Autoresearch的技能优化

Most skills work about 70% of the time. The other 30% you get garbage. The fix isn't to rewrite the skill from scratch. It's to let an agent run it dozens of times, score every output, and tighten the prompt until that 30% disappears.
This skill adapts Andrej Karpathy's autoresearch methodology (autonomous experimentation loops) to Claude Code skills. Instead of optimizing ML training code, we optimize skill prompts.

大多数技能的有效率约为70%,剩下30%的输出质量很差。解决方法不是从头重写技能,而是让Agent多次运行技能,为每个输出打分,逐步优化提示词,直到消除那30%的无效输出。
本技能将Andrej Karpathy的autoresearch方法论(自主实验循环)适配到Claude Code技能中。我们优化的不是机器学习训练代码,而是技能的提示词。

the core job

核心目标

Take any existing skill, define what "good output" looks like as binary yes/no checks, then run an autonomous loop that:
  1. Generates outputs from the skill using test inputs
  2. Scores every output against the eval criteria
  3. Mutates the skill prompt to fix failures
  4. Keeps mutations that improve the score, discards the rest
  5. Repeats until the score ceiling is hit or the user stops it
Output: An improved SKILL.md +
results.tsv
log +
changelog.md
of every mutation attempted + a live HTML dashboard you can watch in your browser.

获取任意现有技能,定义“优质输出”的二元是非判断标准,然后运行自主循环:
  1. 使用测试输入生成技能输出
  2. 根据评估标准为每个输出打分
  3. 修改技能提示词以修复失败案例
  4. 保留能提升分数的修改,丢弃无效修改
  5. 重复循环直到达到分数上限或用户终止
输出结果: 优化后的SKILL.md +
results.tsv
日志 + 记录所有尝试修改的
changelog.md
+ 可在浏览器中实时查看的HTML仪表盘。

before starting: gather context

开始前:收集上下文信息

STOP. Do not run any experiments until all fields below are confirmed with the user. Ask for any missing fields before proceeding.
  1. Target skill — Which skill do you want to optimize? (need the exact path to SKILL.md)
  2. Test inputs — What 3-5 different prompts/scenarios should we test the skill with? (variety matters — pick inputs that cover different use cases so we don't overfit to one scenario)
  3. Eval criteria — What 3-6 binary yes/no checks define a good output? (these are your "test questions" — see references/eval-guide.md for how to write good evals)
  4. Runs per experiment — How many times should we run the skill per mutation? Default: 5. (more runs = more reliable scores, but slower and more expensive. 5 is the sweet spot for most skills.)
  5. Run interval — How often should experiments cycle? Default: every 2 minutes. (shorter = faster iteration, but costs more)
  6. Budget cap — Optional. Max number of experiment cycles before stopping. Default: no cap (runs until you stop it).

暂停。在确认以下所有信息前,不要运行任何实验。如有缺失信息,请先向用户询问。
  1. 目标技能 — 你想要优化哪个技能?(需要SKILL.md的准确路径)
  2. 测试输入 — 我们应该用哪3-5个不同的提示/场景来测试技能?(多样性很重要——选择覆盖不同使用场景的输入,避免过度拟合单一场景)
  3. 评估标准 — 用哪3-6条二元是非判断来定义优质输出?(这些是你的“测试问题”——查看references/eval-guide.md了解如何编写有效的评估标准)
  4. 每次实验运行次数 — 每次修改后应运行技能多少次?默认值:5次。(运行次数越多,分数越可靠,但速度越慢、成本越高。5次是大多数技能的最优选择。)
  5. 循环间隔 — 实验循环的频率?默认值:每2分钟一次。(间隔越短,迭代速度越快,但成本越高)
  6. 预算上限 — 可选。停止前的最大实验循环次数。默认值:无上限(直到用户终止)。

step 1: read the skill

步骤1:研读技能

Before changing anything, read and understand the target skill completely.
  1. Read the full SKILL.md file
  2. Read any files in
    references/
    that the skill links to
  3. Identify the skill's core job, process steps, and output format
  4. Note any existing quality checks or anti-patterns already in the skill
Do NOT skip this. You need to understand what the skill does before you can improve it.

在进行任何修改前,完整阅读并理解目标技能。
  1. 阅读完整的SKILL.md文件
  2. 阅读技能链接的
    references/
    目录下的所有文件
  3. 明确技能的核心目标、流程步骤和输出格式
  4. 记录技能中已有的质量检查规则或反模式
请勿跳过此步骤。在改进技能前,你需要先了解它的功能。

step 2: build the eval suite

步骤2:构建评估套件

Convert the user's eval criteria into a structured test. Every check must be binary — pass or fail, no scales.
Format each eval as:
EVAL [number]: [Short name]
Question: [Yes/no question about the output]
Pass condition: [What "yes" looks like — be specific]
Fail condition: [What triggers a "no"]
Rules for good evals:
  • Binary only. Yes or no. No "rate 1-7" scales. Scales compound variability and give unreliable results.
  • Specific enough to be consistent. "Is the text readable?" is too vague. "Are all words spelled correctly with no truncated sentences?" is testable.
  • Not so narrow that the skill games the eval. "Contains fewer than 200 words" will make the skill optimize for brevity at the expense of everything else.
  • 3-6 evals is the sweet spot. More than that and the skill starts parroting eval criteria back instead of actually improving.
See references/eval-guide.md for detailed examples of good vs bad evals.
Max score calculation:
max_score = [number of evals] × [runs per experiment]
Example: 4 evals × 5 runs = max score of 20.

将用户的评估标准转换为结构化测试。每条检查必须是二元的——通过或失败,无中间尺度。
每条评估的格式:
EVAL [编号]: [简短名称]
Question: [关于输出的是非问题]
Pass condition: [“是”的具体判定标准]
Fail condition: [触发“否”的情况]
编写优质评估的规则:
  • 仅采用二元判断。是或否。禁止“1-7分评分”等尺度。评分尺度会增加变异性,导致结果不可靠。
  • 足够具体以保证一致性。“文本是否可读?”过于模糊。“所有单词拼写正确且无截断句子?”才是可测试的标准。
  • 不要过于狭窄,避免技能为了通过评估而投机取巧。“字数少于200”会让技能过度优化简洁性而牺牲其他方面。
  • 3-6条评估是最优数量。超过这个数量,技能会开始照搬评估标准,而非真正提升性能。
查看references/eval-guide.md获取优质与劣质评估标准的详细示例。
最高分计算:
max_score = [评估数量] × [每次实验运行次数]
示例:4条评估 × 5次运行 = 最高分20。

step 3: generate the live dashboard

步骤3:生成实时仪表盘

Before running any experiments, create a live HTML dashboard at
autoresearch-[skill-name]/dashboard.html
and open it in the browser.
The dashboard must:
  • Auto-refresh every 10 seconds (reads from results.tsv)
  • Show a score progression line chart (experiment number on X axis, pass rate % on Y axis)
  • Show a colored bar for each experiment: green = keep, red = discard, blue = baseline
  • Show a table of all experiments with: experiment #, score, pass rate, status, description
  • Show per-eval breakdown: which evals pass most/least across all runs
  • Show current status: "Running experiment [N]..." or "Idle"
  • Use clean styling with soft colors (white background, pastel accents, clean sans-serif font)
Generate the dashboard as a single self-contained HTML file with inline CSS and JavaScript. Use Chart.js loaded from CDN for the line chart. The JS should fetch
results.json
(which you update after each experiment alongside results.tsv) and re-render.
Open it immediately after creating it:
open dashboard.html
(macOS) so the user can see it in their browser.
Update
results.json
after every experiment so the dashboard stays current. The JSON format:
json
{
  "skill_name": "[name]",
  "status": "running",
  "current_experiment": 3,
  "baseline_score": 70.0,
  "best_score": 90.0,
  "experiments": [
    {
      "id": 0,
      "score": 14,
      "max_score": 20,
      "pass_rate": 70.0,
      "status": "baseline",
      "description": "original skill — no changes"
    }
  ],
  "eval_breakdown": [
    { "name": "Text legibility", "pass_count": 8, "total": 10 },
    { "name": "Pastel colors", "pass_count": 9, "total": 10 }
  ]
}
When the run finishes (user stops it or ceiling hit), update
status
to
"complete"
so the dashboard shows a "Done" state with final summary.

在运行任何实验前,创建实时HTML仪表盘
autoresearch-[skill-name]/dashboard.html
并在浏览器中打开。
仪表盘必须具备:
  • 每10秒自动刷新(读取results.tsv)
  • 显示分数变化折线图(X轴为实验编号,Y轴为通过率%)
  • 为每个实验显示彩色条形:绿色=保留,红色=丢弃,蓝色=基准线
  • 显示所有实验的表格:实验编号、分数、通过率、状态、描述
  • 显示每条评估的详细数据:所有运行中哪些评估通过率最高/最低
  • 显示当前状态:“正在运行实验[N]...”或“空闲”
  • 使用简洁样式,柔和配色(白色背景、淡色点缀、简洁无衬线字体)
将仪表盘生成为包含内联CSS和JavaScript的独立HTML文件。使用CDN加载的Chart.js绘制折线图。JavaScript应获取
results.json
(每次实验后与results.tsv同步更新)并重新渲染。
创建后立即打开:
open dashboard.html
(macOS),让用户能在浏览器中查看。
每次实验后更新
results.json
,保持仪表盘数据最新。JSON格式:
json
{
  "skill_name": "[名称]",
  "status": "running",
  "current_experiment": 3,
  "baseline_score": 70.0,
  "best_score": 90.0,
  "experiments": [
    {
      "id": 0,
      "score": 14,
      "max_score": 20,
      "pass_rate": 70.0,
      "status": "baseline",
      "description": "original skill — no changes"
    }
  ],
  "eval_breakdown": [
    { "name": "Text legibility", "pass_count": 8, "total": 10 },
    { "name": "Pastel colors", "pass_count": 9, "total": 10 }
  ]
}
当运行结束(用户终止或达到分数上限),将
status
更新为
"complete"
,让仪表盘显示“完成”状态及最终总结。

step 4: establish baseline

步骤4:建立基准线

Run the skill AS-IS before changing anything. This is experiment #0.
  1. Create a working directory:
    autoresearch-[skill-name]/
    inside the skill's folder
  2. Create
    results.tsv
    with the header row
  3. Create
    results.json
    and
    dashboard.html
    , then open the dashboard in the browser
  4. Back up the original SKILL.md as
    SKILL.md.baseline
  5. Run the skill [N] times using the test inputs
  6. Score every output against every eval
  7. Record the baseline score and update both results.tsv and results.json
results.tsv format (tab-separated):
experiment	score	max_score	pass_rate	status	description
0	14	20	70.0%	baseline	original skill — no changes
IMPORTANT: After establishing baseline, confirm the score with the user before proceeding. If baseline is already 90%+, the skill may not need optimization — ask the user if they want to continue.

在进行任何修改前,按原样运行技能。这是实验#0。
  1. 在技能文件夹内创建工作目录:
    autoresearch-[skill-name]/
  2. 创建带表头的
    results.tsv
  3. 创建
    results.json
    dashboard.html
    ,然后在浏览器中打开仪表盘
  4. 将原始SKILL.md备份为
    SKILL.md.baseline
  5. 使用测试输入运行技能[N]次
  6. 根据每条评估标准为所有输出打分
  7. 记录基准分数并更新results.tsv和results.json
results.tsv格式(制表符分隔):
experiment	score	max_score	pass_rate	status	description
0	14	20	70.0%	baseline	original skill — no changes
重要提示: 建立基准线后,先与用户确认分数再继续。如果基准线已经达到90%以上,技能可能无需优化——询问用户是否要继续。

step 5: run the experiment loop

步骤5:运行实验循环

This is the core autoresearch loop. Once started, run autonomously until stopped.
LOOP:
  1. Analyze failures. Look at which evals are failing most. Read the actual outputs that failed. Identify the pattern — is it a formatting issue? A missing instruction? An ambiguous directive?
  2. Form a hypothesis. Pick ONE thing to change. Don't change 5 things at once — you won't know what helped.
    Good mutations:
    • Add a specific instruction that addresses the most common failure
    • Reword an ambiguous instruction to be more explicit
    • Add an anti-pattern ("Do NOT do X") for a recurring mistake
    • Move a buried instruction higher in the skill (priority = position)
    • Add or improve an example that shows the correct behavior
    • Remove an instruction that's causing the skill to over-optimize for one thing at the expense of others
    Bad mutations:
    • Rewriting the entire skill from scratch
    • Adding 10 new rules at once
    • Making the skill longer without a specific reason
    • Adding vague instructions like "make it better" or "be more creative"
  3. Make the change. Edit SKILL.md with ONE targeted mutation.
  4. Run the experiment. Execute the skill [N] times with the same test inputs.
  5. Score it. Run every output through every eval. Calculate total score.
  6. Decide: keep or discard.
    • Score improved → KEEP. Log it. This is the new baseline.
    • Score stayed the same → DISCARD. Revert SKILL.md to previous version. The change added complexity without improvement.
    • Score got worse → DISCARD. Revert SKILL.md to previous version.
  7. Log the result in results.tsv.
  8. Repeat. Go back to step 1 of the loop.
NEVER STOP. Once the loop starts, do not pause to ask the user if you should continue. They may be away from the computer. Run autonomously until:
  • The user manually stops you
  • You hit the budget cap (if one was set)
  • You hit 95%+ pass rate for 3 consecutive experiments (diminishing returns)
If you run out of ideas: Re-read the failing outputs. Try combining two previous near-miss mutations. Try a completely different approach to the same problem. Try removing things instead of adding them. Simplification that maintains the score is a win.

这是autoresearch的核心循环。启动后自主运行直到终止。
循环流程:
  1. 分析失败案例。查看哪些评估失败次数最多。阅读失败的实际输出。识别模式——是格式问题?缺少指令?模糊的指示?
  2. 形成假设。只选择一个修改点。不要同时修改5个地方——你将无法确定哪个修改起了作用。
    优质修改示例:
    • 添加针对最常见失败的具体指令
    • 重写模糊指令使其更明确
    • 添加针对重复错误的反模式(“禁止执行X”)
    • 将隐藏的指令移到技能的更前面(位置决定优先级)
    • 添加或改进展示正确行为的示例
    • 删除导致技能过度优化某一方面而牺牲其他性能的指令
    劣质修改示例:
    • 从头重写整个技能
    • 一次性添加10条新规则
    • 无特定理由地增加技能长度
    • 添加模糊指令如“做得更好”或“更有创意”
  3. 执行修改。编辑SKILL.md,只进行一处针对性修改。
  4. 运行实验。使用相同测试输入执行技能[N]次。
  5. 打分。将所有输出通过每条评估标准打分。计算总分。
  6. 决定:保留或丢弃
    • 分数提升 → 保留。记录该修改。这成为新的基准线。
    • 分数不变 → 丢弃。将SKILL.md恢复到之前的版本。该修改增加了复杂度但未带来提升。
    • 分数下降 → 丢弃。将SKILL.md恢复到之前的版本。
  7. 记录结果到results.tsv。
  8. 重复。回到循环的步骤1。
请勿中途停止。循环启动后,不要暂停询问用户是否继续。用户可能不在电脑前。自主运行直到:
  • 用户手动终止
  • 达到预算上限(如果设置了)
  • 连续3次实验通过率达到95%以上(收益递减)
如果缺乏修改思路: 重新阅读失败输出。尝试结合之前接近成功的两次修改。尝试用完全不同的方法解决同一问题。尝试删除内容而非添加内容。在保持分数的前提下简化技能也是一种成功。

step 6: write the changelog

步骤6:编写变更日志

After each experiment (whether kept or discarded), append to
changelog.md
:
markdown
undefined
每次实验后(无论保留或丢弃修改),向
changelog.md
追加内容:
markdown
undefined

Experiment [N] — [keep/discard]

实验[N] — [保留/丢弃]

Score: [X]/[max] ([percent]%) Change: [One sentence describing what was changed] Reasoning: [Why this change was expected to help] Result: [What actually happened — which evals improved/declined] Failing outputs: [Brief description of what still fails, if anything]

This changelog is the most valuable artifact. It's a research log that any future agent (or smarter future model) can pick up and continue from.

---
分数: [X]/[最高分] ([百分比]%) 修改内容: [一句话描述修改点] 修改理由: [预期该修改能带来提升的原因] 结果: [实际效果——哪些评估通过率提升/下降] 失败输出: [简要描述仍存在的失败情况(如有)]

变更日志是最有价值的产物。它是一份研究日志,未来的Agent(或更智能的模型)可以基于此继续优化。

---

step 7: deliver results

步骤7:交付结果

When the user returns or the loop stops, present:
  1. Score summary: Baseline score → Final score (percent improvement)
  2. Total experiments run: How many mutations were tried
  3. Keep rate: How many mutations were kept vs discarded
  4. Top 3 changes that helped most (from the changelog)
  5. Remaining failure patterns (what the skill still gets wrong, if anything)
  6. The improved SKILL.md (already saved in place)
  7. Location of results.tsv and changelog.md for reference

当用户返回或循环终止时,展示以下内容:
  1. 分数总结: 基准分数 → 最终分数(提升百分比)
  2. 总实验次数: 尝试了多少次修改
  3. 保留率: 保留的修改数 vs 丢弃的修改数
  4. 最有效的3项修改(来自变更日志)
  5. 剩余失败模式(技能仍存在的问题,如有)
  6. 优化后的SKILL.md(已保存到原位置)
  7. results.tsv和changelog.md的位置供参考

output format

输出格式

The skill produces four files in
autoresearch-[skill-name]/
:
autoresearch-[skill-name]/
├── dashboard.html       # live browser dashboard (auto-refreshes)
├── results.json         # data file powering the dashboard
├── results.tsv          # score log for every experiment
├── changelog.md         # detailed mutation log
└── SKILL.md.baseline    # original skill before optimization
Plus the improved SKILL.md saved back to its original location.
results.tsv example:
experiment	score	max_score	pass_rate	status	description
0	14	20	70.0%	baseline	original skill — no changes
1	16	20	80.0%	keep	added explicit instruction to avoid numbering in diagrams
2	16	20	80.0%	discard	tried enforcing left-to-right layout — no improvement
3	18	20	90.0%	keep	added color palette hex codes instead of vague "pastel" description
4	18	20	90.0%	discard	added anti-pattern for neon colors — no improvement
5	19	20	95.0%	keep	added worked example showing correct label formatting

该技能会在
autoresearch-[skill-name]/
目录下生成四个文件:
autoresearch-[skill-name]/
├── dashboard.html       # 浏览器实时仪表盘(自动刷新)
├── results.json         # 驱动仪表盘的数据文件
├── results.tsv          # 所有实验的分数日志
├── changelog.md         # 详细的修改记录日志
└── SKILL.md.baseline    # 优化前的原始技能备份
此外,优化后的SKILL.md会保存回原位置。
results.tsv示例:
experiment	score	max_score	pass_rate	status	description
0	14	20	70.0%	baseline	original skill — no changes
1	16	20	80.0%	keep	added explicit instruction to avoid numbering in diagrams
2	16	20	80.0%	discard	tried enforcing left-to-right layout — no improvement
3	18	20	90.0%	keep	added color palette hex codes instead of vague "pastel" description
4	18	20	90.0%	discard	added anti-pattern for neon colors — no improvement
5	19	20	95.0%	keep	added worked example showing correct label formatting

example: optimizing a diagram-generator skill

示例:优化图表生成技能

Context gathered:
  • Target skill:
    ~/.claude/skills/diagram-generator/SKILL.md
  • Test inputs: "OAuth flow diagram", "CI/CD pipeline", "microservices architecture", "user onboarding funnel", "database schema relationships"
  • Evals: (1) All text legible and spelled correctly? (2) Uses only pastel/soft colors? (3) Linear layout — left-to-right or top-to-bottom? (4) Free of numbers, ordinals, and ordering?
  • Runs per experiment: 10
  • Max score: 40
Baseline run (experiment 0): Generated 10 diagrams. Scored each against 4 evals. Result: 32/40 (80%). Common failures: 3 diagrams had numbered steps, 2 had bright red elements, 3 had illegible small text.
Experiment 1 — KEEP (35/40, 87.5%): Change: Added "NEVER include step numbers, ordinal numbers (1st, 2nd), or any numerical ordering in diagrams" to the anti-patterns section. Result: Numbering failures dropped from 3 to 1. Other evals held steady.
Experiment 2 — DISCARD (34/40, 85%): Change: Added "All text must be minimum 14px font size." Result: Legibility improved by 1, but color compliance dropped by 2. Reverted.
Experiment 3 — KEEP (37/40, 92.5%): Change: Replaced vague "pastel colors" instruction with specific hex codes:
#A8D8EA, #AA96DA, #FCBAD3, #FFFFD2, #B5EAD7
. Result: Color eval went from 8/10 to 10/10. Other evals held.
Experiment 4 — DISCARD (37/40, 92.5%): Change: Added anti-pattern "Do NOT use red (#FF0000), orange (#FF8C00), or neon green (#39FF14)." Result: No change. The hex codes from experiment 3 already solved the color problem. Reverted to keep skill simpler.
Experiment 5 — KEEP (39/40, 97.5%): Change: Added a worked example showing a correct diagram with properly formatted labels (no numbers, pastel fills, left-to-right flow, legible text). Result: Hit 39/40. One remaining failure: a complex diagram with overlapping labels. Diminishing returns — stopped.
Final delivery:
  • Baseline: 32/40 (80%) → Final: 39/40 (97.5%)
  • 5 experiments, 3 kept, 2 discarded
  • Top changes: specific hex codes for colors, explicit anti-numbering rule, worked example
  • Remaining issue: very complex diagrams occasionally get overlapping labels (1/40 failure rate)

收集到的上下文:
  • 目标技能:
    ~/.claude/skills/diagram-generator/SKILL.md
  • 测试输入:"OAuth流程图表"、"CI/CD流水线"、"微服务架构"、"用户引导漏斗"、"数据库 schema 关系"
  • 评估标准:(1) 所有文本清晰可读且拼写正确?(2) 仅使用柔和/淡色?(3) 线性布局——从左到右或从上到下?(4) 无数字、序数词和排序标识?
  • 每次实验运行次数:10次
  • 最高分:40
基准运行(实验0): 生成10张图表。根据4条评估标准打分。结果:32/40(80%)。 常见失败:3张图表包含编号步骤,2张使用亮红色元素,3张文本过小难以辨认。
实验1 — 保留(35/40,87.5%): 修改:在反模式部分添加"图表中绝对禁止包含步骤编号、序数词(第1、第2)或任何数字排序标识"。 结果:编号失败案例从3个减少到1个。其他评估标准表现稳定。
实验2 — 丢弃(34/40,85%): 修改:添加"所有文本字体大小至少为14px"。 结果:可读性提升1分,但颜色合规性下降2分。恢复原版本。
实验3 — 保留(37/40,92.5%): 修改:将模糊的"淡色"指令替换为具体的十六进制颜色代码:
#A8D8EA, #AA96DA, #FCBAD3, #FFFFD2, #B5EAD7
。 结果:颜色评估从8/10提升到10/10。其他评估标准表现稳定。
实验4 — 丢弃(37/40,92.5%): 修改:添加反模式"禁止使用红色(#FF0000)、橙色(#FF8C00)或霓虹绿(#39FF14)"。 结果:无变化。实验3中的十六进制颜色代码已经解决了颜色问题。恢复原版本以保持技能简洁。
实验5 — 保留(39/40,97.5%): 修改:添加一个示例,展示格式正确的图表(无编号、淡色填充、从左到右布局、文本清晰)。 结果:达到39/40。剩余1个失败案例:一张复杂图表的标签重叠。收益递减——终止运行。
最终交付:
  • 基准:32/40(80%)→ 最终:39/40(97.5%)
  • 5次实验,3次保留,2次丢弃
  • 最有效的修改:具体颜色十六进制代码、明确的禁止编号规则、示例展示
  • 剩余问题:非常复杂的图表偶尔会出现标签重叠(失败率1/40)

how this connects to other skills

与其他技能的关联

What feeds into autoresearch:
  • Any existing skill that needs optimization
  • User-defined eval criteria (or help them define evals using the eval guide)
What autoresearch feeds into:
  • The improved skill replaces the original
  • The changelog can be passed to future models for continued optimization
  • The eval suite can be reused whenever the skill is updated

输入到autoresearch的内容:
  • 需要优化的任意现有技能
  • 用户定义的评估标准(或使用评估指南帮助用户定义)
autoresearch输出的内容:
  • 优化后的技能替换原技能
  • 变更日志可传递给未来模型以继续优化
  • 评估套件可在技能更新时重复使用

the test

验证标准

A good autoresearch run:
  1. Started with a baseline — never changed anything before measuring the starting point
  2. Used binary evals only — no scales, no vibes, no "rate this 1-10"
  3. Changed one thing at a time — so you know exactly what helped
  4. Kept a complete log — every experiment recorded, kept or discarded
  5. Improved the score — measurable improvement from baseline to final
  6. Didn't overfit — the skill got better at the actual job, not just at passing the specific test inputs
  7. Ran autonomously — didn't stop to ask permission between experiments
If the skill "passes" all evals but the actual output quality hasn't improved — the evals are bad, not the skill. Go back to step 2 and write better evals.
一次优质的autoresearch运行需满足:
  1. 从基准线开始 — 在测量初始状态前绝不修改任何内容
  2. 仅使用二元评估 — 无评分尺度、无主观感受、无“1-10分评分”
  3. 每次只修改一处 — 明确知道哪个修改起了作用
  4. 保留完整日志 — 记录所有实验,无论保留或丢弃
  5. 分数提升 — 从基准线到最终分数有可衡量的提升
  6. 未过度拟合 — 技能在实际任务中表现更好,而非仅能通过特定测试输入
  7. 自主运行 — 实验间无需暂停请求许可
如果技能“通过”所有评估但实际输出质量未提升——说明评估标准存在问题,而非技能本身。回到步骤2重新编写更优质的评估标准。