skill-creator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Skill Creator

技能创建器

Create new skills and iteratively improve them. The lifecycle:
  1. Decide what the skill should do and how
  2. Draft the skill
  3. Run test prompts through claude-with-the-skill
  4. Evaluate results qualitatively and quantitatively with the user
  5. Rewrite based on feedback
  6. Repeat until satisfied, then expand the test set
  7. Package and deliver
Determine where the user is in this lifecycle and help them progress. Starting from scratch ("I want a skill for X") -- help narrow scope, draft, test, iterate. Already have a draft -- go straight to eval/iterate. Just want a better description -- jump to Description Optimization.
Be flexible: if the user says "just vibe with me, no evals," do that.

创建新技能并迭代改进。生命周期如下:
  1. 确定技能的功能与实现方式
  2. 起草技能文档
  3. 通过搭载该技能的Claude运行测试提示词
  4. 与用户一起从定性和定量角度评估结果
  5. 根据反馈重写技能文档
  6. 重复上述步骤直至满意,然后扩展测试集
  7. 打包并交付技能
判断用户当前处于生命周期的哪个阶段,协助其推进流程。若用户从零开始(如“我想要一个针对X的技能”),则帮助缩小范围、起草、测试、迭代;若已有草稿,则直接进入评估/迭代环节;若仅想优化描述,则跳转至描述优化步骤。
保持灵活性:若用户表示“不用评估,跟着感觉来”,则按用户要求执行。

Available Resources

可用资源

Before starting, verify what is available in the environment. The full workflow requires these bundled resources:
ResourcePathUsed for
Grader instructions
agents/grader.md
Assertion evaluation
Comparator instructions
agents/comparator.md
Blind A/B comparison
Analyzer instructions
agents/analyzer.md
Benchmark pattern analysis
Schema reference
references/schemas.md
evals.json, grading.json formats
Assertion patterns
references/eval-patterns.md
Writing discriminating assertions
Troubleshooting
references/troubleshooting.md
Common errors and fixes
Eval review template
assets/eval_review.html
Description optimization UI
Benchmark aggregator
scripts/aggregate_benchmark
python -m scripts.aggregate_benchmark
Description optimizer
scripts/run_loop
python -m scripts.run_loop
Eval viewer
eval-viewer/generate_review.py
Human review interface
If resources are missing: The core loop (draft -> test -> review -> improve) still works without them -- grade inline and present results in-conversation instead of the browser viewer. Note what is unavailable to the user and adapt.

开始前,请确认环境中可用的资源。完整工作流需要以下捆绑资源:
资源路径用途
评分器说明
agents/grader.md
断言评估
比较器说明
agents/comparator.md
盲态A/B对比
分析器说明
agents/analyzer.md
基准测试模式分析
参考 schema
references/schemas.md
evals.json、grading.json格式规范
断言模式
references/eval-patterns.md
编写判别式断言
故障排查指南
references/troubleshooting.md
常见错误及修复方案
评估审查模板
assets/eval_review.html
描述优化UI
基准测试聚合工具
scripts/aggregate_benchmark
执行命令:
python -m scripts.aggregate_benchmark
描述优化工具
scripts/run_loop
执行命令:
python -m scripts.run_loop
评估查看器
eval-viewer/generate_review.py
人工审查界面
若资源缺失: 核心循环(起草→测试→审查→改进)仍可正常运行——只需在对话内直接评分并展示结果,无需使用浏览器查看器。需告知用户哪些资源不可用,并调整流程适配。

Communicating with the user

与用户沟通

The skill creator serves people across a wide range of technical backgrounds. Non-developers are opening terminals for the first time because Claude makes things possible; experienced engineers are also common.
Pay attention to context cues:
  • "evaluation" and "benchmark" are borderline but generally OK
  • Avoid "JSON" or "assertion" without a brief explanation unless the user has already used those terms
Err toward clarity. A short inline definition never hurts.

技能创建器服务于技术背景各异的用户:有的是非开发者,因Claude的能力首次使用终端;也有经验丰富的工程师。
注意上下文线索:
  • “evaluation”(评估)和“benchmark”(基准测试)这类术语用户通常可以理解
  • 除非用户已使用过“JSON”或“assertion”(断言),否则使用时需附带简短解释
优先保证清晰易懂,简短的内联定义不会有坏处。

Creating a skill

创建技能

Capture Intent

捕捉意图

The current conversation may already contain the workflow to capture. If the user says "turn this into a skill" or "make this repeatable," mine the conversation history first:
  • What tools were used, in what sequence?
  • What corrections did the user make?
  • What were the input and output formats?
  • What would a different user need to know to reproduce this?
Fill gaps with the user and confirm before drafting.
If starting from scratch, establish:
  1. What should this skill enable Claude to do?
  2. When should this skill trigger? (what user phrases, contexts, task types)
  3. What is the expected output -- files, structured data, prose, or a workflow?
  4. Should test cases be set up? Skills with verifiable outputs (file transforms, code generation, fixed workflows) benefit from them. Skills with subjective outputs (writing style, creative direction) often do not. Suggest the appropriate default based on skill type, but let the user decide.
当前对话可能已包含可捕捉的工作流。若用户说“把这个变成技能”或“让这个可重复执行”,首先挖掘对话历史:
  • 使用了哪些工具,顺序如何?
  • 用户做了哪些修正?
  • 输入和输出格式是什么?
  • 其他用户需要了解哪些信息才能复现该流程?
填补信息空白并与用户确认后,再开始起草。
若从零开始,需明确:
  1. 该技能应让Claude实现什么功能?
  2. 技能应在何时触发?(用户的哪些表述、场景、任务类型)
  3. 预期输出是什么——文件、结构化数据、 prose(散文式文本)还是工作流?
  4. 是否需要设置测试用例?输出可验证的技能(如文件转换、代码生成、固定工作流)会从中受益;输出主观化的技能(如写作风格、创意方向)通常不需要。根据技能类型建议合适的默认方案,但最终由用户决定。

Interview and Research

访谈与调研

Ask about edge cases, input/output examples, success criteria, and dependencies before writing test prompts. If useful MCPs are available (docs search, similar skill lookup), research in parallel via subagents. Come prepared so the interview is lightweight.
在编写测试提示词前,询问边缘情况、输入/输出示例、成功标准和依赖项。若有可用的MCP(如文档搜索、相似技能查询),可通过子代理并行调研。提前做好准备,让访谈更高效。

Write the SKILL.md

编写SKILL.md

Based on the interview, compose the skill. Every skill has:
  • name: Identifier (kebab-case)
  • description: The primary triggering mechanism. Describe what the skill does AND when to use it. All "when to trigger" information goes here, not in the body. Apply the "pushy principle" -- err toward including trigger contexts rather than omitting them. Example: instead of "Builds REST APIs," write "Builds REST APIs. Invoke whenever users ask about endpoints, routes, HTTP methods, API design, or want to add a backend -- even if they don't say 'REST.'"
  • compatibility: Required tools or dependencies (optional, rarely needed)
  • skill body: Instructions, workflow, examples, references
基于访谈内容撰写技能文档。每个技能包含:
  • name(名称):标识符(采用kebab-case格式)
  • description(描述):主要触发机制。需描述技能的功能以及触发场景。所有“何时触发”的信息都应放在此处,而非正文。遵循“主动原则”——宁可多包含触发场景,也不要遗漏。示例:不要写“构建REST API”,而要写“构建REST API。当用户询问端点、路由、HTTP方法、API设计,或想要添加后端时调用——即使用户没有提到‘REST’。”
  • compatibility(兼容性):所需工具或依赖项(可选,很少需要)
  • skill body(技能正文):说明、工作流、示例、参考资料

Skill output template

技能输出模板

Use this as the starting structure -- fill in or remove sections as needed:
markdown
---
name: skill-name
description: >
  [What this skill does and when to trigger it. One to three sentences.
  Apply the pushy principle: include adjacent trigger contexts.]
---
以下为基础结构,可根据需要填充或删除章节:
markdown
---
name: skill-name
description: >
  [技能功能及触发场景,1-3句话。遵循主动原则:包含相关的触发场景。]
---

[Skill Name]

[技能名称]

[One paragraph: what this skill enables and when to use it.]
[一段文字:说明该技能的作用及使用场景。]

[Core workflow or main section]

[核心工作流或主要章节]

[Instructions in imperative form. Explain the why behind each step.]
[使用祈使句编写说明。解释每个步骤的原因。]

Output format

输出格式

[Concrete template for what this skill produces.]
[技能产出的具体模板。]

Examples

示例

Example 1: Input: [realistic user prompt] Output: [expected result]
示例1: Input: [真实的用户提示词] Output: [预期结果]

Reference files

参考文件

  • Read
    references/advanced.md
    when [specific situation]
  • Run
    scripts/transform.py
    for [repeatable operation]
undefined
  • 当[特定场景]时,阅读
    references/advanced.md
  • 当[重复操作]时,运行
    scripts/transform.py
undefined

Anatomy of a skill directory

技能目录结构

skill-name/
+-- SKILL.md (required)
+-- Optional bundled resources:
    +-- scripts/    - Scripts for deterministic/repetitive tasks
    +-- references/ - Docs loaded into context as needed
    +-- assets/     - Templates, icons, fonts
Progressive loading:
  1. Metadata (name + description) -- always in context
  2. SKILL.md body -- in context when the skill triggers (<500 lines ideal)
  3. Bundled resources -- loaded as needed, scripts can run without loading
Keep SKILL.md under 500 lines. If approaching that limit, add a layer of hierarchy with clear pointers to reference files. For reference files over 300 lines, add a table of contents.
Domain organization -- when a skill covers multiple frameworks/variants:
cloud-deploy/
+-- SKILL.md (workflow + selection logic)
+-- references/
    +-- aws.md
    +-- gcp.md
    +-- azure.md
Claude reads only the relevant reference file.
skill-name/
+-- SKILL.md(必填)
+-- 可选捆绑资源:
    +-- scripts/    - 用于确定性/重复性任务的脚本
    +-- references/ - 按需加载到上下文的文档
    +-- assets/     - 模板、图标、字体
渐进式加载:
  1. 元数据(name + description)——始终在上下文中
  2. SKILL.md正文——技能触发时加载到上下文中(理想情况下少于500行)
  3. 捆绑资源——按需加载,脚本无需加载即可运行
保持SKILL.md在500行以内。若接近该限制,可添加层级结构并明确指向参考文件。对于超过300行的参考文件,添加目录。
领域组织——当技能涵盖多个框架/变体时:
cloud-deploy/
+-- SKILL.md(工作流 + 选择逻辑)
+-- references/
    +-- aws.md
    +-- gcp.md
    +-- azure.md
Claude仅会读取相关的参考文件。

Writing patterns

写作模式

Output templates -- be concrete:
markdown
undefined
输出模板——要具体:
markdown
undefined

Report structure

报告结构

ALWAYS use this template:
始终使用以下模板:

[Title]

[标题]

Executive summary

执行摘要

Key findings

关键发现

Recommendations

建议


**Examples** -- include realistic ones:
```markdown

**示例**——包含真实场景:
```markdown

Commit message format

提交信息格式

Example: Input: Added user authentication with JWT tokens Output: feat(auth): implement JWT-based authentication

**Writing style** -- use imperative form. Explain the *why* rather than issuing
rules. Avoid ALWAYS/NEVER in all caps where the reasoning can be explained
instead -- models respond better to understanding than mandates. Write a draft,
then read it fresh and improve it.

**Security** -- skills must not contain malware, exploit code, or anything that
would surprise the user if they read the description. Do not create skills
designed to facilitate unauthorized access or data exfiltration. Roleplay skills
are fine.
示例: Input: 添加了基于JWT的用户认证 Output: feat(auth): implement JWT-based authentication

**写作风格**——使用祈使句。解释原因,而非只给出规则。若可以解释原因,避免使用全大写的ALWAYS/NEVER——模型理解原因比遵守规则表现更好。先起草,再重新阅读并改进。

**安全性**——技能不得包含恶意软件、漏洞利用代码,或用户阅读描述时会感到意外的内容。不得创建用于未经授权访问或数据泄露的技能。角色扮演类技能不受此限制。

Test Cases

测试用例

After drafting, write 2-3 realistic test prompts -- things a real user would actually type. Share them: "Here are a few test cases I'd like to try. Do these look right, or do you want to add more?"
Save to
evals/evals.json
. Do not write assertions yet -- just prompts. Assertions get drafted while runs are in progress.
json
{
  "skill_name": "example-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "User's task prompt",
      "expected_output": "Description of expected result",
      "files": []
    }
  ]
}
See
references/schemas.md
for the full schema including the
assertions
field.

起草完成后,编写2-3个真实的测试提示词——即真实用户实际会输入的内容。告知用户:“这里有几个我想尝试的测试用例,你看是否合适,或者需要添加更多?”
将测试用例保存到
evals/evals.json
。暂时不要编写断言——只保存提示词。断言可在运行过程中起草。
json
{
  "skill_name": "example-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "用户的任务提示词",
      "expected_output": "预期结果描述",
      "files": []
    }
  ]
}
完整schema请参考
references/schemas.md
,其中包含
assertions
字段。

Running and evaluating test cases

运行并评估测试用例

This is one continuous sequence -- do not stop partway through. Do NOT use
/skill-test
or any other testing skill.
Put results in
<skill-name>-workspace/
as a sibling to the skill directory. Organize by iteration, then by eval name, then by configuration, then by run. Create directories as you go, not all upfront.
这是一个连续的流程——不要中途停止。请勿使用
/skill-test
或其他测试技能。
将结果保存到
<skill-name>-workspace/
,作为技能目录的同级目录。按迭代、评估名称、配置、运行结果组织目录。按需创建目录,无需提前全部创建。

Workspace layout

工作区布局

<skill-name>-workspace/
+-- iteration-1/
|   +-- eval-<name>/                    <-- one per test case (descriptive name)
|   |   +-- eval_metadata.json          <-- eval ID, prompt, assertions
|   |   +-- with_skill/
|   |   |   +-- run-1/
|   |   |       +-- outputs/            <-- files the subagent produced
|   |   |       |   +-- metrics.json
|   |   |       +-- timing.json         <-- from task notification
|   |   |       +-- grading.json        <-- written by grader agent
|   |   +-- without_skill/              <-- or old_skill/ when improving
|   |       +-- run-1/
|   |           +-- outputs/
|   |           +-- timing.json
|   |           +-- grading.json
|   +-- benchmark.json                  <-- written by aggregate_benchmark.py
|   +-- benchmark.md
+-- iteration-2/
    +-- ...
<skill-name>-workspace/
+-- iteration-1/
|   +-- eval-<name>/                    <-- 每个测试用例对应一个目录(名称需具描述性)
|   |   +-- eval_metadata.json          <-- 评估ID、提示词、断言
|   |   +-- with_skill/
|   |   |   +-- run-1/
|   |   |       +-- outputs/            <-- 子代理生成的文件
|   |   |       |   +-- metrics.json
|   |   |       +-- timing.json         <-- 来自任务通知
|   |   |       +-- grading.json        <-- 由评分器代理生成
|   |   +-- without_skill/              <-- 改进现有技能时,替换为old_skill/
|   |       +-- run-1/
|   |           +-- outputs/
|   |           +-- timing.json
|   |           +-- grading.json
|   +-- benchmark.json                  <-- 由aggregate_benchmark.py生成
|   +-- benchmark.md
+-- iteration-2/
    +-- ...

Step 1: Spawn all runs (with-skill AND baseline) in the same turn

步骤1:同一轮次启动所有运行(含技能版及基准版)

For each test case, spawn two subagents simultaneously -- one with the skill, one without. Do not do with-skill runs first and baseline runs later. Launch everything at once so results arrive around the same time.
With-skill run:
Execute this task:
- Skill path: <path-to-skill>
- Task: <eval prompt>
- Input files: <eval files if any, or "none">
- Save outputs to: <workspace>/iteration-<N>/eval-<name>/with_skill/run-1/outputs/
- Outputs to save: <what the user cares about>
Baseline run -- depends on context:
  • New skill: no skill at all. Same prompt, save to
    <workspace>/iteration-<N>/eval-<name>/without_skill/run-1/outputs/
  • Improving existing skill: snapshot the old version first (
    cp -r <skill-path> <workspace>/skill-snapshot/
    ), point baseline at snapshot, save to
    old_skill/run-1/outputs/
Write
eval_metadata.json
at the eval level (e.g.,
<workspace>/iteration-N/eval-<name>/eval_metadata.json
). Give each eval a descriptive name based on what it tests -- not just "eval-0":
json
{
  "eval_id": 0,
  "eval_name": "descriptive-name-here",
  "prompt": "The user's task prompt",
  "assertions": []
}
针对每个测试用例,同时启动两个子代理——一个搭载技能,一个不搭载。不要先运行搭载技能的版本,再运行基准版本。同时启动所有任务,确保结果大致同时返回。
搭载技能的运行:
执行以下任务:
- 技能路径:<path-to-skill>
- 任务:<eval prompt>
- 输入文件:<若有评估文件则填写,否则填“none”>
- 输出保存路径:<workspace>/iteration-<N>/eval-<name>/with_skill/run-1/outputs/
- 需保存的输出:<用户关心的内容>
基准运行——取决于上下文:
  • 新技能:不搭载任何技能。使用相同提示词,保存到
    <workspace>/iteration-<N>/eval-<name>/without_skill/run-1/outputs/
  • 改进现有技能:先快照旧版本(
    cp -r <skill-path> <workspace>/skill-snapshot/
    ),基准版本指向该快照,保存到
    old_skill/run-1/outputs/
在评估层级写入
eval_metadata.json
(例如
<workspace>/iteration-N/eval-<name>/eval_metadata.json
)。为每个评估赋予描述性名称,而非仅“eval-0”:
json
{
  "eval_id": 0,
  "eval_name": "descriptive-name-here",
  "prompt": "用户的任务提示词",
  "assertions": []
}

Step 2: While runs are in progress, draft assertions

步骤2:运行过程中起草断言

Do not just wait for runs to finish -- use this time productively. Draft quantitative assertions for each test case and explain them to the user. If assertions already exist in
evals/evals.json
, review and explain them.
Good assertions are objectively verifiable and have descriptive names -- someone glancing at benchmark results should immediately understand what each one checks. For subjective skills (writing style, design quality), skip assertions and rely on qualitative review. See
references/eval-patterns.md
for assertion patterns by task type and the discriminating assertion test.
Update
eval_metadata.json
files and
evals/evals.json
with assertions once drafted. Preview for the user: explain what they will see in the viewer.
不要等待运行完成——利用这段时间高效工作。为每个测试用例起草定量断言并向用户解释。若
evals/evals.json
中已有断言,则进行审查并解释。
好的断言应具备客观可验证性和描述性名称——只需 glance(浏览)基准测试结果,就能立即明白每个断言的检查内容。对于主观技能(如写作风格、设计质量),可跳过断言,依赖定性审查。按任务类型编写断言模式及判别式断言测试,请参考
references/eval-patterns.md
起草完成后,更新
eval_metadata.json
文件和
evals/evals.json
中的断言。向用户预览:解释他们在查看器中会看到的内容。

Step 3: As runs complete, capture timing data

步骤3:运行完成后捕获计时数据

When each subagent task completes, save timing data immediately to
timing.json
in the run directory (e.g.,
with_skill/run-1/timing.json
):
json
{
  "total_tokens": 84852,
  "duration_ms": 23332,
  "total_duration_seconds": 23.3
}
This data only comes through task completion notifications -- capture it then.
每个子代理任务完成后,立即将计时数据保存到运行目录的
timing.json
中(例如
with_skill/run-1/timing.json
):
json
{
  "total_tokens": 84852,
  "duration_ms": 23332,
  "total_duration_seconds": 23.3
}
该数据仅在任务完成通知中提供——需及时捕获。

Step 4: Grade, aggregate, and launch the viewer

步骤4:评分、聚合并启动查看器

GENERATE THE EVAL VIEWER BEFORE EVALUATING INPUTS YOURSELF. Get outputs in front of the human first.
Once all runs complete:
  1. Grade each run -- spawn a grader subagent for each run directory (both with_skill and without_skill/old_skill get their own grader). Use this prompt:
    You are a grader agent. Read agents/grader.md from <skill-creator-path>
    to load your full instructions, then grade this eval run:
    
    - Expectations: ["assertion 1", "assertion 2", ...]
    - Transcript path: <workspace>/iteration-N/eval-<name>/with_skill/run-1/transcript.md
    - Outputs dir:    <workspace>/iteration-N/eval-<name>/with_skill/run-1/outputs/
    
    Save grading.json to <workspace>/iteration-N/eval-<name>/with_skill/run-1/grading.json.
    Required field names in
    grading.json
    :
    text
    ,
    passed
    ,
    evidence
    (not
    name
    /
    met
    /
    details
    -- the viewer depends on these exact names). For assertions that can be checked programmatically, write and run a script.
  2. Aggregate -- run:
    bash
    python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
    Produces
    benchmark.json
    and
    benchmark.md
    . List with_skill before baseline. If generating
    benchmark.json
    manually, see
    references/schemas.md
    for the exact schema the viewer expects.
  3. Analyst pass -- read the benchmark data and surface patterns the aggregate stats might hide: non-discriminating assertions (always pass regardless of skill), high-variance evals (possibly flaky), time/token tradeoffs. See
    agents/analyzer.md
    ("Analyzing Benchmark Results" section).
  4. Launch the viewer:
    bash
    nohup python <skill-creator-path>/eval-viewer/generate_review.py \
      <workspace>/iteration-N \
      --skill-name "my-skill" \
      --benchmark <workspace>/iteration-N/benchmark.json \
      > /dev/null 2>&1 &
    VIEWER_PID=$!
    For iteration 2+, add
    --previous-workspace <workspace>/iteration-<N-1>
    .
    No display / headless: use
    --static <output_path>
    to write a standalone HTML file. The "Submit All Reviews" button downloads
    feedback.json
    -- copy it into the workspace directory when done.
  5. Tell the user: "I've opened the results in your browser. 'Outputs' tab lets you review each test case and leave feedback. 'Benchmark' tab shows the quantitative comparison. Come back when you're done."
先生成评估查看器,再自行评估输入。先让用户看到输出结果。
所有运行完成后:
  1. 为每个运行评分——为每个运行目录(含技能版和无技能版/旧技能版)启动一个评分器子代理。使用以下提示词:
    你是评分器代理。从<skill-creator-path>读取agents/grader.md加载完整说明,然后评估本次运行:
    
    - 预期:["assertion 1", "assertion 2", ...]
    - 对话记录路径:<workspace>/iteration-N/eval-<name>/with_skill/run-1/transcript.md
    - 输出目录:    <workspace>/iteration-N/eval-<name>/with_skill/run-1/outputs/
    
    将grading.json保存到<workspace>/iteration-N/eval-<name>/with_skill/run-1/grading.json。
    grading.json
    中必填字段:
    text
    passed
    evidence
    (不能是
    name
    /
    met
    /
    details
    ——查看器依赖这些精确字段名)。对于可通过编程检查的断言,编写并运行脚本。
  2. 聚合结果——运行:
    bash
    python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
    生成
    benchmark.json
    benchmark.md
    。将含技能版放在基准版之前。若手动生成
    benchmark.json
    ,请参考
    references/schemas.md
    中查看器所需的精确schema。
  3. 分析环节——读取基准测试数据,发现聚合统计可能隐藏的模式:非判别式断言(无论是否使用技能都始终通过)、高方差评估(可能不稳定)、时间/令牌权衡。请参考
    agents/analyzer.md
    的“分析基准测试结果”章节。
  4. 启动查看器
    bash
    nohup python <skill-creator-path>/eval-viewer/generate_review.py \
      <workspace>/iteration-N \
      --skill-name "my-skill" \
      --benchmark <workspace>/iteration-N/benchmark.json \
      > /dev/null 2>&1 &
    VIEWER_PID=$!
    对于第2次及以上迭代,添加
    --previous-workspace <workspace>/iteration-<N-1>
    无显示/无头模式:使用
    --static <output_path>
    生成独立HTML文件。“提交所有审查”按钮会下载
    feedback.json
    ——完成后将其复制到工作区目录。
  5. 告知用户:“我已在你的浏览器中打开了结果。‘Outputs’标签可查看每个测试用例并留下反馈。‘Benchmark’标签展示定量对比结果。请查看后告知我。”

What the user sees

用户看到的内容

Outputs tab: one test case at a time -- prompt, output, previous output (collapsed, iteration 2+), formal grades (collapsed), feedback textbox, previous feedback.
Benchmark tab: pass rates, timing, token usage per configuration, per-eval breakdowns, analyst observations.
Navigation: prev/next buttons or arrow keys. "Submit All Reviews" saves
feedback.json
.
Outputs标签:一次展示一个测试用例——提示词、输出结果、上一次输出结果(折叠状态,第2次及以上迭代可见)、正式评分(折叠状态)、反馈文本框、上一次反馈。
Benchmark标签:通过率、计时、每个配置的令牌使用情况、每个评估的细分结果、分析观察结论。
导航:使用上一页/下一页按钮或方向键。“提交所有审查”保存
feedback.json

Step 5: Read the feedback

步骤5:读取反馈

json
{
  "reviews": [
    {"run_id": "eval-0-with_skill", "feedback": "chart is missing axis labels"},
    {"run_id": "eval-1-with_skill", "feedback": ""},
    {"run_id": "eval-2-with_skill", "feedback": "perfect, love this"}
  ],
  "status": "complete"
}
Empty feedback means the user was satisfied. Focus improvements on cases with specific complaints. Kill the viewer when done:
kill $VIEWER_PID 2>/dev/null

json
{
  "reviews": [
    {"run_id": "eval-0-with_skill", "feedback": "图表缺少坐标轴标签"},
    {"run_id": "eval-1-with_skill", "feedback": ""},
    {"run_id": "eval-2-with_skill", "feedback": "完美,很喜欢"}
  ],
  "status": "complete"
}
空反馈表示用户满意。重点改进有具体意见的测试用例。完成后关闭查看器:
kill $VIEWER_PID 2>/dev/null

Improving the skill

改进技能

This is the heart of the loop. Tests have been run, the user has reviewed -- now make the skill better.
这是循环的核心环节。测试已完成,用户已审查——现在优化技能。

How to think about improvements

改进思路

Generalize from feedback, do not overfit to examples. The skill will run across countless different prompts. A skill that works only for the test examples is useless. Rather than fiddly constraints, try different metaphors, different working patterns, different framings -- cheap to try, potentially transformative.
Keep the prompt lean. Read the transcripts, not just final outputs. If the skill makes the model waste time on unproductive steps, remove those instructions. Deadweight is not neutral -- it adds noise and slows execution.
Explain the why, not just the what. Models have good theory of mind. They perform better with reasoning than rules. If ALWAYS or NEVER in all caps appears, try instead to explain why the thing matters. "Show the loading state before fetching so users aren't left wondering if anything happened" outperforms "ALWAYS show loading state."
Bundle repeated work. If all test case transcripts show the subagent writing the same helper script, write it once, put it in
scripts/
, point the skill at it. Every future invocation benefits.
Take time here. Read the draft revision with fresh eyes. Try to genuinely understand what the user wants and what is blocking them, then transmit that understanding into the instructions.
从反馈中归纳共性,不要过度拟合示例。 技能将在无数不同的提示词下运行。仅适用于测试示例的技能毫无用处。与其添加繁琐的约束,不如尝试不同的隐喻、工作模式、框架——成本低,可能带来颠覆性效果。
保持提示词简洁。 阅读对话记录,而非仅看最终输出。若技能让模型在无效步骤上浪费时间,则删除这些说明。冗余内容并非中性——会增加噪音并减慢执行速度。
解释原因,而非只说明做法。 模型具备良好的心智理论。理解原因比遵守规则表现更好。若出现全大写的ALWAYS或NEVER,尝试解释其重要性。例如,“在获取数据前显示加载状态,避免用户疑惑是否有响应”比“始终显示加载状态”效果更好。
捆绑重复工作。 若所有测试用例的对话记录显示子代理都在编写相同的辅助脚本,则编写一次,放入
scripts/
目录,让技能指向该脚本。未来每次调用都能受益。
在此环节多花时间。重新阅读修订后的草稿,尝试真正理解用户需求和痛点,然后将这种理解融入说明中。

The iteration loop

迭代循环

  1. Apply improvements to the skill
  2. Rerun all test cases into
    iteration-<N+1>/
    , including baselines
  3. Launch the viewer with
    --previous-workspace
    pointing at the previous iteration
  4. Wait for the user to review
  5. Read feedback, improve again, repeat
Stop when: the user says they are happy, feedback is all empty, or meaningful progress has stopped.

  1. 对技能进行改进
  2. 将所有测试用例重新运行到
    iteration-<N+1>/
    目录,包括基准版本
  3. 启动查看器,添加
    --previous-workspace
    参数指向之前的迭代目录
  4. 等待用户审查
  5. 读取反馈,再次改进,重复循环
停止迭代的条件:用户表示满意、所有反馈为空,或无法取得有意义的进展。

Advanced: Blind comparison

进阶:盲态对比

For rigorous comparison between two skill versions, use the blind comparison system: read
agents/comparator.md
and
agents/analyzer.md
. An independent agent judges quality without knowing which version is which. Most users do not need this -- the human review loop is usually sufficient.

若需严格对比两个技能版本,使用盲态对比系统:阅读
agents/comparator.md
agents/analyzer.md
。独立代理会在不知道版本的情况下评判质量。大多数用户不需要此功能——人工审查循环通常已足够。

Description Optimization

描述优化

The
description
field is the primary mechanism controlling whether Claude invokes a skill. After creating or improving a skill, offer to optimize the description for triggering accuracy.
description
字段是控制Claude是否调用技能的主要机制。创建或改进技能后,可主动提出优化描述以提升触发准确性。

Step 1: Generate trigger eval queries

步骤1:生成触发评估查询

Create 20 queries -- a mix of should-trigger and should-not-trigger. Save as JSON:
json
[
  {"query": "the user prompt", "should_trigger": true},
  {"query": "another prompt", "should_trigger": false}
]
Make queries realistic and specific. Include file paths, personal context, column names, company names, casual speech, typos, abbreviations, varying lengths. Focus on edge cases, not clear-cut cases.
Bad:
"Format this data"
,
"Create a chart"
Good:
"ok so my boss just sent me this xlsx file (its in my downloads, called something like 'Q4 sales final FINAL v2.xlsx') and she wants me to add a column that shows the profit margin as a percentage. The revenue is in column C and costs are in column D i think"
Should-trigger queries (8-10): Cover different phrasings of the same intent. Include cases where the user clearly needs the skill but does not name it. Include uncommon use cases and cases where this skill competes with another but should win.
Should-not-trigger queries (8-10): The most valuable are near-misses -- queries that share keywords but need something different. Adjacent domains, ambiguous phrasing where keyword-matching would fire but should not. Do not use obviously irrelevant negatives -- they test nothing.
创建20个查询——包含应触发和不应触发的场景。保存为JSON格式:
json
[
  {"query": "用户提示词", "should_trigger": true},
  {"query": "另一个提示词", "should_trigger": false}
]
查询需真实且具体。 包含文件路径、个人上下文、列名、公司名称、口语化表达、拼写错误、缩写、不同长度的内容。重点关注边缘情况,而非明确的触发场景。
反面示例:
"Format this data"
"Create a chart"
正面示例:
"好吧,我老板刚给我发了一个xlsx文件(在我的下载文件夹里,名字大概是'Q4 sales final FINAL v2.xlsx'),她想让我加一列显示利润率百分比。我记得收入在C列,成本在D列"
应触发查询(8-10个): 涵盖同一意图的不同表述。包含用户明确需要技能但未提及技能名称的场景。包含不常见的使用场景,以及该技能与其他技能竞争但应胜出的场景。
不应触发查询(8-10个): 最有价值的是“近似触发”场景——共享关键词但需要不同功能的查询。包括相邻领域、模糊表述(关键词匹配会触发但实际不应触发)。不要使用明显无关的负面示例——无法测试任何内容。

Step 2: Review with user

步骤2:与用户一起审查

  1. Read
    assets/eval_review.html
  2. Replace
    __EVAL_DATA_PLACEHOLDER__
    with the JSON array (no quotes -- JS variable assignment)
  3. Replace
    __SKILL_NAME_PLACEHOLDER__
    and
    __SKILL_DESCRIPTION_PLACEHOLDER__
  4. Write to
    /tmp/eval_review_<skill-name>.html
    and open it
  5. User can edit queries, toggle should-trigger, add/remove entries, then "Export Eval Set"
  6. Check Downloads for the most recent
    eval_set.json
  1. 读取
    assets/eval_review.html
  2. __EVAL_DATA_PLACEHOLDER__
    替换为JSON数组(无需引号——作为JS变量赋值)
  3. 替换
    __SKILL_NAME_PLACEHOLDER__
    __SKILL_DESCRIPTION_PLACEHOLDER__
  4. 写入
    /tmp/eval_review_<skill-name>.html
    并打开
  5. 用户可编辑查询、切换是否触发、添加/删除条目,然后点击“导出评估集”
  6. 在下载文件夹中查找最新的
    eval_set.json

Step 3: Run the optimization loop

步骤3:运行优化循环

bash
python -m scripts.run_loop \
  --eval-set <path-to-trigger-eval.json> \
  --skill-path <path-to-skill> \
  --model <model-id-powering-this-session> \
  --max-iterations 5 \
  --verbose
Use the model ID from the system prompt so the test matches what the user actually experiences. Run in background and tail output periodically for updates.
The loop: 60/40 train/test split, evaluate current description (3 runs per query for variance), propose improvements from failures, re-evaluate, pick best by test score (not train, to avoid overfitting), repeat up to 5 iterations.
How triggering works: Claude sees skills as
(name, description)
pairs and decides whether to consult a skill based on that. It only consults skills for tasks it cannot easily handle alone -- eval queries should be substantive enough that a skill would actually help. Simple one-step queries will not trigger skills even with perfect description matching.
bash
python -m scripts.run_loop \
  --eval-set <path-to-trigger-eval.json> \
  --skill-path <path-to-skill> \
  --model <model-id-powering-this-session> \
  --max-iterations 5 \
  --verbose
使用系统提示词中的模型ID,确保测试与用户实际体验一致。在后台运行并定期查看输出更新。
循环流程:60/40训练/测试拆分,评估当前描述(每个查询运行3次以减少方差),从失败案例中提出改进建议,重新评估,选择测试得分最高的描述(而非训练得分,避免过拟合),重复最多5次迭代。
触发机制说明: Claude将技能视为
(name, description)
对,并据此决定是否调用技能。仅当任务无法轻松独立完成时,Claude才会调用技能——评估查询应足够复杂,确保技能能实际发挥作用。即使描述匹配完美,简单的单步查询也不会触发技能。

Step 4: Apply the result

步骤4:应用结果

Take
best_description
from the JSON output, update the skill's SKILL.md frontmatter. Show before/after to the user and report scores.

从JSON输出中获取
best_description
,更新技能SKILL.md的前置元数据。向用户展示前后对比并报告得分。

Package and Present

打包与交付

If the
present_files
tool is available:
bash
python -m scripts.package_skill <path/to/skill-folder>
Present the resulting
.skill
file to the user for installation.

present_files
工具可用:
bash
python -m scripts.package_skill <path/to/skill-folder>
将生成的
.skill
文件交付给用户安装。

Claude.ai-specific instructions

Claude.ai专属说明

Same core workflow, different mechanics:
Running test cases: No subagents. Read SKILL.md, follow instructions to complete each test prompt directly, one at a time. Skip baselines. Less rigorous (the skill author is also running it), but useful as a sanity check -- human review compensates.
Reviewing results: If no browser is available, present results directly in conversation. For file outputs (.docx, .xlsx), save to filesystem and tell the user where to find them. Ask for feedback inline.
Benchmarking: Skip quantitative benchmarking -- baseline comparisons are not meaningful without subagents. Focus on qualitative user feedback.
Description optimization: Requires
claude -p
CLI. Skip if unavailable.
Blind comparison: Requires subagents. Skip.
Updating an existing skill:
  • Preserve the original name (directory name and
    name
    frontmatter field unchanged)
  • The installed skill path may be read-only -- copy to
    /tmp/skill-name/
    , edit there, package from the copy
  • Direct writes may fail; stage in
    /tmp/
    first

核心工作流相同,但实现机制不同:
运行测试用例: 无子代理。读取SKILL.md,直接按说明完成每个测试提示词,一次一个。跳过基准版本。严谨性稍弱(技能作者同时负责运行),但可作为 sanity check(合理性检查)——人工审查可弥补不足。
审查结果: 若无浏览器,直接在对话中展示结果。对于文件输出(.docx、.xlsx),保存到文件系统并告知用户位置。在对话内询问反馈。
基准测试: 跳过定量基准测试——无子代理的情况下,基准对比无意义。重点关注用户的定性反馈。
描述优化: 需要
claude -p
CLI。若不可用则跳过。
盲态对比: 需要子代理。若不可用则跳过。
更新现有技能:
  • 保留原始名称(目录名和
    name
    前置元数据字段不变)
  • 已安装的技能路径可能为只读——复制到
    /tmp/skill-name/
    ,在该目录编辑,然后从副本打包
  • 直接写入可能失败——先在
    /tmp/
    目录暂存

Cowork-specific instructions

Cowork专属说明

  • Subagents work. If severe timeout problems arise, run test prompts in series.
  • No browser/display: use
    --static <output_path>
    for the eval viewer, then provide the link for the user to open locally.
  • Generate the eval viewer before evaluating inputs. See Step 4 above.
  • Feedback: "Submit All Reviews" downloads
    feedback.json
    . Read from there.
  • Description optimization (
    run_loop.py
    ) works fine -- uses
    claude -p
    .
  • Save description optimization until the skill is fully done and the user agrees.
  • Updating an existing skill: same instructions as Claude.ai section above.

Track these lifecycle steps in TodoWrite if available. In Cowork, specifically add "Create evals JSON and run
eval-viewer/generate_review.py
so human can review test cases" as a todo item. For a full list of bundled resources, see the Available Resources table near the top of this file.
  • 子代理可用。若出现严重超时问题,可串行运行测试提示词。
  • 无浏览器/显示:为评估查看器使用
    --static <output_path>
    参数,然后提供链接让用户在本地打开。
  • 先生成评估查看器,再评估输入。 见上方步骤4。
  • 反馈:“提交所有审查”会下载
    feedback.json
    。从此文件读取反馈。
  • 描述优化(
    run_loop.py
    )可正常运行——使用
    claude -p
  • 待技能完全完成且用户同意后,再进行描述优化。
  • 更新现有技能:与Claude.ai部分的说明相同。

若TodoWrite可用,跟踪这些生命周期步骤。在Cowork中,需特别添加“创建evals JSON并运行
eval-viewer/generate_review.py
以便人工审查测试用例”作为待办事项。完整捆绑资源列表见本文档顶部的“可用资源”表格。