skill-creator
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSkill Creator
技能创建器
Create new skills and iteratively improve them. The lifecycle:
- Decide what the skill should do and how
- Draft the skill
- Run test prompts through claude-with-the-skill
- Evaluate results qualitatively and quantitatively with the user
- Rewrite based on feedback
- Repeat until satisfied, then expand the test set
- Package and deliver
Determine where the user is in this lifecycle and help them progress. Starting
from scratch ("I want a skill for X") -- help narrow scope, draft, test, iterate.
Already have a draft -- go straight to eval/iterate. Just want a better
description -- jump to Description Optimization.
Be flexible: if the user says "just vibe with me, no evals," do that.
创建新技能并迭代改进。生命周期如下:
- 确定技能的功能与实现方式
- 起草技能文档
- 通过搭载该技能的Claude运行测试提示词
- 与用户一起从定性和定量角度评估结果
- 根据反馈重写技能文档
- 重复上述步骤直至满意,然后扩展测试集
- 打包并交付技能
判断用户当前处于生命周期的哪个阶段,协助其推进流程。若用户从零开始(如“我想要一个针对X的技能”),则帮助缩小范围、起草、测试、迭代;若已有草稿,则直接进入评估/迭代环节;若仅想优化描述,则跳转至描述优化步骤。
保持灵活性:若用户表示“不用评估,跟着感觉来”,则按用户要求执行。
Available Resources
可用资源
Before starting, verify what is available in the environment. The full workflow
requires these bundled resources:
| Resource | Path | Used for |
|---|---|---|
| Grader instructions | | Assertion evaluation |
| Comparator instructions | | Blind A/B comparison |
| Analyzer instructions | | Benchmark pattern analysis |
| Schema reference | | evals.json, grading.json formats |
| Assertion patterns | | Writing discriminating assertions |
| Troubleshooting | | Common errors and fixes |
| Eval review template | | Description optimization UI |
| Benchmark aggregator | | |
| Description optimizer | | |
| Eval viewer | | Human review interface |
If resources are missing: The core loop (draft -> test -> review -> improve)
still works without them -- grade inline and present results in-conversation
instead of the browser viewer. Note what is unavailable to the user and adapt.
开始前,请确认环境中可用的资源。完整工作流需要以下捆绑资源:
| 资源 | 路径 | 用途 |
|---|---|---|
| 评分器说明 | | 断言评估 |
| 比较器说明 | | 盲态A/B对比 |
| 分析器说明 | | 基准测试模式分析 |
| 参考 schema | | evals.json、grading.json格式规范 |
| 断言模式 | | 编写判别式断言 |
| 故障排查指南 | | 常见错误及修复方案 |
| 评估审查模板 | | 描述优化UI |
| 基准测试聚合工具 | | 执行命令: |
| 描述优化工具 | | 执行命令: |
| 评估查看器 | | 人工审查界面 |
若资源缺失: 核心循环(起草→测试→审查→改进)仍可正常运行——只需在对话内直接评分并展示结果,无需使用浏览器查看器。需告知用户哪些资源不可用,并调整流程适配。
Communicating with the user
与用户沟通
The skill creator serves people across a wide range of technical backgrounds.
Non-developers are opening terminals for the first time because Claude makes
things possible; experienced engineers are also common.
Pay attention to context cues:
- "evaluation" and "benchmark" are borderline but generally OK
- Avoid "JSON" or "assertion" without a brief explanation unless the user has already used those terms
Err toward clarity. A short inline definition never hurts.
技能创建器服务于技术背景各异的用户:有的是非开发者,因Claude的能力首次使用终端;也有经验丰富的工程师。
注意上下文线索:
- “evaluation”(评估)和“benchmark”(基准测试)这类术语用户通常可以理解
- 除非用户已使用过“JSON”或“assertion”(断言),否则使用时需附带简短解释
优先保证清晰易懂,简短的内联定义不会有坏处。
Creating a skill
创建技能
Capture Intent
捕捉意图
The current conversation may already contain the workflow to capture. If the user
says "turn this into a skill" or "make this repeatable," mine the conversation
history first:
- What tools were used, in what sequence?
- What corrections did the user make?
- What were the input and output formats?
- What would a different user need to know to reproduce this?
Fill gaps with the user and confirm before drafting.
If starting from scratch, establish:
- What should this skill enable Claude to do?
- When should this skill trigger? (what user phrases, contexts, task types)
- What is the expected output -- files, structured data, prose, or a workflow?
- Should test cases be set up? Skills with verifiable outputs (file transforms, code generation, fixed workflows) benefit from them. Skills with subjective outputs (writing style, creative direction) often do not. Suggest the appropriate default based on skill type, but let the user decide.
当前对话可能已包含可捕捉的工作流。若用户说“把这个变成技能”或“让这个可重复执行”,首先挖掘对话历史:
- 使用了哪些工具,顺序如何?
- 用户做了哪些修正?
- 输入和输出格式是什么?
- 其他用户需要了解哪些信息才能复现该流程?
填补信息空白并与用户确认后,再开始起草。
若从零开始,需明确:
- 该技能应让Claude实现什么功能?
- 技能应在何时触发?(用户的哪些表述、场景、任务类型)
- 预期输出是什么——文件、结构化数据、 prose(散文式文本)还是工作流?
- 是否需要设置测试用例?输出可验证的技能(如文件转换、代码生成、固定工作流)会从中受益;输出主观化的技能(如写作风格、创意方向)通常不需要。根据技能类型建议合适的默认方案,但最终由用户决定。
Interview and Research
访谈与调研
Ask about edge cases, input/output examples, success criteria, and dependencies
before writing test prompts. If useful MCPs are available (docs search, similar
skill lookup), research in parallel via subagents. Come prepared so the interview
is lightweight.
在编写测试提示词前,询问边缘情况、输入/输出示例、成功标准和依赖项。若有可用的MCP(如文档搜索、相似技能查询),可通过子代理并行调研。提前做好准备,让访谈更高效。
Write the SKILL.md
编写SKILL.md
Based on the interview, compose the skill. Every skill has:
- name: Identifier (kebab-case)
- description: The primary triggering mechanism. Describe what the skill does AND when to use it. All "when to trigger" information goes here, not in the body. Apply the "pushy principle" -- err toward including trigger contexts rather than omitting them. Example: instead of "Builds REST APIs," write "Builds REST APIs. Invoke whenever users ask about endpoints, routes, HTTP methods, API design, or want to add a backend -- even if they don't say 'REST.'"
- compatibility: Required tools or dependencies (optional, rarely needed)
- skill body: Instructions, workflow, examples, references
基于访谈内容撰写技能文档。每个技能包含:
- name(名称):标识符(采用kebab-case格式)
- description(描述):主要触发机制。需描述技能的功能以及触发场景。所有“何时触发”的信息都应放在此处,而非正文。遵循“主动原则”——宁可多包含触发场景,也不要遗漏。示例:不要写“构建REST API”,而要写“构建REST API。当用户询问端点、路由、HTTP方法、API设计,或想要添加后端时调用——即使用户没有提到‘REST’。”
- compatibility(兼容性):所需工具或依赖项(可选,很少需要)
- skill body(技能正文):说明、工作流、示例、参考资料
Skill output template
技能输出模板
Use this as the starting structure -- fill in or remove sections as needed:
markdown
---
name: skill-name
description: >
[What this skill does and when to trigger it. One to three sentences.
Apply the pushy principle: include adjacent trigger contexts.]
---以下为基础结构,可根据需要填充或删除章节:
markdown
---
name: skill-name
description: >
[技能功能及触发场景,1-3句话。遵循主动原则:包含相关的触发场景。]
---[Skill Name]
[技能名称]
[One paragraph: what this skill enables and when to use it.]
[一段文字:说明该技能的作用及使用场景。]
[Core workflow or main section]
[核心工作流或主要章节]
[Instructions in imperative form. Explain the why behind each step.]
[使用祈使句编写说明。解释每个步骤的原因。]
Output format
输出格式
[Concrete template for what this skill produces.]
[技能产出的具体模板。]
Examples
示例
Example 1:
Input: [realistic user prompt]
Output: [expected result]
示例1:
Input: [真实的用户提示词]
Output: [预期结果]
Reference files
参考文件
- Read when [specific situation]
references/advanced.md - Run for [repeatable operation]
scripts/transform.py
undefined- 当[特定场景]时,阅读
references/advanced.md - 当[重复操作]时,运行
scripts/transform.py
undefinedAnatomy of a skill directory
技能目录结构
skill-name/
+-- SKILL.md (required)
+-- Optional bundled resources:
+-- scripts/ - Scripts for deterministic/repetitive tasks
+-- references/ - Docs loaded into context as needed
+-- assets/ - Templates, icons, fontsProgressive loading:
- Metadata (name + description) -- always in context
- SKILL.md body -- in context when the skill triggers (<500 lines ideal)
- Bundled resources -- loaded as needed, scripts can run without loading
Keep SKILL.md under 500 lines. If approaching that limit, add a layer of
hierarchy with clear pointers to reference files. For reference files over 300
lines, add a table of contents.
Domain organization -- when a skill covers multiple frameworks/variants:
cloud-deploy/
+-- SKILL.md (workflow + selection logic)
+-- references/
+-- aws.md
+-- gcp.md
+-- azure.mdClaude reads only the relevant reference file.
skill-name/
+-- SKILL.md(必填)
+-- 可选捆绑资源:
+-- scripts/ - 用于确定性/重复性任务的脚本
+-- references/ - 按需加载到上下文的文档
+-- assets/ - 模板、图标、字体渐进式加载:
- 元数据(name + description)——始终在上下文中
- SKILL.md正文——技能触发时加载到上下文中(理想情况下少于500行)
- 捆绑资源——按需加载,脚本无需加载即可运行
保持SKILL.md在500行以内。若接近该限制,可添加层级结构并明确指向参考文件。对于超过300行的参考文件,添加目录。
领域组织——当技能涵盖多个框架/变体时:
cloud-deploy/
+-- SKILL.md(工作流 + 选择逻辑)
+-- references/
+-- aws.md
+-- gcp.md
+-- azure.mdClaude仅会读取相关的参考文件。
Writing patterns
写作模式
Output templates -- be concrete:
markdown
undefined输出模板——要具体:
markdown
undefinedReport structure
报告结构
ALWAYS use this template:
始终使用以下模板:
[Title]
[标题]
Executive summary
执行摘要
Key findings
关键发现
Recommendations
建议
**Examples** -- include realistic ones:
```markdown
**示例**——包含真实场景:
```markdownCommit message format
提交信息格式
Example:
Input: Added user authentication with JWT tokens
Output: feat(auth): implement JWT-based authentication
**Writing style** -- use imperative form. Explain the *why* rather than issuing
rules. Avoid ALWAYS/NEVER in all caps where the reasoning can be explained
instead -- models respond better to understanding than mandates. Write a draft,
then read it fresh and improve it.
**Security** -- skills must not contain malware, exploit code, or anything that
would surprise the user if they read the description. Do not create skills
designed to facilitate unauthorized access or data exfiltration. Roleplay skills
are fine.示例:
Input: 添加了基于JWT的用户认证
Output: feat(auth): implement JWT-based authentication
**写作风格**——使用祈使句。解释原因,而非只给出规则。若可以解释原因,避免使用全大写的ALWAYS/NEVER——模型理解原因比遵守规则表现更好。先起草,再重新阅读并改进。
**安全性**——技能不得包含恶意软件、漏洞利用代码,或用户阅读描述时会感到意外的内容。不得创建用于未经授权访问或数据泄露的技能。角色扮演类技能不受此限制。Test Cases
测试用例
After drafting, write 2-3 realistic test prompts -- things a real user would
actually type. Share them: "Here are a few test cases I'd like to try. Do these
look right, or do you want to add more?"
Save to . Do not write assertions yet -- just prompts.
Assertions get drafted while runs are in progress.
evals/evals.jsonjson
{
"skill_name": "example-skill",
"evals": [
{
"id": 1,
"prompt": "User's task prompt",
"expected_output": "Description of expected result",
"files": []
}
]
}See for the full schema including the field.
references/schemas.mdassertions起草完成后,编写2-3个真实的测试提示词——即真实用户实际会输入的内容。告知用户:“这里有几个我想尝试的测试用例,你看是否合适,或者需要添加更多?”
将测试用例保存到。暂时不要编写断言——只保存提示词。断言可在运行过程中起草。
evals/evals.jsonjson
{
"skill_name": "example-skill",
"evals": [
{
"id": 1,
"prompt": "用户的任务提示词",
"expected_output": "预期结果描述",
"files": []
}
]
}完整schema请参考,其中包含字段。
references/schemas.mdassertionsRunning and evaluating test cases
运行并评估测试用例
This is one continuous sequence -- do not stop partway through. Do NOT use
or any other testing skill.
/skill-testPut results in as a sibling to the skill directory.
Organize by iteration, then by eval name, then by configuration, then by run.
Create directories as you go, not all upfront.
<skill-name>-workspace/这是一个连续的流程——不要中途停止。请勿使用或其他测试技能。
/skill-test将结果保存到,作为技能目录的同级目录。按迭代、评估名称、配置、运行结果组织目录。按需创建目录,无需提前全部创建。
<skill-name>-workspace/Workspace layout
工作区布局
<skill-name>-workspace/
+-- iteration-1/
| +-- eval-<name>/ <-- one per test case (descriptive name)
| | +-- eval_metadata.json <-- eval ID, prompt, assertions
| | +-- with_skill/
| | | +-- run-1/
| | | +-- outputs/ <-- files the subagent produced
| | | | +-- metrics.json
| | | +-- timing.json <-- from task notification
| | | +-- grading.json <-- written by grader agent
| | +-- without_skill/ <-- or old_skill/ when improving
| | +-- run-1/
| | +-- outputs/
| | +-- timing.json
| | +-- grading.json
| +-- benchmark.json <-- written by aggregate_benchmark.py
| +-- benchmark.md
+-- iteration-2/
+-- ...<skill-name>-workspace/
+-- iteration-1/
| +-- eval-<name>/ <-- 每个测试用例对应一个目录(名称需具描述性)
| | +-- eval_metadata.json <-- 评估ID、提示词、断言
| | +-- with_skill/
| | | +-- run-1/
| | | +-- outputs/ <-- 子代理生成的文件
| | | | +-- metrics.json
| | | +-- timing.json <-- 来自任务通知
| | | +-- grading.json <-- 由评分器代理生成
| | +-- without_skill/ <-- 改进现有技能时,替换为old_skill/
| | +-- run-1/
| | +-- outputs/
| | +-- timing.json
| | +-- grading.json
| +-- benchmark.json <-- 由aggregate_benchmark.py生成
| +-- benchmark.md
+-- iteration-2/
+-- ...Step 1: Spawn all runs (with-skill AND baseline) in the same turn
步骤1:同一轮次启动所有运行(含技能版及基准版)
For each test case, spawn two subagents simultaneously -- one with the skill, one
without. Do not do with-skill runs first and baseline runs later. Launch
everything at once so results arrive around the same time.
With-skill run:
Execute this task:
- Skill path: <path-to-skill>
- Task: <eval prompt>
- Input files: <eval files if any, or "none">
- Save outputs to: <workspace>/iteration-<N>/eval-<name>/with_skill/run-1/outputs/
- Outputs to save: <what the user cares about>Baseline run -- depends on context:
- New skill: no skill at all. Same prompt, save to
<workspace>/iteration-<N>/eval-<name>/without_skill/run-1/outputs/ - Improving existing skill: snapshot the old version first
(), point baseline at snapshot, save to
cp -r <skill-path> <workspace>/skill-snapshot/old_skill/run-1/outputs/
Write at the eval level (e.g.,
). Give each eval a
descriptive name based on what it tests -- not just "eval-0":
eval_metadata.json<workspace>/iteration-N/eval-<name>/eval_metadata.jsonjson
{
"eval_id": 0,
"eval_name": "descriptive-name-here",
"prompt": "The user's task prompt",
"assertions": []
}针对每个测试用例,同时启动两个子代理——一个搭载技能,一个不搭载。不要先运行搭载技能的版本,再运行基准版本。同时启动所有任务,确保结果大致同时返回。
搭载技能的运行:
执行以下任务:
- 技能路径:<path-to-skill>
- 任务:<eval prompt>
- 输入文件:<若有评估文件则填写,否则填“none”>
- 输出保存路径:<workspace>/iteration-<N>/eval-<name>/with_skill/run-1/outputs/
- 需保存的输出:<用户关心的内容>基准运行——取决于上下文:
- 新技能:不搭载任何技能。使用相同提示词,保存到
<workspace>/iteration-<N>/eval-<name>/without_skill/run-1/outputs/ - 改进现有技能:先快照旧版本(),基准版本指向该快照,保存到
cp -r <skill-path> <workspace>/skill-snapshot/old_skill/run-1/outputs/
在评估层级写入(例如)。为每个评估赋予描述性名称,而非仅“eval-0”:
eval_metadata.json<workspace>/iteration-N/eval-<name>/eval_metadata.jsonjson
{
"eval_id": 0,
"eval_name": "descriptive-name-here",
"prompt": "用户的任务提示词",
"assertions": []
}Step 2: While runs are in progress, draft assertions
步骤2:运行过程中起草断言
Do not just wait for runs to finish -- use this time productively. Draft
quantitative assertions for each test case and explain them to the user. If
assertions already exist in , review and explain them.
evals/evals.jsonGood assertions are objectively verifiable and have descriptive names -- someone
glancing at benchmark results should immediately understand what each one checks.
For subjective skills (writing style, design quality), skip assertions and rely
on qualitative review. See for assertion patterns
by task type and the discriminating assertion test.
references/eval-patterns.mdUpdate files and with assertions once
drafted. Preview for the user: explain what they will see in the viewer.
eval_metadata.jsonevals/evals.json不要等待运行完成——利用这段时间高效工作。为每个测试用例起草定量断言并向用户解释。若中已有断言,则进行审查并解释。
evals/evals.json好的断言应具备客观可验证性和描述性名称——只需 glance(浏览)基准测试结果,就能立即明白每个断言的检查内容。对于主观技能(如写作风格、设计质量),可跳过断言,依赖定性审查。按任务类型编写断言模式及判别式断言测试,请参考。
references/eval-patterns.md起草完成后,更新文件和中的断言。向用户预览:解释他们在查看器中会看到的内容。
eval_metadata.jsonevals/evals.jsonStep 3: As runs complete, capture timing data
步骤3:运行完成后捕获计时数据
When each subagent task completes, save timing data immediately to
in the run directory (e.g., ):
timing.jsonwith_skill/run-1/timing.jsonjson
{
"total_tokens": 84852,
"duration_ms": 23332,
"total_duration_seconds": 23.3
}This data only comes through task completion notifications -- capture it then.
每个子代理任务完成后,立即将计时数据保存到运行目录的中(例如):
timing.jsonwith_skill/run-1/timing.jsonjson
{
"total_tokens": 84852,
"duration_ms": 23332,
"total_duration_seconds": 23.3
}该数据仅在任务完成通知中提供——需及时捕获。
Step 4: Grade, aggregate, and launch the viewer
步骤4:评分、聚合并启动查看器
GENERATE THE EVAL VIEWER BEFORE EVALUATING INPUTS YOURSELF. Get outputs in front of the human first.
Once all runs complete:
-
Grade each run -- spawn a grader subagent for each run directory (both with_skill and without_skill/old_skill get their own grader). Use this prompt:
You are a grader agent. Read agents/grader.md from <skill-creator-path> to load your full instructions, then grade this eval run: - Expectations: ["assertion 1", "assertion 2", ...] - Transcript path: <workspace>/iteration-N/eval-<name>/with_skill/run-1/transcript.md - Outputs dir: <workspace>/iteration-N/eval-<name>/with_skill/run-1/outputs/ Save grading.json to <workspace>/iteration-N/eval-<name>/with_skill/run-1/grading.json.Required field names in:grading.json,text,passed(notevidence/name/met-- the viewer depends on these exact names). For assertions that can be checked programmatically, write and run a script.details -
Aggregate -- run:bash
python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>Producesandbenchmark.json. List with_skill before baseline. If generatingbenchmark.mdmanually, seebenchmark.jsonfor the exact schema the viewer expects.references/schemas.md -
Analyst pass -- read the benchmark data and surface patterns the aggregate stats might hide: non-discriminating assertions (always pass regardless of skill), high-variance evals (possibly flaky), time/token tradeoffs. See("Analyzing Benchmark Results" section).
agents/analyzer.md -
Launch the viewer:bash
nohup python <skill-creator-path>/eval-viewer/generate_review.py \ <workspace>/iteration-N \ --skill-name "my-skill" \ --benchmark <workspace>/iteration-N/benchmark.json \ > /dev/null 2>&1 & VIEWER_PID=$!For iteration 2+, add.--previous-workspace <workspace>/iteration-<N-1>No display / headless: useto write a standalone HTML file. The "Submit All Reviews" button downloads--static <output_path>-- copy it into the workspace directory when done.feedback.json -
Tell the user: "I've opened the results in your browser. 'Outputs' tab lets you review each test case and leave feedback. 'Benchmark' tab shows the quantitative comparison. Come back when you're done."
先生成评估查看器,再自行评估输入。先让用户看到输出结果。
所有运行完成后:
-
为每个运行评分——为每个运行目录(含技能版和无技能版/旧技能版)启动一个评分器子代理。使用以下提示词:
你是评分器代理。从<skill-creator-path>读取agents/grader.md加载完整说明,然后评估本次运行: - 预期:["assertion 1", "assertion 2", ...] - 对话记录路径:<workspace>/iteration-N/eval-<name>/with_skill/run-1/transcript.md - 输出目录: <workspace>/iteration-N/eval-<name>/with_skill/run-1/outputs/ 将grading.json保存到<workspace>/iteration-N/eval-<name>/with_skill/run-1/grading.json。中必填字段:grading.json、text、passed(不能是evidence/name/met——查看器依赖这些精确字段名)。对于可通过编程检查的断言,编写并运行脚本。details -
聚合结果——运行:bash
python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>生成和benchmark.json。将含技能版放在基准版之前。若手动生成benchmark.md,请参考benchmark.json中查看器所需的精确schema。references/schemas.md -
分析环节——读取基准测试数据,发现聚合统计可能隐藏的模式:非判别式断言(无论是否使用技能都始终通过)、高方差评估(可能不稳定)、时间/令牌权衡。请参考的“分析基准测试结果”章节。
agents/analyzer.md -
启动查看器:bash
nohup python <skill-creator-path>/eval-viewer/generate_review.py \ <workspace>/iteration-N \ --skill-name "my-skill" \ --benchmark <workspace>/iteration-N/benchmark.json \ > /dev/null 2>&1 & VIEWER_PID=$!对于第2次及以上迭代,添加。--previous-workspace <workspace>/iteration-<N-1>无显示/无头模式:使用生成独立HTML文件。“提交所有审查”按钮会下载--static <output_path>——完成后将其复制到工作区目录。feedback.json -
告知用户:“我已在你的浏览器中打开了结果。‘Outputs’标签可查看每个测试用例并留下反馈。‘Benchmark’标签展示定量对比结果。请查看后告知我。”
What the user sees
用户看到的内容
Outputs tab: one test case at a time -- prompt, output, previous output
(collapsed, iteration 2+), formal grades (collapsed), feedback textbox, previous
feedback.
Benchmark tab: pass rates, timing, token usage per configuration, per-eval
breakdowns, analyst observations.
Navigation: prev/next buttons or arrow keys. "Submit All Reviews" saves
.
feedback.jsonOutputs标签:一次展示一个测试用例——提示词、输出结果、上一次输出结果(折叠状态,第2次及以上迭代可见)、正式评分(折叠状态)、反馈文本框、上一次反馈。
Benchmark标签:通过率、计时、每个配置的令牌使用情况、每个评估的细分结果、分析观察结论。
导航:使用上一页/下一页按钮或方向键。“提交所有审查”保存。
feedback.jsonStep 5: Read the feedback
步骤5:读取反馈
json
{
"reviews": [
{"run_id": "eval-0-with_skill", "feedback": "chart is missing axis labels"},
{"run_id": "eval-1-with_skill", "feedback": ""},
{"run_id": "eval-2-with_skill", "feedback": "perfect, love this"}
],
"status": "complete"
}Empty feedback means the user was satisfied. Focus improvements on cases with
specific complaints. Kill the viewer when done:
kill $VIEWER_PID 2>/dev/nulljson
{
"reviews": [
{"run_id": "eval-0-with_skill", "feedback": "图表缺少坐标轴标签"},
{"run_id": "eval-1-with_skill", "feedback": ""},
{"run_id": "eval-2-with_skill", "feedback": "完美,很喜欢"}
],
"status": "complete"
}空反馈表示用户满意。重点改进有具体意见的测试用例。完成后关闭查看器:
kill $VIEWER_PID 2>/dev/nullImproving the skill
改进技能
This is the heart of the loop. Tests have been run, the user has reviewed -- now
make the skill better.
这是循环的核心环节。测试已完成,用户已审查——现在优化技能。
How to think about improvements
改进思路
Generalize from feedback, do not overfit to examples. The skill will run
across countless different prompts. A skill that works only for the test examples
is useless. Rather than fiddly constraints, try different metaphors, different
working patterns, different framings -- cheap to try, potentially transformative.
Keep the prompt lean. Read the transcripts, not just final outputs. If the
skill makes the model waste time on unproductive steps, remove those
instructions. Deadweight is not neutral -- it adds noise and slows execution.
Explain the why, not just the what. Models have good theory of mind. They
perform better with reasoning than rules. If ALWAYS or NEVER in all caps appears,
try instead to explain why the thing matters. "Show the loading state before
fetching so users aren't left wondering if anything happened" outperforms "ALWAYS
show loading state."
Bundle repeated work. If all test case transcripts show the subagent writing
the same helper script, write it once, put it in , point the skill at
it. Every future invocation benefits.
scripts/Take time here. Read the draft revision with fresh eyes. Try to genuinely
understand what the user wants and what is blocking them, then transmit that
understanding into the instructions.
从反馈中归纳共性,不要过度拟合示例。 技能将在无数不同的提示词下运行。仅适用于测试示例的技能毫无用处。与其添加繁琐的约束,不如尝试不同的隐喻、工作模式、框架——成本低,可能带来颠覆性效果。
保持提示词简洁。 阅读对话记录,而非仅看最终输出。若技能让模型在无效步骤上浪费时间,则删除这些说明。冗余内容并非中性——会增加噪音并减慢执行速度。
解释原因,而非只说明做法。 模型具备良好的心智理论。理解原因比遵守规则表现更好。若出现全大写的ALWAYS或NEVER,尝试解释其重要性。例如,“在获取数据前显示加载状态,避免用户疑惑是否有响应”比“始终显示加载状态”效果更好。
捆绑重复工作。 若所有测试用例的对话记录显示子代理都在编写相同的辅助脚本,则编写一次,放入目录,让技能指向该脚本。未来每次调用都能受益。
scripts/在此环节多花时间。重新阅读修订后的草稿,尝试真正理解用户需求和痛点,然后将这种理解融入说明中。
The iteration loop
迭代循环
- Apply improvements to the skill
- Rerun all test cases into , including baselines
iteration-<N+1>/ - Launch the viewer with pointing at the previous iteration
--previous-workspace - Wait for the user to review
- Read feedback, improve again, repeat
Stop when: the user says they are happy, feedback is all empty, or meaningful
progress has stopped.
- 对技能进行改进
- 将所有测试用例重新运行到目录,包括基准版本
iteration-<N+1>/ - 启动查看器,添加参数指向之前的迭代目录
--previous-workspace - 等待用户审查
- 读取反馈,再次改进,重复循环
停止迭代的条件:用户表示满意、所有反馈为空,或无法取得有意义的进展。
Advanced: Blind comparison
进阶:盲态对比
For rigorous comparison between two skill versions, use the blind comparison
system: read and . An independent
agent judges quality without knowing which version is which. Most users do not
need this -- the human review loop is usually sufficient.
agents/comparator.mdagents/analyzer.md若需严格对比两个技能版本,使用盲态对比系统:阅读和。独立代理会在不知道版本的情况下评判质量。大多数用户不需要此功能——人工审查循环通常已足够。
agents/comparator.mdagents/analyzer.mdDescription Optimization
描述优化
The field is the primary mechanism controlling whether Claude
invokes a skill. After creating or improving a skill, offer to optimize the
description for triggering accuracy.
descriptiondescriptionStep 1: Generate trigger eval queries
步骤1:生成触发评估查询
Create 20 queries -- a mix of should-trigger and should-not-trigger. Save as
JSON:
json
[
{"query": "the user prompt", "should_trigger": true},
{"query": "another prompt", "should_trigger": false}
]Make queries realistic and specific. Include file paths, personal context,
column names, company names, casual speech, typos, abbreviations, varying
lengths. Focus on edge cases, not clear-cut cases.
Bad: ,
"Format this data""Create a chart"Good:
"ok so my boss just sent me this xlsx file (its in my downloads, called something like 'Q4 sales final FINAL v2.xlsx') and she wants me to add a column that shows the profit margin as a percentage. The revenue is in column C and costs are in column D i think"Should-trigger queries (8-10): Cover different phrasings of the same intent.
Include cases where the user clearly needs the skill but does not name it.
Include uncommon use cases and cases where this skill competes with another but
should win.
Should-not-trigger queries (8-10): The most valuable are near-misses --
queries that share keywords but need something different. Adjacent domains,
ambiguous phrasing where keyword-matching would fire but should not. Do not use
obviously irrelevant negatives -- they test nothing.
创建20个查询——包含应触发和不应触发的场景。保存为JSON格式:
json
[
{"query": "用户提示词", "should_trigger": true},
{"query": "另一个提示词", "should_trigger": false}
]查询需真实且具体。 包含文件路径、个人上下文、列名、公司名称、口语化表达、拼写错误、缩写、不同长度的内容。重点关注边缘情况,而非明确的触发场景。
反面示例:、
"Format this data""Create a chart"正面示例:
"好吧,我老板刚给我发了一个xlsx文件(在我的下载文件夹里,名字大概是'Q4 sales final FINAL v2.xlsx'),她想让我加一列显示利润率百分比。我记得收入在C列,成本在D列"应触发查询(8-10个): 涵盖同一意图的不同表述。包含用户明确需要技能但未提及技能名称的场景。包含不常见的使用场景,以及该技能与其他技能竞争但应胜出的场景。
不应触发查询(8-10个): 最有价值的是“近似触发”场景——共享关键词但需要不同功能的查询。包括相邻领域、模糊表述(关键词匹配会触发但实际不应触发)。不要使用明显无关的负面示例——无法测试任何内容。
Step 2: Review with user
步骤2:与用户一起审查
- Read
assets/eval_review.html - Replace with the JSON array (no quotes -- JS variable assignment)
__EVAL_DATA_PLACEHOLDER__ - Replace and
__SKILL_NAME_PLACEHOLDER____SKILL_DESCRIPTION_PLACEHOLDER__ - Write to and open it
/tmp/eval_review_<skill-name>.html - User can edit queries, toggle should-trigger, add/remove entries, then "Export Eval Set"
- Check Downloads for the most recent
eval_set.json
- 读取
assets/eval_review.html - 将替换为JSON数组(无需引号——作为JS变量赋值)
__EVAL_DATA_PLACEHOLDER__ - 替换和
__SKILL_NAME_PLACEHOLDER____SKILL_DESCRIPTION_PLACEHOLDER__ - 写入并打开
/tmp/eval_review_<skill-name>.html - 用户可编辑查询、切换是否触发、添加/删除条目,然后点击“导出评估集”
- 在下载文件夹中查找最新的
eval_set.json
Step 3: Run the optimization loop
步骤3:运行优化循环
bash
python -m scripts.run_loop \
--eval-set <path-to-trigger-eval.json> \
--skill-path <path-to-skill> \
--model <model-id-powering-this-session> \
--max-iterations 5 \
--verboseUse the model ID from the system prompt so the test matches what the user
actually experiences. Run in background and tail output periodically for updates.
The loop: 60/40 train/test split, evaluate current description (3 runs per query
for variance), propose improvements from failures, re-evaluate, pick best by
test score (not train, to avoid overfitting), repeat up to 5 iterations.
How triggering works: Claude sees skills as pairs and
decides whether to consult a skill based on that. It only consults skills for
tasks it cannot easily handle alone -- eval queries should be substantive enough
that a skill would actually help. Simple one-step queries will not trigger skills
even with perfect description matching.
(name, description)bash
python -m scripts.run_loop \
--eval-set <path-to-trigger-eval.json> \
--skill-path <path-to-skill> \
--model <model-id-powering-this-session> \
--max-iterations 5 \
--verbose使用系统提示词中的模型ID,确保测试与用户实际体验一致。在后台运行并定期查看输出更新。
循环流程:60/40训练/测试拆分,评估当前描述(每个查询运行3次以减少方差),从失败案例中提出改进建议,重新评估,选择测试得分最高的描述(而非训练得分,避免过拟合),重复最多5次迭代。
触发机制说明: Claude将技能视为对,并据此决定是否调用技能。仅当任务无法轻松独立完成时,Claude才会调用技能——评估查询应足够复杂,确保技能能实际发挥作用。即使描述匹配完美,简单的单步查询也不会触发技能。
(name, description)Step 4: Apply the result
步骤4:应用结果
Take from the JSON output, update the skill's SKILL.md
frontmatter. Show before/after to the user and report scores.
best_description从JSON输出中获取,更新技能SKILL.md的前置元数据。向用户展示前后对比并报告得分。
best_descriptionPackage and Present
打包与交付
If the tool is available:
present_filesbash
python -m scripts.package_skill <path/to/skill-folder>Present the resulting file to the user for installation.
.skill若工具可用:
present_filesbash
python -m scripts.package_skill <path/to/skill-folder>将生成的文件交付给用户安装。
.skillClaude.ai-specific instructions
Claude.ai专属说明
Same core workflow, different mechanics:
Running test cases: No subagents. Read SKILL.md, follow instructions to
complete each test prompt directly, one at a time. Skip baselines. Less rigorous
(the skill author is also running it), but useful as a sanity check -- human
review compensates.
Reviewing results: If no browser is available, present results directly in
conversation. For file outputs (.docx, .xlsx), save to filesystem and tell the
user where to find them. Ask for feedback inline.
Benchmarking: Skip quantitative benchmarking -- baseline comparisons are not
meaningful without subagents. Focus on qualitative user feedback.
Description optimization: Requires CLI. Skip if unavailable.
claude -pBlind comparison: Requires subagents. Skip.
Updating an existing skill:
- Preserve the original name (directory name and frontmatter field unchanged)
name - The installed skill path may be read-only -- copy to , edit there, package from the copy
/tmp/skill-name/ - Direct writes may fail; stage in first
/tmp/
核心工作流相同,但实现机制不同:
运行测试用例: 无子代理。读取SKILL.md,直接按说明完成每个测试提示词,一次一个。跳过基准版本。严谨性稍弱(技能作者同时负责运行),但可作为 sanity check(合理性检查)——人工审查可弥补不足。
审查结果: 若无浏览器,直接在对话中展示结果。对于文件输出(.docx、.xlsx),保存到文件系统并告知用户位置。在对话内询问反馈。
基准测试: 跳过定量基准测试——无子代理的情况下,基准对比无意义。重点关注用户的定性反馈。
描述优化: 需要 CLI。若不可用则跳过。
claude -p盲态对比: 需要子代理。若不可用则跳过。
更新现有技能:
- 保留原始名称(目录名和前置元数据字段不变)
name - 已安装的技能路径可能为只读——复制到,在该目录编辑,然后从副本打包
/tmp/skill-name/ - 直接写入可能失败——先在目录暂存
/tmp/
Cowork-specific instructions
Cowork专属说明
- Subagents work. If severe timeout problems arise, run test prompts in series.
- No browser/display: use for the eval viewer, then provide the link for the user to open locally.
--static <output_path> - Generate the eval viewer before evaluating inputs. See Step 4 above.
- Feedback: "Submit All Reviews" downloads . Read from there.
feedback.json - Description optimization () works fine -- uses
run_loop.py.claude -p - Save description optimization until the skill is fully done and the user agrees.
- Updating an existing skill: same instructions as Claude.ai section above.
Track these lifecycle steps in TodoWrite if available. In Cowork, specifically
add "Create evals JSON and run so human can
review test cases" as a todo item. For a full list of bundled resources, see the
Available Resources table near the top of this file.
eval-viewer/generate_review.py- 子代理可用。若出现严重超时问题,可串行运行测试提示词。
- 无浏览器/显示:为评估查看器使用参数,然后提供链接让用户在本地打开。
--static <output_path> - 先生成评估查看器,再评估输入。 见上方步骤4。
- 反馈:“提交所有审查”会下载。从此文件读取反馈。
feedback.json - 描述优化()可正常运行——使用
run_loop.py。claude -p - 待技能完全完成且用户同意后,再进行描述优化。
- 更新现有技能:与Claude.ai部分的说明相同。
若TodoWrite可用,跟踪这些生命周期步骤。在Cowork中,需特别添加“创建evals JSON并运行以便人工审查测试用例”作为待办事项。完整捆绑资源列表见本文档顶部的“可用资源”表格。
eval-viewer/generate_review.py