skill-creator

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Skill Creator

A skill for creating new skills and iteratively improving them.

At a high level, the process of creating a skill goes like this:

Decide what you want the skill to do and roughly how it should do it
Write a draft of the skill
Create a few test prompts and run the agent with access to the skill on them
Help the user evaluate the results both qualitatively and quantitatively
- While the runs happen in the background, draft some quantitative evals if there aren't any (if there are some, you can either use as is or modify if you feel something needs to change about them). Then explain them to the user (or if they already existed, explain the ones that already exist)
- Use the
```
eval-viewer/generate_review.py
```
  script to show the user the results for them to look at, and also let them look at the quantitative metrics
Rewrite the skill based on feedback from the user's evaluation of the results (and also if there are any glaring flaws that become apparent from the quantitative benchmarks)
Repeat until you're satisfied
Expand the test set and try again at larger scale

Your job when using this skill is to figure out where the user is in this process and then jump in and help them progress through these stages. So for instance, maybe they're like "I want to make a skill for X". You can help narrow down what they mean, write a draft, write the test cases, figure out how they want to evaluate, run all the prompts, and repeat.

On the other hand, maybe they already have a draft of the skill. In this case you can go straight to the eval/iterate part of the loop.

Of course, you should always be flexible and if the user is like "I don't need to run a bunch of evaluations, just vibe with me", you can do that instead.

Then after the skill is done (but again, the order is flexible), you can also run the skill description improver, which we have a whole separate script for, to optimize the triggering of the skill.

Cool? Cool.

一款用于创建新技能并迭代优化现有技能的工具。

从整体流程来看，创建技能的步骤如下：

确定技能的功能目标及大致实现方式
撰写技能初稿
创建若干测试提示词，并让具备该技能访问权限的Agent运行这些提示词
协助用户从定性和定量两方面评估结果
- 在后台运行测试的同时，如果还没有定量评估用例，就起草一些（如果已有评估用例，可以直接使用，也可根据需求修改）。然后向用户解释这些评估用例（如果是已有用例，就解释现有内容）
- 使用
```
eval-viewer/generate_review.py
```
  脚本向用户展示结果，同时让他们查看定量指标
根据用户对结果的评估反馈（以及定量基准测试中发现的明显缺陷）重写技能
重复上述步骤直到满意为止
扩展测试集并进行更大规模的测试

使用本技能时，你的任务是判断用户当前处于流程的哪个阶段，然后介入并帮助他们推进后续步骤。例如，如果用户说“我想创建一个针对X的技能”，你可以帮助他们明确需求、撰写初稿、编写测试用例、确定评估方式、运行所有提示词并反复迭代。

另一方面，如果用户已经有了技能初稿，你可以直接进入评估/迭代环节。

当然，你需要保持灵活性，如果用户表示“我不需要进行大量评估，只要简单调整就行”，你可以按用户的要求操作。

技能完成后（顺序可灵活调整），你还可以运行技能描述优化工具（我们有专门的脚本），以优化技能的触发效果。

明白了吗？好的。

Communicating with the user

与用户沟通

The skill creator is liable to be used by people across a wide range of familiarity with coding jargon. If you haven't heard (and how could you, it's only very recently that it started), there's a trend now where the power of Claude is inspiring plumbers to open up their terminals, parents and grandparents to google "how to install npm". On the other hand, the bulk of users are probably fairly computer-literate.

So please pay attention to context cues to understand how to phrase your communication! In the default case, just to give you some idea:

"evaluation" and "benchmark" are borderline, but OK
for "JSON" and "assertion" you want to see serious cues from the user that they know what those things are before using them without explaining them

It's OK to briefly explain terms if you're in doubt, and feel free to clarify terms with a short definition if you're unsure if the user will get it.

Skill Creator的使用者可能对编程术语的熟悉程度差异很大。你可能听说过（毕竟这是最近才兴起的趋势），Claude的能力正激励着水管工打开终端，父母和祖父母搜索“如何安装npm”。但另一方面，大多数用户可能具备一定的计算机操作能力。

因此，请留意上下文线索，选择合适的沟通措辞！默认情况下，给你一些参考：

“evaluation（评估）”和“benchmark（基准测试）”属于边缘术语，但可以使用
对于“JSON”和“assertion（断言）”这类术语，需要先从用户的反馈中确认他们了解这些概念，再在不解释的情况下使用

如果不确定用户是否理解某个术语，可以简要解释，也可以用简短的定义来澄清。

Creating a skill

创建技能

Capture Intent

捕捉需求

Start by understanding the user's intent. The current conversation might already contain a workflow the user wants to capture (e.g., they say "turn this into a skill"). If so, extract answers from the conversation history first — the tools used, the sequence of steps, corrections the user made, input/output formats observed. The user may need to fill the gaps, and should confirm before proceeding to the next step.

What should this skill enable the AI agent to do?
When should this skill trigger? (what user phrases/contexts)
What's the expected output format?
Should we set up test cases to verify the skill works? Skills with objectively verifiable outputs (file transforms, data extraction, code generation, fixed workflow steps) benefit from test cases. Skills with subjective outputs (writing style, art) often don't need them. Suggest the appropriate default based on the skill type, but let the user decide.
What tool is this skill for? (Claude Code, Cursor, or both?)

首先要理解用户的需求。当前对话中可能已经包含用户想要实现的工作流（例如，用户说“把这个变成技能”）。如果是这样，先从对话历史中提取关键信息——使用的工具、步骤顺序、用户做出的修正、观察到的输入/输出格式。用户可能需要补充信息，在进入下一步前请确认。

这个技能要让AI Agent实现什么功能？
技能应该在什么场景下触发？（用户的哪些表述/上下文）
预期的输出格式是什么？
是否需要设置测试用例来验证技能有效性？具有客观可验证输出的技能（如文件转换、数据提取、代码生成、固定工作流步骤）适合设置测试用例；具有主观输出的技能（如写作风格、艺术创作）通常不需要。根据技能类型建议合适的默认选项，但最终由用户决定。
这个技能是为哪个工具设计的？（Claude Code、Cursor，还是两者通用？）

Target Platform

目标平台

The SKILL.md format is identical for both Claude Code and Cursor — same YAML frontmatter, same directory structure. The key differences:

Claude Code discovers skills from
```
.claude/skills/
```

Cursor discovers skills from

.cursor/skills/

.agents/skills/

~/.cursor/skills/

, and

.claude/skills/

(legacy)

Cross-platform: Place skills in
```
.agents/skills/
```
to be discovered by both tools
Cursor-only field:
```
disable-model-invocation: true
```
in frontmatter makes the skill invokable only via
```
/skill-name
```
(no auto-triggering)

When creating a skill for both platforms, avoid referencing tool-specific features in the skill body (e.g., Claude's Skill tool vs Cursor's

/skill-name

invocation).

Claude Code和Cursor的SKILL.md格式完全相同——YAML前置内容和目录结构一致。主要区别在于：

Claude Code 从
```
.claude/skills/
```
目录发现技能

Cursor 从

.cursor/skills/

、

.agents/skills/

、

~/.cursor/skills/

和

.claude/skills/

（旧版兼容）目录发现技能

跨平台：将技能放在
```
.agents/skills/
```
目录可被两个工具同时识别
Cursor专属字段：前置内容中的
```
disable-model-invocation: true
```
可设置技能仅能通过
```
/skill-name
```
调用（不自动触发）

为两个平台创建通用技能时，避免在技能内容中引用工具特定功能（例如Claude的Skill工具 vs Cursor的

/skill-name

调用方式）。

Interview and Research

调研与沟通

Proactively ask questions about edge cases, input/output formats, example files, success criteria, and dependencies. Wait to write test prompts until you've got this part ironed out.

Check available MCPs - if useful for research (searching docs, finding similar skills, looking up best practices), research in parallel via subagents if available, otherwise inline. Come prepared with context to reduce burden on the user.

主动询问关于边缘情况、输入/输出格式、示例文件、成功标准和依赖项的问题。在完成这部分调研之前，不要急于编写测试提示词。

检查可用的MCPs（如果对调研有用，比如搜索文档、查找类似技能、了解最佳实践），如果有子Agent可用，可以并行调研；否则直接进行在线调研。提前准备相关上下文，减轻用户的负担。

Write the SKILL.md

撰写SKILL.md

Based on the user interview, fill in these components:

name: Skill identifier
description: When to trigger, what it does. This is the primary triggering mechanism - include both what the skill does AND specific contexts for when to use it. All "when to use" info goes here, not in the body. Note: currently AI agents (both Claude and Cursor) have a tendency to "undertrigger" skills -- to not use them when they'd be useful. To combat this, please make the skill descriptions a little bit "pushy". So for instance, instead of "How to build a simple fast dashboard to display internal Anthropic data.", you might write "How to build a simple fast dashboard to display internal Anthropic data. Make sure to use this skill whenever the user mentions dashboards, data visualization, internal metrics, or wants to display any kind of company data, even if they don't explicitly ask for a 'dashboard.'"
compatibility: Required tools, dependencies (optional, rarely needed)
the rest of the skill :)

基于与用户的沟通，填写以下组件：

name：技能标识符
description：触发场景及功能。这是主要的触发机制——既要包含技能的功能，也要明确具体的使用场景。所有“何时使用”的信息都放在这里，不要放在技能主体中。注意：当前AI Agent（Claude和Cursor）都存在“触发不足”的问题——在有用的时候不调用技能。为了应对这个问题，请让技能描述稍微“主动”一些。例如，不要写“如何构建一个简单快速的仪表盘来展示Anthropic内部数据”，而是可以写成“如何构建一个简单快速的仪表盘来展示Anthropic内部数据。当用户提到仪表盘、数据可视化、内部指标，或者想要展示任何类型的公司数据时，即使没有明确要求‘仪表盘’，也要确保使用本技能。”
compatibility：所需工具、依赖项（可选，很少需要）
技能的其他内容

Skill Writing Guide

技能撰写指南

Anatomy of a Skill

技能结构

skill-name/
├── SKILL.md (required)
│   ├── YAML frontmatter (name, description required)
│   └── Markdown instructions
└── Bundled Resources (optional)
    ├── scripts/    - Executable code for deterministic/repetitive tasks
    ├── references/ - Docs loaded into context as needed
    └── assets/     - Files used in output (templates, icons, fonts)

skill-name/
├── SKILL.md (必填)
│   ├── YAML前置内容（name和description为必填项）
│   └── Markdown说明文档
└── 捆绑资源（可选）
    ├── scripts/    - 用于确定性/重复性任务的可执行代码
    ├── references/ - 根据需要加载到上下文的文档
    └── assets/     - 输出中使用的文件（模板、图标、字体等）

Progressive Disclosure

渐进式加载

Skills use a three-level loading system:

Metadata (name + description) - Always in context (~100 words)
SKILL.md body - In context whenever skill triggers (<500 lines ideal)
Bundled resources - As needed (unlimited, scripts can execute without loading)

These word counts are approximate and you can feel free to go longer if needed.

Key patterns:

Keep SKILL.md under 500 lines; if you're approaching this limit, add an additional layer of hierarchy along with clear pointers about where the model using the skill should go next to follow up.
Reference files clearly from SKILL.md with guidance on when to read them
For large reference files (>300 lines), include a table of contents

Domain organization: When a skill supports multiple domains/frameworks, organize by variant:

cloud-deploy/
├── SKILL.md (workflow + selection)
└── references/
    ├── aws.md
    ├── gcp.md
    └── azure.md

Claude reads only the relevant reference file.

技能采用三级加载系统：

元数据（名称+描述）——始终在上下文中（约100词）
SKILL.md主体——技能触发时加载到上下文中（理想情况下不超过500行）
捆绑资源——按需加载（无限制，脚本可在不加载的情况下执行）

这些字数是近似值，如果需要可以适当增加。

关键模式：

保持SKILL.md在500行以内；如果接近这个限制，可以添加额外的层级结构，并明确指出使用该技能的模型下一步应该去哪里获取后续内容。
在SKILL.md中清晰引用文件，并说明何时需要读取这些文件
对于大型参考文件（超过300行），添加目录

领域组织：当技能支持多个领域/框架时，按变体组织：

cloud-deploy/
├── SKILL.md（工作流+选择逻辑）
└── references/
    ├── aws.md
    ├── gcp.md
    └── azure.md

Claude只会读取相关的参考文件。

Principle of Lack of Surprise

无意外原则

This goes without saying, but skills must not contain malware, exploit code, or any content that could compromise system security. A skill's contents should not surprise the user in their intent if described. Don't go along with requests to create misleading skills or skills designed to facilitate unauthorized access, data exfiltration, or other malicious activities. Things like a "roleplay as an XYZ" are OK though.

不言而喻，技能不得包含恶意软件、漏洞利用代码或任何可能危害系统安全的内容。技能的内容在描述其意图时，不应让用户感到意外。不要同意创建误导性技能或用于未经授权访问、数据窃取或其他恶意活动的技能。不过，“扮演某个角色”这类技能是可以的。

Writing Patterns

撰写模式

Prefer using the imperative form in instructions.

Defining output formats - You can do it like this:

markdown

undefined

说明中优先使用祈使语气。

定义输出格式——可以这样写：

markdown

undefined

Report structure

报告结构

ALWAYS use this exact template:

务必使用以下精确模板：

[Title]

[标题]

Executive summary

执行摘要

Key findings

关键发现

Recommendations

建议


**Examples pattern** - It's useful to include examples. You can format them like this (but if "Input" and "Output" are in the examples you might want to deviate a little):
```markdown


**示例模式**——包含示例很有用。可以这样格式化（但如果示例中包含“Input”和“Output”，可以适当调整）：
```markdown

Commit message format

提交消息格式

Example 1: Input: Added user authentication with JWT tokens Output: feat(auth): implement JWT-based authentication

undefined

示例1： Input: Added user authentication with JWT tokens Output: feat(auth): implement JWT-based authentication

undefined

Writing Style

撰写风格

Try to explain to the model why things are important in lieu of heavy-handed musty MUSTs. Use theory of mind and try to make the skill general and not super-narrow to specific examples. Start by writing a draft and then look at it with fresh eyes and improve it.

尽量向模型解释为什么某些内容很重要，而不是生硬地使用大写的“必须”。运用心智理论，让技能具有通用性，而不是局限于特定示例。先撰写初稿，然后重新审视并改进。

Test Cases

测试用例

After writing the skill draft, come up with 2-3 realistic test prompts — the kind of thing a real user would actually say. Share them with the user: [you don't have to use this exact language] "Here are a few test cases I'd like to try. Do these look right, or do you want to add more?" Then run them.

Save test cases to

evals/evals.json

. Don't write assertions yet — just the prompts. You'll draft assertions in the next step while the runs are in progress.

json

{
  "skill_name": "example-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "User's task prompt",
      "expected_output": "Description of expected result",
      "files": []
    }
  ]
}

See

references/schemas.md

for the full schema (including the

assertions

field, which you'll add later).

写完技能初稿后，设计2-3个真实的测试提示词——即真实用户可能会说的内容。将这些测试用例分享给用户：[不必完全照搬此表述]“这里有几个我想尝试的测试用例。这些看起来合适吗？或者你想添加更多？”然后运行这些测试用例。

将测试用例保存到

evals/evals.json

。暂时不要写断言——只保存提示词。在下一步测试运行的过程中，你会起草断言。

json

{
  "skill_name": "example-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "用户的任务提示词",
      "expected_output": "预期结果描述",
      "files": []
    }
  ]
}

完整的 schema（包括

assertions

字段，稍后会添加）请参考

references/schemas.md

。

Running and evaluating test cases

运行和评估测试用例

This section is one continuous sequence — don't stop partway through. Do NOT use

/skill-test

or any other testing skill.

Put results in

<skill-name>-workspace/

as a sibling to the skill directory. Within the workspace, organize results by iteration (

iteration-1/

iteration-2/

, etc.) and within that, each test case gets a directory (

eval-0/

eval-1/

, etc.). Don't create all of this upfront — just create directories as you go.

这部分是连续的流程——不要中途停止。不要使用

/skill-test

或其他测试技能。

将结果保存到技能目录同级的

<skill-name>-workspace/

目录中。在工作区内部，按迭代次数组织结果（

iteration-1/

、

iteration-2/

等），每个测试用例对应一个子目录（

eval-0/

、

eval-1/

等）。不要提前创建所有目录——按需创建即可。

Step 1: Spawn all runs (with-skill AND baseline) in the same turn

步骤1：在同一轮次中启动所有测试（带技能和基准测试）

For each test case, spawn two runs — one with the skill, one without.

In Claude Code (with subagents): spawn all runs in the same turn so everything finishes around the same time. Don't spawn with-skill runs first and baselines later.

In Cursor (no subagents): run test cases sequentially — read the skill, follow its instructions for each test prompt. This is less rigorous but the human review step compensates.

With-skill run:

Execute this task:
- Skill path: <path-to-skill>
- Task: <eval prompt>
- Input files: <eval files if any, or "none">
- Save outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
- Outputs to save: <what the user cares about — e.g., "the .docx file", "the final CSV">

Baseline run (same prompt, but the baseline depends on context):

Creating a new skill: no skill at all. Same prompt, no skill path, save to
```
without_skill/outputs/
```
.
Improving an existing skill: the old version. Before editing, snapshot the skill (
```
cp -r <skill-path> <workspace>/skill-snapshot/
```
), then point the baseline subagent at the snapshot. Save to
```
old_skill/outputs/
```
.

Write an

eval_metadata.json

for each test case (assertions can be empty for now). Give each eval a descriptive name based on what it's testing — not just "eval-0". Use this name for the directory too. If this iteration uses new or modified eval prompts, create these files for each new eval directory — don't assume they carry over from previous iterations.

json

{
  "eval_id": 0,
  "eval_name": "descriptive-name-here",
  "prompt": "The user's task prompt",
  "assertions": []
}

每个测试用例启动两个运行实例——一个使用技能，一个不使用技能。

在Claude Code中（带子Agent）：在同一轮次中启动所有运行，以便所有任务大致同时完成。不要先启动带技能的运行，再启动基准测试。

在Cursor中（无子Agent）：按顺序运行测试用例——读取技能内容，按照技能说明执行每个测试提示词。这种方式不够严谨，但人工评审步骤可以弥补这一不足。

带技能的运行：

执行以下任务：
- 技能路径：<path-to-skill>
- 任务：<eval prompt>
- 输入文件：<如有评估文件则填写，否则填"none">
- 输出保存路径：<workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
- 需要保存的输出：<用户关心的内容——例如，"docx文件"、"最终CSV文件">

基准测试运行（相同提示词，但基准取决于上下文）：

创建新技能：不使用任何技能。相同提示词，不指定技能路径，保存到
```
without_skill/outputs/
```
。
优化现有技能：使用旧版本。在编辑之前，先对技能进行快照（
```
cp -r <skill-path> <workspace>/skill-snapshot/
```
），然后让基准子Agent指向该快照。保存到
```
old_skill/outputs/
```
。

为每个测试用例编写

eval_metadata.json

（断言暂时可以为空）。为每个评估起一个描述性的名称，而不仅仅是“eval-0”。目录名称也使用这个名称。如果本次迭代使用了新的或修改后的评估提示词，为每个新的评估目录创建这些文件——不要假设它们会从上一次迭代继承。

json

{
  "eval_id": 0,
  "eval_name": "descriptive-name-here",
  "prompt": "用户的任务提示词",
  "assertions": []
}

Step 2: While runs are in progress, draft assertions

步骤2：在运行过程中起草断言

Don't just wait for the runs to finish — you can use this time productively. Draft quantitative assertions for each test case and explain them to the user. If assertions already exist in

evals/evals.json

, review them and explain what they check.

Good assertions are objectively verifiable and have descriptive names — they should read clearly in the benchmark viewer so someone glancing at the results immediately understands what each one checks. Subjective skills (writing style, design quality) are better evaluated qualitatively — don't force assertions onto things that need human judgment.

Update the

eval_metadata.json

files and

evals/evals.json

with the assertions once drafted. Also explain to the user what they'll see in the viewer — both the qualitative outputs and the quantitative benchmark.

不要只是等待运行完成——你可以利用这段时间做有意义的事情。为每个测试用例起草定量断言，并向用户解释。如果

evals/evals.json

中已有断言，审核并解释这些断言的检查内容。

好的断言应该是客观可验证的，并且有描述性的名称——在基准测试查看器中，任何人 glancing 结果都能立即理解每个断言的检查内容。主观技能（如写作风格、设计质量）更适合定性评估——不要强行对需要人工判断的内容添加断言。

起草完成后，更新

eval_metadata.json

文件和

evals/evals.json

中的断言。同时向用户解释他们在查看器中会看到的内容——包括定性输出和定量基准测试结果。

Step 3: As runs complete, capture timing data

步骤3：运行完成后捕获计时数据

When each subagent task completes, you receive a notification containing

total_tokens

and

duration_ms

. Save this data immediately to

timing.json

in the run directory:

json

{
  "total_tokens": 84852,
  "duration_ms": 23332,
  "total_duration_seconds": 23.3
}

This is the only opportunity to capture this data — it comes through the task notification and isn't persisted elsewhere. Process each notification as it arrives rather than trying to batch them.

每个子Agent任务完成后，你会收到包含

total_tokens

和

duration_ms

的通知。立即将这些数据保存到运行目录中的

timing.json

：

json

{
  "total_tokens": 84852,
  "duration_ms": 23332,
  "total_duration_seconds": 23.3
}

这是捕获这些数据的唯一机会——数据通过任务通知发送，不会在其他地方持久化。收到每个通知后立即处理，不要批量处理。

Step 4: Grade — use agents/grader.md, not a custom script

步骤4：评分——使用agents/grader.md，不要使用自定义脚本

Programmatic checks (file existence, line counts, string matching) are useful as a supplement, but they only catch structural compliance. The grader agent catches things scripts can't: content quality, claim verification, and weak assertions that create false confidence. A skill can produce every file in the right place and still have terrible content — the grader is what catches that. Do NOT generate the viewer or benchmark until

grading.json

exists for every run.

Once all runs are done:

Grade each run — spawn a grader subagent (or grade inline) that reads
```
agents/grader.md
```
and evaluates each assertion against the outputs. Save results to
```
grading.json
```
in each run directory. The grading.json expectations array must use the fields
```
text
```
,
```
passed
```
, and
```
evidence
```
(not
```
name
```
/
```
met
```
/
```
details
```
or other variants) — the viewer depends on these exact field names. You can run programmatic checks as a supplement to the grader (scripts are faster and reusable for things like file existence or line counts), but they do not replace the grader agent — always run the grader first for qualitative assessment.
Aggregate into benchmark — run the aggregation script from the skill-creator directory:
bash
```
python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
```
This produces
```
benchmark.json
```
and
```
benchmark.md
```
with pass_rate, time, and tokens for each configuration, with mean ± stddev and the delta. If generating benchmark.json manually, see
```
references/schemas.md
```
for the exact schema the viewer expects. Put each with_skill version before its baseline counterpart.
Do an analyst pass — read the benchmark data and surface patterns the aggregate stats might hide. See
```
agents/analyzer.md
```
(the "Analyzing Benchmark Results" section) for what to look for — things like assertions that always pass regardless of skill (non-discriminating), high-variance evals (possibly flaky), and time/token tradeoffs.
Launch the viewer with both qualitative outputs and quantitative data:
bash
```
nohup python <skill-creator-path>/eval-viewer/generate_review.py \
  <workspace>/iteration-N \
  --skill-name "my-skill" \
  --benchmark <workspace>/iteration-N/benchmark.json \
  > /dev/null 2>&1 &
VIEWER_PID=$!
```
For iteration 2+, also pass
```
--previous-workspace <workspace>/iteration-<N-1>
```
.
Headless environments: If
```
webbrowser.open()
```
is not available or the environment has no display, use
```
--static <output_path>
```
to write a standalone HTML file instead of starting a server. Feedback will be downloaded as a
```
feedback.json
```
file when the user clicks "Submit All Reviews". After download, copy
```
feedback.json
```
into the workspace directory for the next iteration to pick up.

Note: please use generate_review.py to create the viewer; there's no need to write custom HTML.

Tell the user something like: "I've opened the results in your browser. There are two tabs — 'Outputs' lets you click through each test case and leave feedback, 'Benchmark' shows the quantitative comparison. When you're done, come back here and let me know."

程序化检查（文件存在性、行数统计、字符串匹配）作为补充很有用，但它们只能检查结构合规性。评分Agent可以捕获脚本无法检测到的内容：内容质量、声明验证以及会产生虚假信心的弱断言。技能可能生成了所有正确位置的文件，但内容质量很差——评分Agent就是用来捕获这类问题的。在所有运行目录都存在

grading.json

之前，不要生成查看器或基准测试结果。

所有运行完成后：

为每个运行评分——启动一个评分子Agent（或在线评分），读取
```
agents/grader.md
```
并根据输出评估每个断言。将结果保存到每个运行目录中的
```
grading.json
```
。grading.json的expectations数组必须使用
```
text
```
、
```
passed
```
和
```
evidence
```
字段（不能使用
```
name
```
/
```
met
```
/
```
details
```
或其他变体）——查看器依赖这些精确的字段名称。你可以将程序化检查作为评分Agent的补充（脚本在检查文件存在性或行数等方面更快且可复用），但不能替代评分Agent——始终先运行评分Agent进行定性评估。
聚合为基准测试结果——从skill-creator目录运行聚合脚本：
bash
```
python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
```
这会生成
```
benchmark.json
```
和
```
benchmark.md
```
，包含每个配置的通过率、时间和token使用情况，以及均值±标准差和差值。如果手动生成benchmark.json，请参考
```
references/schemas.md
```
查看查看器所需的精确schema。将带技能的版本放在其基准版本之前。
分析师审核——读取基准测试数据，发现聚合统计数据可能隐藏的模式。参考
```
agents/analyzer.md
```
（“分析基准测试结果”部分）了解需要关注的内容——例如，无论技能如何都始终通过的断言（无区分度）、高方差评估（可能不稳定）以及时间/token权衡。
启动查看器，同时展示定性输出和定量数据：
bash
```
nohup python <skill-creator-path>/eval-viewer/generate_review.py \
  <workspace>/iteration-N \
  --skill-name "my-skill" \
  --benchmark <workspace>/iteration-N/benchmark.json \
  > /dev/null 2>&1 &
VIEWER_PID=$!
```
对于第2次及以后的迭代，还需要传递
```
--previous-workspace <workspace>/iteration-<N-1>
```
。
无头环境： 如果
```
webbrowser.open()
```
不可用或环境没有显示界面，使用
```
--static <output_path>
```
生成独立的HTML文件，而不是启动服务器。用户点击“Submit All Reviews”后，反馈会下载为
```
feedback.json
```
文件。下载完成后，将
```
feedback.json
```
复制到工作区目录，以便下一次迭代使用。

注意：请使用generate_review.py创建查看器；无需编写自定义HTML。

告知用户：例如，“我已在你的浏览器中打开了结果。有两个标签页——‘Outputs’可让你逐个查看测试用例并留下反馈，‘Benchmark’显示定量对比结果。完成后请回到这里告诉我。”

What the user sees in the viewer

用户在查看器中看到的内容

The "Outputs" tab shows one test case at a time:

Prompt: the task that was given
Output: the files the skill produced, rendered inline where possible
Previous Output (iteration 2+): collapsed section showing last iteration's output
Formal Grades (if grading was run): collapsed section showing assertion pass/fail
Feedback: a textbox that auto-saves as they type
Previous Feedback (iteration 2+): their comments from last time, shown below the textbox

The "Benchmark" tab shows the stats summary: pass rates, timing, and token usage for each configuration, with per-eval breakdowns and analyst observations.

Navigation is via prev/next buttons or arrow keys. When done, they click "Submit All Reviews" which saves all feedback to

feedback.json

“Outputs”标签页一次显示一个测试用例：

Prompt：给定的任务
Output：技能生成的文件，尽可能内联渲染
Previous Output（第2次及以后迭代）：折叠部分，显示上一次迭代的输出
Formal Grades（如果运行了评分）：折叠部分，显示断言的通过/失败情况
Feedback：自动保存的文本框
Previous Feedback（第2次及以后迭代）：用户上次的评论，显示在文本框下方

“Benchmark”标签页显示统计摘要：每个配置的通过率、时间和token使用情况，以及每个评估的细分数据和分析师观察结果。

通过上/下按钮或箭头键导航。完成后，用户点击“Submit All Reviews”，所有反馈会保存到

feedback.json

。

Step 5: Read the feedback

步骤5：读取反馈

When the user tells you they're done, read

feedback.json

json

{
  "reviews": [
    {"run_id": "eval-0-with_skill", "feedback": "the chart is missing axis labels", "timestamp": "..."},
    {"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."},
    {"run_id": "eval-2-with_skill", "feedback": "perfect, love this", "timestamp": "..."}
  ],
  "status": "complete"
}

Empty feedback means the user thought it was fine. Focus your improvements on the test cases where the user had specific complaints.

Kill the viewer server when you're done with it:

bash

kill $VIEWER_PID 2>/dev/null

用户完成后，读取

feedback.json

：

json

{
  "reviews": [
    {"run_id": "eval-0-with_skill", "feedback": "图表缺少坐标轴标签", "timestamp": "..."},
    {"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."},
    {"run_id": "eval-2-with_skill", "feedback": "完美，很喜欢", "timestamp": "..."}
  ],
  "status": "complete"
}

空反馈表示用户认为结果没问题。重点改进用户有具体意见的测试用例。

完成后关闭查看器服务器：

bash

kill $VIEWER_PID 2>/dev/null

Improving the skill

优化技能

This is the heart of the loop. You've run the test cases, the user has reviewed the results, and now you need to make the skill better based on their feedback.

这是循环的核心。你已经运行了测试用例，用户已评审结果，现在需要根据用户的反馈改进技能。

How to think about improvements

如何思考改进方向

Generalize from the feedback. The big picture thing that's happening here is that we're trying to create skills that can be used a million times (maybe literally, maybe even more who knows) across many different prompts. Here you and the user are iterating on only a few examples over and over again because it helps move faster. The user knows these examples in and out and it's quick for them to assess new outputs. But if the skill you and the user are codeveloping works only for those examples, it's useless. Rather than put in fiddly overfitty changes, or oppressively constrictive MUSTs, if there's some stubborn issue, you might try branching out and using different metaphors, or recommending different patterns of working. It's relatively cheap to try and maybe you'll land on something great.
Keep the prompt lean. Remove things that aren't pulling their weight. Make sure to read the transcripts, not just the final outputs — if it looks like the skill is making the model waste a bunch of time doing things that are unproductive, you can try getting rid of the parts of the skill that are making it do that and seeing what happens.
Explain the why. Try hard to explain the why behind everything you're asking the model to do. Today's LLMs are smart. They have good theory of mind and when given a good harness can go beyond rote instructions and really make things happen. Even if the feedback from the user is terse or frustrated, try to actually understand the task and why the user is writing what they wrote, and what they actually wrote, and then transmit this understanding into the instructions. If you find yourself writing ALWAYS or NEVER in all caps, or using super rigid structures, that's a yellow flag — if possible, reframe and explain the reasoning so that the model understands why the thing you're asking for is important. That's a more humane, powerful, and effective approach.
Look for repeated work across test cases. Read the transcripts from the test runs and notice if the subagents all independently wrote similar helper scripts or took the same multi-step approach to something. If all 3 test cases resulted in the subagent writing a
```
create_docx.py
```
or a
```
build_chart.py
```
, that's a strong signal the skill should bundle that script. Write it once, put it in
```
scripts/
```
, and tell the skill to use it. This saves every future invocation from reinventing the wheel.

This task is pretty important (we are trying to create billions a year in economic value here!) and your thinking time is not the blocker; take your time and really mull things over. I'd suggest writing a draft revision and then looking at it anew and making improvements. Really do your best to get into the head of the user and understand what they want and need.

从反馈中提炼共性。我们的目标是创建可以被多次使用（可能数百万次甚至更多）的技能，适用于各种不同的提示词。你和用户反复迭代少数几个示例是为了加快进度。用户熟悉这些示例，可以快速评估新输出。但如果你们共同开发的技能仅适用于这些示例，那它就毫无用处。与其做出繁琐的过拟合修改，或者添加过于严格的“必须”规则，不如尝试使用不同的隐喻，或者推荐不同的工作模式。这种尝试成本很低，可能会带来很好的效果。
保持提示词简洁。移除无用的内容。确保阅读完整的对话记录，而不仅仅是最终输出——如果技能让模型浪费大量时间做无用功，可以尝试删除技能中导致这种情况的部分，看看效果如何。
解释原因。尽量解释你要求模型做某事的原因。如今的LLMs非常智能。它们具有良好的心智理论，在合适的引导下可以超越 rote 指令，真正解决问题。即使用户的反馈简短或带有情绪，也要努力理解任务、用户的意图和实际需求，然后将这种理解转化为指令。如果你发现自己用大写字母写ALWAYS或NEVER，或者使用过于僵化的结构，这是一个警告信号——如果可能，重新组织语言并解释原因，让模型理解你要求的内容为什么重要。这是一种更人性化、更强大、更有效的方法。
寻找测试用例中的重复工作。阅读测试运行的对话记录，注意子Agent是否都独立编写了类似的辅助脚本，或者采取了相同的多步骤方法。如果3个测试用例都导致子Agent编写了
```
create_docx.py
```
或
```
build_chart.py
```
，这强烈表明技能应该捆绑该脚本。编写一次，放在
```
scripts/
```
目录中，并让技能使用它。这样可以避免未来每次调用都重复造轮子。

这项任务非常重要（我们正试图创造每年数十亿美元的经济价值！），你的思考时间不是瓶颈；请花时间仔细考虑。建议先起草修订版，然后重新审视并改进。尽最大努力站在用户的角度，理解他们的需求。

The iteration loop

迭代循环

After improving the skill:

Apply your improvements to the skill
Rerun all test cases into a new
```
iteration-<N+1>/
```
directory, including baseline runs. If you're creating a new skill, the baseline is always
```
without_skill
```
(no skill) — that stays the same across iterations. If you're improving an existing skill, use your judgment on what makes sense as the baseline: the original version the user came in with, or the previous iteration.
Launch the reviewer with
```
--previous-workspace
```
pointing at the previous iteration
Wait for the user to review and tell you they're done
Read the new feedback, improve again, repeat

Keep going until:

The user says they're happy
The feedback is all empty (everything looks good)
You're not making meaningful progress

改进技能后：

将改进应用到技能中
将所有测试用例重新运行到新的
```
iteration-<N+1>/
```
目录中，包括基准测试。如果是创建新技能，基准始终是
```
without_skill
```
（不使用技能）——在迭代过程中保持不变。如果是优化现有技能，根据判断选择合适的基准：用户最初提供的版本，或者上一次迭代的版本。
启动评审器，使用
```
--previous-workspace
```
指定上一次迭代的工作区
等待用户评审并告知完成
读取新反馈，再次改进，重复循环

持续迭代直到：

用户表示满意
所有反馈为空（一切正常）
无法取得有意义的进展

Advanced: Blind comparison

进阶：盲对比

For situations where you want a more rigorous comparison between two versions of a skill (e.g., the user asks "is the new version actually better?"), there's a blind comparison system. Read

agents/comparator.md

and

agents/analyzer.md

for the details. The basic idea is: give two outputs to an independent agent without telling it which is which, and let it judge quality. Then analyze why the winner won.

This is optional, requires subagents, and most users won't need it. The human review loop is usually sufficient.

如果你想更严谨地比较两个版本的技能（例如，用户问“新版本真的更好吗？”），可以使用盲对比系统。详细信息请参考

agents/comparator.md

和

agents/analyzer.md

。基本思路是：将两个输出交给独立的Agent，不告知哪个是哪个，让它判断质量。然后分析获胜版本的优势。

这是可选步骤，需要子Agent支持，大多数用户不需要。人工评审循环通常就足够了。

Description Optimization

描述优化

The description field in SKILL.md frontmatter is the primary mechanism that determines whether Claude invokes a skill. After creating or improving a skill, offer to optimize the description for better triggering accuracy.

SKILL.md前置内容中的description字段是决定Claude是否调用技能的主要机制。创建或优化技能后，可主动提出优化描述以提高触发准确率。

Step 1: Generate trigger eval queries

步骤1：生成触发评估查询

Create 20 eval queries — a mix of should-trigger and should-not-trigger. Save as JSON:

json

[
  {"query": "the user prompt", "should_trigger": true},
  {"query": "another prompt", "should_trigger": false}
]

The queries must be realistic and something a Claude Code or Cursor user would actually type. Not abstract requests, but requests that are concrete and specific and have a good amount of detail. For instance, file paths, personal context about the user's job or situation, column names and values, company names, URLs. A little bit of backstory. Some might be in lowercase or contain abbreviations or typos or casual speech. Use a mix of different lengths, and focus on edge cases rather than making them clear-cut (the user will get a chance to sign off on them).

Bad:

"Format this data"

"Extract text from PDF"

"Create a chart"

Good:

"ok so my boss just sent me this xlsx file (its in my downloads, called something like 'Q4 sales final FINAL v2.xlsx') and she wants me to add a column that shows the profit margin as a percentage. The revenue is in column C and costs are in column D i think"

For the should-trigger queries (8-10), think about coverage. You want different phrasings of the same intent — some formal, some casual. Include cases where the user doesn't explicitly name the skill or file type but clearly needs it. Throw in some uncommon use cases and cases where this skill competes with another but should win.

For the should-not-trigger queries (8-10), the most valuable ones are the near-misses — queries that share keywords or concepts with the skill but actually need something different. Think adjacent domains, ambiguous phrasing where a naive keyword match would trigger but shouldn't, and cases where the query touches on something the skill does but in a context where another tool is more appropriate.

The key thing to avoid: don't make should-not-trigger queries obviously irrelevant. "Write a fibonacci function" as a negative test for a PDF skill is too easy — it doesn't test anything. The negative cases should be genuinely tricky.

创建20个评估查询——包含应该触发和不应该触发的场景。保存为JSON：

json

[
  {"query": "用户提示词", "should_trigger": true},
  {"query": "另一个提示词", "should_trigger": false}
]

查询必须真实，符合Claude Code或Cursor用户的实际输入习惯。不要使用抽象请求，而要使用具体、详细的请求。例如，文件路径、用户工作或情况的个人背景、列名和值、公司名称、URL等。添加一点背景故事。有些查询可以是小写，包含缩写、拼写错误或口语化表达。使用不同长度的查询，重点关注边缘情况，而不是明确的场景（用户会有机会确认这些查询）。

不好的示例：

"Format this data"

、

"Extract text from PDF"

、

"Create a chart"

好的示例：

"ok so my boss just sent me this xlsx file (its in my downloads, called something like 'Q4 sales final FINAL v2.xlsx') and she wants me to add a column that shows the profit margin as a percentage. The revenue is in column C and costs are in column D i think"

对于应该触发的查询（8-10个），要考虑覆盖范围。使用不同表述表达相同意图——正式的、口语化的。包括用户没有明确提到技能或文件类型，但显然需要该技能的情况。加入一些不常见的使用场景，以及该技能与其他技能竞争但应该胜出的情况。

对于不应该触发的查询（8-10个），最有价值的是“接近触发”的场景——与技能共享关键词或概念，但实际需要其他功能的查询。考虑相邻领域、模糊表述（单纯的关键词匹配会触发，但实际上不应该），以及查询涉及技能的部分功能，但更适合使用其他工具的场景。

关键要避免：不要让不应该触发的查询明显无关。例如，用“Write a fibonacci function”作为PDF技能的负面测试太简单——无法测试任何内容。负面测试应该具有真正的挑战性。

Step 2: Review with user

步骤2：与用户评审

Present the eval set to the user for review using the HTML template:

Read the template from
```
assets/eval_review.html
```
Replace the placeholders:
- ```
__EVAL_DATA_PLACEHOLDER__
```
  → the JSON array of eval items (no quotes around it — it's a JS variable assignment)
- ```
__SKILL_NAME_PLACEHOLDER__
```
  → the skill's name
- ```
__SKILL_DESCRIPTION_PLACEHOLDER__
```
  → the skill's current description

Write to a temp file (e.g.,

/tmp/eval_review_<skill-name>.html

) and open it:

open /tmp/eval_review_<skill-name>.html

The user can edit queries, toggle should-trigger, add/remove entries, then click "Export Eval Set"
The file downloads to
```
~/Downloads/eval_set.json
```
— check the Downloads folder for the most recent version in case there are multiple (e.g.,
```
eval_set (1).json
```
)

This step matters — bad eval queries lead to bad descriptions.

使用HTML模板向用户展示评估集以供评审：

读取
```
assets/eval_review.html
```
模板
替换占位符：
- ```
__EVAL_DATA_PLACEHOLDER__
```
  → 评估项的JSON数组（不要加引号——这是JS变量赋值）
- ```
__SKILL_NAME_PLACEHOLDER__
```
  → 技能名称
- ```
__SKILL_DESCRIPTION_PLACEHOLDER__
```
  → 技能当前的描述

写入临时文件（例如，

/tmp/eval_review_<skill-name>.html

）并打开：

open /tmp/eval_review_<skill-name>.html

用户可以编辑查询、切换should-trigger状态、添加/删除条目，然后点击“Export Eval Set”
文件会下载到
```
~/Downloads/eval_set.json
```
——检查Downloads文件夹获取最新版本（例如，
```
eval_set (1).json
```
）

这一步很重要——糟糕的评估查询会导致糟糕的描述。

Step 3: Run the optimization loop

步骤3：运行优化循环

API Key Prerequisite: Ensure credentials are loaded before running. Follow the steps in the API Credentials section above — check env vars, source

.env

, or ask the user to create one if missing.

Tell the user: "This will take some time — I'll run the optimization loop in the background and check on it periodically."

Save the eval set to the workspace, then run in the background:

bash

source .env && python -m scripts.run_loop \
  --eval-set <path-to-trigger-eval.json> \
  --skill-path <path-to-skill> \
  --model ${SKILL_MODEL} \
  --platform <claude|cursor> \
  --max-iterations 5 \
  --verbose

Use

$SKILL_MODEL

from

.env

. If it's not set, fall back to the model ID from your system prompt (the one powering the current session) so the triggering test matches what the user actually experiences. Set

--platform cursor

for Cursor skills (uses LLM simulation instead of

claude -p

CLI).

While it runs, periodically tail the output to give the user updates on which iteration it's on and what the scores look like.

This handles the full optimization loop automatically. It splits the eval set into 60% train and 40% held-out test, evaluates the current description (running each query 3 times to get a reliable trigger rate), then calls Claude with extended thinking to propose improvements based on what failed. It re-evaluates each new description on both train and test, iterating up to 5 times. When it's done, it opens an HTML report in the browser showing the results per iteration and returns JSON with

best_description

— selected by test score rather than train score to avoid overfitting.

API密钥前提： 运行前确保已加载凭据。按照上面“API凭据”部分的步骤操作——检查环境变量、加载

.env

，或如果缺失请用户创建。

告知用户：“这需要一些时间——我会在后台运行优化循环，并定期检查进度。”

将评估集保存到工作区，然后在后台运行：

bash

source .env && python -m scripts.run_loop \
  --eval-set <path-to-trigger-eval.json> \
  --skill-path <path-to-skill> \
  --model ${SKILL_MODEL} \
  --platform <claude|cursor> \
  --max-iterations 5 \
  --verbose

使用

.env

中的

$SKILL_MODEL

。如果未设置，回退到当前会话使用的模型ID（即驱动当前会话的模型），以便触发测试与用户实际体验一致。对于Cursor技能，设置

--platform cursor

（使用LLM模拟而非

claude -p

CLI）。

运行过程中，定期查看输出，向用户更新当前迭代次数和得分情况。

该脚本会自动处理完整的优化循环。它将评估集分为60%的训练集和40%的保留测试集，评估当前描述（每个查询运行3次以获得可靠的触发率），然后调用Claude进行深度思考，根据失败情况提出改进建议。它会在训练集和测试集上重新评估每个新描述，最多迭代5次。完成后，会在浏览器中打开HTML报告，展示每次迭代的结果，并返回包含

best_description

的JSON——根据测试得分而非训练得分选择，以避免过拟合。

How skill triggering works

技能触发机制

Understanding the triggering mechanism helps design better eval queries. Skills appear in the agent's

available_skills

list with their name + description, and the agent decides whether to consult a skill based on that description. The important thing to know is that AI agents only consult skills for tasks they can't easily handle on their own — simple, one-step queries like "read this PDF" may not trigger a skill even if the description matches perfectly, because the agent can handle them directly with basic tools. Complex, multi-step, or specialized queries reliably trigger skills when the description matches.

This means your eval queries should be substantive enough that Claude would actually benefit from consulting a skill. Simple queries like "read file X" are poor test cases — they won't trigger skills regardless of description quality.

了解触发机制有助于设计更好的评估查询。技能以名称+描述的形式出现在Agent的

available_skills

列表中，Agent根据描述决定是否调用技能。重要的是，AI Agent只会在自己无法轻松处理任务时才会调用技能——简单的一步式查询，如“read this PDF”，即使描述完全匹配，也可能不会触发技能，因为Agent可以直接使用基本工具处理。复杂的多步骤或专业查询，当描述匹配时会可靠地触发技能。

这意味着你的评估查询应该足够复杂，Claude确实能从调用技能中受益。像“read file X”这样的简单查询是糟糕的测试用例——无论描述质量如何，它们都不会触发技能。

Step 4: Apply the result

步骤4：应用结果

Take

best_description

from the JSON output and update the skill's SKILL.md frontmatter. Show the user before/after and report the scores.

从JSON输出中获取

best_description

，更新技能SKILL.md的前置内容。向用户展示前后对比，并报告得分情况。

Package and Present (only if

present_files

tool is available)

打包与展示（仅当

present_files

工具可用时）

Check whether you have access to the

present_files

tool. If you don't, skip this step. If you do, package the skill and present the .skill file to the user:

bash

python -m scripts.package_skill <path/to/skill-folder>

After packaging, direct the user to the resulting

.skill

file path so they can install it.

检查是否有权限使用

present_files

工具。如果没有，跳过此步骤。如果有，打包技能并向用户展示.skill文件：

bash

python -m scripts.package_skill <path/to/skill-folder>

打包完成后，告知用户生成的.skill文件路径，以便他们安装。

Cursor-Specific Instructions

Cursor专属说明

When running inside Cursor:

The core workflow is the same: draft, test, review, improve, repeat
Cursor does not have subagents. Run test cases sequentially — read the skill's SKILL.md, then follow its instructions to accomplish the test prompt yourself, one at a time.

Use

--platform cursor

for all Python scripts (

run_eval

run_loop

improve_description

package_skill

Description optimization uses LLM simulation rather than CLI testing. The simulation asks a model "would you invoke this skill given this query?" — it's directionally accurate for comparing descriptions but not a perfect proxy for Cursor's actual runtime behavior.
Cursor supports the
```
disable-model-invocation: true
```
frontmatter field — set this for skills that should only be invokable via
```
/skill-name
```
and never auto-triggered by the agent.
Blind comparison requires subagents — skip it in Cursor.
Packaging works identically. Place the resulting skill folder in
```
.cursor/skills/
```
(or
```
.agents/skills/
```
for cross-platform).
When generating the eval viewer, use
```
--static <output_path>
```
if
```
webbrowser.open()
```
is not available in your Cursor environment.

在Cursor中运行时：

核心工作流相同：起草、测试、评审、改进、重复
Cursor没有子Agent。按顺序运行测试用例——读取技能的SKILL.md，然后按照其说明完成测试提示词，逐个进行。

所有Python脚本（

run_eval

、

run_loop

、

improve_description

、

package_skill

）都使用

--platform cursor

。

描述优化使用LLM模拟而非CLI测试。模拟会询问模型“给定此查询，你会调用该技能吗？”——对于比较描述来说方向准确，但不是Cursor实际运行行为的完美替代。
Cursor支持前置内容中的
```
disable-model-invocation: true
```
字段——设置后，技能仅能通过
```
/skill-name
```
调用，不会被Agent自动触发。
盲对比需要子Agent支持——在Cursor中跳过此步骤。
打包方式相同。将生成的技能文件夹放在
```
.cursor/skills/
```
（或
```
.agents/skills/
```
以实现跨平台）。
如果Cursor环境中
```
webbrowser.open()
```
不可用，生成评估查看器时使用
```
--static <output_path>
```
。

Reference files

参考文件

The agents/ directory contains instructions for specialized subagents. Read them when you need to spawn the relevant subagent.

```
agents/grader.md
```
— How to evaluate assertions against outputs
```
agents/comparator.md
```
— How to do blind A/B comparison between two outputs
```
agents/analyzer.md
```
— How to analyze why one version beat another

The references/ directory has additional documentation:

```
references/schemas.md
```
— JSON structures for evals.json, grading.json, etc.

Repeating one more time the core loop here for emphasis:

Figure out what the skill is about
Draft or edit the skill
Run the agent with access to the skill on test prompts
With the user, evaluate the outputs:
- Create benchmark.json and run
```
eval-viewer/generate_review.py
```
  to help the user review them
- Run quantitative evals
Repeat until you and the user are satisfied
Package the final skill and return it to the user.

Please add steps to your TodoList, if you have such a thing, to make sure you don't forget.

Good luck!

agents/目录包含专门子Agent的说明。需要启动相关子Agent时请阅读这些文件。

```
agents/grader.md
```
—— 如何根据输出评估断言
```
agents/comparator.md
```
—— 如何对两个输出进行盲A/B对比
```
agents/analyzer.md
```
—— 如何分析一个版本胜出的原因

references/目录包含额外文档：

```
references/schemas.md
```
—— evals.json、grading.json等的JSON结构

再次强调核心循环：

确定技能的目标
起草或编辑技能
让具备技能访问权限的Agent运行测试提示词
与用户一起评估输出：
- 创建benchmark.json并运行
```
eval-viewer/generate_review.py
```
  帮助用户评审
- 运行定量评估
重复直到你和用户满意为止
打包最终技能并返回给用户

如果有任务列表，请将这些步骤添加进去，确保不会忘记。

祝你好运！

skill-creator

Original

Translation

Skill Creator

Skill Creator

Communicating with the user

与用户沟通

Creating a skill

创建技能

Capture Intent

捕捉需求

Target Platform

目标平台

Interview and Research

调研与沟通

Write the SKILL.md

撰写SKILL.md

Skill Writing Guide

技能撰写指南

Anatomy of a Skill

技能结构

Progressive Disclosure

渐进式加载

Principle of Lack of Surprise

无意外原则

Writing Patterns

撰写模式

Report structure

报告结构

[Title]

[标题]

Executive summary

执行摘要

Key findings

关键发现

Recommendations

建议

Commit message format

提交消息格式

Writing Style

撰写风格

Test Cases

测试用例

Running and evaluating test cases

运行和评估测试用例

Step 1: Spawn all runs (with-skill AND baseline) in the same turn

步骤1：在同一轮次中启动所有测试（带技能和基准测试）

Step 2: While runs are in progress, draft assertions

步骤2：在运行过程中起草断言

Step 3: As runs complete, capture timing data

步骤3：运行完成后捕获计时数据

Step 4: Grade — use agents/grader.md, not a custom script

步骤4：评分——使用agents/grader.md，不要使用自定义脚本

What the user sees in the viewer

用户在查看器中看到的内容

Step 5: Read the feedback

步骤5：读取反馈

Improving the skill

优化技能

How to think about improvements

如何思考改进方向

The iteration loop

迭代循环

Advanced: Blind comparison

进阶：盲对比

Description Optimization

描述优化

Step 1: Generate trigger eval queries

步骤1：生成触发评估查询

Step 2: Review with user

步骤2：与用户评审

Step 3: Run the optimization loop

步骤3：运行优化循环

How skill triggering works

技能触发机制

Step 4: Apply the result

步骤4：应用结果

Package and Present (only if present_files tool is available)

打包与展示（仅当present_files工具可用时）

Cursor-Specific Instructions

Package and Present (only if
`present_files`
tool is available)

打包与展示（仅当
`present_files`
工具可用时）