test-skill-quality
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseTest Skill Quality
测试Skill质量
Test any skill's instructions by following them literally, documenting every moment they fail to guide the next action, and fixing the root causes.
通过严格遵循任意Skill的指令、记录其无法指导下一步操作的所有场景并修复根本原因,来测试该Skill的质量。
Trigger boundary
触发边界
Use this skill when:
- testing whether a skill's instructions are complete and unambiguous
- auditing a skill for instructional quality before publishing
- dogfooding a skill by running its workflow on a real task
- improving a skill after receiving feedback that it's confusing or incomplete
- validating that fixes to a skill's instructions actually resolved the original friction
Do not use this skill for:
- building a new skill from scratch (use )
build-skills - evaluating the quality of a skill's output (use evaluation suites)
- reviewing code changes in a pull request (use )
review-pr - general documentation improvements not related to skill instructions
在以下场景中使用本Skill:
- 测试某个Skill的指令是否完整且无歧义
- 发布前审计Skill的指导质量
- 在真实任务中运行Skill的工作流进行dogfooding测试
- 在收到Skill存在混淆或不完整的反馈后对其进行改进
- 验证Skill指令的修复是否真正解决了原有的卡点问题
请勿在以下场景中使用本Skill:
- 从零开始构建新Skill(使用)
build-skills - 评估Skill的输出质量(使用评估套件)
- 审查拉取请求中的代码变更(使用)
review-pr - 与Skill指令无关的一般性文档改进
Non-negotiable rules
不可妥协的规则
- Follow literally, not intelligently. Suppress domain knowledge. If the instructions don't specify it, record a friction point — even if you "know" the answer.
- Test on a real task. Toy examples don't exercise branching logic, error handling, or cross-references. The task must be genuine and non-trivial.
- Every derailment gets an ID. Friction points are numbered F-01, F-02, ... with P0/P1/P2 severity. No unnamed complaints.
- Fix the instructions, not the executor. The remedy is always a text edit to the skill files, never "use a smarter agent."
- The derail notes are the primary deliverable. Not a pass/fail verdict. A structured document showing what broke and why.
- Verify after fixing. Run grep-based consistency checks and confirm routing integrity after edits.
- 严格遵循,而非自主推断。 摒弃领域知识。如果指令未明确说明,即使你“知道”答案,也要记录一个卡点。
- 在真实任务中测试。 玩具示例无法检验分支逻辑、错误处理或交叉引用。任务必须真实且具有一定复杂度。
- 每个卡点都要有ID。 卡点编号为F-01、F-02……并标记P0/P1/P2级别。禁止无编号的反馈。
- 修复指令,而非执行者。 解决方案必须是对Skill文件的文本编辑,绝不能是“使用更智能的agent”。
- 卡点记录是主要交付物。 并非通过/不通过的 verdict,而是一份展示问题所在及原因的结构化文档。
- 修复后进行验证。 修复完成后,运行基于grep的一致性检查并确认路由完整性。
Required workflow
必备工作流
1. Select the test subject and task
1. 选择测试对象与任务
Choose:
- The skill to test — any skill with a SKILL.md and optional references
- The test task — a real, representative task within the skill's trigger boundary
The test task must be:
- Genuinely within the skill's scope (not an edge case)
- Complex enough to exercise the full workflow (not a one-step operation)
- Executable in the current environment (required tools available)
Record the test metadata:
Skill under test: [name]
Test task: [one-line description]
Date: [YYYY-MM-DD]
Method: Follow SKILL.md steps N–M exactly as written选择:
- 待测试的Skill — 任意包含SKILL.md及可选参考文档的Skill
- 测试任务 — Skill触发边界内的真实、具有代表性的任务
测试任务必须满足:
- 确实属于Skill的适用范围(非边缘案例)
- 复杂度足以覆盖完整工作流(非单步操作)
- 可在当前环境中执行(所需工具已就绪)
记录测试元数据:
Skill under test: [name]
Test task: [one-line description]
Date: [YYYY-MM-DD]
Method: Follow SKILL.md steps N–M exactly as written2. Pre-scan the skill
2. 预扫描Skill
Before executing, read through the skill once:
- Read SKILL.md fully — note total steps, branching points, cross-references
- Tree the directory
references/ - List external dependencies (tools, MCP servers, APIs)
- Note the skill's declared trigger boundary
Do NOT execute anything during this step. This is orientation only.
执行前,通读一次Skill:
- 完整阅读SKILL.md — 记录总步骤数、分支点、交叉引用
- 梳理目录结构
references/ - 列出外部依赖(工具、MCP服务器、API)
- 记录Skill声明的触发边界
此步骤请勿执行任何操作,仅用于熟悉内容。
3. Execute literally (the core loop)
3. 严格执行(核心循环)
For each step in the skill's workflow:
- Read only the current step. Do not look ahead.
- Attempt to execute using only the information provided in the skill.
- Record the outcome:
- Clean pass — step was unambiguous and executable.
- Derailment — you could not determine the next action from the instructions alone. Record a friction point.
- Implicit knowledge used — you could execute, but only because you knew something not stated. Record a lower-severity friction point.
For each derailment, write:
markdown
**F-[NN] — [short title]** (P[0-2])
[What happened, what the instructions said, what was missing or ambiguous.]
Fix: [Specific text edit that would prevent this derailment.]See for severity assignment rules.
references/friction-classification.md对于Skill工作流中的每一步:
- 仅阅读当前步骤。 请勿提前查看后续步骤。
- 尝试仅使用Skill提供的信息执行。
- 记录结果:
- 顺利通过 — 步骤清晰明确且可执行。
- 卡点 — 仅通过指令无法确定下一步操作。记录一个卡点。
- 使用了隐含知识 — 你能够执行,但仅因为你知道指令中未提及的信息。记录一个低级别卡点。
对于每个卡点,记录:
markdown
**F-[NN] — [简短标题]** (P[0-2])
[事件经过、指令原文、缺失或模糊的内容。]
Fix: [可避免此卡点的具体文本编辑方案。]请查看获取级别分配规则。
references/friction-classification.md4. Collect evidence
4. 收集证据
After completing all steps, calculate:
| Metric | Value |
|---|---|
| Total steps attempted | |
| Clean passes | |
| P0 (blocks progress) | |
| P1 (causes confusion) | |
| P2 (minor annoyance) |
Build a derailment density map showing which workflow phases have the most friction.
Tag each friction point with a root cause code — see .
references/root-cause-taxonomy.md完成所有步骤后,计算:
| Metric | Value |
|---|---|
| Total steps attempted | |
| Clean passes | |
| P0 (blocks progress) | |
| P1 (causes confusion) | |
| P2 (minor annoyance) |
制作卡点密度图,展示工作流中哪些阶段的卡点最多。
为每个卡点标记根本原因代码 — 请查看。
references/root-cause-taxonomy.md5. Write the derail notes
5. 撰写卡点记录
Write the report to in the project root.
derail-notes/NN-dogfood-[topic].mdStructure:
markdown
undefined在项目根目录下,将报告写入。
derail-notes/NN-dogfood-[topic].md结构:
markdown
undefinedDerailment Test: [skill-name] on "[task]"
Derailment Test: [skill-name] on "[task]"
Date: ...
Skill under test: ...
Test task: ...
Method: ...
Date: ...
Skill under test: ...
Test task: ...
Method: ...
Friction points
Friction points
[Phase/step name]
[Phase/step name]
F-01 — [title] (P0)
...
F-01 — [title] (P0)
...
What worked well
What worked well
- ...
- ...
Priority summary
Priority summary
| Priority | Count | Friction points |
|---|---|---|
| P0 | N | F-xx, ... |
| P1 | N | F-xx, ... |
| P2 | N | F-xx, ... |
undefined| Priority | Count | Friction points |
|---|---|---|
| P0 | N | F-xx, ... |
| P1 | N | F-xx, ... |
| P2 | N | F-xx, ... |
undefined6. Apply fixes
6. 应用修复
Fix priority: all P0, then all P1, then P2 if time allows.
For each friction point, apply the fix directly to the skill's source files. Read to match the derailment type to a proven fix pattern.
references/fix-patterns.mdFixes must be:
- In-place — edit the existing instruction, don't create errata
- Self-contained — the fix works without consulting the derail notes
- Minimal — add only what was missing
修复优先级:先处理所有P0级,再处理所有P1级,时间允许的话再处理P2级。
针对每个卡点,直接在Skill的源文件中应用修复方案。请查看,将卡点类型与已验证的修复模式匹配。
references/fix-patterns.md修复必须满足:
- 原位修改 — 编辑现有指令,不要创建勘误表
- 自包含 — 修复无需参考卡点记录即可生效
- 最小化 — 仅添加缺失的内容
7. Verify fixes
7. 验证修复
After all edits:
- Terminology consistency — grep for stale terms that should have been updated
- Routing integrity — confirm every reference file is still reachable from SKILL.md
- Cross-reference consistency — no contradictions between documents
- Size constraints — SKILL.md still under 500 lines
- No regressions — fixes didn't introduce new ambiguities
bash
undefined完成所有编辑后:
- 术语一致性 — 使用grep查找应更新的过时术语
- 路由完整性 — 确认所有参考文件仍可从SKILL.md访问
- 交叉引用一致性 — 文档间无矛盾
- 篇幅限制 — SKILL.md仍少于500行
- 无回归 — 修复未引入新的歧义
bash
undefinedExample verification commands
Example verification commands
grep -r "old_term" skills/[skill-name]/ # should be zero
find skills/[skill-name]/references -type f -name "*.md" | while read f; do
grep -q "$(basename "$f" .md)" skills/[skill-name]/SKILL.md || echo "ORPHAN: $f"
done
wc -l skills/[skill-name]/SKILL.md # should be under 500
undefinedgrep -r "old_term" skills/[skill-name]/ # should be zero
find skills/[skill-name]/references -type f -name "*.md" | while read f; do
grep -q "$(basename "$f" .md)" skills/[skill-name]/SKILL.md || echo "ORPHAN: $f"
done
wc -l skills/[skill-name]/SKILL.md # should be under 500
undefined8. Optional: Re-run the test
8. 可选:重新测试
The gold standard is re-running the test on the fixed skill with a different task. New derailments go into a new derail-notes file (). Compare metrics across runs to verify improvement.
02-dogfood-[topic].md黄金标准是使用不同任务对修复后的Skill重新测试。新的卡点记录在新的卡点文件中()。比较多次测试的指标以验证改进效果。
02-dogfood-[topic].mdDecision rules
决策规则
- If the skill has no references, test only SKILL.md steps
- If a derailment is actually a bug in an external tool (not the instructions), document it but tag it as — don't fix the skill for someone else's bug
external - If 3+ P1 items cluster in one step, treat the cluster as compound P0
- If the skill references other skills, test only the current skill's instructions — not the referenced skill's workflow
- If you discover the skill's trigger boundary is wrong (fires on wrong queries), record it as a friction point but also flag it separately as a trigger issue
- 如果Skill无参考文档,仅测试SKILL.md中的步骤
- 如果卡点实际上是外部工具的bug(而非指令问题),记录并标记为— 不要为他人的bug修复Skill
external - 如果某一步骤出现3个及以上P1级卡点,将该组卡点视为复合P0级
- 如果Skill引用了其他Skill,仅测试当前Skill的指令 — 不测试被引用Skill的工作流
- 如果发现Skill的触发边界错误(对错误的查询做出响应),将其记录为卡点,并单独标记为触发问题
Do this, not that
正确做法 vs 错误做法
| Do this | Not that |
|---|---|
| Follow each step literally as written | Fill in gaps from personal knowledge |
| Record every uncertainty as a friction point | Skip ambiguities that seem "minor" |
| Fix the source files directly | Create a separate errata or known-issues file |
| Test on a real task within the skill's scope | Use a toy example or hypothetical scenario |
| Write structured derail notes with IDs and severities | Write prose complaints without classification |
| Verify fixes with grep and routing checks | Assume fixes are correct without verification |
| Report what worked well alongside what broke | Write a purely negative report |
| Do this | Not that |
|---|---|
| 严格遵循每一步的字面描述 | 凭借个人知识填补空白 |
| 将所有不确定性记录为卡点 | 忽略看似“微小”的歧义 |
| 直接修改源文件进行修复 | 创建单独的勘误表或已知问题文件 |
| 在Skill范围内的真实任务上测试 | 使用玩具示例或假设场景 |
| 撰写带ID和级别的结构化卡点记录 | 撰写无分类的散文式反馈 |
| 使用grep和路由检查验证修复 | 未经验证即假设修复正确 |
| 同时报告有效部分与问题部分 | 撰写纯负面报告 |
Output contract
输出约定
Deliver in this order:
- Test metadata (skill, task, date)
- Pre-scan summary (step count, branching, dependencies)
- Friction point registry (F-01 through F-NN with severity and root cause)
- Derailment density map
- What worked well section
- Priority summary table
- Fixes applied (which file, which friction point)
- Verification results
按以下顺序交付:
- 测试元数据(Skill、任务、日期)
- 预扫描摘要(步骤数、分支、依赖)
- 卡点注册表(F-01至F-NN,含级别与根本原因)
- 卡点密度图
- 有效部分说明
- 优先级汇总表
- 已应用的修复(对应文件、对应卡点)
- �验证结果
Reference routing
参考文档路由
| File | Read when |
|---|---|
| Assigning severity (P0/P1/P2) to a friction point or choosing between severity levels |
| Tagging friction points with root cause codes for pattern analysis |
| Matching a derailment type to a proven fix pattern |
| Tracking improvement across multiple test runs or building cross-run reports |
| Applying Derailment Testing to non-skill instruction sets (runbooks, SOPs, API docs) |
| File | Read when |
|---|---|
| 为卡点分配级别(P0/P1/P2)或在级别间做选择时 |
| 为卡点标记根本原因代码以进行模式分析时 |
| 将卡点类型与已验证的修复模式匹配时 |
| 跟踪多次测试的改进效果或构建跨测试报告时 |
| 将脱轨测试应用于非Skill指令集(运行手册、标准操作流程、API文档)时 |
Guardrails
约束规则
- Do not skip the pre-scan. It prevents misidentifying "working as designed" as a derailment.
- Do not fix friction points without reading the root cause taxonomy. Fixes without root cause analysis recur.
- Do not create an errata file. Fixes go directly into the source.
- Do not declare the test complete without the "What worked well" section.
- Do not re-test with the same task. Use a different representative task for each run.
- Do not test a skill you are currently building. Build it first (with ), then test it.
build-skills
- 请勿跳过预扫描。它可避免将“按设计工作”误判为卡点。
- 请勿在未查看根本原因分类的情况下修复卡点。无根本原因分析的修复会重复出现。
- 请勿创建勘误表。修复直接应用于源文件。
- 请勿在缺少“有效部分说明”的情况下宣布测试完成。
- 请勿使用相同任务重新测试。每次测试使用不同的代表性任务。
- 请勿测试当前正在构建的Skill。先完成构建(使用),再进行测试。
build-skills
Final checks
最终检查
Before declaring the test complete:
- Derail notes file exists at
derail-notes/NN-dogfood-[topic].md - Every friction point has an ID (F-NN), severity (P0-P2), and root cause code
- All P0 fixes are applied
- All P1 fixes are applied (or deferred with justification)
- "What worked well" section is present
- Priority summary table is present
- Verification grep shows zero stale terms
- Routing integrity confirmed — no orphaned reference files in tested skill
- SKILL.md of tested skill is still under 500 lines after fixes
宣布测试完成前,请确认:
- 卡点记录文件已存在于
derail-notes/NN-dogfood-[topic].md - 每个卡点都有ID(F-NN)、级别(P0-P2)和根本原因代码
- 所有P0级修复已应用
- 所有P1级修复已应用(或有正当理由延迟)
- 包含“有效部分说明”章节
- 包含优先级汇总表
- 验证grep未找到过时术语
- 已确认路由完整性 — 待测试Skill中无孤立的参考文件
- 修复后待测试Skill的SKILL.md仍少于500行