test-skill-quality

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Test Skill Quality

测试Skill质量

Test any skill's instructions by following them literally, documenting every moment they fail to guide the next action, and fixing the root causes.
通过严格遵循任意Skill的指令、记录其无法指导下一步操作的所有场景并修复根本原因,来测试该Skill的质量。

Trigger boundary

触发边界

Use this skill when:
  • testing whether a skill's instructions are complete and unambiguous
  • auditing a skill for instructional quality before publishing
  • dogfooding a skill by running its workflow on a real task
  • improving a skill after receiving feedback that it's confusing or incomplete
  • validating that fixes to a skill's instructions actually resolved the original friction
Do not use this skill for:
  • building a new skill from scratch (use
    build-skills
    )
  • evaluating the quality of a skill's output (use evaluation suites)
  • reviewing code changes in a pull request (use
    review-pr
    )
  • general documentation improvements not related to skill instructions
在以下场景中使用本Skill:
  • 测试某个Skill的指令是否完整且无歧义
  • 发布前审计Skill的指导质量
  • 在真实任务中运行Skill的工作流进行dogfooding测试
  • 在收到Skill存在混淆或不完整的反馈后对其进行改进
  • 验证Skill指令的修复是否真正解决了原有的卡点问题
请勿在以下场景中使用本Skill:
  • 从零开始构建新Skill(使用
    build-skills
  • 评估Skill的输出质量(使用评估套件)
  • 审查拉取请求中的代码变更(使用
    review-pr
  • 与Skill指令无关的一般性文档改进

Non-negotiable rules

不可妥协的规则

  1. Follow literally, not intelligently. Suppress domain knowledge. If the instructions don't specify it, record a friction point — even if you "know" the answer.
  2. Test on a real task. Toy examples don't exercise branching logic, error handling, or cross-references. The task must be genuine and non-trivial.
  3. Every derailment gets an ID. Friction points are numbered F-01, F-02, ... with P0/P1/P2 severity. No unnamed complaints.
  4. Fix the instructions, not the executor. The remedy is always a text edit to the skill files, never "use a smarter agent."
  5. The derail notes are the primary deliverable. Not a pass/fail verdict. A structured document showing what broke and why.
  6. Verify after fixing. Run grep-based consistency checks and confirm routing integrity after edits.
  1. 严格遵循,而非自主推断。 摒弃领域知识。如果指令未明确说明,即使你“知道”答案,也要记录一个卡点。
  2. 在真实任务中测试。 玩具示例无法检验分支逻辑、错误处理或交叉引用。任务必须真实且具有一定复杂度。
  3. 每个卡点都要有ID。 卡点编号为F-01、F-02……并标记P0/P1/P2级别。禁止无编号的反馈。
  4. 修复指令,而非执行者。 解决方案必须是对Skill文件的文本编辑,绝不能是“使用更智能的agent”。
  5. 卡点记录是主要交付物。 并非通过/不通过的 verdict,而是一份展示问题所在及原因的结构化文档。
  6. 修复后进行验证。 修复完成后,运行基于grep的一致性检查并确认路由完整性。

Required workflow

必备工作流

1. Select the test subject and task

1. 选择测试对象与任务

Choose:
  • The skill to test — any skill with a SKILL.md and optional references
  • The test task — a real, representative task within the skill's trigger boundary
The test task must be:
  • Genuinely within the skill's scope (not an edge case)
  • Complex enough to exercise the full workflow (not a one-step operation)
  • Executable in the current environment (required tools available)
Record the test metadata:
Skill under test: [name]
Test task: [one-line description]
Date: [YYYY-MM-DD]
Method: Follow SKILL.md steps N–M exactly as written
选择:
  • 待测试的Skill — 任意包含SKILL.md及可选参考文档的Skill
  • 测试任务 — Skill触发边界内的真实、具有代表性的任务
测试任务必须满足:
  • 确实属于Skill的适用范围(非边缘案例)
  • 复杂度足以覆盖完整工作流(非单步操作)
  • 可在当前环境中执行(所需工具已就绪)
记录测试元数据:
Skill under test: [name]
Test task: [one-line description]
Date: [YYYY-MM-DD]
Method: Follow SKILL.md steps N–M exactly as written

2. Pre-scan the skill

2. 预扫描Skill

Before executing, read through the skill once:
  • Read SKILL.md fully — note total steps, branching points, cross-references
  • Tree the
    references/
    directory
  • List external dependencies (tools, MCP servers, APIs)
  • Note the skill's declared trigger boundary
Do NOT execute anything during this step. This is orientation only.
执行前,通读一次Skill:
  • 完整阅读SKILL.md — 记录总步骤数、分支点、交叉引用
  • 梳理
    references/
    目录结构
  • 列出外部依赖(工具、MCP服务器、API)
  • 记录Skill声明的触发边界
此步骤请勿执行任何操作,仅用于熟悉内容。

3. Execute literally (the core loop)

3. 严格执行(核心循环)

For each step in the skill's workflow:
  1. Read only the current step. Do not look ahead.
  2. Attempt to execute using only the information provided in the skill.
  3. Record the outcome:
    • Clean pass — step was unambiguous and executable.
    • Derailment — you could not determine the next action from the instructions alone. Record a friction point.
    • Implicit knowledge used — you could execute, but only because you knew something not stated. Record a lower-severity friction point.
For each derailment, write:
markdown
**F-[NN] — [short title]** (P[0-2])
[What happened, what the instructions said, what was missing or ambiguous.]
Fix: [Specific text edit that would prevent this derailment.]
See
references/friction-classification.md
for severity assignment rules.
对于Skill工作流中的每一步:
  1. 仅阅读当前步骤。 请勿提前查看后续步骤。
  2. 尝试仅使用Skill提供的信息执行。
  3. 记录结果:
    • 顺利通过 — 步骤清晰明确且可执行。
    • 卡点 — 仅通过指令无法确定下一步操作。记录一个卡点。
    • 使用了隐含知识 — 你能够执行,但仅因为你知道指令中未提及的信息。记录一个低级别卡点。
对于每个卡点,记录:
markdown
**F-[NN] — [简短标题]** (P[0-2])
[事件经过、指令原文、缺失或模糊的内容。]
Fix: [可避免此卡点的具体文本编辑方案。]
请查看
references/friction-classification.md
获取级别分配规则。

4. Collect evidence

4. 收集证据

After completing all steps, calculate:
MetricValue
Total steps attempted
Clean passes
P0 (blocks progress)
P1 (causes confusion)
P2 (minor annoyance)
Build a derailment density map showing which workflow phases have the most friction.
Tag each friction point with a root cause code — see
references/root-cause-taxonomy.md
.
完成所有步骤后,计算:
MetricValue
Total steps attempted
Clean passes
P0 (blocks progress)
P1 (causes confusion)
P2 (minor annoyance)
制作卡点密度图,展示工作流中哪些阶段的卡点最多。
为每个卡点标记根本原因代码 — 请查看
references/root-cause-taxonomy.md

5. Write the derail notes

5. 撰写卡点记录

Write the report to
derail-notes/NN-dogfood-[topic].md
in the project root.
Structure:
markdown
undefined
在项目根目录下,将报告写入
derail-notes/NN-dogfood-[topic].md
结构:
markdown
undefined

Derailment Test: [skill-name] on "[task]"

Derailment Test: [skill-name] on "[task]"

Date: ... Skill under test: ... Test task: ... Method: ...

Date: ... Skill under test: ... Test task: ... Method: ...

Friction points

Friction points

[Phase/step name]

[Phase/step name]

F-01 — [title] (P0) ...
F-01 — [title] (P0) ...

What worked well

What worked well

  1. ...
  1. ...

Priority summary

Priority summary

PriorityCountFriction points
P0NF-xx, ...
P1NF-xx, ...
P2NF-xx, ...
undefined
PriorityCountFriction points
P0NF-xx, ...
P1NF-xx, ...
P2NF-xx, ...
undefined

6. Apply fixes

6. 应用修复

Fix priority: all P0, then all P1, then P2 if time allows.
For each friction point, apply the fix directly to the skill's source files. Read
references/fix-patterns.md
to match the derailment type to a proven fix pattern.
Fixes must be:
  • In-place — edit the existing instruction, don't create errata
  • Self-contained — the fix works without consulting the derail notes
  • Minimal — add only what was missing
修复优先级:先处理所有P0级,再处理所有P1级,时间允许的话再处理P2级。
针对每个卡点,直接在Skill的源文件中应用修复方案。请查看
references/fix-patterns.md
,将卡点类型与已验证的修复模式匹配。
修复必须满足:
  • 原位修改 — 编辑现有指令,不要创建勘误表
  • 自包含 — 修复无需参考卡点记录即可生效
  • 最小化 — 仅添加缺失的内容

7. Verify fixes

7. 验证修复

After all edits:
  1. Terminology consistency — grep for stale terms that should have been updated
  2. Routing integrity — confirm every reference file is still reachable from SKILL.md
  3. Cross-reference consistency — no contradictions between documents
  4. Size constraints — SKILL.md still under 500 lines
  5. No regressions — fixes didn't introduce new ambiguities
bash
undefined
完成所有编辑后:
  1. 术语一致性 — 使用grep查找应更新的过时术语
  2. 路由完整性 — 确认所有参考文件仍可从SKILL.md访问
  3. 交叉引用一致性 — 文档间无矛盾
  4. 篇幅限制 — SKILL.md仍少于500行
  5. 无回归 — 修复未引入新的歧义
bash
undefined

Example verification commands

Example verification commands

grep -r "old_term" skills/[skill-name]/ # should be zero find skills/[skill-name]/references -type f -name "*.md" | while read f; do grep -q "$(basename "$f" .md)" skills/[skill-name]/SKILL.md || echo "ORPHAN: $f" done wc -l skills/[skill-name]/SKILL.md # should be under 500
undefined
grep -r "old_term" skills/[skill-name]/ # should be zero find skills/[skill-name]/references -type f -name "*.md" | while read f; do grep -q "$(basename "$f" .md)" skills/[skill-name]/SKILL.md || echo "ORPHAN: $f" done wc -l skills/[skill-name]/SKILL.md # should be under 500
undefined

8. Optional: Re-run the test

8. 可选:重新测试

The gold standard is re-running the test on the fixed skill with a different task. New derailments go into a new derail-notes file (
02-dogfood-[topic].md
). Compare metrics across runs to verify improvement.
黄金标准是使用不同任务对修复后的Skill重新测试。新的卡点记录在新的卡点文件中(
02-dogfood-[topic].md
)。比较多次测试的指标以验证改进效果。

Decision rules

决策规则

  • If the skill has no references, test only SKILL.md steps
  • If a derailment is actually a bug in an external tool (not the instructions), document it but tag it as
    external
    — don't fix the skill for someone else's bug
  • If 3+ P1 items cluster in one step, treat the cluster as compound P0
  • If the skill references other skills, test only the current skill's instructions — not the referenced skill's workflow
  • If you discover the skill's trigger boundary is wrong (fires on wrong queries), record it as a friction point but also flag it separately as a trigger issue
  • 如果Skill无参考文档,仅测试SKILL.md中的步骤
  • 如果卡点实际上是外部工具的bug(而非指令问题),记录并标记为
    external
    — 不要为他人的bug修复Skill
  • 如果某一步骤出现3个及以上P1级卡点,将该组卡点视为复合P0级
  • 如果Skill引用了其他Skill,仅测试当前Skill的指令 — 不测试被引用Skill的工作流
  • 如果发现Skill的触发边界错误(对错误的查询做出响应),将其记录为卡点,并单独标记为触发问题

Do this, not that

正确做法 vs 错误做法

Do thisNot that
Follow each step literally as writtenFill in gaps from personal knowledge
Record every uncertainty as a friction pointSkip ambiguities that seem "minor"
Fix the source files directlyCreate a separate errata or known-issues file
Test on a real task within the skill's scopeUse a toy example or hypothetical scenario
Write structured derail notes with IDs and severitiesWrite prose complaints without classification
Verify fixes with grep and routing checksAssume fixes are correct without verification
Report what worked well alongside what brokeWrite a purely negative report
Do thisNot that
严格遵循每一步的字面描述凭借个人知识填补空白
将所有不确定性记录为卡点忽略看似“微小”的歧义
直接修改源文件进行修复创建单独的勘误表或已知问题文件
在Skill范围内的真实任务上测试使用玩具示例或假设场景
撰写带ID和级别的结构化卡点记录撰写无分类的散文式反馈
使用grep和路由检查验证修复未经验证即假设修复正确
同时报告有效部分与问题部分撰写纯负面报告

Output contract

输出约定

Deliver in this order:
  1. Test metadata (skill, task, date)
  2. Pre-scan summary (step count, branching, dependencies)
  3. Friction point registry (F-01 through F-NN with severity and root cause)
  4. Derailment density map
  5. What worked well section
  6. Priority summary table
  7. Fixes applied (which file, which friction point)
  8. Verification results
按以下顺序交付:
  1. 测试元数据(Skill、任务、日期)
  2. 预扫描摘要(步骤数、分支、依赖)
  3. 卡点注册表(F-01至F-NN,含级别与根本原因)
  4. 卡点密度图
  5. 有效部分说明
  6. 优先级汇总表
  7. 已应用的修复(对应文件、对应卡点)
  8. �验证结果

Reference routing

参考文档路由

FileRead when
references/friction-classification.md
Assigning severity (P0/P1/P2) to a friction point or choosing between severity levels
references/root-cause-taxonomy.md
Tagging friction points with root cause codes for pattern analysis
references/fix-patterns.md
Matching a derailment type to a proven fix pattern
references/metrics-and-iteration.md
Tracking improvement across multiple test runs or building cross-run reports
references/adaptation-domains.md
Applying Derailment Testing to non-skill instruction sets (runbooks, SOPs, API docs)
FileRead when
references/friction-classification.md
为卡点分配级别(P0/P1/P2)或在级别间做选择时
references/root-cause-taxonomy.md
为卡点标记根本原因代码以进行模式分析时
references/fix-patterns.md
将卡点类型与已验证的修复模式匹配时
references/metrics-and-iteration.md
跟踪多次测试的改进效果或构建跨测试报告时
references/adaptation-domains.md
将脱轨测试应用于非Skill指令集(运行手册、标准操作流程、API文档)时

Guardrails

约束规则

  • Do not skip the pre-scan. It prevents misidentifying "working as designed" as a derailment.
  • Do not fix friction points without reading the root cause taxonomy. Fixes without root cause analysis recur.
  • Do not create an errata file. Fixes go directly into the source.
  • Do not declare the test complete without the "What worked well" section.
  • Do not re-test with the same task. Use a different representative task for each run.
  • Do not test a skill you are currently building. Build it first (with
    build-skills
    ), then test it.
  • 请勿跳过预扫描。它可避免将“按设计工作”误判为卡点。
  • 请勿在未查看根本原因分类的情况下修复卡点。无根本原因分析的修复会重复出现。
  • 请勿创建勘误表。修复直接应用于源文件。
  • 请勿在缺少“有效部分说明”的情况下宣布测试完成。
  • 请勿使用相同任务重新测试。每次测试使用不同的代表性任务。
  • 请勿测试当前正在构建的Skill。先完成构建(使用
    build-skills
    ),再进行测试。

Final checks

最终检查

Before declaring the test complete:
  • Derail notes file exists at
    derail-notes/NN-dogfood-[topic].md
  • Every friction point has an ID (F-NN), severity (P0-P2), and root cause code
  • All P0 fixes are applied
  • All P1 fixes are applied (or deferred with justification)
  • "What worked well" section is present
  • Priority summary table is present
  • Verification grep shows zero stale terms
  • Routing integrity confirmed — no orphaned reference files in tested skill
  • SKILL.md of tested skill is still under 500 lines after fixes
宣布测试完成前,请确认:
  • 卡点记录文件已存在于
    derail-notes/NN-dogfood-[topic].md
  • 每个卡点都有ID(F-NN)、级别(P0-P2)和根本原因代码
  • 所有P0级修复已应用
  • 所有P1级修复已应用(或有正当理由延迟)
  • 包含“有效部分说明”章节
  • 包含优先级汇总表
  • 验证grep未找到过时术语
  • 已确认路由完整性 — 待测试Skill中无孤立的参考文件
  • 修复后待测试Skill的SKILL.md仍少于500行