skill-progressive-disclosure-design
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSkill Progressive Disclosure Design
Skill渐进式披露设计
Each section that recommends a direction includes explicit pros and cons. The decisions in this skill are trade-offs, not rules. The model using this skill should reason from the trade-offs to the user's specific situation rather than apply rules blindly.
每个推荐方向的章节都明确列出了优缺点。本Skill中的决策是权衡取舍,而非硬性规则。使用本Skill的模型应根据这些权衡,结合用户的具体情况进行推理,而非盲目套用规则。
Triggering vs. disclosure: separate these first
触发机制与披露机制:先区分二者
Two problems get conflated and need separating before any splitting decision.
Triggering is whether Claude invokes the skill at all. Driven entirely by the YAML . File splitting does not affect triggering. If the question is "my skill doesn't trigger reliably", do not split files, fix the description (use from the skill).
descriptionrun_loop.pyskill-creatorProgressive disclosure is what loads after the skill activates. SKILL.md body always loads. only loads when SKILL.md tells the model to read a specific file. executes without loading into context at all. This is where context protection happens.
references/*scripts/*If the user is asking about splitting because of triggering issues, surface the confusion first and redirect.
有两个问题常被混淆,在做出任何拆分决策前需先区分清楚。
触发机制指的是Claude是否会调用该Skill,完全由YAML 驱动。文件拆分不会影响触发机制。如果用户的问题是“我的Skill触发不可靠”,不要拆分文件,应修改description(使用 Skill中的工具)。
descriptionskill-creatorrun_loop.py渐进式披露指的是Skill激活后加载的内容。SKILL.md主体内容会始终加载,仅在SKILL.md指示模型读取特定文件时才会加载,则直接执行,不会加载到上下文中,这正是上下文保护的实现方式。
references/*scripts/*如果用户因触发问题而询问拆分方案,需先指出二者的混淆,再引导用户解决触发问题。
Default: do not split
默认原则:不拆分
A monolithic SKILL.md beats a split one until proven otherwise.
Split only when at least one is true:
- SKILL.md exceeds ~400 lines and content has natural branches.
- Empirical evidence (eval transcripts) shows the model wasting context on irrelevant sections.
- Specific content is large and only needed in narrow conditions.
Pros of staying monolithic:
- Single context load, no router prose to maintain.
- No tool-call overhead from reading references.
- No risk of the model loading the wrong reference or skipping a needed one.
- Easier to maintain: one file, one source of truth.
- Better for highly interconnected content where context is global.
- Easier for human reviewers to read end-to-end.
Cons of staying monolithic:
- Every invocation pays the full token cost, even when only 10% of the content is relevant.
- Does not scale past ~500 lines without degrading the model's ability to find what matters.
- No mechanism to gate rare or niche content.
- All content must justify its always-loaded status.
- Maintenance gets harder as the file grows.
在被证明需要拆分之前,单体式SKILL.md优于拆分后的结构。
仅当满足以下至少一个条件时才进行拆分:
- SKILL.md篇幅超过约400行,且内容存在自然分支。
- 实证证据(评估记录)显示模型在无关章节上浪费上下文资源。
- 特定内容体量较大,且仅在特定场景下才需要使用。
单体式结构的优点:
- 单次上下文加载,无需维护路由说明。
- 无读取参考文件的工具调用开销。
- 不存在模型加载错误参考文件或遗漏必要文件的风险。
- 更易维护:单一文件,单一事实来源。
- 更适合高度关联、需要全局上下文的内容。
- 更便于人工审阅者通读全文。
单体式结构的缺点:
- 每次调用都需承担全部令牌成本,即便仅需10%的内容。
- 篇幅超过约500行后,模型查找关键内容的能力会下降。
- 没有机制可以隔离罕见或 niche 内容。
- 所有内容都必须证明其始终加载的合理性。
- 文件增长后维护难度加大。
Three split axes that work
三种有效的拆分维度
1. Variant branch
1. 变体分支
User intent selects exactly one path. SKILL.md holds the decision logic and shared workflow. Each holds path-specific detail.
references/<variant>.mdmy-skill/
├── SKILL.md # decision tree + shared steps
└── references/
├── variant-a.md
├── variant-b.md
└── variant-c.mdExamples of clean variants: cloud provider, database engine, framework choice, output format, language.
Pros:
- Each invocation loads only the matching variant; large savings when variants are big.
- Variants evolve independently, simplifying maintenance.
- Adding a new variant does not bloat existing content.
- Mental model is easy: select one path based on input.
- Maps cleanly to user intent that already mentions the variant.
Cons:
- Requires routing logic in SKILL.md, eating back some of the line savings.
- Cross-cutting changes touch every variant file, multiplying effort.
- Risk of treatments diverging across variants over time.
- If user intent is ambiguous, the model may load multiple variants and lose the savings.
- If variants share more than ~60% of their content, the abstraction breaks down.
用户意图会选择唯一路径。SKILL.md包含决策逻辑和共享工作流,每个包含特定路径的详细内容。
references/<variant>.mdmy-skill/
├── SKILL.md # 决策树 + 共享步骤
└── references/
├── variant-a.md
├── variant-b.md
└── variant-c.md清晰变体的示例:云服务商、数据库引擎、框架选择、输出格式、编程语言。
优点:
- 每次调用仅加载匹配的变体,当变体体量较大时可大幅节省资源。
- 变体可独立演进,简化维护工作。
- 添加新变体不会膨胀现有内容。
- 心智模型简单:根据输入选择一条路径。
- 与已提及变体的用户意图完美匹配。
缺点:
- 需要在SKILL.md中添加路由逻辑,会抵消部分行数节省。
- 跨变体的改动需要修改所有变体文件,增加工作量。
- 各变体的处理逻辑可能随时间逐渐分化。
- 如果用户意图模糊,模型可能加载多个变体,导致资源节省失效。
- 如果变体之间共享超过约60%的内容,该抽象结构会失效。
2. Workflow vs. reference data
2. 工作流与参考数据
SKILL.md holds the procedure (verbs, sequence, decisions). holds lookup material queried by key.
references/Good reference content: schemas, error code tables, API surface listings, example galleries, configuration option matrices, design tokens.
Pros:
- Highest leverage of all splits: lookups are narrow, the model reads one entry.
- Natural conceptual boundary (procedure vs. data).
- Reference can grow large without affecting per-invocation cost.
- Adding new reference entries does not touch the workflow.
- Reference data can often be machine-generated and regenerated.
Cons:
- The model must know what to look up before reading. Pointer must encode lookup keys explicitly.
- Fails when the workflow needs to weave reference data inline rather than at discrete points.
- Splits content that is conceptually unified, harder for human readers.
- The model may miss broader context that lives only in the reference.
- Lookup data that is small (under ~50 lines total) is rarely worth splitting.
SKILL.md包含流程(动词、序列、决策),目录包含通过关键字查询的查找类资料。
references/适合作为参考内容的类型:Schema、错误码表、API接口清单、示例库、配置选项矩阵、设计令牌。
优点:
- 所有拆分方式中杠杆率最高:查找范围窄,模型仅读取一个条目。
- 自然的概念边界(流程 vs 数据)。
- 参考内容可大幅增长,不影响每次调用的成本。
- 添加新参考条目无需修改工作流。
- 参考数据通常可由机器生成和重新生成。
缺点:
- 模型必须在读取前明确要查找的内容,指针需明确编码查找关键字。
- 当工作流需要将参考数据内联到流程中而非离散点时,该方式失效。
- 拆分了概念上统一的内容,不利于人工阅读。
- 模型可能会遗漏仅存在于参考文件中的更广泛上下文。
- 体量较小的查找数据(总计不足约50行)通常不值得拆分。
3. Depth tier (common path vs. edge cases)
3. 深度层级(通用路径 vs 边缘案例)
SKILL.md covers the 80% case. covers the rest.
references/edge-cases.mdThe pointer must read like:
If you see X, Y, or Z, stop and readbefore continuing.references/edge-cases.md
Pros:
- Common path stays minimal, fast, cheap.
- Edge cases can be exhaustive without polluting every invocation.
- Easy to extend edge-case coverage without touching the common path.
- Mirrors how experts work: defaults first, exceptions on demand.
Cons:
- The load condition must be sharp and observable from user input. Most edge cases do not satisfy this.
- Vague conditions cause either always-loading (waste) or never-loading (dead weight).
- Edge cases get less testing because evals naturally cluster on common queries.
- The model may follow the common path past a point where it should have escalated.
- The 80/20 estimate is often wrong; what looked like an edge case turns out to be common.
SKILL.md覆盖80%的常见场景,覆盖剩余场景。
references/edge-cases.md指针表述应如下:
如果遇到X、Y或Z,请暂停并先阅读,再继续操作。references/edge-cases.md
优点:
- 通用路径保持精简、快速、低成本。
- 边缘案例可被详尽覆盖,且不会影响每次调用。
- 无需修改通用路径即可轻松扩展边缘案例覆盖范围。
- 与专家工作方式一致:优先默认处理,按需处理例外情况。
缺点:
- 加载条件必须清晰明确,且可从用户输入中观察到。大多数边缘案例不满足此条件。
- 模糊的条件会导致要么始终加载(浪费资源),要么从不加载(内容闲置)。
- 边缘案例的测试较少,因为评估通常集中在常见查询上。
- 模型可能会在需要升级处理的情况下仍遵循通用路径。
- 80/20的估算往往不准确:看似边缘的案例实际可能很常见。
Splits that do not work
无效的拆分方式
For each anti-pattern, "why it appears attractive" shows what makes designers reach for it; "why it fails" shows what goes wrong in practice.
针对每种反模式,“为何看似有吸引力”说明设计者选择它的原因;“为何失效”说明实际操作中会出现的问题。
Topic-based splits where invocations do not cluster by topic
调用未按主题聚类的主题式拆分
A testing skill split into , , is a typical example.
unit.mdintegration.mdmocks.mdWhy it appears attractive:
- Conceptually clean, mirrors how a human would organize documentation.
- Easy to navigate as a maintainer.
- Plausibly reduces context per invocation.
Why it fails:
- Real tasks span 2-3 topics, forcing multiple loads per invocation.
- Cross-topic concerns get duplicated or fragmented.
- The savings are theoretical, not empirical.
例如将测试Skill拆分为、、。
unit.mdintegration.mdmocks.md为何看似有吸引力:
- 概念清晰,符合人类组织文档的方式。
- 便于维护者导航。
- 看似能减少每次调用的上下文资源。
为何失效:
- 实际任务通常涉及2-3个主题,导致每次调用需加载多个文件。
- 跨主题的关注点会被重复或碎片化。
- 资源节省仅停留在理论层面,无实证支持。
Splitting to hit a line target without a real branching condition
为达到行数目标而拆分,无实际分支条件
Why it appears attractive:
- A heuristic ("keep SKILL.md under 400 lines") feels like a clean rule to satisfy.
- Splitting feels like progress.
Why it fails:
- Without a branching condition, references load in parallel or always, providing no savings.
- Adds router prose to SKILL.md, often making the total content longer.
为何看似有吸引力:
- “保持SKILL.md在400行以内”的启发式规则看似清晰易执行。
- 拆分看似是一种进展。
为何失效:
- 若无分支条件,参考文件会被并行加载或始终加载,无法节省资源。
- 需要在SKILL.md中添加路由说明,通常会导致总内容更长。
Rare-but-critical content in references/
将罕见但关键的内容放在references/目录中
Why it appears attractive:
- The content is large or specialized.
- Moving it out of SKILL.md feels like good hygiene.
Why it fails:
- References are optional by design; the model may skip them.
- If the content is critical, it must be loaded reliably, which means SKILL.md.
- "Rare" and "critical" together is usually a sign the skill is doing two jobs and should be two skills.
为何看似有吸引力:
- 内容体量较大或专业性强。
- 将其移出SKILL.md看似是良好的代码整洁习惯。
为何失效:
- 参考文件设计为可选加载,模型可能会跳过它们。
- 如果内容至关重要,必须确保可靠加载,这意味着应放在SKILL.md中。
- “罕见”且“关键”通常表明该Skill承担了两项职责,应拆分为两个独立Skill。
Cosmetic splits (Examples, Notes, Tips files)
cosmetic拆分(示例、注释、技巧文件)
Why it appears attractive:
- Reduces visual clutter in SKILL.md.
- Feels like good organization.
Why it fails:
- No load condition: either always loaded (wasted tool call) or never loaded (dead content).
- Implies an importance hierarchy that does not exist at runtime.
- Frequently hides content from the model that needs it.
为何看似有吸引力:
- 减少SKILL.md中的视觉杂乱。
- 看似是良好的组织方式。
为何失效:
- 无加载条件:要么始终加载(浪费工具调用),要么从不加载(内容闲置)。
- 暗示了运行时不存在的重要性层级。
- 经常会隐藏模型需要的内容。
Pointer hygiene
指针规范
When SKILL.md points at a reference, the pointer is the entire load contract. Rules:
- Name the user-visible signal that triggers the load. "If the user mentions snapshot tests" not "for testing concerns".
- One sentence per pointer. Do not summarize the reference content in SKILL.md.
- Encode the load condition in the filename. not
go126-simd.md.advanced.md - Top-of-file table of contents for any reference over 300 lines.
- If two references are co-loaded in most runs, merge them.
Pros of strict pointer hygiene:
- Wrong-load rate drops sharply.
- Filename encodes load condition, self-documenting for future maintainers.
- Forces upfront clarity about when each reference is needed.
- Makes architecture evals easier to interpret.
Cons of strict pointer hygiene:
- Some content has no crisp trigger; rules force awkward formulations.
- Filenames become long and awkward.
- Requires discipline; easy to drift over time.
- Can over-constrain useful loads when the trigger condition is genuinely fuzzy.
当SKILL.md指向参考文件时,指针就是完整的加载约定。规则如下:
- 指定触发加载的用户可见信号。例如“如果用户提及快照测试”而非“针对测试相关问题”。
- 每个指针一句话。不要在SKILL.md中总结参考文件的内容。
- 在文件名中编码加载条件。例如而非
go126-simd.md。advanced.md - 篇幅超过300行的参考文件需在顶部添加目录。
- 如果两个参考文件在大多数调用中都会被共同加载,将它们合并。
严格遵循指针规范的优点:
- 错误加载率大幅降低。
- 文件名编码了加载条件,对未来维护者自文档化。
- 迫使提前明确每个参考文件的加载时机。
- 便于解释架构评估结果。
严格遵循指针规范的缺点:
- 某些内容没有清晰的触发条件,规则会导致表述生硬。
- 文件名可能变得冗长笨拙。
- 需要自律,容易随时间偏离规范。
- 当触发条件确实模糊时,可能会过度限制有用的加载操作。
Use scripts/ before references/
优先使用scripts/而非references/
For anything deterministic (formatting, validation, schema generation, file transforms, regex-heavy parsing), a script in beats prose in .
scripts/references/Pros of scripts over reference prose:
- Zero context cost for execution.
- Deterministic, repeatable output.
- Reusable across invocations without re-reading.
- Can be unit tested independently.
- Often faster than prose-driven generation by the model.
Cons of scripts:
- Requires the runtime to support script execution; not all environments do.
- Less flexible than letting the model reason over prose.
- Harder to handle unanticipated edge cases without code changes.
- Adds a maintenance burden: code in the skill needs to keep working.
- Users cannot easily customize behavior without editing the script.
- Failure modes are sharper: script errors stop the workflow.
对于任何确定性任务(格式化、验证、Schema生成、文件转换、基于正则的解析),目录中的脚本优于中的文本说明。
scripts/references/脚本优于参考文本的优点:
- 执行无上下文成本。
- 输出确定、可重复。
- 无需重新读取即可在多次调用中复用。
- 可独立进行单元测试。
- 通常比模型基于文本生成的速度更快。
脚本的缺点:
- 依赖运行时支持脚本执行,并非所有环境都支持。
- 灵活性不如让模型基于文本推理。
- 处理意外边缘案例需修改代码,难度较大。
- 增加维护负担:Skill中的代码需保持可用。
- 用户无法轻松自定义行为,需编辑脚本。
- 故障模式更尖锐:脚本错误会终止工作流。
Decision checklist
决策检查清单
Before splitting any content out of SKILL.md, answer:
- Does this content have a sharp, observable load condition the model can detect from user input?
- Will splitting actually reduce context, accounting for the router prose added to SKILL.md?
- Is this reference data (lookup) or procedural (sequence)? Procedural content usually stays.
- Could a script handle this deterministically instead?
- Across realistic invocations, what fraction of runs would load this file? Below 20%, inline or delete — rarely-loaded references rarely justify the routing overhead. 20–80% is the split sweet spot. Above 80%, promote into SKILL.md — the routing cost exceeds the load savings.
If the answer to question 1 is unclear, do not split.
在将任何内容从SKILL.md中拆分出去之前,回答以下问题:
- 该内容是否有清晰、可观察的加载条件,模型可从用户输入中检测到?
- 考虑到SKILL.md中添加的路由说明,拆分是否真的能减少上下文资源消耗?
- 该内容是参考数据(查找类)还是流程类(序列类)?流程类内容通常应保留在SKILL.md中。
- 是否可以用脚本确定性地处理该内容?
- 在实际调用中,该文件的加载比例是多少?低于20%:内联或删除——很少加载的参考文件不值得付出路由开销;20–80%:适合拆分——路由成本可带来收益;高于80%:移至SKILL.md——始终加载的成本低于路由成本。
如果问题1的答案不明确,不要拆分。
Evaluating skill architecture
评估Skill架构
Architecture evaluation is different from output evaluation. Output evals ask "did the skill produce the right thing?". Architecture evals ask "did the skill load the right files for the right reasons, at acceptable cost?". Same harness, different metrics. Run both. Output quality is the floor; architecture is optimization above that floor.
Pros of running architecture evals:
- Catches dead references, dead SKILL.md sections, and mis-routed content.
- Quantifies whether a split actually saved tokens or just looked clean.
- Reveals real load patterns that intuition misses.
- Forces the eval set to cover all declared paths, surfacing dead paths.
- Compounds with output evals to catch regressions across both axes.
Cons of running architecture evals:
- Requires harness setup beyond standard output evals.
- Eval-set design for path coverage takes work.
- Metrics need calibration per-skill (thresholds vary with cost profile).
- Output evals are still required; this adds to total iteration cost.
- Easy to over-optimize for token cost at the expense of output quality.
架构评估与输出评估不同。输出评估关注“Skill是否生成了正确的结果?”,架构评估关注“Skill是否在合理成本下,为正确的场景加载了正确的文件?”。使用相同的测试框架,但指标不同。需同时运行两种评估。输出质量是底线,架构是底线之上的优化。
运行架构评估的优点:
- 发现闲置参考文件、闲置SKILL.md章节和路由错误的内容。
- 量化拆分是否真的节省了令牌,还是仅看起来整洁。
- 揭示直觉无法发现的实际加载模式。
- 迫使评估集覆盖所有声明的路径,发现闲置路径。
- 与输出评估结合,可同时发现两个维度的回归问题。
运行架构评估的缺点:
- 需要在标准输出评估之外设置测试框架。
- 为路径覆盖设计评估集需要投入工作。
- 指标需针对每个Skill校准(阈值随成本配置而异)。
- 仍需运行输出评估,增加了总迭代成本。
- 容易过度优化令牌成本,牺牲输出质量。
Eval set design for architecture
架构评估集设计
Output evals optimize for output quality across realistic queries. Architecture evals optimize for path coverage. The eval set must exercise every code path the skill claims to have, otherwise the metrics are noise.
Construct, at minimum:
- One query per declared variant (if the skill uses variant-branch splits).
- One query per edge-case branch (if depth-tier splits exist).
- One query per major lookup category (if reference-data splits exist).
- One query that should hit the common path only and load zero references.
- 2-3 off-topic queries that should not trigger the skill at all (also tests the description).
If no realistic query triggers a given reference file, that file is dead. Inline it or delete it before running anything.
输出评估针对实际查询优化输出质量,架构评估优化路径覆盖。评估集必须覆盖Skill声明的所有代码路径,否则指标无效。
至少构建以下查询:
- 每个声明变体的一个查询(如果Skill使用变体分支拆分)。
- 每个边缘案例分支的一个查询(如果使用深度层级拆分)。
- 每个主要查找类别的一个查询(如果使用参考数据拆分)。
- 一个仅触发通用路径、不加载任何参考文件的查询。
- 2-3个不应触发该Skill的偏离主题查询(同时测试description)。
如果没有实际查询触发某个参考文件,该文件即为闲置文件。在运行任何评估前,将其内联或删除。
Instrumentation
instrumentation
Each eval run is executed by a subagent with the skill loaded. Capture per run:
- Full transcript including every tool call.
- Which files were read (parse
references/*calls on paths inside the skill directory).view - Whether were invoked.
scripts/* - Total tokens and wall time.
- The output (for the parallel output-quality eval).
Persist as and per run, alongside the standard output. The harness from skill-creator already records tokens and time in ; extend its grading step to extract reference loads from transcripts.
transcript.jsonloads.jsontiming.json每次评估运行由加载了该Skill的子代理执行。捕获每次运行的以下数据:
- 完整记录,包括所有工具调用。
- 读取了哪些文件(解析Skill目录内路径的
references/*调用)。view - 是否调用了。
scripts/* - 总令牌数和耗时。
- 输出结果(用于并行输出质量评估)。
将数据保存为每次运行的和,与标准输出一起存储。的测试框架已在中记录了令牌数和耗时;扩展其评分步骤,从记录中提取参考文件加载情况。
transcript.jsonloads.jsonskill-creatortiming.jsonMetrics per reference file
每个参考文件的指标
Across all eval runs, for each :
references/*.md- Load rate: fraction of runs that read it.
- Co-occurrence: for each other reference, fraction of runs that loaded both.
- Use rate when loaded: of the runs that loaded it, did the content visibly inform the output (cited content, applied procedure, used schema)? Inspect transcripts.
- Re-read rate: fraction of runs that loaded the same file twice.
在所有评估运行中,针对每个:
references/*.md- 加载率:读取该文件的运行占比。
- 共现率:针对每个其他参考文件,同时加载两个文件的运行占比。
- 加载后使用率:在加载该文件的运行中,内容是否明显影响了输出(引用内容、应用流程、使用Schema)?需检查记录。
- 重复读取率:同一运行中多次加载同一文件的占比。
Metrics for the skill overall
Skill整体指标
- Median and p95 tokens per invocation, with and without references.
- SKILL.md utilization: read transcripts and identify sections of SKILL.md the model never references in any run. Strong candidates for deletion.
- Path coverage: did every declared path get hit by at least one query?
- 每次调用的中位数和p95令牌数,包含和不包含参考文件的情况。
- SKILL.md利用率:检查记录,找出模型在任何运行中都未引用的SKILL.md章节,这些是删除的重点候选。
- 路径覆盖率:每个声明的路径是否至少被一个查询触发?
Decision rules
决策规则
| Observation | Action |
|---|---|
| Reference loaded in <20% of runs | Inline into SKILL.md or delete — routing overhead not justified |
| Reference loaded in 20–80% of runs | Leave split — the sweet spot; routing pays off |
| Reference loaded in >80% of runs | Promote into SKILL.md — always-load cost beats routing cost |
| Two references co-load in >70% of runs | Merge into one file |
| Reference loaded but not used in output | Fix or remove the pointer in SKILL.md |
| Reference re-read inside the same run | SKILL.md routing is unclear; clarify |
| No query triggers a reference | Delete the reference |
| SKILL.md section never referenced in any run | Delete that section |
These thresholds are starting points. Tune them based on the cost profile: small references with cheap loads tolerate lower load rates than large ones.
| 观察结果 | 操作建议 |
|---|---|
| 参考文件在不足20%的调用中被加载 | 内联到SKILL.md中或删除——路由开销得不偿失 |
| 参考文件在20–80%的调用中被加载 | 保持拆分——黄金区间;路由成本可带来收益 |
| 参考文件在超过80%的调用中被加载 | 移至SKILL.md——始终加载的成本低于路由成本 |
| 两个参考文件在超过70%的调用中共同加载 | 合并为一个文件 |
| 参考文件被加载但未在输出中使用 | 修复或删除SKILL.md中的指针 |
| 同一运行中重复读取参考文件 | SKILL.md路由不清晰;需明确路由逻辑 |
| 无查询触发参考文件 | 删除该参考文件 |
| SKILL.md章节从未被任何运行引用 | 删除该章节 |
这些阈值是起点,需根据成本配置调整:体量小、加载成本低的参考文件比大文件更能容忍较低的加载率。
Comparing two architectures
比较两种架构
When choosing between architectures (monolithic vs. split, or split A vs. split B):
- Run the identical eval set against both versions.
- Run output-quality evals on both. Confirm no regression. If quality drops, the architecture change is a loss regardless of token savings.
- Compare median tokens, p95 tokens, and median time per run.
- Compare path coverage: does each version reliably reach the same outputs through the expected paths?
A split that saves 15% tokens but adds variance in output quality is worse than the monolith. Reliability beats efficiency.
在选择架构时(单体式 vs 拆分式,或拆分方式A vs 拆分方式B):
- 用相同的评估集测试两个版本。
- 对两个版本运行输出质量评估,确认无回归。如果质量下降,无论令牌节省多少,架构变更都是失败的。
- 比较中位数令牌数、p95令牌数和每次运行的中位数耗时。
- 比较路径覆盖率:每个版本是否能通过预期路径可靠地生成相同输出?
节省15%令牌但增加输出质量方差的拆分方式不如单体式结构。可靠性优于效率。
What the metrics will not tell you
指标无法告诉你的信息
- Whether the SKILL.md prose is clear. Read transcripts for confused tool calls and dead-end attempts.
- Whether the description triggers correctly. That is a separate eval (use from the
run_loop.pyskill).skill-creator - Whether content placement matches user mental models. Subjective; review with a human.
The split that looked clean at design time rarely matches real load patterns. Trust the transcripts over your intuitions.
- SKILL.md文本是否清晰。需检查记录中的混淆工具调用和无效尝试。
- description触发是否正确。这是单独的评估(使用Skill中的
skill-creator)。run_loop.py - 内容布局是否符合用户心智模型。主观判断;需人工评审。
设计时看似整洁的拆分方式很少与实际加载模式匹配。信任记录而非直觉。
Output when advising
提供建议时的输出规范
When asked to advise on a specific skill's organization:
- Diagnose first. Is this a triggering question or a disclosure question?
- Quote relevant content from the existing SKILL.md (or the user's description of it) before recommending.
- Propose the minimum viable split. Resist splitting into more files than necessary.
- For each proposed reference file, write the exact pointer sentence that would go in SKILL.md.
- Surface the trade-offs explicitly. Use the pros/cons in this skill as the model for how to present a recommendation.
- If unsure whether a split helps, recommend instrumentation (eval the skill, read transcripts) before committing.
当被要求为特定Skill的组织方式提供建议时:
- 先诊断。这是触发问题还是披露问题?
- 在提出建议前,引用现有SKILL.md(或用户描述)中的相关内容。
- 提出最小可行拆分方案。避免不必要地拆分为多个文件。
- 针对每个提议的参考文件,写出将放入SKILL.md的精确指针语句。
- 明确列出权衡取舍。以本Skill中的优缺点为模板呈现建议。
- 如果不确定拆分是否有益,建议先进行instrumentation(评估Skill、检查记录),再做决定。