skill-creator
Build, validate, and iterate agent skills in this monorepo. Bakes in the conventions every skill here follows: kebab-case naming, "Use when" trigger phrases, selective XML for example boundaries, and a RED→GREEN→REFACTOR evaluation cycle (see
references/tdd-for-skills.md
).
When to use
Verbatim trigger phrases the user might say:
- "build a skill for X"
- "create a new skill"
- "scaffold a skill"
- "add a skill that does Y"
- "make me a skill"
- "audit this skill against our rules"
- "refactor this skill to match repo conventions"
When NOT to use
- User is modifying source code, not skills
- User is debugging an existing skill (just edit it directly)
- User wants to install a third-party skill ()
- User is writing non-skill markdown (docs, READMEs, etc.)
Workflow
1. Discover
Clarify what the skill should do. Answer these before scaffolding:
- What user request triggers this skill? Capture verbatim phrases.
- Does an existing skill in already cover this? Run and skim each description.
If overlap is >70%, propose extending the existing skill instead.
2. Name
- only
- Matches
^[a-z][a-z0-9-]+[a-z0-9]$
- ≤64 chars
- No abbreviations like — use
- Prefer (e.g., ) over generic
3. RED phase — write and run baselines
Pressure-test the gap
before writing the skill. See
references/pressure-scenarios.md
for what makes a good scenario per skill type.
- Ask the user for 3+ pressure scenarios: realistic prompts (not abstract "convert this PDF" — specific, messy, with personal context).
- For each scenario, decide assertions (regex / contains / file_exists) — see .
- Write from the
templates/evals.json.template
.
- Invoke which dispatches Agent(general-purpose) for each scenario without the skill loaded and saves transcripts to
skills/<name>/.workspace/iteration-1/eval-K-name/without_skill/
.
- Inspect the without-skill transcripts (open one in
pnpm skill-tools view <name>
). If the baseline already passes the assertions cleanly, the skill is unnecessary — tell the user and stop. Better to skip than ship a no-op skill.
The transcripts are gitignored; the
is committed.
4. Draft frontmatter
Skills here are agent-agnostic:
and
are universally required (the
CLI rejects skills missing either); the others are Claude Code extensions kept for cross-agent compatibility (other agents ignore them). Full schema in
references/frontmatter.md
.
yaml
---
name: <skill-name>
description: >-
This skill should be used when [trigger condition]. Common triggers
include "verbatim phrase 1", "verbatim phrase 2", and "verbatim phrase 3".
[What it bakes in / what's distinctive]. Skip when [anti-trigger].
# --- Claude Code extensions (ignored by other agents) ---
argument-hint: '[<optional-arg>]'
user-invocable: true
model-invocable: true
---
must exactly match the skill's directory name (kebab-case).
Description rules (full list):
- 80–1024 characters
- Contains or
"This skill should be used when"
- Lists ≥3 verbatim trigger phrases in double quotes
- No anti-shortcut words: , , , , — these get followed as instructions instead of treated as triggers
- Includes a clause naming what the skill does NOT do
5. Draft body
Markdown headings (
,
) for structure. XML
only inside these tags (
when to use which):
- for full scenarios
- / for contrast pairs
- / for tool-call boundaries
Typical body sections:
- — verbatim trigger phrases
- — anti-triggers
- — numbered actions the agent takes
- — at least one block
- — links to companion docs
6. Self-lint
Run
pnpm skill-tools lint <name>
. All
-severity findings must clear;
and
are advisory. If any rule fails, fix the SKILL.md and re-run.
The full rule list lives in
references/lint-checklist.md
. The TS implementation in
packages/skill-tools/src/lib/lint.ts
is the enforcer.
7. GREEN phase — re-run with the skill loaded
Invoke
again — this dispatches Agent(general-purpose) for each scenario
with the new skill in context, saves to
skills/<name>/.workspace/iteration-1/eval-K-name/with_skill/
, then grades.
Acceptance: every eval that failed without the skill should now pass. If any still fail, the skill body is missing instructions — patch and rerun. If any regress (passed without, now fails with), the skill introduced a problem — also patch and rerun.
7.5. REFACTOR — capture rationalizations (discipline skills only)
This is the REFACTOR phase of the RED→GREEN→REFACTOR cycle. If this is a
discipline skill (one that enforces rules the agent might rationalize skipping — e.g., "always run tests", "never use
", "always use Result"), read the with-skill transcripts. When the subagent skipped a rule and explained why, capture the excuse
verbatim into a
section at the bottom of
.
Format:
markdown
## Rationalization table
| --------------------------- | --------------------------------- | ----------------------------------------------- |
| Always run the test | "the change is tiny so I'll skip" | Tiny changes still break behavior; run the test |
| Use Result instead of throw | "this is just a quick prototype" | Prototypes leak into prod; use Result anyway |
Capturing excuses verbatim — not sanitized — is the point. Future agents recognize their own pattern. Skip this step only when the skill has no rules an agent could rationalize skipping (most reference skills, some pattern skills). Technique and discipline skills almost always benefit from a rationalization table.
8. Package
- — the skill body
- — the test definitions (already created in step 3)
- — MIT (matches repo root)
- — human-facing summary
Optional companions for non-trivial skills:
- — deeper rules referenced from SKILL.md
templates/<thing>.template
— boilerplate the skill scaffolds from
The nested
directory (transcripts, grading, benchmarks) is gitignored — only
ships with the skill.
Examples
<example>
<input>User says: "build me a skill for parsing TOML config files"</input>
<output>
1. Discover — confirm: "Should this trigger on `*.toml` files? Or any time the user mentions TOML?" Check `skills/` for overlap (none).
2. Name — propose `toml-config-parser` (kebab-case, descriptive).
3. RED — three scenarios: (a) "parse this config.toml", (b) "validate the toml schema", (c) "convert toml to json". Without the skill, agent uses ad-hoc string parsing.
4. Frontmatter — description includes "Use when", lists 3 trigger phrases, adds "Skip when working with YAML or JSON".
5. Body — markdown sections, one `<example>` showing parse-validate-output.
6. Self-lint — name matches regex; description = 412 chars, contains "Use when"; no anti-shortcut words; XML balances.
7. GREEN — rerun RED scenarios; agent now uses zod + smol-toml.
8. Package — write `skills/toml-config-parser/{SKILL.md, README.md, LICENSE}`.
</output>
</example>
<example>
<good>
description: >-
This skill should be used when the user wants to refactor TypeScript
code to follow functional patterns. Common triggers include "make this
functional", "remove the class", and "use Result instead of throw".
Bakes in factory functions over classes, Result<T,E> over exceptions,
and immutable state. Skip when working with framework-required classes
(PrismaClient, etc.).
</good>
<bad>
description: >-
This skill helps with TypeScript. First it analyzes the code, then it
refactors it. The process involves several steps.
</bad>
The
example fails three rules: no "Use when" phrase, no verbatim trigger phrases in quotes, contains anti-shortcut words ("first", "then", "process") that cause the agent to follow them as instructions instead of treating them as triggers.
</example>
References
- — schema and assertion types
references/pressure-scenarios.md
— how to write good pressure scenarios per skill type
references/tdd-for-skills.md
— RED → GREEN → REFACTOR cycle
references/frontmatter.md
— frontmatter schema
- — naming rules
references/description.md
— description rules + anti-shortcut patterns
- — when to use XML vs Markdown
references/lint-checklist.md
— full self-lint checklist
Templates
templates/SKILL.md.template
— boilerplate with placeholders
templates/README.md.template
— readme boilerplate
templates/example-skill.md
— fully-worked example skill