Writing Skills
Overview
Writing skills means applying Test-Driven Development (TDD) to process documentation.
Personal skills are stored in agent-specific directories (Claude Code uses , Codex uses )
You write test cases (stress scenarios with subagents), watch them fail (baseline behavior), write the skill (documentation), watch the tests pass (agent follows rules), then refactor (plug loopholes).
Core Principle: If you don't observe the agent failing without the skill, you don't know if the skill teaches the right thing.
Required Background: Before using this skill, you must understand superpowers:test-driven-development. That skill defines the basic red-green-refactor cycle. This skill adapts TDD to documentation writing.
Official Guidelines: For Anthropic's official best practices for skill writing, see anthropic-best-practices.md. This document provides additional patterns and guidelines that complement this skill's TDD-oriented approach.
What is a Skill?
A skill is a reference guide for proven techniques, patterns, or tools. Skills help future Claude instances find and apply effective methods.
Skills are: Reusable techniques, patterns, tools, reference guides
Skills are NOT: Narratives about how you solved a problem once
TDD Mapping to Skills
| TDD Concept | Skill Creation |
|---|
| Test Case | Stress scenario with subagents |
| Production Code | Skill documentation (SKILL.md) |
| Test Failure (Red) | Agent violates rules without the skill (baseline) |
| Test Pass (Green) | Agent follows rules with the skill |
| Refactor | Plug loopholes while maintaining compliance |
| Write Test First | Run baseline scenarios before writing the skill |
| Observe Failure | Record the exact rationalizations the agent uses |
| Minimal Code | Write the skill specifically for those violations |
| Observe Pass | Verify the agent now follows the rules |
| Refactor Cycle | Discover new rationalizations → plug → re-verify |
The entire skill creation process follows red-green-refactor.
When to Create a Skill
Create when:
- The technique isn't intuitively obvious to you
- You'll reference it repeatedly across different projects
- The pattern has broad applicability (not project-specific)
- Others will also benefit
Don't create:
- One-off solutions
- Standard practices that are well-documented elsewhere
- Project-specific conventions (put in CLAUDE.md)
- Mechanical constraints (automate if you can enforce with regex/validation – documentation is for scenarios requiring judgment)
Skill Types
Technical
Methods with specific steps (condition-based-waiting, root-cause-tracing)
Pattern
Ways to think about problems (flatten-with-flags, test-invariants)
Reference
API docs, syntax guides, tool documentation (office docs)
Directory Structure
skills/
skill-name/
SKILL.md # Main reference document (required)
supporting-file.* # Only when needed
Flat Namespace - All skills live in a single searchable namespace
When to separate files:
- Large reference content (100+ lines) - API documentation, comprehensive syntax explanations
- Reusable tools - Scripts, utilities, templates
Keep inline:
- Principles and concepts
- Code patterns (< 50 lines)
- Everything else
SKILL.md Structure
Frontmatter (YAML):
- Two required fields: and (see agentskills.io/specification for full supported fields)
- Maximum 1024 characters total
- : Use only letters, numbers, and hyphens (no parentheses or special characters)
- : Third-person, only describes when to use (not what it does)
- Start with "Use when...", focus on trigger conditions
- Include specific symptoms, scenarios, and context
- Never summarize the skill's process or workflow (see CSO section for why)
- Try to keep it under 500 characters
markdown
---
name: Skill-Name-With-Hyphens
description: Use when [specific trigger conditions and symptoms]
---
# Skill Name
## Overview
What is it? Explain the core principle in 1-2 sentences.
## When to Use
[Use small inline flowcharts if the decision isn't obvious]
Bullet list of symptoms and use cases
Scenarios where it doesn't apply
## Core Pattern (Technical/Pattern Skills)
Before/after code comparison
## Quick Reference
Table or bullets for quick scanning of common operations
## Implementation
Simple patterns inline code
Large reference or reusable tools link to files
## Common Mistakes
Common issues + fixes
## Practical Outcomes (Optional)
Specific results
Claude Search Optimization (CSO)
Discovery is Critical: Future Claude instances need to find your skills
1. Rich Description Field
Purpose: Claude reads descriptions to decide which skills to load for the current task. Let it answer: "Should I read this skill right now?"
Format: Start with "Use when...", focus on trigger conditions
Key: Description = When to use, not what the skill does
The description should only describe trigger conditions. Never summarize the skill's process or workflow in the description.
Why this matters: Testing shows that when descriptions summarize the skill's workflow, Claude may follow the description instead of reading the full skill content. A description stating "Perform code reviews between tasks" caused Claude to only do one review, even though the skill's flowchart clearly showed two reviews (first spec compliance, then code quality).
When the description was changed to only "Use when executing implementation plans with independent tasks in the current session" (no workflow summary), Claude correctly read the flowchart and followed the two-stage review process.
Pitfall: Descriptions that summarize workflows create shortcuts Claude will take. The skill body becomes documentation Claude skips.
yaml
# Wrong: Summarizes workflow - Claude may follow description instead of reading skill
description: Use when executing plans - dispatches subagent per task with code review between tasks
# Wrong: Too much process detail
description: Use for TDD - write test first, watch it fail, write minimal code, refactor
# Correct: Only trigger conditions, no workflow summary
description: Use when executing implementation plans with independent tasks in the current session
# Correct: Only trigger conditions
description: Use when implementing any feature or bugfix, before writing implementation code
Content:
- Use specific trigger conditions, symptoms, and scenarios to indicate when this skill applies
- Describe the problem (race conditions, inconsistent behavior) rather than language-specific symptoms (setTimeout, sleep)
- Keep trigger conditions technology-agnostic unless the skill itself is technology-specific
- If the skill is technology-specific, clearly state that in the trigger conditions
- Write in third person (for injection into system prompts)
- Never summarize the skill's process or workflow
yaml
# Wrong: Too abstract, vague, no when-to-use context
description: For async testing
# Wrong: First person
description: I can help you with async tests when they're flaky
# Wrong: Mentions technology but skill isn't technology-specific
description: Use when tests use setTimeout/sleep and are flaky
# Correct: Starts with "Use when", describes problem, no workflow
description: Use when tests have race conditions, timing dependencies, or pass/fail inconsistently
# Correct: Technology-specific skill with clear trigger conditions
description: Use when using React Router and handling authentication redirects
2. Keyword Coverage
Use words Claude will search for:
- Error messages: "Hook timed out", "ENOTEMPTY", "race condition"
- Symptoms: "flaky", "hanging", "zombie", "pollution"
- Synonyms: "timeout/hang/freeze", "cleanup/teardown/afterEach"
- Tools: Actual commands, library names, file types
3. Descriptive Naming
Use active voice, verb-first:
- ✅ instead of
- ✅ instead of
4. Token Efficiency (Critical)
Problem: getting-started and frequently referenced skills load into every conversation. Every token matters.
Target word counts:
- getting-started workflows: <150 words each
- Frequently loaded skills: <200 words total
- Other skills: <500 words (still be concise)
Techniques:
Move details to tool help:
bash
# Wrong: List all parameters in SKILL.md
search-conversations supports --text, --both, --after DATE, --before DATE, --limit N
# Correct: Reference --help
search-conversations supports multiple modes and filters. Run --help for details.
Use cross-references:
markdown
# Wrong: Repeat workflow details
When searching, dispatch subagents with templates...
[20 lines of repeated instructions]
# Correct: Reference other skills
Always use subagents (saves 50-100x context). Required: Use [other-skill-name] workflow.
Compress examples:
markdown
# Wrong: Verbose example (42 words)
Your partner: "How did we handle authentication errors in React Router before?"
You: I'll search past conversations for React Router authentication patterns.
[Dispatch subagent with search query: "React Router authentication error handling 401"]
# Correct: Concise example (20 words)
Partner: "How did we handle authentication errors in React Router before?"
You: Searching...
[Dispatch subagent → integrate]
Eliminate redundancy:
- Don't repeat content already in cross-referenced skills
- Don't explain what's obvious from the command
- Don't provide multiple examples for the same pattern
Validation:
bash
wc -w skills/path/SKILL.md
# getting-started workflows: target <150 each
# other frequently loaded: target <200 total
Name after what you do or core insight:
- ✅ >
- ✅ instead of
- ✅ >
data-structure-refactoring
- ✅ >
Gerunds (-ing) are good for processes:
- , ,
- Active, describes what you're doing
4. Cross-Reference Other Skills
When writing documentation that references other skills:
Only use the skill name with clear required markers:
- ✅ Good:
**Required Subskill:** Use superpowers:test-driven-development
- ✅ Good:
**Required Background:** You must understand superpowers:systematic-debugging
- ❌ Poor:
See skills/testing/test-driven-development
(unclear if required)
- ❌ Poor:
@skills/testing/test-driven-development/SKILL.md
(forces immediate loading, wastes context)
Why no @ links: The
syntax immediately forces file loading, consuming 200k+ context before you need it.
Flowchart Usage
dot
digraph when_flowchart {
"Need to show information?" [shape=diamond];
"Might I make a mistake in decision?" [shape=diamond];
"Use markdown" [shape=box];
"Small inline flowchart" [shape=box];
"Need to show information?" -> "Might I make a mistake in decision?" [label="Yes"];
"Might I make a mistake in decision?" -> "Small inline flowchart" [label="Yes"];
"Might I make a mistake in decision?" -> "Use markdown" [label="No"];
}
Only use flowcharts when:
- Non-obvious decision points
- Process loops where you might stop early
- Deciding "when to use A vs B"
Never use flowcharts for:
- Reference material → tables, lists
- Code examples → Markdown code blocks
- Linear instructions → numbered lists
- Labels with no semantic meaning (step1, helper2)
See @graphviz-conventions.dot for graphviz style rules.
Visualize for your partner: Use
in this directory to render your skill's flowcharts as SVG:
bash
./render-graphs.js ../some-skill # Render each chart separately
./render-graphs.js ../some-skill --combine # Combine all charts into one SVG
Code Examples
One excellent example beats multiple mediocre ones
Choose the most relevant language:
- Testing techniques → TypeScript/JavaScript
- System debugging → Shell/Python
- Data processing → Python
Good examples:
- Fully runnable
- Well-commented, explains why
- From real scenarios
- Clearly demonstrates the pattern
- Can be adapted directly (not generic templates)
Don't:
- Implement in more than 5 languages
- Create fill-in-the-blank templates
- Write artificially constructed examples
You're good at language porting – one excellent example is enough.
File Organization
Self-Contained Skill
defense-in-depth/
SKILL.md # All content inline
Applicable when: All content fits, no large reference needed
Skill with Reusable Tools
condition-based-waiting/
SKILL.md # Overview + pattern
example.ts # Adaptable working code
Applicable when: Tools are reusable code, not just narrative
Skill with Large Reference
pptx/
SKILL.md # Overview + workflow
pptxgenjs.md # 600-line API reference
ooxml.md # 500-line XML structure
scripts/ # Executable tools
Applicable when: Reference material is too large to inline
Iron Rule (Same as TDD)
No skill without a failing test
This applies to new skills and edits to existing skills.
Wrote the skill first then tested? Delete it. Start over.
Edited a skill without testing? Same violation.
No exceptions:
- Doesn't apply to "simple additions"
- Doesn't apply to "just adding a section"
- Doesn't apply to "documentation updates"
- Don't keep untested changes as "reference"
- Don't "tweak" while running tests
- Delete means delete
Required Background: The superpowers:test-driven-development skill explains why this is important. The same principles apply to documentation.
Testing All Skill Types
Different skill types require different testing approaches:
Discipline-Enforcing Skills (Rules/Requirements)
Examples: TDD, verify before completion, design before coding
How to test:
- Academic questions: Do they understand the rules?
- Stress scenarios: Do they follow under pressure?
- Combined multiple stresses: Time + sunk cost + fatigue
- Identify rationalizations and add explicit rebuttals
Success criteria: Agent follows rules under maximum pressure
Technical Skills (How-To Guides)
Examples: condition-based-waiting, root-cause-tracing, defensive-programming
How to test:
- Application scenarios: Can they apply the technique correctly?
- Variant scenarios: Can they handle edge cases?
- Missing information test: Do they indicate if something is missing?
Success criteria: Agent successfully applies the technique to new scenarios
Pattern Skills (Mental Models)
Examples: reducing-complexity, information-hiding concepts
How to test:
- Recognition scenarios: Can they identify when the pattern applies?
- Application scenarios: Can they use the mental model?
- Counterexamples: Do they know when not to apply it?
Success criteria: Agent correctly identifies when/how to apply the pattern
Reference Skills (Documentation/API)
Examples: API docs, command references, library guides
How to test:
- Retrieval scenarios: Can they find the correct information?
- Application scenarios: Can they correctly use what they found?
- Coverage tests: Are common use cases covered?
Success criteria: Agent finds and correctly applies reference information
Common Rationalizations for Skipping Tests
| Rationalization | Reality |
|---|
| "The skill is obviously clear" | Clear to you ≠ clear to other agents. Test it. |
| "This is just reference material" | Reference material can have gaps and ambiguities. Test retrieval. |
| "Testing is overkill" | Untested skills always have issues. 15 minutes of testing saves hours. |
| "I'll test if there's a problem" | Problem = agent can't use the skill. Test before deployment. |
| "Testing is too tedious" | Testing is less tedious than debugging bad skills in production. |
| "I'm confident it's good" | Overconfidence guarantees problems. Test anyway. |
| "Academic review is enough" | Reading ≠ using. Test application scenarios. |
| "I don't have time to test" | Deploying untested skills wastes more time than fixing later. |
All of these mean: Test before deployment. No exceptions.
Making Skills Resistant to Rationalization
Discipline-enforcing skills (like TDD) need to resist rationalization. Agents are smart and will find loopholes under pressure.
Psychology Note: Understanding why persuasion techniques work helps you apply them systematically. See persuasion-principles.md for research foundations (Cialdini, 2021; Meincke et al., 2025), covering authority, commitment, scarcity, social proof, and belonging principles.
Explicitly Plug Each Loophole
Don't just state the rule – prohibit specific workarounds:
<Bad>
```markdown
Wrote code before tests? Delete it.
```
</Bad>
<Good>
```markdown
Wrote code before tests? Delete it. Start over.
No exceptions:
- Don't keep it as "reference"
- Don't "tweak" it while writing tests
- Don't look at it
- Delete means delete
</Good>
### Address "Spirit vs Letter" Debates
Add a foundational principle upfront:
```markdown
**Violating the letter of the rule is violating the spirit of the rule.**
This cuts off an entire category of "I followed the spirit" rationalizations.
Build a Rationalization Table
Capture rationalizations from baseline tests (see testing section below). Every excuse the agent uses goes into the table:
markdown
|------|------|
| "Too simple to test" | Simple code still breaks. Testing takes 30 seconds. |
| "I'll test later" | Testing immediately proves nothing. |
| "Writing tests later works just as well" | Writing tests later = "What does this do?" Writing tests first = "What should this do?" |
Create a Red Line List
Make it easy for agents to self-check if they're rationalizing:
markdown
## Red Lines - Stop and Start Over
- Wrote code before tests
- "I've tested manually"
- "Writing tests later works just as well"
- "The spirit matters more than the ritual"
- "This case is different because..."
**All of these mean: Delete the code. Start over with TDD.**
Update CSO to Include Violation Symptoms
Add to the description: Symptoms that you're about to violate the rule:
yaml
description: use when implementing any feature or bugfix, before writing implementation code
Red-Green-Refactor for Skills
Follow the TDD cycle:
Red: Write Failing Tests (Baseline)
Run stress scenarios without the skill. Record behavior verbatim:
- What choices did they make?
- What rationalizations did they use (exact wording)?
- Which stresses triggered violations?
This is "observing test failure" – you must see what the agent naturally does before writing the skill.
Green: Write Minimal Skill
Write the skill specifically for those rationalizations. Don't add extra content for hypothetical cases.
Run the same scenarios with the skill. The agent should now comply.
Refactor: Plug Loopholes
Did the agent find new rationalizations? Add explicit rebuttals. Retest until it's bulletproof.
Testing Methodology: See @testing-skills-with-subagents.md for complete testing methods:
- How to write stress scenarios
- Types of stress (time, sunk cost, authority, fatigue)
- Systematically plugging loopholes
- Meta-testing techniques
Anti-Patterns
Narrative Examples
"In the session on 2025-10-03, we discovered empty projectDir caused..."
Why bad: Too specific, not reusable
Multi-Language Dilution
example-js.js, example-py.py, example-go.go
Why bad: Mediocre quality, maintenance burden
Code in Flowcharts
dot
step1 [label="import fs"];
step2 [label="read file"];
Why bad: Can't copy-paste, hard to read
Generic Labels
helper1, helper2, step3, pattern4
Why bad: Labels should have semantic meaning
Stop: Before Moving to Next Skill
After writing any skill, you must stop and complete the deployment process.
Don't:
- Batch-create multiple skills without testing each one
- Move to the next skill before validating the current one
- Skip testing because "batch processing is more efficient"
The deployment checklist below is mandatory for every skill.
Deploying untested skills = deploying untested code. This is a violation of quality standards.
Skill Creation Checklist (TDD-Adapted)
Important: Use TodoWrite to create todos for each checklist item below.
Red Phase - Write Failing Tests:
Green Phase - Write Minimal Skill:
Refactor Phase - Plug Loopholes:
Quality Check:
Deployment:
Discovery Workflow
How future Claude instances find your skills:
- Encounter a problem ("Tests are flaky")
- Find skill (description matches)
- Scan overview (Is this relevant?)
- Read pattern (Quick reference table)
- Load example (Only when implementing)
Optimize for this flow - Put searchable terms upfront and throughout.
Summary
Creating skills is TDD for process documentation.
Same iron rule: No skill without a failing test.
Same cycle: Red (baseline) → Green (write skill) → Refactor (plug loopholes).
Same benefits: Higher quality, fewer surprises, bulletproof results.
If you follow TDD for code, you should follow it for skills. It's the same discipline applied to documentation.