Skill Tester
The agent validates skill packages for structure compliance, tests Python scripts for syntax and stdlib-only imports, and scores quality across four dimensions (documentation, code quality, completeness, usability) with letter grades and improvement recommendations. It supports BASIC, STANDARD, and POWERFUL tier classification.
Quick Start
bash
# Validate skill structure and documentation
python skill_validator.py engineering/my-skill --tier POWERFUL --json
# Test all Python scripts in a skill
python script_tester.py engineering/my-skill --timeout 30
# Score quality with improvement roadmap
python quality_scorer.py engineering/my-skill --detailed --minimum-score 75
Core Workflows
Workflow 1: Validate a New Skill
- Run with target tier to check structure, frontmatter, required sections, and scripts
- Review errors (blocking) and warnings (non-blocking) in the report
- Fix all errors -- missing SKILL.md, invalid frontmatter, external imports
- Validation checkpoint: Score >= 60; zero errors; all scripts pass
bash
python skill_validator.py engineering/my-skill --tier STANDARD --json
Workflow 2: Test Skill Scripts
- Run to execute syntax validation, import analysis, and runtime tests
- Review per-script results: argparse detection, output, sample data execution
- Fix failures: add
if __name__ == "__main__"
guards, replace external imports with stdlib
- Validation checkpoint: All scripts pass syntax; zero external imports; exits cleanly
bash
python script_tester.py engineering/my-skill --timeout 60 --json
Workflow 3: Score and Improve Quality
- Run with for component-level breakdowns
- Review the prioritized improvement roadmap (up to 5 items)
- Address HIGH-priority items first (documentation gaps, missing error handling)
- Re-run to verify score improvement
- Validation checkpoint: Overall score >= 75; no dimension below 50%
bash
python quality_scorer.py engineering/my-skill --detailed --minimum-score 75 --json
Tier Requirements
| Requirement | BASIC | STANDARD | POWERFUL |
|---|
| SKILL.md lines | 100+ | 200+ | 300+ |
| Python scripts | 1 (100-300 LOC) | 1-2 (300-500 LOC) | 2-3 (500-800 LOC) |
| Argparse | Basic | Subcommands | Multiple modes |
| Output formats | Single | JSON + text | JSON + text + validation |
| Error handling | Essential | Comprehensive | Advanced recovery |
Quality Scoring Dimensions
| Dimension | Weight | Measures |
|---|
| Documentation | 25% | SKILL.md depth, README clarity, reference quality |
| Code Quality | 25% | Complexity, error handling, output consistency |
| Completeness | 25% | Required files, sample data, expected outputs |
| Usability | 25% | Argparse help text, example clarity, ease of setup |
Grades: A+ (97+) through F (<40). Exit code 0 for A+ through C-, exit code 2 for D, exit code 1 for F.
CI/CD Integration
yaml
# GitHub Actions example
- name: Validate Changed Skills
run: |
for skill in $(git diff --name-only | grep -E '^engineering/[^/]+/' | cut -d'/' -f1-2 | sort -u); do
python engineering/skill-tester/scripts/skill_validator.py $skill --json
python engineering/skill-tester/scripts/script_tester.py $skill
python engineering/skill-tester/scripts/quality_scorer.py $skill --minimum-score 75
done
Anti-Patterns
- Padding SKILL.md with filler -- line count thresholds measure substantive content; blank lines and boilerplate do not count
- External imports disguised as stdlib -- the import allowlist is manually maintained; if a legit stdlib module is flagged, add it to
- Missing argparse help strings -- usability scoring requires parameters on every argument; empty help strings score zero
- No guard -- scripts without
if __name__ == "__main__"
fail runtime tests when imported
- Relying on SKILL.md for usability -- usability is scored from scripts and README independently; a detailed SKILL.md does not compensate for missing output
Troubleshooting
| Problem | Cause | Solution |
|---|
| error despite sufficient content | Validator counts only non-blank lines; blank lines inflate raw line count but are excluded from the tally | Remove excessive blank lines or add more substantive content sections to meet the tier threshold |
| YAML frontmatter parse failure | Frontmatter contains invalid YAML syntax (unquoted colons, tabs instead of spaces, missing closing ) | Validate frontmatter through locally; ensure the closing marker is present on its own line |
| External import false positive | The stdlib module allowlist in and is manually maintained and may not include every standard library module | Add the missing module name to the set in the relevant script, or restructure the import |
| Script execution timeout during testing | Script requires interactive input, enters an infinite loop, or performs long-running computation | Increase value, add early-exit logic for missing arguments, or ensure scripts exit cleanly when no input is provided |
| Tier compliance check fails despite passing individual checks | _validate_tier_compliance
only examines , , and ; other failures (e.g., missing directories) are reported separately | Fix the specific critical checks listed in the error message; review the dictionary for the target tier |
| Quality scorer reports low usability despite good documentation | Usability dimension scores help text inside scripts, usage sections, and practical example files independently of SKILL.md content | Add help strings with parameters, include a section in README.md, and place sample/example files in the directory |
| flag produces no output | Script raised an unhandled exception before reaching the output formatter; errors are written to stderr | Run with to see the full traceback on stderr, then address the underlying exception |
Success Criteria
- Structure pass rate above 95%: Validated skills pass all required-file and directory-structure checks on first run in at least 95% of cases.
- Script syntax zero-defect: Every Python script in a validated skill compiles without via .
- Standard library compliance 100%: No external (non-stdlib) imports detected across all validated scripts.
- Quality score consistency within 5 points: Re-running on an unchanged skill produces scores that vary by no more than 5 points across runs.
- Execution time under 10 seconds per skill: Full validation, testing, and scoring pipeline completes in under 10 seconds for a single skill with up to 3 scripts.
- Actionable recommendation density: Every skill scoring below 75/100 receives at least 3 prioritized improvement suggestions in the roadmap.
- CI/CD gate reliability: When integrated as a GitHub Actions step, the tool exits with non-zero status for every skill that fails critical checks, blocking the merge.
Scope & Limitations
Covers:
- Structural validation of skill directories against tier-specific requirements (BASIC, STANDARD, POWERFUL)
- Static analysis of Python scripts including syntax checking, import validation, argparse detection, and main guard verification
- Multi-dimensional quality scoring across documentation, code quality, completeness, and usability
- Dual output formatting (JSON for CI/CD pipelines, human-readable for developer consumption)
Does NOT cover:
- Functional correctness of script logic or algorithm accuracy — the tester verifies structure and conventions, not business logic
- Performance benchmarking or memory profiling of scripts — see
engineering/performance-profiler
for runtime analysis
- Security vulnerability scanning of script code — see
engineering/skill-security-auditor
for dependency and code security audits
- Cross-skill dependency resolution or integration testing — skills are validated in isolation without verifying inter-skill compatibility
Integration Points
| Skill | Integration | Data Flow |
|---|
engineering/skill-security-auditor
| Run security audit after validation passes | confirms structure compliance, then scans for vulnerabilities in the same skill path |
engineering/ci-cd-pipeline-builder
| Embed skill-tester as a quality gate stage | Pipeline builder generates workflow YAML that invokes , , and sequentially |
engineering/changelog-generator
| Feed quality score deltas into changelog entries | Compare JSON output between releases to surface quality improvements or regressions |
engineering/pr-review-expert
| Attach validation report to pull request reviews | skill_validator.py --json
output is posted as a PR comment for reviewer context |
engineering/performance-profiler
| Complement structural testing with runtime profiling | After confirms execution succeeds, measures execution time and resource usage |
engineering/tech-debt-tracker
| Track quality score trends over time | Periodic output is ingested to detect score degradation and flag technical debt |
Tool Reference
skill_validator.py
Purpose: Validates a skill directory's structure, documentation, and Python scripts against the claude-skills ecosystem standards. Checks required files, YAML frontmatter, required SKILL.md sections, directory layout, script syntax, import compliance, and tier-specific requirements.
Usage:
bash
python skill_validator.py <skill_path> [--tier TIER] [--json] [--verbose]
Parameters:
| Parameter | Type | Required | Default | Description |
|---|
| positional | Yes | — | Path to the skill directory to validate |
| option | No | None | Target tier for validation: , , or |
| flag | No | Off | Output results in JSON format instead of human-readable text |
| flag | No | Off | Enable verbose logging to stderr |
Example:
bash
python skill_validator.py engineering/my-skill --tier POWERFUL --json
Output Formats:
- Human-readable (default): Grouped report with STRUCTURE VALIDATION, SCRIPT VALIDATION, ERRORS, WARNINGS, and SUGGESTIONS sections. Displays overall score out of 100 with compliance level (EXCELLENT, GOOD, ACCEPTABLE, NEEDS_IMPROVEMENT, POOR).
- JSON (): Object with keys , , , , (dict of check name to pass/message/score), , , .
Exit codes: on success (score >= 60 and no errors),
on failure.
script_tester.py
Purpose: Tests all Python scripts within a skill's
directory. Performs syntax validation via AST parsing, import analysis for stdlib compliance, argparse implementation verification, main guard detection, runtime execution with timeout protection,
functionality testing, sample data processing against files in
, and output format compliance checks.
Usage:
bash
python script_tester.py <skill_path> [--timeout SECONDS] [--json] [--verbose]
Parameters:
| Parameter | Type | Required | Default | Description |
|---|
| positional | Yes | — | Path to the skill directory containing scripts to test |
| option | No | | Timeout in seconds for each script execution test |
| flag | No | Off | Output results in JSON format instead of human-readable text |
| flag | No | Off | Enable verbose logging to stderr |
Example:
bash
python script_tester.py engineering/my-skill --timeout 60 --json
Output Formats:
- Human-readable (default): Report with SUMMARY (total/passed/partial/failed counts), GLOBAL ERRORS, and per-script sections showing status, execution time, individual test results, errors, and warnings.
- JSON (): Object with keys , , (counts and overall status), , (dict per script with , , , , ).
Exit codes: on full success,
on failure or global errors,
on partial success.
quality_scorer.py
Purpose: Provides a comprehensive multi-dimensional quality assessment for a skill. Evaluates four equally weighted dimensions — Documentation (25%), Code Quality (25%), Completeness (25%), and Usability (25%) — and produces an overall score, letter grade (A+ through F), tier recommendation, and a prioritized improvement roadmap.
Usage:
bash
python quality_scorer.py <skill_path> [--detailed] [--minimum-score SCORE] [--json] [--verbose]
Parameters:
| Parameter | Type | Required | Default | Description |
|---|
| positional | Yes | — | Path to the skill directory to assess |
| flag | No | Off | Show detailed component scores within each dimension |
| option | No | | Minimum acceptable overall score; exits with error code if the score falls below this threshold |
| flag | No | Off | Output results in JSON format instead of human-readable text |
| flag | No | Off | Enable verbose logging to stderr |
Example:
bash
python quality_scorer.py engineering/my-skill --detailed --minimum-score 75 --json
Output Formats:
- Human-readable (default): Report with overall score and letter grade, per-dimension scores with weights, summary statistics (highest/lowest dimension, dimensions above 70%, dimensions below 50%), and a prioritized improvement roadmap (up to 5 items with HIGH/MEDIUM/LOW priority). When is used, component-level breakdowns appear under each dimension.
- JSON (): Object with keys , , , , , , (per-dimension name/weight/score/details/suggestions), (list of priority/dimension/suggestion/current_score objects).
Exit codes: for grades A+ through C-,
for grade F or when score is below
,
for grade D.