Loading...
Loading...
Found 2 Skills
Benchmark any agent skill to measure whether it actually improves performance. Use when the user wants to evaluate, test, or compare a skill against baseline, or when they mention "benchmark", "eval", "skill performance", or "does this skill help". Runs isolated eval sessions with and without the skill, grades outputs via layered grading (deterministic checks + LLM-as-judge), analyzes behavioral signals, and generates a comparison report with a USE / DON'T USE verdict.
Create new skills, modify and improve existing skills, and measure skill performance. Enhanced version with quick commands. Use when users want to create a skill from scratch, update or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy. Triggers on phrases like "make a skill", "create a new skill", "build a skill for", "improve this skill", "optimize my skill", "test my skill", "turn this into a skill", "skill description optimization", or "help me create a skill".