Loading...
Loading...
This skill should be used when the user wants to run baseline evaluations on existing agent skills, regenerate transcripts after a model upgrade, or check whether a skill still solves the gap it was authored for. Common triggers include "rerun the baselines", "re-eval skill X", "test all the skills", "check for skill drift", and "run the evals". Bakes in verbatim transcript capture (no paraphrasing), deterministic-only grading (regex / contains / file_exists — no LLM-as-judge), and the iteration-N workspace convention. Skip when authoring a new skill (use skill-creator) or modifying skill content directly.
npx skill4agent add zrosenbauer/skills skill-evalevals.json/skill-creatorpackages/skill-tools$ARGUMENTS<skill-name>skills/.agents/skills/--allevals.json--all$ARGUMENTSskills/<name>/.agents/skills/<name>/evals.json/skill-creatorevals.json--allSKILL.mdevals.jsonskills/<skill-name>/.workspace/.agents/skills/<skill-name>/.workspace/1iteration-N/max(N) + 1skills/<skill-name>/.workspace/iteration-<N>/.workspace/evals.jsonsubagent_type: general-purposeExecute this task exactly:
[eval.prompt]
No skill is loaded for this task. After attempting it, report what you did,
what decisions you made and why, and anything you found tricky. Report
verbatim — do not polish, do not summarize. Include any code you wrote
inline so it can be analyzed.skills/<skill-name>/.workspace/iteration-<N>/eval-<id>-<eval_name>/without_skill/transcript.mdExecute this task exactly:
[eval.prompt]
The skill `<skill-name>` is available — apply its rules and patterns.
After attempting it, report what you did, what decisions you made and why,
and anything you found tricky. Report verbatim — do not polish, do not
summarize. Include any code you wrote inline.
If you considered skipping any rule from the skill, capture the exact
reasoning verbatim — that's the kind of failure mode the skill needs to
catch.skills/<skill-name>/.workspace/iteration-<N>/eval-<id>-<eval_name>/with_skill/transcript.mdskill-tools evalnode packages/skill-tools/dist/index.mjs eval <skill-name> <eval.id> \
--variant <with_skill|without_skill> \
--iteration <N> \
--transcript <path-to-transcript.md>grading.jsonnode packages/skill-tools/dist/index.mjs benchmark <skill-name>benchmark.jsonbenchmark.mdpnpm skill-tools view <skill-name>evals.jsonskills/skill-creator/references/evals-json.mdpackages/skill-tools/