Loading...
Loading...
Run the trigger evaluation pipeline — classify, analyze, and optionally compare against a baseline. Only run when explicitly asked — evals are expensive.
npx skill4agent add datocms/agent-skills eval-triggers$ARGUMENTSrun_claude_trigger_eval.pyrun_codex_trigger_eval.pypython3 evals/scripts/run_claude_trigger_eval.py \
--repo-root . \
--output-dir evals/results/adHocRuns/<date>-<label>/raw \
--source combinedpython3 evals/scripts/run_codex_trigger_eval.py \
--repo-root . \
--output-dir evals/results/adHocRuns/<date>-<label>/rawpython3 evals/scripts/analyze_trigger_results.py \
--results-dir evals/results/adHocRuns/<date>-<label>/raw \
--output-json evals/results/adHocRuns/<date>-<label>/analysis.json \
--output-markdown evals/results/adHocRuns/<date>-<label>/analysis.mdevals/results/adHocRuns/python3 evals/scripts/compare_trigger_runs.py \
--baseline <baseline>/analysis.json \
--candidate evals/results/adHocRuns/<date>-<label>/analysis.json \
--output-markdown evals/results/adHocRuns/<date>-<label>/comparison.md \
--output-json evals/results/adHocRuns/<date>-<label>/comparison.json