Search Results: agent-evaluation

Found 43 Skills

skill-benchmark

Benchmark any agent skill to measure whether it actually improves performance. Use when the user wants to evaluate, test, or compare a skill against baseline, or when they mention "benchmark", "eval", "skill performance", or "does this skill help". Runs isolated eval sessions with and without the skill, grades outputs via layered grading (deterministic checks + LLM-as-judge), analyzes behavioral signals, and generates a comparison report with a USE / DON'T USE verdict.

🇺🇸|EnglishTranslated

3 scripts/Attention

AI & Machine Learningalirezarezvani/claude-ski...

eval

Evaluate and rank agent results by metric or LLM judge for an AgentHub session.

🇺🇸|EnglishTranslated

AI & Machine Learningsehoon787/my-codex

skill-stocktake

Use when auditing Claude skills and commands for quality. Supports Quick Scan (changed skills only) and Full Stocktake modes with sequential subagent batch evaluation.

🇺🇸|EnglishTranslated

3 scripts/Attention

AI & Machine Learningaws/agent-toolkit-for-aws

agents-optimize

Use when measuring or improving agent quality and performance — set up evaluators, online monitoring, CI/CD quality gates, observability, or cost optimization. Triggers on: "evaluate my agent", "add evaluator", "measure quality", "quality gate", "run evals", "agent too slow", "why is it slow", "reduce latency", "set up observability", "CloudWatch dashboard", "how much does my agent cost", "cost optimization", "logs not showing up", "logs missing", "spans not found", "eval failing", "eval error", "dev traces", "local traces", "agentcore dev traces", "traces to CloudWatch". Not for debugging errors or crashes — use agents-debug. Slow but correct routes here; broken routes to debug.

🇺🇸|EnglishTranslated

AI & Machine Learningbagelhole/devops-security...

agent-evals

Build automated evaluation suites for AI agents using golden datasets, rubrics, and regression gates.

🇺🇸|EnglishTranslated

AI & Machine Learninggooglecloudplatform/cxas-...

cxas-agent-foundry

End-to-end GECX/CXAS/CES conversational agent lifecycle -- build agents from requirements (PRD-to-agent), create and run evals (goldens, simulations, tool tests, callback tests), debug failures, and iterate to production quality. Use this skill whenever the user mentions GECX, CXAS, CES, SCRAPI, conversational agents, voice agents, audio agents, agent evals, pushing/pulling/linting agents, or agent instructions/callbacks/tools on the Google Customer Engagement Suite platform.

🇺🇸|EnglishTranslated

27 scripts/Attention

AI & Machine Learningjackjin1997/clawforge

langsmith-dataset

Use this skill for ANY question about creating test or evaluation datasets for LangChain agents. Covers generating datasets from traces (final_response, single_step, trajectory, RAG types), uploading to LangSmith, and managing evaluation data.

🇺🇸|EnglishTranslated

2 scripts/Checked