Search Results: ai-evaluation

Found 19 Skills

AI & Machine Learningmicrosoft/agent-skills

azure-ai-evaluation-py

Azure AI Evaluation SDK for Python. Use for evaluating generative AI applications with quality, safety, agent, and custom evaluators. Triggers: "azure-ai-evaluation", "evaluators", "GroundednessEvaluator", "evaluate", "AI quality metrics", "RedTeam", "agent evaluation".

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningoldwinter/skills

ai-evaluation-evals

Create AI evaluation plans with benchmarks, rubrics, and error analysis workflows.

🇺🇸|EnglishTranslated

AI & Machine Learningrefoundai/lenny-skills

ai-evals

Help users create and run AI evaluations. Use when someone is building evals for LLM products, measuring model quality, creating test cases, designing rubrics, or trying to systematically measure AI output quality.

🇺🇸|EnglishTranslated

AI & Machine Learningaffaan-m/everything-claud...

eval-harness

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

🇺🇸|EnglishTranslated

AI & Machine Learningrefoundai/lenny-skills

building-with-llms

Help users build effective AI applications. Use when someone is building with LLMs, writing prompts, designing AI features, implementing RAG, creating agents, running evals, or trying to improve AI output quality.

🇺🇸|EnglishTranslated

AI & Machine Learninglebsral/dspy-programming-...

ai-improving-accuracy

Measure and improve how well your AI works. Use when AI gives wrong answers, accuracy is bad, responses are unreliable, you need to test AI quality, evaluate your AI, write metrics, benchmark performance, optimize prompts, improve results, or systematically make your AI better. Covers DSPy evaluation, metrics, and optimization.

🇺🇸|EnglishTranslated

AI & Machine Learningmlflow/skills

mlflow-agent

Master dispatcher for all MLflow workflows. Use this skill when the user wants to do anything with MLflow — tracing, evaluating, debugging, or improving an agent. Routes to the right MLflow sub-skill automatically. Triggers on: "use mlflow", "help with mlflow", "mlflow agent", "add mlflow to my project", "trace my agent", "evaluate my agent", or any MLflow task without a specific skill in mind.

🇺🇸|EnglishTranslated

AI & Machine Learningyonatangross/orchestkit

golden-dataset-validation

Use when validating golden dataset quality. Runs schema checks, duplicate detection, and coverage analysis to ensure dataset integrity for AI evaluation.

🇺🇸|EnglishTranslated

AI & Machine Learningyonatangross/orchestkit

golden-dataset-curation

Use when creating or improving golden datasets for AI evaluation. Defines quality criteria, curation workflows, and multi-agent analysis patterns for test data.

🇺🇸|EnglishTranslated

AI & Machine Learningwandb/skills

wandb-primary

Comprehensive primary skill for agents working with Weights & Biases. Covers both the W&B SDK (training runs, metrics, artifacts, sweeps) and the Weave SDK (GenAI traces, evaluations, scorers). Includes helper libraries, gotcha tables, and data analysis patterns. Use this skill whenever the user asks about W&B runs, Weave traces, evaluations, training metrics, loss curves, model comparisons, or any Weights & Biases data — even if they don't say "W&B" explicitly.

🇺🇸|EnglishTranslated

2 scripts/Checked

AI & Machine Learningsharpdeveye/maestro

iterate

Use when the workflow needs to self-correct, improve over time, or establish feedback loops and evaluation cycles.

🇺🇸|EnglishTranslated

AI & Machine Learningneolabhq/context-engineer...

judge-with-debate

Evaluate solutions through multi-round debate between independent judges until consensus

🇺🇸|EnglishTranslated