Search Results: ai-evaluation

Found 16 Skills

AI & Machine Learningarize-ai/arize-skills

arize-dataset

INVOKE THIS SKILL when creating, managing, or querying Arize datasets and examples. Covers dataset CRUD, appending examples, exporting data, and file-based dataset creation using the ax CLI.

🇺🇸|EnglishTranslated

AI & Machine Learningaffaan-m/everything-claud...

eval-harness

Formal evaluation framework for Claude Code sessions implementing eval-driven development (EDD) principles

🇺🇸|EnglishTranslated

AI & Machine Learningyonatangross/orchestkit

golden-dataset-validation

Use when validating golden dataset quality. Runs schema checks, duplicate detection, and coverage analysis to ensure dataset integrity for AI evaluation.

🇺🇸|EnglishTranslated

AI & Machine Learningmlflow/skills

mlflow-agent

Master dispatcher for all MLflow workflows. Use this skill when the user wants to do anything with MLflow — tracing, evaluating, debugging, or improving an agent. Routes to the right MLflow sub-skill automatically. Triggers on: "use mlflow", "help with mlflow", "mlflow agent", "add mlflow to my project", "trace my agent", "evaluate my agent", or any MLflow task without a specific skill in mind.

🇺🇸|EnglishTranslated

AI & Machine Learningsharpdeveye/maestro

iterate

Use when the workflow needs to self-correct, improve over time, or establish feedback loops and evaluation cycles.

🇺🇸|EnglishTranslated

AI & Machine Learningrefoundai/lenny-skills

building-with-llms

Help users build effective AI applications. Use when someone is building with LLMs, writing prompts, designing AI features, implementing RAG, creating agents, running evals, or trying to improve AI output quality.

🇺🇸|EnglishTranslated

AI & Machine Learningruvnet/ruflo

gaia-submission

Walk through a complete GAIA benchmark→submit flow — from key resolution through HAL-compatible package generation

🇺🇸|EnglishTranslated

AI & Machine Learningrysweet/amplihack

eval-recipes-runner

Run Microsoft's eval-recipes benchmarks to validate amplihack improvements against baseline agents. Auto-activates when testing improvements, running evals, or benchmarking changes.

🇺🇸|EnglishTranslated

AI & Machine Learninghamelsmu/evals-skills

write-judge-prompt

Design LLM-as-Judge evaluators for subjective criteria that code-based checks cannot handle. Use when a failure mode requires interpretation (tone, faithfulness, relevance, completeness). Do NOT use when the failure mode can be checked with code (regex, schema validation, execution tests). Do NOT use when you need to validate or calibrate the judge — use validate-evaluator instead.

🇺🇸|EnglishTranslated

AI & Machine Learningcoval-ai/coval-external-s...

get-results

Retrieve and analyze simulation results from a Coval run. Use when user wants to review evaluation outcomes or debug agent behavior.

🇺🇸|EnglishTranslated

AI & Machine Learningyonatangross/orchestkit

golden-dataset-curation

Use when creating or improving golden datasets for AI evaluation. Defines quality criteria, curation workflows, and multi-agent analysis patterns for test data.

🇺🇸|EnglishTranslated

AI & Machine Learningkeyvaluesoftwaresystems/n...

netra-best-practices

Code-first Netra best-practices playbook covering setup, instrumentation, context tracking, custom spans/metrics, integration patterns, evaluation, simulation, and troubleshooting.

🇺🇸|EnglishTranslated