Search Results: agent-evaluation

Found 43 Skills

AI & Machine Learningmicrosoft/eval-guide

eval-suite-planner

Produces a concrete eval suite plan grounded in Microsoft's Eval Scenario Library and MS Learn agent evaluation guidance — scenario types, evaluation methods, quality signals, thresholds, and priority order — before any test cases are generated or evals are run.

🇺🇸|EnglishTranslated

AI & Machine Learningmicrosoft/eval-guide

eval-result-interpreter

Analyzes Copilot Studio evaluation CSV results using Microsoft's Triage & Improvement Playbook. Returns a SHIP / ITERATE / BLOCK verdict with root cause classification, diagnostic triage, prioritized remediation, and pattern analysis.

🇺🇸|EnglishTranslated

Testing & QAmicrosoft/eval-guide

eval-generator

Generates eval test cases from an eval suite plan (output of /eval-suite-planner) or a plain-English agent description. Supports both single-response and conversation (multi-turn) evaluation modes. Outputs a Copilot Studio test set table, a CSV file for import (single-response only), and a docx report for human review.

🇺🇸|EnglishTranslated

AI & Machine Learninggithub/awesome-copilot

agentic-eval

Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when: - Implementing self-critique and reflection loops - Building evaluator-optimizer pipelines for quality-critical generation - Creating test-driven code refinement workflows - Designing rubric-based or LLM-as-judge evaluation systems - Adding iterative improvement to agent outputs (code, reports, analysis) - Measuring and improving agent response quality

🇺🇸|EnglishTranslated

AI & Machine Learningmicrosoft/eval-guide

eval-triage-and-improvement

Use this skill when the user's Copilot Studio agent evaluations have come back and they need to interpret scores, diagnose root causes of underperforming test cases, find remediation steps, or analyze patterns to improve their agent. Always use this skill when the user mentions: "eval failed", "why did this fail", "triage", "diagnose failure", "low pass rate", "fix evaluation results", "not passing", "failing test cases", "evaluation results", "improve my eval scores", or any situation where eval scores need interpretation and action.

🇺🇸|EnglishTranslated

AI & Machine Learningadaptationio/skrillz

bedrock-agentcore-evaluations

Amazon Bedrock AgentCore Evaluations for testing and monitoring AI agent quality. 13 built-in evaluators plus custom LLM-as-Judge patterns. Use when testing agents, monitoring production quality, setting up alerts, or validating agent behavior.

🇺🇸|EnglishTranslated

AI & Machine Learninggoogle/adk-docs

adk-dev-guide

ALWAYS ACTIVE — read at the start of any ADK agent development session. ADK development lifecycle and mandatory coding guidelines — spec-driven workflow, code preservation rules, model selection, and troubleshooting.

🇺🇸|EnglishTranslated

AI & Machine Learningneolabhq/context-engineer...

sadd:tree-of-thoughts

Execute tasks through systematic exploration, pruning, and expansion using Tree of Thoughts methodology with multi-agent evaluation

🇺🇸|EnglishTranslated

AI & Machine Learningshipshitdev/library

evaluation

Build evaluation frameworks for agent systems. Use when testing agent performance, validating context engineering choices, or measuring improvements over time.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningsammcj/agentic-coding

deepeval

Use when discussing or working with DeepEval (the python AI evaluation framework)

🇺🇸|EnglishTranslated

AI & Machine Learningtyler-r-kendrick/agent-sk...

microsoft-foundry

Use this skill to work with Microsoft Foundry (Azure AI Foundry): deploy AI models from catalog, build RAG applications with knowledge indexes, create and evaluate AI agents. USE FOR: Microsoft Foundry, AI Foundry, deploy model, model catalog, RAG, knowledge index, create agent, evaluate agent, agent monitoring. DO NOT USE FOR: Azure Functions (use azure-functions), App Service (use azure-create-app).

🇺🇸|EnglishTranslated

AI & Machine Learningaffaan-m/everything-claud...

agent-eval

Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics

🇺🇸|EnglishTranslated