Search Results: agent-evaluation

Found 16 Skills

google-agents-cli-workflow

This skill should be used when the user wants to "develop an agent", "build an agent using ADK", "run the agent locally", "debug agent code", "test an agent", "deploy an agent", "publish an agent", "monitor an agent", or needs the ADK (Agent Development Kit) development lifecycle and coding guidelines. Entrypoint for building ADK agents. Always active — provides the full workflow (scaffold, build, evaluate, deploy, publish, observe), code preservation rules, model selection guidance, and troubleshooting steps for ADK or any agent development.

🇺🇸|EnglishTranslated

233

AI & Machine Learningmicrosoft/agent-skills

azure-ai-evaluation-py

Azure AI Evaluation SDK for Python. Use for evaluating generative AI applications with quality, safety, agent, and custom evaluators. Triggers: "azure-ai-evaluation", "evaluators", "GroundednessEvaluator", "evaluate", "AI quality metrics", "RedTeam", "agent evaluation".

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learninggoogle/adk-docs

adk-dev-guide

ALWAYS ACTIVE — read at the start of any ADK agent development session. ADK development lifecycle and mandatory coding guidelines — spec-driven workflow, code preservation rules, model selection, and troubleshooting.

🇺🇸|EnglishTranslated

AI & Machine Learningsammcj/agentic-coding

deepeval

Use when discussing or working with DeepEval (the python AI evaluation framework)

🇺🇸|EnglishTranslated

AI & Machine Learningaffaan-m/everything-claud...

agent-eval

Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics

🇺🇸|EnglishTranslated

AI & Machine Learningmizchi/chezmoi-dotfiles

empirical-prompt-tuning

A method for iteratively improving text instructions for agents (skills / slash commands / task prompts / CLAUDE.md sections / code generation prompts) by having unbiased executors run them, then evaluating from both perspectives (executor self-report + instruction-side metrics). Repeat until improvement plateaus. Use immediately after creating or significantly revising a prompt or skill, or when you suspect the reason an agent isn't behaving as expected is due to ambiguity in the instructions.

🇨🇳|ChineseTranslated

AI & Machine Learning10xchengtu/harness-engine...

harness-engineering

Set up and improve harness engineering (AGENTS.md, docs/, lint rules, eval systems, project-level prompt engineering) for AI-agent-friendly codebases. Triggers on: new/empty project setup for AI agents, AGENTS.md or CLAUDE.md creation, harness engineering questions, making agents work better on a codebase. ALSO triggers when users are frustrated or complaining about agent quality — e.g. 'the agent keeps ignoring conventions', 'it never follows instructions', 'why does it keep doing X', 'the agent is broken' — because poor agent output almost always signals harness gaps, not model problems. Covers: context engineering, architectural constraints, multi-agent coordination, evaluation, long-running agent harness, and diagnosis of agent quality issues.

🇺🇸|EnglishTranslated

AI & Machine Learninglangchain-ai/lca-skills

langsmith-code-eval

Create code-based evaluators for LangSmith-traced agents with step-by-step collaborative guidance through inspection, evaluation logic, and testing.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningshentufoundation/openmath...

openmath-lean-theorem

Configures Lean environments, installs external proof skills, runs preflight checks, and guides the workflow for proving downloaded OpenMath Lean theorems locally.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningharbor-framework/harbor

create-task

Create a new Harbor task for evaluating agents. Use when the user wants to scaffold, build, or design a new task, benchmark problem, or eval. Guides through instruction writing, environment setup, verifier design (pytest vs Reward Kit vs custom), and solution scripting.

🇺🇸|EnglishTranslated

AI & Machine Learningneolabhq/context-engineer...

judge

Launch a meta-judge then a judge sub-agent to evaluate results produced in the current conversation

🇺🇸|EnglishTranslated

AI & Machine Learningglennguilloux/context-eng...

agent-evaluation

Evaluate and improve Claude Code commands, skills, and agents. Use when testing prompt effectiveness, validating context engineering choices, or measuring improvement quality.

🇺🇸|EnglishTranslated