Loading...
Loading...
Found 33 Skills
TDD-style testing methodology for skills using fresh subagent instances to prevent priming bias and validate skill effectiveness. Use when validating skill improvements, testing skill effectiveness, preventing priming bias, measuring skill impact on behavior. Do not use when implementing skills (use skill-authoring instead), creating hooks (use hook-authoring instead).
Design and implement comprehensive evaluation systems for AI agents. Use when building evals for coding agents, conversational agents, research agents, or computer-use agents. Covers grader types, benchmarks, 8-step roadmap, and production integration.
Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.
This skill should be used when the user asks to "create an agent", "add an agent", "write a subagent", "agent frontmatter", "when to use description", "agent examples", "agent tools", "agent colors", "autonomous agent", or needs guidance on agent structure, system prompts, triggering conditions, or agent development best practices for Claude Code plugins.
Build AI agents with Pydantic AI — tools, capabilities, structured output, streaming, testing, and multi-agent patterns. Use when the user mentions Pydantic AI, imports pydantic_ai, or asks to build an AI agent, add tools/capabilities, stream output, define agents from YAML, or test agent behavior.
End-to-end interactive workflow — pick a product, then either run existing tasks and environments (Path A) or set up new ones from docs, suggested tasks, credentials, and templates (Path B). Builds the experiment, attaches signals, and optionally triggers the first iteration. Trigger when users say: "set up an experiment", "create an experiment", "I want to run an experiment", "run my tasks", "setup experiment", "new experiment", "configure an experiment", or "experiment setup".
Complete reference for writing, running, and iterating on evals (automated conversation tests) for ADK agents. Covers eval file format, all assertion types, CLI usage, and per-primitive testing patterns.
Use when facing 2 or more independent tasks that can be completed without shared state or sequential dependencies
Deploy prompt-based Azure AI agents from YAML definitions to Azure AI Foundry projects. Use when users want to (1) create and deploy Azure AI agents, (2) set up Azure AI infrastructure, (3) deploy AI models to Azure, or (4) test deployed agents interactively. Handles authentication, RBAC, quotas, and deployment complexities automatically.
Test PydanticAI agents using TestModel, FunctionModel, VCR cassettes, and inline snapshots. Use when writing unit tests, mocking LLM responses, or recording API interactions.
Use when evaluating agent performance, building test frameworks, measuring quality, or asking about "agent evaluation", "LLM-as-judge", "agent testing", "quality metrics", "evaluation rubrics", "agent benchmarks"
Use when an AI agent should run protocols or workflow tests against kairos-dev (KAIROS MCP in this repo's dev environment). Covers AI–MCP integration and workflow-test flows; MCP-only, reports/ output.