Loading...
Loading...
Found 28 Skills
Master context engineering for AI agent systems. Use when designing agent architectures, debugging context failures, optimizing token usage, implementing memory systems, building multi-agent coordination, evaluating agent performance, or developing LLM-powered pipelines. Covers context fundamentals, degradation patterns, optimization techniques (compaction, masking, caching), compression strategies, memory architectures, multi-agent patterns, LLM-as-Judge evaluation, tool design, and project development.
Evaluate LLM systems using automated metrics, LLM-as-judge, and benchmarks. Use when testing prompt quality, validating RAG pipelines, measuring safety (hallucinations, bias), or comparing models for production deployment.
Use this skill when you need to test or evaluate LangGraph/LangChain agents: writing unit or integration tests, generating test scaffolds, mocking LLM/tool behavior, running trajectory evaluation (match or LLM-as-judge), running LangSmith dataset evaluations, and comparing two agent versions with A/B-style offline analysis. Use it for Python and JavaScript/TypeScript workflows, evaluator design, experiment setup, regression gates, and debugging flaky/incorrect evaluation results.
Build and run LLM-as-judge evaluation pipelines using Amazon Bedrock Evaluation Jobs with pre-computed inference datasets. Use when setting up automated model evaluation, designing test scenarios, collecting pre-computed responses, configuring custom metrics, creating AWS infrastructure, running evaluation jobs, parsing results, and iterating on findings.
Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.
Execute complex tasks through sequential sub-agent orchestration with intelligent model selection, and LLM-as-a-judge verification
Comprehensive multi-perspective review using specialized judges with debate and consensus building
Evaluate and improve Claude Code commands, skills, and agents. Use when testing prompt effectiveness, validating context engineering choices, or measuring improvement quality.
Launch a sub-agent judge to evaluate results produced in the current conversation
Execute a task with sub-agent implementation and LLM-as-a-judge verification with automatic retry loop
Build evaluation frameworks for agent systems. Use when testing agent performance, validating context engineering choices, or measuring improvements over time.
Comprehensive multi-perspective review using specialized judges with debate and consensus building