Search Results: llm-as-judge

Found 40 Skills

AI & Machine Learningshipshitdev/library

evaluation

Build evaluation frameworks for agent systems. Use when testing agent performance, validating context engineering choices, or measuring improvements over time.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningorq-ai/assistant-plugins

build-evaluator

Create validated LLM-as-a-Judge evaluators following best practices — binary Pass/Fail judges with TPR/TNR validation for measuring specific failure modes. Use when you need to automate quality checks, build guardrails, or measure a specific failure mode identified during trace analysis. Do NOT use when failures are fixable with prompt changes (use optimize-prompt) or when failure modes are unknown (use analyze-trace-failures first).

🇺🇸|EnglishTranslated

AI & Machine Learningneolabhq/context-engineer...

do-in-steps

Execute complex tasks through sequential sub-agent orchestration with intelligent model selection, meta-judge → LLM-as-a-judge verification

🇺🇸|EnglishTranslated

AI & Machine Learningneolabhq/context-engineer...

do-and-judge

Execute a task with sub-agent implementation and LLM-as-a-judge verification with automatic retry loop

🇺🇸|EnglishTranslated

AI & Machine Learningeyadsibai/ltk

agent-evaluation

Use when evaluating agent performance, building test frameworks, measuring quality, or asking about "agent evaluation", "LLM-as-judge", "agent testing", "quality metrics", "evaluation rubrics", "agent benchmarks"

🇺🇸|EnglishTranslated

AI & Machine Learningneolabhq/context-engineer...

sadd:judge

Launch a sub-agent judge to evaluate results produced in the current conversation

🇺🇸|EnglishTranslated

AI & Machine Learninglaunchdarkly/agent-skills

aiconfig-online-evals

Attach judges to AI Config variations for automatic LLM-as-a-judge evaluation. Create custom judges, configure sampling rates, and monitor quality scores.

🇺🇸|EnglishTranslated

AI & Machine Learningneolabhq/context-engineer...

judge

Launch a meta-judge then a judge sub-agent to evaluate results produced in the current conversation

🇺🇸|EnglishTranslated

AI & Machine Learningglennguilloux/context-eng...

agent-evaluation

Evaluate and improve Claude Code commands, skills, and agents. Use when testing prompt effectiveness, validating context engineering choices, or measuring improvement quality.

🇺🇸|EnglishTranslated

Code Qualityneolabhq/context-engineer...

reflexion:critique

Comprehensive multi-perspective review using specialized judges with debate and consensus building

🇺🇸|EnglishTranslated

AI & Machine Learningneolabhq/context-engineer...

sadd:do-and-judge

Execute a task with sub-agent implementation and LLM-as-a-judge verification with automatic retry loop

🇺🇸|EnglishTranslated

AI & Machine Learningawslabs/agent-plugins

model-evaluation

Generates a Jupyter notebook that evaluates a fine-tuned SageMaker model using LLM-as-a-Judge. Use when the user says "evaluate my model", "how did my model perform", "compare models", or after a training job completes. Supports built-in and custom evaluation metrics, evaluation dataset setup, and judge model selection.

🇺🇸|EnglishTranslated

2 scripts/Checked