Search Results: llm-testing

Found 6 Skills

AI & Machine Learningconfident-ai/deepeval

deepeval

DeepEval evaluation workflow for AI agents and LLM applications. TRIGGER when the user wants to evaluate or improve an AI agent, tool-using workflow, multi-turn chatbot, RAG pipeline, or LLM app; add evals; generate datasets or goldens; use deepeval generate; use deepeval test run; add tracing or @observe; send results to Confident AI; monitor production; run online evals; inspect traces; or iterate on prompts, tools, retrieval, or agent behavior from eval failures. AI agents are the primary use case. Covers Python SDK, pytest eval suites, CLI generation, tracing, Confident AI reporting, and agent-driven improvement loops. DO NOT TRIGGER for unrelated generic pytest, non-AI test setup, or non-DeepEval observability work unless the user asks to compare or migrate to DeepEval.

🇺🇸|EnglishTranslated

4 scripts/Checked

AI & Machine Learningaffaan-m/everything-claud...

skill-comply

Visualize whether skills, rules, and agent definitions are actually followed — auto-generates scenarios at 3 prompt strictness levels, runs agents, classifies behavioral sequences, and reports compliance rates with full tool call timelines

🇺🇸|EnglishTranslated

12 scripts/Attention

AI & Machine Learningalinaqi/claude-bootstrap

llm-patterns

AI-first application patterns, LLM testing, prompt management

🇺🇸|EnglishTranslated

AI & Machine Learningorchestra-research/ai-res...

langsmith-observability

LLM observability platform for tracing, evaluation, and monitoring. Use when debugging LLM applications, evaluating model outputs against datasets, monitoring production systems, or building systematic testing pipelines for AI applications.

🇺🇸|EnglishTranslated

Testing & QAyonatangross/orchestkit

testing-patterns

Redirect — testing-patterns was split into 5 focused sub-skills. Use when looking for testing-patterns, writing tests, or test automation. Redirects to testing-unit, testing-e2e, testing-integration, testing-llm, or testing-perf.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

nemo-rl-e2e-testing

External NeMo-RL end-to-end validation workflow for Megatron-Bridge model/provider changes, including downstream compatibility checks, external RL lifecycle behavior, Megatron policy setup, HF import/export, checkpoint/resume, non-colocated vLLM refit, delta weight transfer, optional LoRA/generation variants, and questions such as "does this model work in NeMo-RL", "run NeMo-RL e2e", or "external RL loop validation". Covers running NeMo-RL Megatron policy jobs from a Bridge checkout, choosing GRPO/SFT/checkpoint/non-colocated refit variants, setting PYTHONPATH so NeMo-RL imports the local Bridge tree, and reporting pass/fail evidence.

🇺🇸|EnglishTranslated