Loading...
Loading...
Found 1,906 Skills
This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise comparison, position bias, evaluation pipelines, or automated quality assessment. Part of the context engineering skill suite — also activates when the user mentions "context engineering" or "context-engineering" in the context of evaluating LLM output quality.
Structured 8-factor vendor evaluation framework for AI marketing tools, based on Venkatesan & Lecinski's The AI Marketing Canvas (2nd ed., Stanford Business Books, 2026). Scores each tool against EA market accessibility, data requirements, integration compatibility, team capability, and total cost in UGX, then produces a shortlist with 30-day experiment briefs. Invoke when a client has completed the ai-readiness-diagnostic and is at Canvas Step 2 (Experimentation) and is ready to select specific AI tools for structured trials. Also invoke when a client wants to compare 2–4 named tools before purchasing or committing budget.
Full browser UAT for web apps — Playwright testing with console/network error capture, accessibility checks, i18n validation, and bug triage. Use when running screen-by-screen UAT or testing specific features in any web or hybrid app (React, Vue, Angular, Ionic, Next.js, etc).
Evaluates ML models for performance, fairness, and reliability. Use for metric selection, cross-validation strategies, overfitting/underfitting diagnosis, hyperparameter tuning, LLM evaluation, A/B testing, and production monitoring for model drift.
Discounted cash flow valuation and intrinsic value analysis for public companies. Use when the brief asks for DCF, fair value, intrinsic value, price target, undervalued or overvalued analysis, or "what is this company worth?"
Provides situational playbooks for high-stakes edge cases that don't fit the standard management toolkit — produces step-by-step guidance for inappropriate team behavior, an engineer badmouthing your manager, letting someone go when circumstances are hard, manager quitting guilt, and handling layoffs (for both those leaving and those staying). Use when the user says "don't know how to handle this," "someone said something inappropriate," "engineer said something offensive," "developer talks badly about my manager," "letting someone go when their situation is hard," "I feel guilty about leaving my job," or "handling a layoff." Do NOT use for standard underperformance management (use performance-reviews) or giving direct feedback (use feedback).
Valuation methodology framework covering absolute (DCF / DDM / SOTP) and relative (PE-Band / PB-ROE / EV-EBITDA / PS) approaches — when to use each, pros/cons, common pitfalls, and practical application with Longbridge data. Triggers: "估值方法", "估值方法论", "DCF", "DDM", "SOTP", "PE估值", "EV/EBITDA", "绝对估值", "相对估值", "估值框架", "估值方法論", "絕對估值", "相對估值", "valuation methodology", "DCF model", "DDM", "SOTP", "PE band", "EV EBITDA", "absolute valuation", "relative valuation", "valuation framework".
Evaluates accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Triggers on "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "accuracy drop", "run nel". Handles deployment, config generation, and evaluation execution. Not for quantizing models (use ptq) or deploying/serving models (use deployment).
Applies multi-disciplinary cognitive frameworks (e.g., Inversion, First Principles, Second-Order Thinking) to user plans and system designs to stress-test decisions.
LLM prompt testing, evaluation, and CI/CD quality gates using Promptfoo. Invoke when: - Setting up prompt evaluation or regression testing - Integrating LLM testing into CI/CD pipelines - Configuring security testing (red teaming, jailbreaks) - Comparing prompt or model performance - Building evaluation suites for RAG, factuality, or safety Keywords: promptfoo, llm evaluation, prompt testing, red team, CI/CD, regression testing
Evaluate and improve Claude Code commands, skills, and agents. Use when testing prompt effectiveness, validating context engineering choices, or measuring improvement quality.
Use this skill for ANY question about CREATING evaluators. Covers creating custom metrics, LLM as Judge evaluators, code-based evaluators, and uploading evaluation logic to LangSmith. Includes basic usage of evaluators to run evaluations.