Loading...
Loading...
Found 1,906 Skills
Evaluate LLM systems using automated metrics, LLM-as-judge, and benchmarks. Use when testing prompt quality, validating RAG pipelines, measuring safety (hallucinations, bias), or comparing models for production deployment.
Consult this skill when building evaluation or scoring systems. Use when implementing evaluation systems, creating quality gates, designing scoring rubrics, building decision frameworks. Do not use when simple pass/fail without scoring needs.
Calibrate an LLM judge against human labels using data splits, TPR/TNR, and bias correction. Use after writing a judge prompt (write-judge-prompt) when you need to verify alignment before trusting its outputs. Do NOT use for code-based evaluators (those are deterministic; test with standard unit tests).
Comprehensive evaluation of potential stock investments combining valuation analysis, fundamental research, technical assessment, and clear buy/hold/sell recommendations. Use when the user asks about buying a stock, evaluating investment opportunities, analyzing watchlist candidates, or requests stock recommendations. Provides specific entry prices, position sizing, and conviction ratings.
Comprehensive technology stack evaluation and comparison tool with TCO analysis, security assessment, and intelligent recommendations for engineering teams
Validates dataset formatting and quality for SageMaker model fine-tuning (SFT, DPO, or RLVR). Use when the user says "is my dataset okay", "evaluate my data", "check my training data", "I have my own data", or before starting any fine-tuning job. Detects file format, checks schema compliance against the selected model and technique, and reports whether the data is ready for training or evaluation.
Generates a Jupyter notebook that evaluates a fine-tuned SageMaker model using LLM-as-a-Judge. Use when the user says "evaluate my model", "how did my model perform", "compare models", or after a training job completes. Supports built-in and custom evaluation metrics, evaluation dataset setup, and judge model selection.
Investigate LLM analytics evaluations of both types — `hog` (deterministic code-based) and `llm_judge` (LLM-prompt-based). Find existing evaluations, inspect their configuration, run them against specific generations, query individual pass/fail results, and generate AI-powered summaries of patterns across many runs. Use when the user asks to debug why an evaluation is failing, surface common failure modes, compare results across filters, dry-run a Hog evaluator, prototype a new LLM-judge prompt, or manage the evaluation lifecycle (create, update, enable/disable, delete).
Structured UX evaluation that produces quantitative assessments, identifies specific issues, and routes to the right Intent skill for resolution. Part of the Intent design strategy system. Runs heuristic evaluations, cognitive walkthroughs, anti-pattern detection, and task success analysis. Scores, categorizes, and prioritizes findings — then maps every issue to the skill that fixes it. Trigger on: UX review, design audit, heuristic evaluation, usability assessment, "review this design", "what's wrong with this", "evaluate the experience", "is this accessible", "check for dark patterns", "how good is this UX", "rate this design", "find the problems", or any request to systematically assess the quality of a user experience. This is the diagnostic entry point of the Intent system — the UX doctor that diagnoses issues and refers to specialists.
Structured scholarly-work evaluation for papers, proposals, literature reviews, methods sections, evidence quality, citation support, and research-writing feedback.
Quickly judge the current valuation level, historical quantile, relative position among peers, and revaluation conditions of individual stocks. Suitable for scenarios such as initially judging "whether it's expensive", screening valuation positions, discussing odds and expectation requirements.
Evaluate and improve Claude Code commands, skills, and agents. Use when testing prompt effectiveness, validating context engineering choices, or measuring improvement quality.