Loading...
Loading...
Found 1,153 Skills
Evaluates LLMs across 100+ benchmarks from 18+ harnesses (MMLU, HumanEval, GSM8K, safety, VLM) with multi-backend execution. Use when needing scalable evaluation on local Docker, Slurm HPC, or cloud platforms. NVIDIA's enterprise-grade platform with container-first architecture for reproducible benchmarking.
Use this skill for ANY question about CREATING evaluators. Covers creating custom metrics, LLM as Judge evaluators, code-based evaluators, and uploading evaluation logic to LangSmith. Includes basic usage of evaluators to run evaluations.
Comprehensive framework for evaluating AI vendors and solutions to avoid costly mistakes. Use this skill when assessing AI vendor proposals, conducting due diligence, evaluating contracts, comparing vendors, or making build-vs-buy decisions. Helps identify red flags, assess pricing models, evaluate technical capabilities, and conduct structured vendor comparisons.
Professionally evaluate story outlines, judge and score from the dimensions of market potential, innovation attributes, and content highlights. Suitable for story outline quality assessment, IP adaptation potential judgment, and project approval decision-making
Defines evaluation criteria and scoring methodologies for deliverable assessment
Iterative refinement workflow for polishing code, documentation, or designs through systematic evaluation and improvement cycles. Use when refining drafts into production-grade quality.
Implement comprehensive evaluation strategies for LLM applications using automated metrics, human feedback, and benchmarking. Use when testing LLM performance, measuring AI application quality, or establishing evaluation frameworks.
Respond in Chinese to user requests; when the user's message is fully in English (ignoring punctuation, digits, emojis, and whitespace), append a brief Chinese evaluation plus a 1-10 score.
Evaluate design effectiveness from a UX perspective. Assesses visual hierarchy, information architecture, emotional resonance, and overall design quality with actionable feedback.
A-share Value Investment Analysis Tool that provides stock screening, in-depth individual stock analysis, industry comparison and valuation calculation functions. Based on value investment theory, it uses akshare to obtain public financial data, suitable for ordinary investors with low-frequency trading.
Build production Spring Boot applications - REST APIs, Security, Data, Actuator
Patterns and techniques for evaluating and improving AI agent outputs. Use this skill when: - Implementing self-critique and reflection loops - Building evaluator-optimizer pipelines for quality-critical generation - Creating test-driven code refinement workflows - Designing rubric-based or LLM-as-judge evaluation systems - Adding iterative improvement to agent outputs (code, reports, analysis) - Measuring and improving agent response quality