Loading...
Loading...
Found 1,928 Skills
Evaluate and rank agent results by metric or LLM judge for an AgentHub session.
Evaluate a website, landing page, content, or any online asset through the eyes of pre-built synthetic ICP personas. Loads personas from icp-persona-builder output, then runs them against target URLs. Supports three modes: structured scorecard, freeform focus group, and head-to-head competitive comparison. Reusable — run against the same site after changes, or against new content anytime.
Technical solution evaluation and code review in the style of Linus Torvalds. Only use this when the user explicitly requests a Linus-style review or explicitly asks for a rigorous evaluation of code changes/technical solutions (e.g., "review changes/code", "evaluate if the solution is appropriate", "check submission standards", "linus-tech-review").
Conduct comprehensive literature research with target disambiguation, evidence grading, and structured theme extraction. Creates a detailed report with mandatory completeness checklist, biological model synthesis, and testable hypotheses. For biological targets, resolves official IDs (Ensembl/UniProt), synonyms, naming collisions, and gathers expression/pathway context before literature search. Default deliverable is a report file; for single factoid questions, uses a fast verification mode and may include an inline answer. Use when users need thorough literature reviews, target profiles, or to verify specific claims from the literature.
Design, evaluate, and document software architecture patterns
Evaluate Figma designs from operator persona perspectives through design critique and user experience evaluation. Use when reviewing UX for specific user roles (e.g., air-surveillance-tech, weapons-director), conducting design reviews, or evaluating operator interfaces. Analyzes cognitive load, communication patterns, pain points, and system visibility. Works with Figma MCP (desktop/URL) and Outline docs.
Use only when the user explicitly requests brainstorming, evaluating architecture choices, or comparing options where no single concern dominates
Evaluate acquisition channels using unit economics, customer quality, and scalability. Recommends scale/test/kill decisions.
A scoring scale for evaluating how well a CLI is designed for AI agents, based on the "Rewrite Your CLI for AI Agents" principles.
Instrument Python LLM apps, build golden datasets, write eval-based tests, run them, and root-cause failures — covering the full eval-driven development cycle. Make sure to use this skill whenever a user is developing, testing, QA-ing, evaluating, or benchmarking a Python project that calls an LLM, even if they don't say "evals" explicitly. Use for making sure an AI app works correctly, catching regressions after prompt changes, debugging why an agent started behaving differently, or validating output quality before shipping.
Attach judges to AI Config variations for automatic LLM-as-a-judge evaluation. Create custom judges, configure sampling rates, and monitor quality scores.
Quick situational awareness for the current git branch. Summarizes what a feature branch is about by analyzing commits and changes against trunk. On trunk, highlights recent interesting activity. Use when user says "wtf", "what's going on", "what is this branch", "what changed", or "catch me up".