Loading...
Loading...
Found 1,928 Skills
Guide for using MassGen to develop and improve itself. This skill should be used when agents need to run MassGen experiments programmatically (using automation mode) OR analyze terminal UI/UX quality (using visual evaluation tools). These are mutually exclusive workflows for different improvement goals.
Score assistant responses for clarity on a strict 1-5 scale, then return strict JSON only with score, rationale, and improvement suggestions. Use when the user asks to evaluate clarity, grade clarity, or critique clarity quality.
Assess research idea novelty through systematic literature search. Multi-round search-evaluate loops with harsh critic persona. Binary novel/not-novel decision with justification. Use before committing to a research direction.
Audit an LLM eval pipeline and surface problems: missing error analysis, unvalidated judges, vanity metrics, etc. Use when inheriting an eval system, when unsure whether evals are trustworthy, or as a starting point when no eval infrastructure exists. Do NOT use when the goal is to build a new evaluator from scratch (use error-analysis, write-judge-prompt, or validate-evaluator instead).
A scoring scale for evaluating how well a CLI is designed for AI agents, based on the "Rewrite Your CLI for AI Agents" principles.
Use this skill whenever the user asks for a security analysis, vulnerability assessment, security audit, or any form of Security Assessment Report (SAR) over a codebase, infrastructure, API, database, or system. Triggers include: "audit my code", "find security issues", "run a security check", "generate a SAR", "check for vulnerabilities", "is this code secure", or any request that involves evaluating the security posture of a project. Also triggers when the user uploads or references source code, config files, environment variables, or architecture diagrams and asks for a security opinion. Do NOT use for generic coding tasks, code reviews focused on quality rather than security, or performance optimization unless a security angle is explicitly present.
Set up a new autoresearch experiment interactively. Collects domain, target file, eval command, metric, direction, and evaluator.
Run isolated eval and grading calls using CC 2.1.81 --bare mode. Constructs claude -p --bare invocations for skill evaluation, trigger testing, and LLM grading without plugin/hook interference. Use when running eval pipelines, grading skill outputs, benchmarking prompt quality, or testing trigger accuracy in isolation.
Quick situational awareness for the current git branch. Summarizes what a feature branch is about by analyzing commits and changes against trunk. On trunk, highlights recent interesting activity. Use when user says "wtf", "what's going on", "what is this branch", "what changed", or "catch me up".
Instrument, trace, evaluate, and monitor LLM applications and AI agents with LangSmith. Use when setting up observability for LLM pipelines, running offline or online evaluations, managing prompts in the Prompt Hub, creating datasets for regression testing, or deploying agent servers. Triggers on: langsmith, langchain tracing, llm tracing, llm observability, llm evaluation, trace llm calls, @traceable, wrap_openai, langsmith evaluate, langsmith dataset, langsmith feedback, langsmith prompt hub, langsmith project, llm monitoring, llm debugging, llm quality, openevals, langsmith cli, langsmith experiment, annotate llm, llm judge.
Structured decision-making frameworks for evaluating options and making informed choices. Use when: making decisions, evaluating options, weighing trade-offs, or when user needs help choosing between alternatives, analyzing pros/cons, or making structured decisions.
Retrieve industry-specific P/E ratios using Octagon MCP. Use when comparing company valuations to specific industry peers, analyzing sub-sector valuations, and understanding niche market valuations beyond broad sector averages.