Loading...
Loading...
Found 803 Skills
Build, validate, and deploy LLM-as-Judge evaluators for automated quality assessment of LLM pipeline outputs. Use this skill whenever the user wants to: create an automated evaluator for subjective or nuanced failure modes, write a judge prompt for Pass/Fail assessment, split labeled data for judge development, measure judge alignment (TPR/TNR), estimate true success rates with bias correction, or set up CI evaluation pipelines. Also trigger when the user mentions "judge prompt", "automated eval", "LLM evaluator", "grading prompt", "alignment metrics", "true positive rate", or wants to move from manual trace review to automated evaluation. This skill covers the full lifecycle: prompt design → data splitting → iterative refinement → success rate estimation.
Design experiment plans with progressive stages — initial implementation, baseline tuning, creative research, and ablation studies. Plan baselines, datasets, hyperparameter sweeps, and evaluation metrics. Use when planning experiments for a research paper.
CI/CD pipelines, deployment strategy, and infrastructure. Use when setting up GitHub Actions workflows, choosing deployment platforms, configuring production environments, securing pipelines with OIDC, optimizing build performance, building container images, measuring DORA metrics, or setting up Docker multi-stage builds.
Parse raw text from an Instagram or TikTok Story insights screenshot and format it into a clean, spreadsheet-ready row with labeled fields. This skill should be used when parsing Story metrics from a screenshot, formatting Story insights for a spreadsheet, extracting metrics from a pasted Story screenshot, cleaning up Story analytics data, converting Story insights text into structured data, turning a Story performance screenshot into a row for the tracker, logging Story metrics into a spreadsheet, normalizing Story screenshot data, pulling numbers from a Story insights paste, organizing Story metrics from creator screenshots, processing a batch of Story screenshots into rows, building a Story metrics tracker from screenshots, or entering Story data from a screenshot into a sheet. For normalizing metrics from multiple sources into a unified table, see metrics-normalization-formatter. For calculating engagement rates and comparing to benchmarks, see engagement-rate-calculator-benchmarker.
Submits and manages FastFold protein folding jobs via the Jobs API. Covers authentication, creating jobs, polling for completion, and fetching CIF/PDB URLs, metrics, and viewer links. Use when folding protein sequences with FastFold, calling the FastFold API, or scripting fold-and-wait workflows.
Captures quality metrics baseline (tests, coverage, type errors, linting, dead code) by running quality gates and storing results in memory for regression detection. Use at feature start, before refactor work, or after major changes to establish baseline. Triggers on "capture baseline", "establish baseline", or PROACTIVELY at start of any feature/refactor work. Works with pytest output, pyright errors, ruff warnings, vulture results, and memory MCP server for baseline storage.
Detect Single Responsibility Principle (SRP) violations using multi-dimensional analysis. Use when reviewing code for "SRP", "single responsibility", "god class", "doing too much", "too many dependencies", before commits, during refactoring, or as quality gate. Analyzes Python, JavaScript, TypeScript files with AST-based detection, metrics (TCC, ATFD, WMC), and project-specific patterns. Provides actionable fix guidance with refactoring estimates.
Scrape public posts from X.com (Twitter) users. Extracts text content, timestamps, engagement metrics (views, likes, retweets, replies), and generates direct post links. Use when user asks to scrape/fetch/analyze X.com posts or Twitter data, or mentions "整理@某人的发言" or "看看某人在X上说了什么".
Self-improving agent architecture using ChromaDB for continuous learning, self-evaluation, and improvement storage. Agents maintain separate memory collections for learned patterns, performance metrics, and self-assessments without modifying their static .md configuration.
When the user wants to set up, interpret, or improve their app analytics and tracking. Also use when the user mentions "analytics", "tracking", "metrics", "KPIs", "App Store Connect analytics", "install tracking", "funnel", "attribution", or "how is my app performing". For A/B testing, see ab-test-store-listing. For retention metrics, see retention-optimization.
Comprehensive system health scanner that checks security risks, performance metrics, and optimization opportunities. Works on Windows, macOS, and Linux.
Support workflows, ticketing systems (Zendesk, Intercom), knowledge base design, chatbot design, and metrics (CSAT, NPS). Use when building support infrastructure, designing help centers, or optimizing customer experience.