Loading...
Loading...
Found 147 Skills
Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring—where even top agents achieve less than 50% on real-world benchmarks Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.
Evaluates code generation models across HumanEval, MBPP, MultiPL-E, and 15+ benchmarks with pass@k metrics. Use when benchmarking code models, comparing coding abilities, testing multi-language support, or measuring code generation quality. Industry standard from BigCode Project used by HuggingFace leaderboards.
Senior SaaS CFO / Financial Analyst (15+ years) specialized in financial modeling, projections, and exit strategy for bootstrapped and VC-backed SaaS companies. Activate when user needs: (1) Revenue projections (1-5 years), (2) Exit valuation and multiples, (3) Unit economics analysis (CAC, LTV, payback), (4) Scenario modeling (conservative/base/optimistic), (5) Fundraising narratives with financial backing, (6) M&A due diligence financials, (7) SaaS metrics benchmarking, (8) Cohort analysis and churn modeling. Triggers: "proyecciones", "projections", "exit", "valuation", "ARR", "MRR", "multiples", "revenue forecast", "financial model", "exit strategy", "CAC", "LTV", "unit economics", "churn", "fundraising", "M&A", "acquisition", "5 year plan".
Automated reproduction of comprehensive model evaluation benchmarks following the Benchmark Suite V3. Auto-activates for model benchmarking, comparison evaluation, or performance testing between AI models.
Spatial indexing and world streaming for Three.js building games with thousands of pieces. Use when optimizing building games, implementing spatial queries, chunk loading, or profiling performance. Includes spatial hash grids, octrees, chunk managers, and benchmarking tools.
Optimizes Python library performance through profiling (cProfile, PyInstrument), memory analysis (memray, tracemalloc), benchmarking (pytest-benchmark), and optimization strategies. Use when analyzing performance bottlenecks, finding memory leaks, or setting up performance regression testing.
Instrument Python LLM apps, build golden datasets, write eval-based tests, run them, and root-cause failures — covering the full eval-driven development cycle. Make sure to use this skill whenever a user is developing, testing, QA-ing, evaluating, or benchmarking a Python project that calls an LLM, even if they don't say "evals" explicitly. Use for making sure an AI app works correctly, catching regressions after prompt changes, debugging why an agent started behaving differently, or validating output quality before shipping.
Research-driven code review and validation at multiple levels of abstraction. Two modes: (1) Session review — after making changes, review and verify work using parallel reviewers that research-validate every assumption; (2) Full codebase audit — deep end-to-end evaluation using parallel teams of subagent-spawning reviewers. Use when reviewing changes, verifying work quality, auditing a codebase, validating correctness, checking assumptions, finding defects, reducing complexity. NOT for writing new code, explaining code, or benchmarking.
Live Google Search Console analytics — fetches real SEO data (clicks, impressions, CTR, rankings) and delivers actionable insights with CTR benchmarking and opportunity detection. Zero dependencies. Use when the user asks about GSC, Google Search Console, SEO performance, search performance, keywords, rankings, organic traffic, top pages, top queries, "how is my site performing in Google", "check rankings", or "search console report".
Go testing patterns for production-grade code: subtests, test helpers, fixtures, golden files, httptest, testcontainers, property-based testing, and fuzz testing. Covers mocking strategies, test isolation, coverage analysis, and test design philosophy. Use when writing tests, improving coverage, reviewing test quality, setting up test infrastructure, or choosing a testing approach. Trigger examples: "add tests", "improve coverage", "write tests for this", "test helpers", "mock this dependency", "integration test", "fuzz test". Do NOT use for performance benchmarking methodology (use go-performance-review), security testing (use go-security-audit), or table-driven test patterns specifically (use go-test-table-driven).
Performance review and testing: evaluate Core Web Vitals, page load times, bundle sizes, runtime performance, resource optimization, and rendering efficiency with browser-based measurement and benchmarking.
Optimize existing Triton kernels for NVIDIA TileIR backend on Blackwell GPUs (sm_100+). Adds TileIR-specific autotune configs: occupancy, num_ctas, TMA descriptors. Covers kernel classification (dot-related, norm-like, elementwise, reduction), type-specific transformations, and PTX-vs-TileIR benchmarking. Triggered by: "optimize for TileIR", "add TileIR configs", "Blackwell optimization", "TMA descriptors", "2CTA mode", "occupancy tuning". Kernels use standard `import triton`; TileIR activates via ENABLE_TILE=1 when nvtriton is installed.