Loading...
Loading...
Found 1,150 Skills
Use when testing Ralph's hat collection presets, validating preset configurations, or auditing the preset library for bugs and UX issues.
Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".
Build evaluation frameworks for agent systems. Use when testing agent performance, validating context engineering choices, or measuring improvements over time.
Evaluates and sharpens content hooks using The Hook Stack™ framework. Use when scoring headlines, refining hooks for video/social/newsletter, or when asked to "evaluate this hook", "run through hook stack", or "score my headline".
Create a Technology Evaluation Pack (problem framing, options matrix, build vs buy, pilot plan, risk review, decision memo). Use for evaluating new tech, emerging technology, AI tools, vendor selection, and tech stack decisions.
Under the assumption that the US dollar or a certain currency loses its reserve status and gold becomes the only anchor, deduce the 'implied gold price that the balance sheet can withstand' by dividing central bank monetary liabilities by gold reserves, and output the leverage level, gap and ranking of each country or currency.
Evaluate GitHub contributors for MLOps/engineering roles. Use when analyzing candidates, researching GitHub profiles, or updating CONTRIBUTORS.md with hiring assessments.
Use when need explicit quality criteria and scoring scales to evaluate work consistently, compare alternatives objectively, set acceptance thresholds, reduce subjective bias, or when user mentions rubric, scoring criteria, quality standards, evaluation framework, inter-rater reliability, or grade/assess work.
Use this skill when you need to test or evaluate LangGraph/LangChain agents: writing unit or integration tests, generating test scaffolds, mocking LLM/tool behavior, running trajectory evaluation (match or LLM-as-judge), running LangSmith dataset evaluations, and comparing two agent versions with A/B-style offline analysis. Use it for Python and JavaScript/TypeScript workflows, evaluator design, experiment setup, regression gates, and debugging flaky/incorrect evaluation results.
Use when evaluating agent performance, building test frameworks, measuring quality, or asking about "agent evaluation", "LLM-as-judge", "agent testing", "quality metrics", "evaluation rubrics", "agent benchmarks"
Organize online information of IPs and conduct multi-dimensional evaluation and scoring. Suitable for assessing the adaptation value of IPs such as novels and scripts, analyzing market potential and innovative attributes
Graduate a workflow insight from learned/<topic>.md into AGENTS.md as a permanent constraint. Use when a lesson is stable enough to apply to every future session.