Search Results: agent-evaluation

Found 43 Skills

AI & Machine Learninglubu-labs/langchain-agent...

langgraph-testing-evaluation

Use this skill when you need to test or evaluate LangGraph/LangChain agents: writing unit or integration tests, generating test scaffolds, mocking LLM/tool behavior, running trajectory evaluation (match or LLM-as-judge), running LangSmith dataset evaluations, and comparing two agent versions with A/B-style offline analysis. Use it for Python and JavaScript/TypeScript workflows, evaluator design, experiment setup, regression gates, and debugging flaky/incorrect evaluation results.

🇺🇸|EnglishTranslated

11 scripts/Attention

AI & Machine Learningmizchi/chezmoi-dotfiles

empirical-prompt-tuning

A method for iteratively improving text instructions for agents (skills / slash commands / task prompts / CLAUDE.md sections / code generation prompts) by having unbiased executors run them, then evaluating from both perspectives (executor self-report + instruction-side metrics). Repeat until improvement plateaus. Use immediately after creating or significantly revising a prompt or skill, or when you suspect the reason an agent isn't behaving as expected is due to ambiguity in the instructions.

🇨🇳|ChineseTranslated

AI & Machine Learning10xchengtu/harness-engine...

harness-engineering

Set up and improve harness engineering (AGENTS.md, docs/, lint rules, eval systems, project-level prompt engineering) for AI-agent-friendly codebases. Triggers on: new/empty project setup for AI agents, AGENTS.md or CLAUDE.md creation, harness engineering questions, making agents work better on a codebase. ALSO triggers when users are frustrated or complaining about agent quality — e.g. 'the agent keeps ignoring conventions', 'it never follows instructions', 'why does it keep doing X', 'the agent is broken' — because poor agent output almost always signals harness gaps, not model problems. Covers: context engineering, architectural constraints, multi-agent coordination, evaluation, long-running agent harness, and diagnosis of agent quality issues.

🇺🇸|EnglishTranslated

AI & Machine Learningflora131/atomic

evaluation

This skill should be used when the user asks to "evaluate agent performance", "build test framework", "measure agent quality", "create evaluation rubrics", or mentions LLM-as-judge, multi-dimensional evaluation, agent testing, or quality gates for agent pipelines. Part of the context engineering skill suite — also activates when the user mentions "context engineering" or "context-engineering" in the context of measuring agent effectiveness.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningcobosteven/cobo-agent-wal...

caw-eval

Evaluate the quality of CAW (Cobo Agentic Wallet) Agent in local Claude Code, and generate scoring data and analysis reports. Use when: Users want to run CAW evaluation, conduct evaluation, test Skill, assess Agent quality, generate evaluation reports, or say "run evaluation", "evaluate CAW", "eval", "score". For weak model / openclaw evaluation, please use caw-eval-openclaw (only installed on openclaw servers).

🇨🇳|ChineseTranslated

8 scripts/Attention

AI & Machine Learninggotalab/skillport

skill-evaluator

Evaluates agent skills against Anthropic's best practices. Use when asked to review, evaluate, assess, or audit a skill for quality. Analyzes SKILL.md structure, naming conventions, description quality, content organization, and identifies anti-patterns. Produces actionable improvement recommendations.

🇺🇸|EnglishTranslated

1 scripts/Attention

AI & Machine Learninglangchain-ai/lca-skills

langsmith-code-eval

Create code-based evaluators for LangSmith-traced agents with step-by-step collaborative guidance through inspection, evaluation logic, and testing.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningshentufoundation/openmath...

openmath-lean-theorem

Configures Lean environments, installs external proof skills, runs preflight checks, and guides the workflow for proving downloaded OpenMath Lean theorems locally.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningharbor-framework/harbor

create-task

Create a new Harbor task for evaluating agents. Use when the user wants to scaffold, build, or design a new task, benchmark problem, or eval. Guides through instruction writing, environment setup, verifier design (pytest vs Reward Kit vs custom), and solution scripting.

🇺🇸|EnglishTranslated

AI & Machine Learningaradotso/trending-skills

future-agi-platform

Expert skill for using Future AGI — the open-source end-to-end platform for evaluating, observing, and improving LLM and AI agent applications with tracing, evals, simulations, datasets, gateway, and guardrails.

🇺🇸|EnglishTranslated

AI & Machine Learningneolabhq/context-engineer...

judge

Launch a meta-judge then a judge sub-agent to evaluate results produced in the current conversation

🇺🇸|EnglishTranslated

AI & Machine Learningthe-perfect-developer/the...

google-adk

This skill should be used when the user asks to "build an agent with Google ADK", "use the Agent Development Kit", "create a Google ADK agent", "set up ADK tools", or needs guidance on Google's Agent Development Kit best practices, multi-agent systems, or agent evaluation.

🇺🇸|EnglishTranslated