Loading...
Loading...
Use this skill when the user asks to "evaluate MCP tools", "test tool selection", "improve tool descriptions", "check MCP schema quality", "eval my MCP server", or wants to measure whether Claude uses their MCP tools correctly. Tests tool selection accuracy, analyzes schema quality, and iteratively optimizes descriptions. Companion to build-mcp-server.
npx skill4agent add pproenca/dot-skills eval-mcpbuild-mcp-serverPhase 1: Connect → Phase 2: Static Analysis → Phase 3: Selection Testing → Phase 4: Optimize
↑__________________________|npxtools/listbuild-mcp-server/scripts/test-server.shhttp://localhost:3000/mcpnode dist/server.jsbash scripts/fetch-tools.sh <url-or-command> <transport> <workspace>/tools.jsontools/list| # | Tool | Description (preview) | Params | Annotations |
|---|------|-----------------------|--------|-------------|
| 1 | search_issues | Search issues by keyword... | 3 | readOnlyHint |
| 2 | create_issue | Create a new issue... | 4 | — |{server-name}-eval/{server-name}-eval/
├── tools.json
├── evals/
│ └── evals.json
└── iteration-N/bash scripts/analyze-schemas.sh <workspace>/tools.json <workspace>/iteration-N/static-analysis.jsonreferences/quality-checklist.md| Tool | Desc | Params | Schema | Annotations | Overall | Issues |
|------|------|--------|--------|-------------|---------|--------|
| search_issues | 3/3 | 3/3 | 2/3 | 2/3 | 2.5 | No negation |
| create_issue | 1/3 | 1/3 | 0/3 | 0/3 | 0.5 | 4 issues |### Sibling Pairs (confusion risk)
| Tool A | Tool B | Overlap | Risk |
|--------|--------|---------|------|
| search_issues | list_issues | 52% | HIGH |references/eval-patterns.md{workspace}/evals/evals.json{
"server_name": "my-server",
"generated_from": "tools.json",
"intents": [
{
"id": 1,
"intent": "Are there any open bugs related to checkout?",
"expected_tool": "search_issues",
"type": "should_trigger",
"target_tool": "search_issues",
"notes": "Implicit intent — doesn't name the action"
}
]
}You have access to the following MCP tools:
{tool schemas as JSON}
A user sends this message:
"{intent text}"
Which tool would you call? Respond with JSON:
{
"selected_tool": "tool_name" or null,
"arguments": { ... } or {},
"reasoning": "One sentence explaining your choice"
}
If no tool fits the user's request, set selected_tool to null.
Select exactly ONE tool. Do not suggest calling multiple tools.{workspace}/iteration-N/selection/intent-{ID}/result.jsonbash scripts/grade-selection.sh \
<workspace>/iteration-N/selection \
<workspace>/evals/evals.json \
<workspace>/iteration-N/benchmark.json## Selection Results — Iteration N
**Accuracy:** 82% (41/50 correct)
| Metric | Count |
|--------|-------|
| Correct | 41 |
| Wrong tool | 5 |
| False accept | 2 |
| False reject | 2 |
### Per-Tool Accuracy
| Tool | Precision | Recall |
|------|-----------|--------|
| search_issues | 0.90 | 0.85 |
| create_issue | 1.00 | 1.00 |
### Worst Confusions
| Expected | Selected Instead | Times |
|----------|-----------------|-------|
| list_issues | search_issues | 3 |
| get_user | find_user_by_email | 2 |references/optimization.md## Suggested Improvements
### search_issues ↔ list_issues (confused 3 times)
**search_issues — Before:**
> Search issues by keyword.
**search_issues — After:**
> Search issues by keyword across title and body. Returns up to `limit` results ranked by relevance. Does NOT filter by status, assignee, or date — use list_issues for structured filtering.
**Reason:** Adding scope boundary and cross-reference to disambiguate from list_issues.{workspace}/iteration-N/suggestions.jsoniteration-N+1## Iteration Comparison
| Metric | Iteration 1 | Iteration 2 | Delta |
|--------|------------|------------|-------|
| Accuracy | 82% | 94% | +12% |
| search↔list confusion | 3 | 0 | -3 |references/quality-checklist.mdreferences/eval-patterns.mdreferences/optimization.mdbuild-mcp-serverbuild-mcp-app