skill_evaluator

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Skill Evaluator (WIP)

Skill Evaluator（开发中）

Evaluates skills against Anthropic's official best practices for agent skill authoring. Produces structured evaluation reports with scores and actionable recommendations.

依据Anthropic官方的Agent技能编写最佳实践评估技能。生成带有分数和可执行建议的结构化评估报告。

Quick Start

快速开始

Read the skill's SKILL.md and understand its purpose
Run automated validation:
```
scripts/validate_skill.py <skill-path>
```
Perform manual evaluation against criteria below
Generate evaluation report with scores and recommendations

阅读技能的SKILL.md并理解其用途
运行自动化验证：
```
scripts/validate_skill.py <skill-path>
```
根据以下标准执行人工评估
生成包含分数和建议的评估报告

Evaluation Workflow

评估工作流

Step 1: Automated Validation

步骤1：自动化验证

Run the validation script first:

bash

scripts/validate_skill.py <path/to/skill>

This checks:

SKILL.md exists with valid YAML frontmatter
Name follows conventions (lowercase, hyphens, max 64 chars)
Description is present and under 1024 chars
Body is under 500 lines
File references are one-level deep

首先运行验证脚本：

bash

scripts/validate_skill.py <path/to/skill>

此脚本检查：

SKILL.md是否存在且包含有效的YAML前置元数据
名称是否遵循规范（小写、连字符、最多64个字符）
是否存在描述且长度不超过1024个字符
正文内容不超过500行
文件引用仅为一级深度

Step 2: Manual Evaluation

步骤2：人工评估

Evaluate each dimension and assign a score (1-5):

评估每个维度并给出1-5分的分数：

A. Naming (Weight: 10%)

A. 命名（权重：10%）

Score	Criteria
5	Gerund form (-ing), clear purpose, memorable
4	Descriptive, follows conventions
3	Acceptable but could be clearer
2	Vague or misleading
1	Violates naming rules

Rules: Max 64 chars, lowercase + numbers + hyphens only, no reserved words (anthropic, claude), no XML tags.

Good:

processing-pdfs

analyzing-spreadsheets

building-dashboards

Bad:

pdf

my-skill

ClaudeHelper

anthropic-tools

分数	标准
5	动名词形式（-ing结尾）、用途清晰、易于记忆
4	描述性强、遵循规范
3	可接受但可更清晰
2	模糊或误导性
1	违反命名规则

规则：最多64个字符，仅使用小写字母+数字+连字符，无保留词（anthropic、claude），无XML标签。

示例良好命名：

processing-pdfs

analyzing-spreadsheets

building-dashboards

示例不良命名：

pdf

my-skill

ClaudeHelper

anthropic-tools

B. Description (Weight: 20%)

B. 描述（权重：20%）

Score	Criteria
5	Clear functionality + specific activation triggers + third person
4	Good description with some triggers
3	Adequate but missing triggers or vague
2	Too brief or unclear purpose
1	Missing or unhelpful

Must include: What the skill does AND when to use it. Good: "Extracts text from PDFs. Use when working with PDF documents for text extraction, form parsing, or content analysis." Bad: "A skill for PDFs." or "Helps with documents."

分数	标准
5	清晰说明功能+具体触发场景+使用第三人称
4	描述良好，包含部分触发场景
3	足够但缺少触发场景或表述模糊
2	过于简短或用途不明确
1	缺失或无帮助

必须包含：技能的功能以及使用时机。 示例良好描述："从PDF中提取文本。当处理PDF文档以进行文本提取、表单解析或内容分析时使用。" 示例不良描述："一款针对PDF的技能。" 或 "帮助处理文档。"

C. Content Quality (Weight: 30%)

C. 内容质量（权重：30%）

Score	Criteria
5	Concise, assumes Claude intelligence, actionable instructions
4	Generally good, minor verbosity
3	Some unnecessary explanations or redundancy
2	Overly verbose or confusing
1	Bloated, explains obvious concepts

Ask: "Does Claude really need this explanation?" Remove anything Claude already knows.

分数	标准
5	简洁、默认Claude具备相关智能、提供可执行指令
4	整体良好，仅存在轻微冗余
3	存在一些不必要的解释或冗余内容
2	过于冗长或表述混乱
1	内容臃肿、解释显而易见的概念

自问："Claude真的需要这个解释吗？" 删除任何Claude已知的内容。

D. Structure & Organization (Weight: 25%)

D. 结构与组织（权重：25%）

Score	Criteria
5	Excellent progressive disclosure, clear navigation, optimal length
4	Good organization, appropriate file splits
3	Acceptable but could be better organized
2	Poor organization, missing references, or bloated SKILL.md
1	No structure, everything dumped in SKILL.md

Check:

SKILL.md under 500 lines
References are one-level deep (no nested chains)
Long reference files (>100 lines) have table of contents
Uses forward slashes in all paths

分数	标准
5	渐进式披露设计优秀、导航清晰、长度最优
4	组织良好、文件拆分合理
3	可接受但可进一步优化组织
2	组织混乱、缺少引用或SKILL.md内容臃肿
1	无结构，所有内容堆砌在SKILL.md中

检查项：

SKILL.md内容不超过500行
引用仅为一级深度（无嵌套链）
长引用文件（>100行）包含目录
所有路径使用正斜杠

E. Degrees of Freedom (Weight: 10%)

E. 自由度设置（权重：10%）

Score	Criteria
5	Perfect match: high freedom for flexible tasks, low for fragile operations
4	Generally appropriate freedom levels
3	Acceptable but could be better calibrated
2	Mismatched: too rigid or too loose
1	Completely wrong freedom level for the task type

Guideline:

High freedom (text): Multiple valid approaches, context-dependent
Medium freedom (parameterized): Preferred pattern exists, some variation OK
Low freedom (specific scripts): Fragile operations, exact sequence required

分数	标准
5	完美匹配：灵活任务设置高自由度，易出错操作设置低自由度
4	自由度设置整体合理
3	可接受但可进一步校准
2	不匹配：过于严格或过于宽松
1	完全错误的自由度设置，与任务类型不符

指导原则：

高自由度（文本类）：多种有效方法、依赖上下文
中等自由度（参数化）：存在首选模式，允许一定变化
低自由度（特定脚本）：易出错操作，需要精确执行序列

F. Anti-Pattern Check (Weight: 5%)

F. 反模式检查（权重：5%）

Step 3: Generate Report

步骤3：生成报告

Use this template:

markdown

undefined

使用以下模板：

markdown

undefined

Skill Evaluation Report: [skill-name]

技能评估报告：[skill-name]

Summary

摘要

Overall Score: X.X/5.0
Recommendation: [Ready for publication / Needs minor improvements / Needs major revision]

整体得分：X.X/5.0
建议：[可发布 / 需要小幅改进 / 需要大幅修订]

Dimension Scores

维度得分

Dimension	Score	Weight	Weighted
Naming	X/5	10%	X.XX
Description	X/5	20%	X.XX
Content Quality	X/5	30%	X.XX
Structure	X/5	25%	X.XX
Degrees of Freedom	X/5	10%	X.XX
Anti-Patterns	X/5	5%	X.XX
Total		100%	X.XX

维度	分数	权重	加权得分
命名	X/5	10%	X.XX
描述	X/5	20%	X.XX
内容质量	X/5	30%	X.XX
结构	X/5	25%	X.XX
自由度设置	X/5	10%	X.XX
反模式	X/5	5%	X.XX
总计		100%	X.XX

Strengths

优势

[List 2-3 things done well]

[列出2-3项做得好的内容]

Areas for Improvement

改进方向

[List specific issues with actionable fixes]

[列出具体问题及可执行的修复方案]

Anti-Patterns Found

发现的反模式

[List any anti-patterns detected]

[列出检测到的所有反模式]

Recommendations

建议

[Priority 1 fix]
[Priority 2 fix]
[Priority 3 fix]

[优先级1修复项]
[优先级2修复项]
[优先级3修复项]

Pre-Publication Checklist

发布前检查清单

Description is specific with activation triggers
SKILL.md under 500 lines
One-level-deep file references
Forward slashes in all paths
No time-sensitive information
Consistent terminology
Concrete examples provided
Scripts handle errors explicitly
All configuration values justified
Required packages listed
Tested with Haiku, Sonnet, Opus

undefined

描述包含具体的激活触发场景
SKILL.md内容不超过500行
文件引用仅为一级深度
所有路径使用正斜杠
无时效性信息
术语一致
提供具体示例
脚本明确处理错误
所有配置值均有说明
列出所需依赖包
已在Haiku、Sonnet、Opus上测试

undefined

Score Interpretation

得分解读

Score Range	Rating	Action
4.5 - 5.0	Excellent	Ready for publication
4.0 - 4.4	Good	Minor improvements recommended
3.0 - 3.9	Acceptable	Several improvements needed
2.0 - 2.9	Needs Work	Major revision required
1.0 - 1.9	Poor	Fundamental redesign needed

得分范围	评级	操作建议
4.5 - 5.0	优秀	可发布
4.0 - 4.4	良好	建议小幅改进
3.0 - 3.9	可接受	需要多项改进
2.0 - 2.9	需要优化	需要大幅修订
1.0 - 1.9	较差	需要重新设计

References

参考资料

references/evaluation-criteria.md - Detailed evaluation criteria with examples
references/scoring-rubric.md - Complete scoring rubric and edge cases

Skill Evaluator v1.1 - Enhanced

references/evaluation-criteria.md - 包含示例的详细评估标准
references/scoring-rubric.md - 完整的评分规则及边缘情况说明

Skill Evaluator v1.1 - 增强版

🔄 Workflow

🔄 工作流

Kaynak: Google Engineering Practices - Code Review & Anthropic System Prompts

来源：Google Engineering Practices - Code Review & Anthropic System Prompts

Aşama 1: Structural Analysis

步骤1：结构分析

Compliance: Dosya yapısı (
```
scripts/
```
,
```
references/
```
) standarta uyuyor mu?
Metadata: YAML frontmatter (
```
name
```
,
```
description
```
) eksiksiz ve valid mi?
Modularity: Skill çok mu büyük? Bölünmesi gerekiyor mu? (Single Responsibility Principle).

合规性：文件结构（
```
scripts/
```
、
```
references/
```
）是否符合标准？
元数据：YAML前置元数据（
```
name
```
、
```
description
```
）是否完整且有效？
模块化：技能是否过大？是否需要拆分？（单一职责原则）

Aşama 2: Content & Semantic Review

步骤2：内容与语义审查

Clarity: Talimatlar emir kipiyle (Imperative) ve net yazılmış mı? Belirsizlik var mı?
Context Efficiency: "Gereksiz nezaket" veya "aşırı açıklama" var mı? Token israfı önlenmeli.
Safety: Skill tehlikeli bir işlem (dosya silme, yetkisiz erişim) öneriyor mu?

清晰度：指令是否使用祈使语气且表述清晰？是否存在模糊性？
上下文效率：是否存在"不必要的礼貌用语"或"过度解释"？应避免浪费Token。
安全性：技能是否建议危险操作（如删除文件、未授权访问）？

Aşama 3: Functionality Verification

步骤3：功能验证

Script Audit:
```
scripts/
```
içindeki Python/Bash kodları güvenli ve çalışır durumda mı?
Reference Check:
```
references/
```
dosyaları gerçekten gerekli mi? Yoksa
```
SKILL.md
```
içine mi gömülmeli?
Usability: Bir kullanıcı (veya ajan) bu skill'i okuyup hemen kullanabilir mi?

脚本审核：
```
scripts/
```
目录下的Python/Bash代码是否安全且可运行？
引用检查：
```
references/
```
目录下的文件是否真的必要？还是应嵌入到
```
SKILL.md
```
中？
易用性：用户（或Agent）阅读该技能后能否立即使用？

Kontrol Noktaları

检查点

Aşama	Doğrulama
1	Skill adı ve açıklaması birbiriyle tutarlı mı?
2	Anti-pattern (örn: Hardcoded path) tespit edildi mi?
3	Puanlama rubriğine göre objektif bir skor (1-5) verildi mi?

步骤	验证内容
1	技能名称与描述是否一致？
2	是否检测到反模式（如硬编码路径）？
3	是否根据评分规则给出了客观的1-5分分数？