skill_evaluator

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Skill Evaluator (WIP)

Skill Evaluator(开发中)

Evaluates skills against Anthropic's official best practices for agent skill authoring. Produces structured evaluation reports with scores and actionable recommendations.
依据Anthropic官方的Agent技能编写最佳实践评估技能。生成带有分数和可执行建议的结构化评估报告。

Quick Start

快速开始

  1. Read the skill's SKILL.md and understand its purpose
  2. Run automated validation:
    scripts/validate_skill.py <skill-path>
  3. Perform manual evaluation against criteria below
  4. Generate evaluation report with scores and recommendations
  1. 阅读技能的SKILL.md并理解其用途
  2. 运行自动化验证:
    scripts/validate_skill.py <skill-path>
  3. 根据以下标准执行人工评估
  4. 生成包含分数和建议的评估报告

Evaluation Workflow

评估工作流

Step 1: Automated Validation

步骤1:自动化验证

Run the validation script first:
bash
scripts/validate_skill.py <path/to/skill>
This checks:
  • SKILL.md exists with valid YAML frontmatter
  • Name follows conventions (lowercase, hyphens, max 64 chars)
  • Description is present and under 1024 chars
  • Body is under 500 lines
  • File references are one-level deep
首先运行验证脚本:
bash
scripts/validate_skill.py <path/to/skill>
此脚本检查:
  • SKILL.md是否存在且包含有效的YAML前置元数据
  • 名称是否遵循规范(小写、连字符、最多64个字符)
  • 是否存在描述且长度不超过1024个字符
  • 正文内容不超过500行
  • 文件引用仅为一级深度

Step 2: Manual Evaluation

步骤2:人工评估

Evaluate each dimension and assign a score (1-5):
评估每个维度并给出1-5分的分数:

A. Naming (Weight: 10%)

A. 命名(权重:10%)

ScoreCriteria
5Gerund form (-ing), clear purpose, memorable
4Descriptive, follows conventions
3Acceptable but could be clearer
2Vague or misleading
1Violates naming rules
Rules: Max 64 chars, lowercase + numbers + hyphens only, no reserved words (anthropic, claude), no XML tags.
Good:
processing-pdfs
,
analyzing-spreadsheets
,
building-dashboards
Bad:
pdf
,
my-skill
,
ClaudeHelper
,
anthropic-tools
分数标准
5动名词形式(-ing结尾)、用途清晰、易于记忆
4描述性强、遵循规范
3可接受但可更清晰
2模糊或误导性
1违反命名规则
规则:最多64个字符,仅使用小写字母+数字+连字符,无保留词(anthropic、claude),无XML标签。
示例良好命名
processing-pdfs
,
analyzing-spreadsheets
,
building-dashboards
示例不良命名
pdf
,
my-skill
,
ClaudeHelper
,
anthropic-tools

B. Description (Weight: 20%)

B. 描述(权重:20%)

ScoreCriteria
5Clear functionality + specific activation triggers + third person
4Good description with some triggers
3Adequate but missing triggers or vague
2Too brief or unclear purpose
1Missing or unhelpful
Must include: What the skill does AND when to use it. Good: "Extracts text from PDFs. Use when working with PDF documents for text extraction, form parsing, or content analysis." Bad: "A skill for PDFs." or "Helps with documents."
分数标准
5清晰说明功能+具体触发场景+使用第三人称
4描述良好,包含部分触发场景
3足够但缺少触发场景或表述模糊
2过于简短或用途不明确
1缺失或无帮助
必须包含:技能的功能以及使用时机。 示例良好描述:"从PDF中提取文本。当处理PDF文档以进行文本提取、表单解析或内容分析时使用。" 示例不良描述:"一款针对PDF的技能。" 或 "帮助处理文档。"

C. Content Quality (Weight: 30%)

C. 内容质量(权重:30%)

ScoreCriteria
5Concise, assumes Claude intelligence, actionable instructions
4Generally good, minor verbosity
3Some unnecessary explanations or redundancy
2Overly verbose or confusing
1Bloated, explains obvious concepts
Ask: "Does Claude really need this explanation?" Remove anything Claude already knows.
分数标准
5简洁、默认Claude具备相关智能、提供可执行指令
4整体良好,仅存在轻微冗余
3存在一些不必要的解释或冗余内容
2过于冗长或表述混乱
1内容臃肿、解释显而易见的概念
自问:"Claude真的需要这个解释吗?" 删除任何Claude已知的内容。

D. Structure & Organization (Weight: 25%)

D. 结构与组织(权重:25%)

ScoreCriteria
5Excellent progressive disclosure, clear navigation, optimal length
4Good organization, appropriate file splits
3Acceptable but could be better organized
2Poor organization, missing references, or bloated SKILL.md
1No structure, everything dumped in SKILL.md
Check:
  • SKILL.md under 500 lines
  • References are one-level deep (no nested chains)
  • Long reference files (>100 lines) have table of contents
  • Uses forward slashes in all paths
分数标准
5渐进式披露设计优秀、导航清晰、长度最优
4组织良好、文件拆分合理
3可接受但可进一步优化组织
2组织混乱、缺少引用或SKILL.md内容臃肿
1无结构,所有内容堆砌在SKILL.md中
检查项
  • SKILL.md内容不超过500行
  • 引用仅为一级深度(无嵌套链)
  • 长引用文件(>100行)包含目录
  • 所有路径使用正斜杠

E. Degrees of Freedom (Weight: 10%)

E. 自由度设置(权重:10%)

ScoreCriteria
5Perfect match: high freedom for flexible tasks, low for fragile operations
4Generally appropriate freedom levels
3Acceptable but could be better calibrated
2Mismatched: too rigid or too loose
1Completely wrong freedom level for the task type
Guideline:
  • High freedom (text): Multiple valid approaches, context-dependent
  • Medium freedom (parameterized): Preferred pattern exists, some variation OK
  • Low freedom (specific scripts): Fragile operations, exact sequence required
分数标准
5完美匹配:灵活任务设置高自由度,易出错操作设置低自由度
4自由度设置整体合理
3可接受但可进一步校准
2不匹配:过于严格或过于宽松
1完全错误的自由度设置,与任务类型不符
指导原则
  • 高自由度(文本类):多种有效方法、依赖上下文
  • 中等自由度(参数化):存在首选模式,允许一定变化
  • 低自由度(特定脚本):易出错操作,需要精确执行序列

F. Anti-Pattern Check (Weight: 5%)

F. 反模式检查(权重:5%)

Deduct points for each anti-pattern found:
  • Too many options without clear recommendation (-1)
  • Time-sensitive information with date conditionals (-1)
  • Inconsistent terminology (-1)
  • Windows-style paths (backslashes) (-1)
  • Deeply nested references (more than one level) (-2)
  • Scripts that punt error handling to Claude (-1)
  • Magic numbers without justification (-1)
每发现一个反模式扣除相应分数:
  • 过多选项但无明确推荐(-1分)
  • 包含带日期条件的时效性信息(-1分)
  • 术语不一致(-1分)
  • Windows风格路径(反斜杠)(-1分)
  • 深度嵌套引用(超过一级)(-2分)
  • 将错误处理推给Claude的脚本(-1分)
  • 无合理说明的魔法数字(-1分)

Step 3: Generate Report

步骤3:生成报告

Use this template:
markdown
undefined
使用以下模板:
markdown
undefined

Skill Evaluation Report: [skill-name]

技能评估报告:[skill-name]

Summary

摘要

  • Overall Score: X.X/5.0
  • Recommendation: [Ready for publication / Needs minor improvements / Needs major revision]
  • 整体得分:X.X/5.0
  • 建议:[可发布 / 需要小幅改进 / 需要大幅修订]

Dimension Scores

维度得分

DimensionScoreWeightWeighted
NamingX/510%X.XX
DescriptionX/520%X.XX
Content QualityX/530%X.XX
StructureX/525%X.XX
Degrees of FreedomX/510%X.XX
Anti-PatternsX/55%X.XX
Total100%X.XX
维度分数权重加权得分
命名X/510%X.XX
描述X/520%X.XX
内容质量X/530%X.XX
结构X/525%X.XX
自由度设置X/510%X.XX
反模式X/55%X.XX
总计100%X.XX

Strengths

优势

  • [List 2-3 things done well]
  • [列出2-3项做得好的内容]

Areas for Improvement

改进方向

  • [List specific issues with actionable fixes]
  • [列出具体问题及可执行的修复方案]

Anti-Patterns Found

发现的反模式

  • [List any anti-patterns detected]
  • [列出检测到的所有反模式]

Recommendations

建议

  1. [Priority 1 fix]
  2. [Priority 2 fix]
  3. [Priority 3 fix]
  1. [优先级1修复项]
  2. [优先级2修复项]
  3. [优先级3修复项]

Pre-Publication Checklist

发布前检查清单

  • Description is specific with activation triggers
  • SKILL.md under 500 lines
  • One-level-deep file references
  • Forward slashes in all paths
  • No time-sensitive information
  • Consistent terminology
  • Concrete examples provided
  • Scripts handle errors explicitly
  • All configuration values justified
  • Required packages listed
  • Tested with Haiku, Sonnet, Opus
undefined
  • 描述包含具体的激活触发场景
  • SKILL.md内容不超过500行
  • 文件引用仅为一级深度
  • 所有路径使用正斜杠
  • 无时效性信息
  • 术语一致
  • 提供具体示例
  • 脚本明确处理错误
  • 所有配置值均有说明
  • 列出所需依赖包
  • 已在Haiku、Sonnet、Opus上测试
undefined

Score Interpretation

得分解读

Score RangeRatingAction
4.5 - 5.0ExcellentReady for publication
4.0 - 4.4GoodMinor improvements recommended
3.0 - 3.9AcceptableSeveral improvements needed
2.0 - 2.9Needs WorkMajor revision required
1.0 - 1.9PoorFundamental redesign needed
得分范围评级操作建议
4.5 - 5.0优秀可发布
4.0 - 4.4良好建议小幅改进
3.0 - 3.9可接受需要多项改进
2.0 - 2.9需要优化需要大幅修订
1.0 - 1.9较差需要重新设计

References

参考资料

  • references/evaluation-criteria.md - Detailed evaluation criteria with examples
  • references/scoring-rubric.md - Complete scoring rubric and edge cases
Skill Evaluator v1.1 - Enhanced
  • references/evaluation-criteria.md - 包含示例的详细评估标准
  • references/scoring-rubric.md - 完整的评分规则及边缘情况说明
Skill Evaluator v1.1 - 增强版

🔄 Workflow

🔄 工作流

Aşama 1: Structural Analysis

步骤1:结构分析

  • Compliance: Dosya yapısı (
    scripts/
    ,
    references/
    ) standarta uyuyor mu?
  • Metadata: YAML frontmatter (
    name
    ,
    description
    ) eksiksiz ve valid mi?
  • Modularity: Skill çok mu büyük? Bölünmesi gerekiyor mu? (Single Responsibility Principle).
  • 合规性:文件结构(
    scripts/
    references/
    )是否符合标准?
  • 元数据:YAML前置元数据(
    name
    description
    )是否完整且有效?
  • 模块化:技能是否过大?是否需要拆分?(单一职责原则)

Aşama 2: Content & Semantic Review

步骤2:内容与语义审查

  • Clarity: Talimatlar emir kipiyle (Imperative) ve net yazılmış mı? Belirsizlik var mı?
  • Context Efficiency: "Gereksiz nezaket" veya "aşırı açıklama" var mı? Token israfı önlenmeli.
  • Safety: Skill tehlikeli bir işlem (dosya silme, yetkisiz erişim) öneriyor mu?
  • 清晰度:指令是否使用祈使语气且表述清晰?是否存在模糊性?
  • 上下文效率:是否存在"不必要的礼貌用语"或"过度解释"?应避免浪费Token。
  • 安全性:技能是否建议危险操作(如删除文件、未授权访问)?

Aşama 3: Functionality Verification

步骤3:功能验证

  • Script Audit:
    scripts/
    içindeki Python/Bash kodları güvenli ve çalışır durumda mı?
  • Reference Check:
    references/
    dosyaları gerçekten gerekli mi? Yoksa
    SKILL.md
    içine mi gömülmeli?
  • Usability: Bir kullanıcı (veya ajan) bu skill'i okuyup hemen kullanabilir mi?
  • 脚本审核
    scripts/
    目录下的Python/Bash代码是否安全且可运行?
  • 引用检查
    references/
    目录下的文件是否真的必要?还是应嵌入到
    SKILL.md
    中?
  • 易用性:用户(或Agent)阅读该技能后能否立即使用?

Kontrol Noktaları

检查点

AşamaDoğrulama
1Skill adı ve açıklaması birbiriyle tutarlı mı?
2Anti-pattern (örn: Hardcoded path) tespit edildi mi?
3Puanlama rubriğine göre objektif bir skor (1-5) verildi mi?
步骤验证内容
1技能名称与描述是否一致?
2是否检测到反模式(如硬编码路径)?
3是否根据评分规则给出了客观的1-5分分数?