skill-grader

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Skill Grader

技能评分工具

Structured evaluation rubric for Claude Agent Skills. Produces letter grades (A+ through F) on 10 axes plus an overall grade, with specific improvement recommendations for each axis.
Designed for sub-agents and non-expert reviewers who need a mechanical, repeatable process for assessing skill quality without deep domain expertise.

针对Claude Agent Skills的结构化评估准则。基于10个维度给出A+至F的等级评分及综合评分,并针对每个维度提供具体改进建议。
专为子代理和非专业评审人员设计,无需深厚领域专业知识,即可通过标准化、可重复的流程评估技能质量。

When to Use

适用场景

Use for:
  • Auditing a single skill's quality
  • Comparing skills against each other
  • Prioritizing which skills to improve first
  • Quality control sweeps across a skill library
  • Generating improvement roadmaps
NOT for:
  • Creating new skills (use
    skill-architect
    )
  • Grading code quality or non-skill documents
  • Evaluating agent performance (different from skill quality)

适用场景:
  • 单个技能的质量审计
  • 技能之间的对比评估
  • 确定技能改进的优先级
  • 技能库的质量管控排查
  • 生成技能改进路线图
不适用场景:
  • 创建新技能(请使用
    skill-architect
  • 代码质量评分或非技能文档评估
  • 代理性能评估(与技能质量评估不同)

Grading Process

评分流程

mermaid
flowchart TD
  A[Read SKILL.md + all files] --> B[Score each of 10 axes]
  B --> C[Assign letter grade per axis]
  C --> D[Compute overall grade]
  D --> E[Write improvement recommendations]
  E --> F[Produce grading report]
mermaid
flowchart TD
  A[Read SKILL.md + all files] --> B[Score each of 10 axes]
  B --> C[Assign letter grade per axis]
  C --> D[Compute overall grade]
  D --> E[Write improvement recommendations]
  E --> F[Produce grading report]

Step-by-Step

分步说明

  1. Read the entire skill folder — SKILL.md, all references, scripts, CHANGELOG, README
  2. Score each axis — Use the rubric below (0-100 per axis)
  3. Convert to letter grade — See grade scale
  4. Compute overall grade — Weighted average (Description and Scope are 2x weight)
  5. Write 1-3 specific improvements per axis scoring below B+
  6. Produce the grading report in the output format below

  1. 阅读完整技能文件夹 — 包括SKILL.md、所有参考文件、脚本、CHANGELOG、README
  2. 为每个维度评分 — 使用下方评分准则(每个维度0-100分)
  3. 转换为等级评分 — 参考等级对照表
  4. 计算综合评分 — 加权平均值(描述质量和范围规范性权重为2倍)
  5. 为每个评分低于B+的维度撰写1-3条具体改进建议
  6. 按照下方输出格式生成评分报告

The 10 Evaluation Axes

10个评估维度

Axis 1: Description Quality (Weight: 2x)

维度1:描述质量(权重:2倍)

Does the description follow
[What] [When] [Keywords]. NOT for [Exclusions]
?
GradeCriteria
ASpecific verb+noun, domain keywords users would type, 2-5 explicit NOT exclusions, 25-50 words
BHas keywords and NOT clause, but slightly vague or missing synonym coverage
CToo generic, missing NOT clause, or >100 words of process detail
DSingle vague sentence ("helps with X") or name/description mismatch
FMissing or empty description
描述是否遵循「[内容] [适用场景] [关键词]。不适用于[排除场景]」的格式?
等级评分标准
A包含具体动词+名词、用户会使用的领域关键词、2-5条明确的排除场景,字数在25-50之间
B有关键词和排除条款,但描述略有模糊或缺少同义词覆盖
C描述过于宽泛、缺少排除条款,或包含超过100字的流程细节
D仅为单一模糊句子(如「帮助处理X」),或名称与描述不匹配
F描述缺失或为空

Axis 2: Scope Discipline (Weight: 2x)

维度2:范围规范性(权重:2倍)

Is the skill narrowly focused on one expertise type, or a catch-all?
GradeCriteria
AOne clear expertise domain, "When to Use" and "NOT for" sections both present and specific
BMostly focused, minor boundary ambiguity
CCovers 2-3 related but distinct domains, should probably be split
DCatch-all skill ("helps with anything related to X")
FNo scope boundaries defined at all
技能是否专注于单一专业领域,还是涵盖范围过于宽泛?
等级评分标准
A有明确的单一专业领域,「适用场景」和「不适用场景」部分均存在且描述具体
B大部分内容聚焦,仅存在轻微的边界模糊
C涵盖2-3个相关但不同的领域,应拆分为多个技能
D涵盖范围过于宽泛(如「帮助处理所有与X相关的事务」)
F完全未定义范围边界

Axis 3: Progressive Disclosure

维度3:渐进式信息披露

Does the skill follow the three-layer architecture (metadata → SKILL.md → references)?
GradeCriteria
ASKILL.md <300 lines, heavy content in references, reference index in SKILL.md with 1-line descriptions
BSKILL.md <500 lines, some references used, index present
CSKILL.md >500 lines, or all content inlined with no references
DSKILL.md >800 lines, or references exist but aren't indexed in SKILL.md
FSingle massive file with no structure
技能是否遵循三层架构(元数据 → SKILL.md → 参考文件)?
等级评分标准
ASKILL.md少于300行,核心内容在参考文件中,SKILL.md包含参考文件索引及1行描述
BSKILL.md少于500行,使用了部分参考文件,存在索引
CSKILL.md超过500行,或所有内容均内联无参考文件
DSKILL.md超过800行,或存在参考文件但未在SKILL.md中建立索引
F仅为单一无结构的大文件

Axis 4: Anti-Pattern Coverage

维度4:反模式覆盖

Does the skill encode expert knowledge that prevents common mistakes?
GradeCriteria
A3+ anti-patterns with Novice/Expert/Timeline template, LLM-mistake notes
B1-2 anti-patterns with clear explanation
CAnti-patterns mentioned but no structured template
DNo anti-patterns, just positive instructions
FContains advice that IS an anti-pattern (outdated, harmful)
技能是否包含可避免常见错误的专业知识?
等级评分标准
A包含3个以上采用新手/专家/时间线模板的反模式,以及LLM错误提示
B包含1-2个有清晰说明的反模式
C提及反模式但未使用结构化模板
D无反模式内容,仅包含正向操作说明
F包含本身即为反模式的建议(过时、有害)

Axis 5: Self-Contained Tools

维度5:独立工具集成

Does the skill include working tools (scripts, MCPs, subagents)?
GradeCriteria
AWorking scripts with CLI interface, error handling, dependency docs; OR valid "no tools needed" justification
BScripts exist and work but lack error handling or docs
CScripts referenced but are templates/pseudocode
DPhantom tools (SKILL.md references files that don't exist)
FReferences non-existent tools AND no acknowledgment
Note: Not every skill needs tools. A pure decision-tree skill can score A if tools aren't applicable.
技能是否包含可用工具(脚本、MCP、子代理)?
等级评分标准
A包含带CLI界面、错误处理、依赖文档的可用脚本;或有合理的「无需工具」说明
B脚本存在且可用,但缺少错误处理或文档
C提及脚本但仅为模板/伪代码
D虚假工具(SKILL.md提及不存在的文件)
F提及不存在的工具且未说明
注意:并非所有技能都需要工具。纯决策树技能若无需工具,可评为A。

Axis 6: Activation Precision

维度6:触发精准度

Would the skill activate correctly on relevant queries and stay silent on irrelevant ones?
GradeCriteria
ADescription has specific keywords matching user language, clear NOT clause, no obvious false-positive vectors
BGood keywords, minor false-positive risk
CGeneric keywords that overlap with other skills
DNo specific keywords, or NOT clause contradicts intended use
FDescription would cause constant false activation
技能是否能在相关查询时正确触发,在无关查询时保持静默?
等级评分标准
A描述包含匹配用户语言的具体关键词、明确的排除条款,无明显误触发可能
B关键词设置合理,存在轻微误触发风险
C关键词过于宽泛,与其他技能存在重叠
D无具体关键词,或排除条款与预期用途矛盾
F描述会导致持续误触发

Axis 7: Visual Artifacts

维度7:可视化元素

Does the skill use Mermaid diagrams, code examples, and tables effectively?
GradeCriteria
ADecision trees as Mermaid flowcharts, tables for comparisons, code examples for concrete patterns
BSome diagrams or tables, but key decision trees still in prose
CTables used but no Mermaid diagrams for processes
DProse-only, no visual structure
FWall of text with no formatting aids
技能是否有效使用Mermaid图表、代码示例和表格?
等级评分标准
A使用Mermaid流程图展示决策树、表格进行对比、代码示例展示具体模式
B使用了部分图表或表格,但核心决策树仍以文字呈现
C使用了表格但未用Mermaid图表展示流程
D仅为文字内容,无可视化结构
F纯文本堆砌,无任何格式辅助

Axis 8: Output Contracts

维度8:输出约定

Does the skill define what it produces in a format consumable by other agents?
GradeCriteria
AExplicit output format (JSON schema, markdown template, or structured sections), subagent-consumable
BOutput format implied but not explicitly documented
CNo output format, but content is structured enough to infer
DUnstructured prose output expected
FN/A (pure reference skill) — exempt from this axis
技能是否定义了可被其他代理消费的输出格式?
等级评分标准
A明确输出格式(JSON schema、markdown模板或结构化章节),可被子代理消费
B输出格式隐含但未明确记录
C无输出格式,但内容结构清晰可推断
D预期输出为非结构化文字
F不适用(纯参考技能)—— 本维度豁免

Axis 9: Temporal Awareness

维度9:时效性感知

Does the skill track when knowledge was current and what has changed?
GradeCriteria
ATimelines in anti-patterns, "as of [date]" markers, CHANGELOG with dates
BSome temporal context, CHANGELOG exists
CNo dates on knowledge, but CHANGELOG exists
DNo temporal context anywhere, knowledge could be stale
FContains demonstrably outdated advice without warning
技能是否记录了知识的时效性及变更内容?
等级评分标准
A反模式中包含时间线、标注「截至[日期]」、带日期的CHANGELOG
B包含部分时间上下文,存在CHANGELOG
C知识无日期标注,但存在CHANGELOG
D无任何时间上下文,知识可能已过时
F包含已过时的建议且未预警

Axis 10: Documentation Quality

维度10:文档质量

README, CHANGELOG, and reference organization.
GradeCriteria
AREADME with quick start, CHANGELOG with dated versions, references well-organized with clear filenames
BREADME and CHANGELOG exist, references present
CSKILL.md is the only file, but it's well-structured
DNo README, no CHANGELOG, disorganized references
FSKILL.md is the only file and it's poorly structured

README、CHANGELOG及参考文件的组织情况。
等级评分标准
AREADME包含快速入门指南、CHANGELOG带版本日期、参考文件组织有序且文件名清晰
BREADME和CHANGELOG均存在,参考文件齐全
C仅存在SKILL.md文件,但结构清晰
D无README、无CHANGELOG,参考文件组织混乱
F仅存在SKILL.md文件且结构混乱

Grade Scale

等级对照表

LetterScore RangeMeaning
A+97-100Exemplary — sets the standard
A93-96Excellent — minor improvements possible
A-90-92Very good — a few small gaps
B+87-89Good — notable room for improvement
B83-86Solid — several areas need work
B-80-82Above average — meaningful gaps
C+77-79Average — significant improvements needed
C73-76Below average — major gaps
C-70-72Barely adequate
D+67-69Poor — fundamental issues
D63-66Very poor — needs major rework
D-60-62Near-failing quality
F<60Failing — start over

等级分数范围说明
A+97-100标杆级 — 树立行业标准
A93-96优秀 — 仅需微小改进
A-90-92非常好 — 存在少量小缺口
B+87-89良好 — 有显著改进空间
B83-86合格 — 多个领域需优化
B-80-82中等偏上 — 存在明显缺口
C+77-79中等 — 需要重大改进
C73-76中等偏下 — 存在主要缺口
C-70-72勉强合格
D+67-69较差 — 存在根本性问题
D63-66很差 — 需要彻底重构
D-60-62接近不合格
F<60不合格 — 需重新开发

Overall Grade Computation

综合评分计算

Axes 1 (Description) and 2 (Scope) carry 2x weight. All others carry 1x weight. If Axis 8 (Output Contracts) is marked exempt, remove it from the calculation.
Overall = (2×Axis1 + 2×Axis2 + Axis3 + Axis4 + Axis5 + Axis6 + Axis7 + Axis8 + Axis9 + Axis10) / 12
Convert the numeric average to a letter grade using the scale above.

维度1(描述质量)和维度2(范围规范性)权重为2倍,其余维度权重为1倍。若维度8(输出约定)被标记为豁免,则从计算中移除该维度。
综合评分 = (2×维度1 + 2×维度2 + 维度3 + 维度4 + 维度5 + 维度6 + 维度7 + 维度8 + 维度9 + 维度10) / 12
将数值平均分对照上述等级表转换为等级评分。

Output Format

输出格式

Produce this exact structure:
markdown
undefined
请严格按照以下结构生成:
markdown
undefined

Skill Grading Report: [skill-name]

Skill Grading Report: [skill-name]

Graded: [date] Overall Grade: [letter] ([score]/100)
Graded: [date] Overall Grade: [letter] ([score]/100)

Axis Grades

Axis Grades

#AxisGradeScoreKey Finding
1Description Quality[grade][score][1-line finding]
2Scope Discipline[grade][score][1-line finding]
3Progressive Disclosure[grade][score][1-line finding]
4Anti-Pattern Coverage[grade][score][1-line finding]
5Self-Contained Tools[grade][score][1-line finding]
6Activation Precision[grade][score][1-line finding]
7Visual Artifacts[grade][score][1-line finding]
8Output Contracts[grade][score][1-line finding]
9Temporal Awareness[grade][score][1-line finding]
10Documentation Quality[grade][score][1-line finding]
#AxisGradeScoreKey Finding
1Description Quality[grade][score][1-line finding]
2Scope Discipline[grade][score][1-line finding]
3Progressive Disclosure[grade][score][1-line finding]
4Anti-Pattern Coverage[grade][score][1-line finding]
5Self-Contained Tools[grade][score][1-line finding]
6Activation Precision[grade][score][1-line finding]
7Visual Artifacts[grade][score][1-line finding]
8Output Contracts[grade][score][1-line finding]
9Temporal Awareness[grade][score][1-line finding]
10Documentation Quality[grade][score][1-line finding]

Top 3 Improvements (Highest Impact)

Top 3 Improvements (Highest Impact)

  1. [Axis]: [Specific action] — [Why this matters, expected grade improvement]
  2. [Axis]: [Specific action] — [Why this matters, expected grade improvement]
  3. [Axis]: [Specific action] — [Why this matters, expected grade improvement]
  1. [Axis]: [Specific action] — [Why this matters, expected grade improvement]
  2. [Axis]: [Specific action] — [Why this matters, expected grade improvement]
  3. [Axis]: [Specific action] — [Why this matters, expected grade improvement]

Detailed Notes

Detailed Notes

[Axis name] ([grade])

[Axis name] ([grade])

[2-3 sentences of specific feedback with examples from the skill]
[Repeat for each axis scoring below B+]

---
[2-3 sentences of specific feedback with examples from the skill]
[Repeat for each axis scoring below B+]

---

Quick Grading (Abbreviated)

快速评分(简化版)

For rapid triage across many skills, produce only:
markdown
| Skill | Overall | Desc | Scope | Disc | Anti | Tools | Activ | Visual | Output | Temp | Docs |
|-------|---------|------|-------|------|------|-------|-------|--------|--------|------|------|
| [name] | [grade] | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

针对大量技能的快速筛选,仅需生成:
markdown
| Skill | Overall | Desc | Scope | Disc | Anti | Tools | Activ | Visual | Output | Temp | Docs |
|-------|---------|------|-------|------|------|-------|-------|--------|--------|------|------|
| [name] | [grade] | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

Anti-Patterns in Grading

评分中的反模式

Grade Inflation

评分膨胀

Wrong: Giving B+ because "it's pretty good" without checking criteria. Right: Match observations to the rubric table literally. If the description lacks a NOT clause, it cannot score above C on Axis 1.
错误做法: 因「整体不错」就给B+,未对照评分准则。 正确做法: 严格对照评分准则表。若描述缺少排除条款,维度1评分不得高于C。

Missing Context

忽略场景适配

Wrong: Grading a pure decision-tree skill poorly on Axis 5 (tools) because it has no scripts. Right: Mark Axis 5 as "A — tools not applicable for this skill type."
错误做法: 因纯决策树技能无脚本,就给维度5(工具)低分。 正确做法: 将维度5标记为「A — 该技能类型无需工具」。

Ignoring Phantoms

忽略虚假工具

Wrong: Scoring Axis 5 as B because scripts are "referenced." Right: Actually check if every referenced file exists. If
scripts/validate.py
is mentioned but doesn't exist, that's D.
错误做法: 因「提及了脚本」就给维度5评B。 正确做法: 实际检查所有提及的文件是否存在。若
scripts/validate.py
被提及但不存在,应评为D。