skill-grader

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Skill Grader

技能评分工具

Structured evaluation rubric for Claude Agent Skills. Produces letter grades (A+ through F) on 10 axes plus an overall grade, with specific improvement recommendations for each axis.

Designed for sub-agents and non-expert reviewers who need a mechanical, repeatable process for assessing skill quality without deep domain expertise.

针对Claude Agent Skills的结构化评估准则。基于10个维度给出A+至F的等级评分及综合评分，并针对每个维度提供具体改进建议。

专为子代理和非专业评审人员设计，无需深厚领域专业知识，即可通过标准化、可重复的流程评估技能质量。

When to Use

适用场景

✅ Use for:

Auditing a single skill's quality
Comparing skills against each other
Prioritizing which skills to improve first
Quality control sweeps across a skill library
Generating improvement roadmaps

❌ NOT for:

Creating new skills (use
```
skill-architect
```
)
Grading code quality or non-skill documents
Evaluating agent performance (different from skill quality)

✅ 适用场景:

单个技能的质量审计
技能之间的对比评估
确定技能改进的优先级
技能库的质量管控排查
生成技能改进路线图

❌ 不适用场景:

创建新技能（请使用
```
skill-architect
```
）
代码质量评分或非技能文档评估
代理性能评估（与技能质量评估不同）

Grading Process

评分流程

mermaid

flowchart TD
  A[Read SKILL.md + all files] --> B[Score each of 10 axes]
  B --> C[Assign letter grade per axis]
  C --> D[Compute overall grade]
  D --> E[Write improvement recommendations]
  E --> F[Produce grading report]

mermaid

flowchart TD
  A[Read SKILL.md + all files] --> B[Score each of 10 axes]
  B --> C[Assign letter grade per axis]
  C --> D[Compute overall grade]
  D --> E[Write improvement recommendations]
  E --> F[Produce grading report]

Step-by-Step

分步说明

Read the entire skill folder — SKILL.md, all references, scripts, CHANGELOG, README
Score each axis — Use the rubric below (0-100 per axis)
Convert to letter grade — See grade scale
Compute overall grade — Weighted average (Description and Scope are 2x weight)
Write 1-3 specific improvements per axis scoring below B+
Produce the grading report in the output format below

阅读完整技能文件夹 — 包括SKILL.md、所有参考文件、脚本、CHANGELOG、README
为每个维度评分 — 使用下方评分准则（每个维度0-100分）
转换为等级评分 — 参考等级对照表
计算综合评分 — 加权平均值（描述质量和范围规范性权重为2倍）
为每个评分低于B+的维度撰写1-3条具体改进建议
按照下方输出格式生成评分报告

The 10 Evaluation Axes

10个评估维度

Axis 1: Description Quality (Weight: 2x)

维度1：描述质量（权重：2倍）

Does the description follow

[What] [When] [Keywords]. NOT for [Exclusions]

Grade	Criteria
A	Specific verb+noun, domain keywords users would type, 2-5 explicit NOT exclusions, 25-50 words
B	Has keywords and NOT clause, but slightly vague or missing synonym coverage
C	Too generic, missing NOT clause, or >100 words of process detail
D	Single vague sentence ("helps with X") or name/description mismatch
F	Missing or empty description

描述是否遵循「[内容] [适用场景] [关键词]。不适用于[排除场景]」的格式？

等级	评分标准
A	包含具体动词+名词、用户会使用的领域关键词、2-5条明确的排除场景，字数在25-50之间
B	有关键词和排除条款，但描述略有模糊或缺少同义词覆盖
C	描述过于宽泛、缺少排除条款，或包含超过100字的流程细节
D	仅为单一模糊句子（如「帮助处理X」），或名称与描述不匹配
F	描述缺失或为空

Axis 2: Scope Discipline (Weight: 2x)

维度2：范围规范性（权重：2倍）

Is the skill narrowly focused on one expertise type, or a catch-all?

Grade	Criteria
A	One clear expertise domain, "When to Use" and "NOT for" sections both present and specific
B	Mostly focused, minor boundary ambiguity
C	Covers 2-3 related but distinct domains, should probably be split
D	Catch-all skill ("helps with anything related to X")
F	No scope boundaries defined at all

技能是否专注于单一专业领域，还是涵盖范围过于宽泛？

等级	评分标准
A	有明确的单一专业领域，「适用场景」和「不适用场景」部分均存在且描述具体
B	大部分内容聚焦，仅存在轻微的边界模糊
C	涵盖2-3个相关但不同的领域，应拆分为多个技能
D	涵盖范围过于宽泛（如「帮助处理所有与X相关的事务」）
F	完全未定义范围边界

Axis 3: Progressive Disclosure

维度3：渐进式信息披露

Does the skill follow the three-layer architecture (metadata → SKILL.md → references)?

Grade	Criteria
A	SKILL.md <300 lines, heavy content in references, reference index in SKILL.md with 1-line descriptions
B	SKILL.md <500 lines, some references used, index present
C	SKILL.md >500 lines, or all content inlined with no references
D	SKILL.md >800 lines, or references exist but aren't indexed in SKILL.md
F	Single massive file with no structure

技能是否遵循三层架构（元数据 → SKILL.md → 参考文件）？

等级	评分标准
A	SKILL.md少于300行，核心内容在参考文件中，SKILL.md包含参考文件索引及1行描述
B	SKILL.md少于500行，使用了部分参考文件，存在索引
C	SKILL.md超过500行，或所有内容均内联无参考文件
D	SKILL.md超过800行，或存在参考文件但未在SKILL.md中建立索引
F	仅为单一无结构的大文件

Axis 4: Anti-Pattern Coverage

维度4：反模式覆盖

Does the skill encode expert knowledge that prevents common mistakes?

Grade	Criteria
A	3+ anti-patterns with Novice/Expert/Timeline template, LLM-mistake notes
B	1-2 anti-patterns with clear explanation
C	Anti-patterns mentioned but no structured template
D	No anti-patterns, just positive instructions
F	Contains advice that IS an anti-pattern (outdated, harmful)

技能是否包含可避免常见错误的专业知识？

等级	评分标准
A	包含3个以上采用新手/专家/时间线模板的反模式，以及LLM错误提示
B	包含1-2个有清晰说明的反模式
C	提及反模式但未使用结构化模板
D	无反模式内容，仅包含正向操作说明
F	包含本身即为反模式的建议（过时、有害）

Axis 5: Self-Contained Tools

维度5：独立工具集成

Does the skill include working tools (scripts, MCPs, subagents)?

Grade	Criteria
A	Working scripts with CLI interface, error handling, dependency docs; OR valid "no tools needed" justification
B	Scripts exist and work but lack error handling or docs
C	Scripts referenced but are templates/pseudocode
D	Phantom tools (SKILL.md references files that don't exist)
F	References non-existent tools AND no acknowledgment

Note: Not every skill needs tools. A pure decision-tree skill can score A if tools aren't applicable.

技能是否包含可用工具（脚本、MCP、子代理）？

等级	评分标准
A	包含带CLI界面、错误处理、依赖文档的可用脚本；或有合理的「无需工具」说明
B	脚本存在且可用，但缺少错误处理或文档
C	提及脚本但仅为模板/伪代码
D	虚假工具（SKILL.md提及不存在的文件）
F	提及不存在的工具且未说明

注意：并非所有技能都需要工具。纯决策树技能若无需工具，可评为A。

Axis 6: Activation Precision

维度6：触发精准度

Would the skill activate correctly on relevant queries and stay silent on irrelevant ones?

Grade	Criteria
A	Description has specific keywords matching user language, clear NOT clause, no obvious false-positive vectors
B	Good keywords, minor false-positive risk
C	Generic keywords that overlap with other skills
D	No specific keywords, or NOT clause contradicts intended use
F	Description would cause constant false activation

技能是否能在相关查询时正确触发，在无关查询时保持静默？

等级	评分标准
A	描述包含匹配用户语言的具体关键词、明确的排除条款，无明显误触发可能
B	关键词设置合理，存在轻微误触发风险
C	关键词过于宽泛，与其他技能存在重叠
D	无具体关键词，或排除条款与预期用途矛盾
F	描述会导致持续误触发

Axis 7: Visual Artifacts

维度7：可视化元素

Does the skill use Mermaid diagrams, code examples, and tables effectively?

Grade	Criteria
A	Decision trees as Mermaid flowcharts, tables for comparisons, code examples for concrete patterns
B	Some diagrams or tables, but key decision trees still in prose
C	Tables used but no Mermaid diagrams for processes
D	Prose-only, no visual structure
F	Wall of text with no formatting aids

技能是否有效使用Mermaid图表、代码示例和表格？

等级	评分标准
A	使用Mermaid流程图展示决策树、表格进行对比、代码示例展示具体模式
B	使用了部分图表或表格，但核心决策树仍以文字呈现
C	使用了表格但未用Mermaid图表展示流程
D	仅为文字内容，无可视化结构
F	纯文本堆砌，无任何格式辅助

Axis 8: Output Contracts

维度8：输出约定

Does the skill define what it produces in a format consumable by other agents?

Grade	Criteria
A	Explicit output format (JSON schema, markdown template, or structured sections), subagent-consumable
B	Output format implied but not explicitly documented
C	No output format, but content is structured enough to infer
D	Unstructured prose output expected
F	N/A (pure reference skill) — exempt from this axis

技能是否定义了可被其他代理消费的输出格式？

等级	评分标准
A	明确输出格式（JSON schema、markdown模板或结构化章节），可被子代理消费
B	输出格式隐含但未明确记录
C	无输出格式，但内容结构清晰可推断
D	预期输出为非结构化文字
F	不适用（纯参考技能）—— 本维度豁免

Axis 9: Temporal Awareness

维度9：时效性感知

Does the skill track when knowledge was current and what has changed?

Grade	Criteria
A	Timelines in anti-patterns, "as of [date]" markers, CHANGELOG with dates
B	Some temporal context, CHANGELOG exists
C	No dates on knowledge, but CHANGELOG exists
D	No temporal context anywhere, knowledge could be stale
F	Contains demonstrably outdated advice without warning

技能是否记录了知识的时效性及变更内容？

等级	评分标准
A	反模式中包含时间线、标注「截至[日期]」、带日期的CHANGELOG
B	包含部分时间上下文，存在CHANGELOG
C	知识无日期标注，但存在CHANGELOG
D	无任何时间上下文，知识可能已过时
F	包含已过时的建议且未预警

Axis 10: Documentation Quality

维度10：文档质量

README, CHANGELOG, and reference organization.

Grade	Criteria
A	README with quick start, CHANGELOG with dated versions, references well-organized with clear filenames
B	README and CHANGELOG exist, references present
C	SKILL.md is the only file, but it's well-structured
D	No README, no CHANGELOG, disorganized references
F	SKILL.md is the only file and it's poorly structured

README、CHANGELOG及参考文件的组织情况。

等级	评分标准
A	README包含快速入门指南、CHANGELOG带版本日期、参考文件组织有序且文件名清晰
B	README和CHANGELOG均存在，参考文件齐全
C	仅存在SKILL.md文件，但结构清晰
D	无README、无CHANGELOG，参考文件组织混乱
F	仅存在SKILL.md文件且结构混乱

Grade Scale

等级对照表

Letter	Score Range	Meaning
A+	97-100	Exemplary — sets the standard
A	93-96	Excellent — minor improvements possible
A-	90-92	Very good — a few small gaps
B+	87-89	Good — notable room for improvement
B	83-86	Solid — several areas need work
B-	80-82	Above average — meaningful gaps
C+	77-79	Average — significant improvements needed
C	73-76	Below average — major gaps
C-	70-72	Barely adequate
D+	67-69	Poor — fundamental issues
D	63-66	Very poor — needs major rework
D-	60-62	Near-failing quality
F	<60	Failing — start over

等级	分数范围	说明
A+	97-100	标杆级 — 树立行业标准
A	93-96	优秀 — 仅需微小改进
A-	90-92	非常好 — 存在少量小缺口
B+	87-89	良好 — 有显著改进空间
B	83-86	合格 — 多个领域需优化
B-	80-82	中等偏上 — 存在明显缺口
C+	77-79	中等 — 需要重大改进
C	73-76	中等偏下 — 存在主要缺口
C-	70-72	勉强合格
D+	67-69	较差 — 存在根本性问题
D	63-66	很差 — 需要彻底重构
D-	60-62	接近不合格
F	<60	不合格 — 需重新开发

Overall Grade Computation

综合评分计算

Axes 1 (Description) and 2 (Scope) carry 2x weight. All others carry 1x weight. If Axis 8 (Output Contracts) is marked exempt, remove it from the calculation.

Overall = (2×Axis1 + 2×Axis2 + Axis3 + Axis4 + Axis5 + Axis6 + Axis7 + Axis8 + Axis9 + Axis10) / 12

Convert the numeric average to a letter grade using the scale above.

维度1（描述质量）和维度2（范围规范性）权重为2倍，其余维度权重为1倍。若维度8（输出约定）被标记为豁免，则从计算中移除该维度。

综合评分 = (2×维度1 + 2×维度2 + 维度3 + 维度4 + 维度5 + 维度6 + 维度7 + 维度8 + 维度9 + 维度10) / 12

将数值平均分对照上述等级表转换为等级评分。

Output Format

输出格式

Produce this exact structure:

markdown

undefined

请严格按照以下结构生成：

markdown

undefined

Skill Grading Report: [skill-name]

Graded: [date] Overall Grade: [letter] ([score]/100)

Axis Grades

#	Axis	Grade	Score	Key Finding
1	Description Quality	[grade]	[score]	[1-line finding]
2	Scope Discipline	[grade]	[score]	[1-line finding]
3	Progressive Disclosure	[grade]	[score]	[1-line finding]
4	Anti-Pattern Coverage	[grade]	[score]	[1-line finding]
5	Self-Contained Tools	[grade]	[score]	[1-line finding]
6	Activation Precision	[grade]	[score]	[1-line finding]
7	Visual Artifacts	[grade]	[score]	[1-line finding]
8	Output Contracts	[grade]	[score]	[1-line finding]
9	Temporal Awareness	[grade]	[score]	[1-line finding]
10	Documentation Quality	[grade]	[score]	[1-line finding]

#	Axis	Grade	Score	Key Finding
1	Description Quality	[grade]	[score]	[1-line finding]
2	Scope Discipline	[grade]	[score]	[1-line finding]
3	Progressive Disclosure	[grade]	[score]	[1-line finding]
4	Anti-Pattern Coverage	[grade]	[score]	[1-line finding]
5	Self-Contained Tools	[grade]	[score]	[1-line finding]
6	Activation Precision	[grade]	[score]	[1-line finding]
7	Visual Artifacts	[grade]	[score]	[1-line finding]
8	Output Contracts	[grade]	[score]	[1-line finding]
9	Temporal Awareness	[grade]	[score]	[1-line finding]
10	Documentation Quality	[grade]	[score]	[1-line finding]

Top 3 Improvements (Highest Impact)

[Axis]: [Specific action] — [Why this matters, expected grade improvement]
[Axis]: [Specific action] — [Why this matters, expected grade improvement]
[Axis]: [Specific action] — [Why this matters, expected grade improvement]

[Axis]: [Specific action] — [Why this matters, expected grade improvement]
[Axis]: [Specific action] — [Why this matters, expected grade improvement]
[Axis]: [Specific action] — [Why this matters, expected grade improvement]

Detailed Notes

[Axis name] ([grade])

[2-3 sentences of specific feedback with examples from the skill]

[Repeat for each axis scoring below B+]

---

[2-3 sentences of specific feedback with examples from the skill]

[Repeat for each axis scoring below B+]

---

Quick Grading (Abbreviated)

快速评分（简化版）

For rapid triage across many skills, produce only:

markdown

| Skill | Overall | Desc | Scope | Disc | Anti | Tools | Activ | Visual | Output | Temp | Docs |
|-------|---------|------|-------|------|------|-------|-------|--------|--------|------|------|
| [name] | [grade] | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

针对大量技能的快速筛选，仅需生成：

markdown

| Skill | Overall | Desc | Scope | Disc | Anti | Tools | Activ | Visual | Output | Temp | Docs |
|-------|---------|------|-------|------|------|-------|-------|--------|--------|------|------|
| [name] | [grade] | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |

Anti-Patterns in Grading

评分中的反模式

Grade Inflation

评分膨胀

Wrong: Giving B+ because "it's pretty good" without checking criteria. Right: Match observations to the rubric table literally. If the description lacks a NOT clause, it cannot score above C on Axis 1.

错误做法: 因「整体不错」就给B+，未对照评分准则。 正确做法: 严格对照评分准则表。若描述缺少排除条款，维度1评分不得高于C。

Missing Context

忽略场景适配

Wrong: Grading a pure decision-tree skill poorly on Axis 5 (tools) because it has no scripts. Right: Mark Axis 5 as "A — tools not applicable for this skill type."

错误做法: 因纯决策树技能无脚本，就给维度5（工具）低分。 正确做法: 将维度5标记为「A — 该技能类型无需工具」。

Ignoring Phantoms

忽略虚假工具

Wrong: Scoring Axis 5 as B because scripts are "referenced." Right: Actually check if every referenced file exists. If

scripts/validate.py

is mentioned but doesn't exist, that's D.

错误做法: 因「提及了脚本」就给维度5评B。 正确做法: 实际检查所有提及的文件是否存在。若

scripts/validate.py

被提及但不存在，应评为D。