skill-grader
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseSkill Grader
技能评分工具
Structured evaluation rubric for Claude Agent Skills. Produces letter grades (A+ through F) on 10 axes plus an overall grade, with specific improvement recommendations for each axis.
Designed for sub-agents and non-expert reviewers who need a mechanical, repeatable process for assessing skill quality without deep domain expertise.
针对Claude Agent Skills的结构化评估准则。基于10个维度给出A+至F的等级评分及综合评分,并针对每个维度提供具体改进建议。
专为子代理和非专业评审人员设计,无需深厚领域专业知识,即可通过标准化、可重复的流程评估技能质量。
When to Use
适用场景
✅ Use for:
- Auditing a single skill's quality
- Comparing skills against each other
- Prioritizing which skills to improve first
- Quality control sweeps across a skill library
- Generating improvement roadmaps
❌ NOT for:
- Creating new skills (use )
skill-architect - Grading code quality or non-skill documents
- Evaluating agent performance (different from skill quality)
✅ 适用场景:
- 单个技能的质量审计
- 技能之间的对比评估
- 确定技能改进的优先级
- 技能库的质量管控排查
- 生成技能改进路线图
❌ 不适用场景:
- 创建新技能(请使用)
skill-architect - 代码质量评分或非技能文档评估
- 代理性能评估(与技能质量评估不同)
Grading Process
评分流程
mermaid
flowchart TD
A[Read SKILL.md + all files] --> B[Score each of 10 axes]
B --> C[Assign letter grade per axis]
C --> D[Compute overall grade]
D --> E[Write improvement recommendations]
E --> F[Produce grading report]mermaid
flowchart TD
A[Read SKILL.md + all files] --> B[Score each of 10 axes]
B --> C[Assign letter grade per axis]
C --> D[Compute overall grade]
D --> E[Write improvement recommendations]
E --> F[Produce grading report]Step-by-Step
分步说明
- Read the entire skill folder — SKILL.md, all references, scripts, CHANGELOG, README
- Score each axis — Use the rubric below (0-100 per axis)
- Convert to letter grade — See grade scale
- Compute overall grade — Weighted average (Description and Scope are 2x weight)
- Write 1-3 specific improvements per axis scoring below B+
- Produce the grading report in the output format below
- 阅读完整技能文件夹 — 包括SKILL.md、所有参考文件、脚本、CHANGELOG、README
- 为每个维度评分 — 使用下方评分准则(每个维度0-100分)
- 转换为等级评分 — 参考等级对照表
- 计算综合评分 — 加权平均值(描述质量和范围规范性权重为2倍)
- 为每个评分低于B+的维度撰写1-3条具体改进建议
- 按照下方输出格式生成评分报告
The 10 Evaluation Axes
10个评估维度
Axis 1: Description Quality (Weight: 2x)
维度1:描述质量(权重:2倍)
Does the description follow ?
[What] [When] [Keywords]. NOT for [Exclusions]| Grade | Criteria |
|---|---|
| A | Specific verb+noun, domain keywords users would type, 2-5 explicit NOT exclusions, 25-50 words |
| B | Has keywords and NOT clause, but slightly vague or missing synonym coverage |
| C | Too generic, missing NOT clause, or >100 words of process detail |
| D | Single vague sentence ("helps with X") or name/description mismatch |
| F | Missing or empty description |
描述是否遵循「[内容] [适用场景] [关键词]。不适用于[排除场景]」的格式?
| 等级 | 评分标准 |
|---|---|
| A | 包含具体动词+名词、用户会使用的领域关键词、2-5条明确的排除场景,字数在25-50之间 |
| B | 有关键词和排除条款,但描述略有模糊或缺少同义词覆盖 |
| C | 描述过于宽泛、缺少排除条款,或包含超过100字的流程细节 |
| D | 仅为单一模糊句子(如「帮助处理X」),或名称与描述不匹配 |
| F | 描述缺失或为空 |
Axis 2: Scope Discipline (Weight: 2x)
维度2:范围规范性(权重:2倍)
Is the skill narrowly focused on one expertise type, or a catch-all?
| Grade | Criteria |
|---|---|
| A | One clear expertise domain, "When to Use" and "NOT for" sections both present and specific |
| B | Mostly focused, minor boundary ambiguity |
| C | Covers 2-3 related but distinct domains, should probably be split |
| D | Catch-all skill ("helps with anything related to X") |
| F | No scope boundaries defined at all |
技能是否专注于单一专业领域,还是涵盖范围过于宽泛?
| 等级 | 评分标准 |
|---|---|
| A | 有明确的单一专业领域,「适用场景」和「不适用场景」部分均存在且描述具体 |
| B | 大部分内容聚焦,仅存在轻微的边界模糊 |
| C | 涵盖2-3个相关但不同的领域,应拆分为多个技能 |
| D | 涵盖范围过于宽泛(如「帮助处理所有与X相关的事务」) |
| F | 完全未定义范围边界 |
Axis 3: Progressive Disclosure
维度3:渐进式信息披露
Does the skill follow the three-layer architecture (metadata → SKILL.md → references)?
| Grade | Criteria |
|---|---|
| A | SKILL.md <300 lines, heavy content in references, reference index in SKILL.md with 1-line descriptions |
| B | SKILL.md <500 lines, some references used, index present |
| C | SKILL.md >500 lines, or all content inlined with no references |
| D | SKILL.md >800 lines, or references exist but aren't indexed in SKILL.md |
| F | Single massive file with no structure |
技能是否遵循三层架构(元数据 → SKILL.md → 参考文件)?
| 等级 | 评分标准 |
|---|---|
| A | SKILL.md少于300行,核心内容在参考文件中,SKILL.md包含参考文件索引及1行描述 |
| B | SKILL.md少于500行,使用了部分参考文件,存在索引 |
| C | SKILL.md超过500行,或所有内容均内联无参考文件 |
| D | SKILL.md超过800行,或存在参考文件但未在SKILL.md中建立索引 |
| F | 仅为单一无结构的大文件 |
Axis 4: Anti-Pattern Coverage
维度4:反模式覆盖
Does the skill encode expert knowledge that prevents common mistakes?
| Grade | Criteria |
|---|---|
| A | 3+ anti-patterns with Novice/Expert/Timeline template, LLM-mistake notes |
| B | 1-2 anti-patterns with clear explanation |
| C | Anti-patterns mentioned but no structured template |
| D | No anti-patterns, just positive instructions |
| F | Contains advice that IS an anti-pattern (outdated, harmful) |
技能是否包含可避免常见错误的专业知识?
| 等级 | 评分标准 |
|---|---|
| A | 包含3个以上采用新手/专家/时间线模板的反模式,以及LLM错误提示 |
| B | 包含1-2个有清晰说明的反模式 |
| C | 提及反模式但未使用结构化模板 |
| D | 无反模式内容,仅包含正向操作说明 |
| F | 包含本身即为反模式的建议(过时、有害) |
Axis 5: Self-Contained Tools
维度5:独立工具集成
Does the skill include working tools (scripts, MCPs, subagents)?
| Grade | Criteria |
|---|---|
| A | Working scripts with CLI interface, error handling, dependency docs; OR valid "no tools needed" justification |
| B | Scripts exist and work but lack error handling or docs |
| C | Scripts referenced but are templates/pseudocode |
| D | Phantom tools (SKILL.md references files that don't exist) |
| F | References non-existent tools AND no acknowledgment |
Note: Not every skill needs tools. A pure decision-tree skill can score A if tools aren't applicable.
技能是否包含可用工具(脚本、MCP、子代理)?
| 等级 | 评分标准 |
|---|---|
| A | 包含带CLI界面、错误处理、依赖文档的可用脚本;或有合理的「无需工具」说明 |
| B | 脚本存在且可用,但缺少错误处理或文档 |
| C | 提及脚本但仅为模板/伪代码 |
| D | 虚假工具(SKILL.md提及不存在的文件) |
| F | 提及不存在的工具且未说明 |
注意:并非所有技能都需要工具。纯决策树技能若无需工具,可评为A。
Axis 6: Activation Precision
维度6:触发精准度
Would the skill activate correctly on relevant queries and stay silent on irrelevant ones?
| Grade | Criteria |
|---|---|
| A | Description has specific keywords matching user language, clear NOT clause, no obvious false-positive vectors |
| B | Good keywords, minor false-positive risk |
| C | Generic keywords that overlap with other skills |
| D | No specific keywords, or NOT clause contradicts intended use |
| F | Description would cause constant false activation |
技能是否能在相关查询时正确触发,在无关查询时保持静默?
| 等级 | 评分标准 |
|---|---|
| A | 描述包含匹配用户语言的具体关键词、明确的排除条款,无明显误触发可能 |
| B | 关键词设置合理,存在轻微误触发风险 |
| C | 关键词过于宽泛,与其他技能存在重叠 |
| D | 无具体关键词,或排除条款与预期用途矛盾 |
| F | 描述会导致持续误触发 |
Axis 7: Visual Artifacts
维度7:可视化元素
Does the skill use Mermaid diagrams, code examples, and tables effectively?
| Grade | Criteria |
|---|---|
| A | Decision trees as Mermaid flowcharts, tables for comparisons, code examples for concrete patterns |
| B | Some diagrams or tables, but key decision trees still in prose |
| C | Tables used but no Mermaid diagrams for processes |
| D | Prose-only, no visual structure |
| F | Wall of text with no formatting aids |
技能是否有效使用Mermaid图表、代码示例和表格?
| 等级 | 评分标准 |
|---|---|
| A | 使用Mermaid流程图展示决策树、表格进行对比、代码示例展示具体模式 |
| B | 使用了部分图表或表格,但核心决策树仍以文字呈现 |
| C | 使用了表格但未用Mermaid图表展示流程 |
| D | 仅为文字内容,无可视化结构 |
| F | 纯文本堆砌,无任何格式辅助 |
Axis 8: Output Contracts
维度8:输出约定
Does the skill define what it produces in a format consumable by other agents?
| Grade | Criteria |
|---|---|
| A | Explicit output format (JSON schema, markdown template, or structured sections), subagent-consumable |
| B | Output format implied but not explicitly documented |
| C | No output format, but content is structured enough to infer |
| D | Unstructured prose output expected |
| F | N/A (pure reference skill) — exempt from this axis |
技能是否定义了可被其他代理消费的输出格式?
| 等级 | 评分标准 |
|---|---|
| A | 明确输出格式(JSON schema、markdown模板或结构化章节),可被子代理消费 |
| B | 输出格式隐含但未明确记录 |
| C | 无输出格式,但内容结构清晰可推断 |
| D | 预期输出为非结构化文字 |
| F | 不适用(纯参考技能)—— 本维度豁免 |
Axis 9: Temporal Awareness
维度9:时效性感知
Does the skill track when knowledge was current and what has changed?
| Grade | Criteria |
|---|---|
| A | Timelines in anti-patterns, "as of [date]" markers, CHANGELOG with dates |
| B | Some temporal context, CHANGELOG exists |
| C | No dates on knowledge, but CHANGELOG exists |
| D | No temporal context anywhere, knowledge could be stale |
| F | Contains demonstrably outdated advice without warning |
技能是否记录了知识的时效性及变更内容?
| 等级 | 评分标准 |
|---|---|
| A | 反模式中包含时间线、标注「截至[日期]」、带日期的CHANGELOG |
| B | 包含部分时间上下文,存在CHANGELOG |
| C | 知识无日期标注,但存在CHANGELOG |
| D | 无任何时间上下文,知识可能已过时 |
| F | 包含已过时的建议且未预警 |
Axis 10: Documentation Quality
维度10:文档质量
README, CHANGELOG, and reference organization.
| Grade | Criteria |
|---|---|
| A | README with quick start, CHANGELOG with dated versions, references well-organized with clear filenames |
| B | README and CHANGELOG exist, references present |
| C | SKILL.md is the only file, but it's well-structured |
| D | No README, no CHANGELOG, disorganized references |
| F | SKILL.md is the only file and it's poorly structured |
README、CHANGELOG及参考文件的组织情况。
| 等级 | 评分标准 |
|---|---|
| A | README包含快速入门指南、CHANGELOG带版本日期、参考文件组织有序且文件名清晰 |
| B | README和CHANGELOG均存在,参考文件齐全 |
| C | 仅存在SKILL.md文件,但结构清晰 |
| D | 无README、无CHANGELOG,参考文件组织混乱 |
| F | 仅存在SKILL.md文件且结构混乱 |
Grade Scale
等级对照表
| Letter | Score Range | Meaning |
|---|---|---|
| A+ | 97-100 | Exemplary — sets the standard |
| A | 93-96 | Excellent — minor improvements possible |
| A- | 90-92 | Very good — a few small gaps |
| B+ | 87-89 | Good — notable room for improvement |
| B | 83-86 | Solid — several areas need work |
| B- | 80-82 | Above average — meaningful gaps |
| C+ | 77-79 | Average — significant improvements needed |
| C | 73-76 | Below average — major gaps |
| C- | 70-72 | Barely adequate |
| D+ | 67-69 | Poor — fundamental issues |
| D | 63-66 | Very poor — needs major rework |
| D- | 60-62 | Near-failing quality |
| F | <60 | Failing — start over |
| 等级 | 分数范围 | 说明 |
|---|---|---|
| A+ | 97-100 | 标杆级 — 树立行业标准 |
| A | 93-96 | 优秀 — 仅需微小改进 |
| A- | 90-92 | 非常好 — 存在少量小缺口 |
| B+ | 87-89 | 良好 — 有显著改进空间 |
| B | 83-86 | 合格 — 多个领域需优化 |
| B- | 80-82 | 中等偏上 — 存在明显缺口 |
| C+ | 77-79 | 中等 — 需要重大改进 |
| C | 73-76 | 中等偏下 — 存在主要缺口 |
| C- | 70-72 | 勉强合格 |
| D+ | 67-69 | 较差 — 存在根本性问题 |
| D | 63-66 | 很差 — 需要彻底重构 |
| D- | 60-62 | 接近不合格 |
| F | <60 | 不合格 — 需重新开发 |
Overall Grade Computation
综合评分计算
Axes 1 (Description) and 2 (Scope) carry 2x weight. All others carry 1x weight. If Axis 8 (Output Contracts) is marked exempt, remove it from the calculation.
Overall = (2×Axis1 + 2×Axis2 + Axis3 + Axis4 + Axis5 + Axis6 + Axis7 + Axis8 + Axis9 + Axis10) / 12Convert the numeric average to a letter grade using the scale above.
维度1(描述质量)和维度2(范围规范性)权重为2倍,其余维度权重为1倍。若维度8(输出约定)被标记为豁免,则从计算中移除该维度。
综合评分 = (2×维度1 + 2×维度2 + 维度3 + 维度4 + 维度5 + 维度6 + 维度7 + 维度8 + 维度9 + 维度10) / 12将数值平均分对照上述等级表转换为等级评分。
Output Format
输出格式
Produce this exact structure:
markdown
undefined请严格按照以下结构生成:
markdown
undefinedSkill Grading Report: [skill-name]
Skill Grading Report: [skill-name]
Graded: [date]
Overall Grade: [letter] ([score]/100)
Graded: [date]
Overall Grade: [letter] ([score]/100)
Axis Grades
Axis Grades
| # | Axis | Grade | Score | Key Finding |
|---|---|---|---|---|
| 1 | Description Quality | [grade] | [score] | [1-line finding] |
| 2 | Scope Discipline | [grade] | [score] | [1-line finding] |
| 3 | Progressive Disclosure | [grade] | [score] | [1-line finding] |
| 4 | Anti-Pattern Coverage | [grade] | [score] | [1-line finding] |
| 5 | Self-Contained Tools | [grade] | [score] | [1-line finding] |
| 6 | Activation Precision | [grade] | [score] | [1-line finding] |
| 7 | Visual Artifacts | [grade] | [score] | [1-line finding] |
| 8 | Output Contracts | [grade] | [score] | [1-line finding] |
| 9 | Temporal Awareness | [grade] | [score] | [1-line finding] |
| 10 | Documentation Quality | [grade] | [score] | [1-line finding] |
| # | Axis | Grade | Score | Key Finding |
|---|---|---|---|---|
| 1 | Description Quality | [grade] | [score] | [1-line finding] |
| 2 | Scope Discipline | [grade] | [score] | [1-line finding] |
| 3 | Progressive Disclosure | [grade] | [score] | [1-line finding] |
| 4 | Anti-Pattern Coverage | [grade] | [score] | [1-line finding] |
| 5 | Self-Contained Tools | [grade] | [score] | [1-line finding] |
| 6 | Activation Precision | [grade] | [score] | [1-line finding] |
| 7 | Visual Artifacts | [grade] | [score] | [1-line finding] |
| 8 | Output Contracts | [grade] | [score] | [1-line finding] |
| 9 | Temporal Awareness | [grade] | [score] | [1-line finding] |
| 10 | Documentation Quality | [grade] | [score] | [1-line finding] |
Top 3 Improvements (Highest Impact)
Top 3 Improvements (Highest Impact)
- [Axis]: [Specific action] — [Why this matters, expected grade improvement]
- [Axis]: [Specific action] — [Why this matters, expected grade improvement]
- [Axis]: [Specific action] — [Why this matters, expected grade improvement]
- [Axis]: [Specific action] — [Why this matters, expected grade improvement]
- [Axis]: [Specific action] — [Why this matters, expected grade improvement]
- [Axis]: [Specific action] — [Why this matters, expected grade improvement]
Detailed Notes
Detailed Notes
[Axis name] ([grade])
[Axis name] ([grade])
[2-3 sentences of specific feedback with examples from the skill]
[Repeat for each axis scoring below B+]
---[2-3 sentences of specific feedback with examples from the skill]
[Repeat for each axis scoring below B+]
---Quick Grading (Abbreviated)
快速评分(简化版)
For rapid triage across many skills, produce only:
markdown
| Skill | Overall | Desc | Scope | Disc | Anti | Tools | Activ | Visual | Output | Temp | Docs |
|-------|---------|------|-------|------|------|-------|-------|--------|--------|------|------|
| [name] | [grade] | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |针对大量技能的快速筛选,仅需生成:
markdown
| Skill | Overall | Desc | Scope | Disc | Anti | Tools | Activ | Visual | Output | Temp | Docs |
|-------|---------|------|-------|------|------|-------|-------|--------|--------|------|------|
| [name] | [grade] | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |Anti-Patterns in Grading
评分中的反模式
Grade Inflation
评分膨胀
Wrong: Giving B+ because "it's pretty good" without checking criteria.
Right: Match observations to the rubric table literally. If the description lacks a NOT clause, it cannot score above C on Axis 1.
错误做法: 因「整体不错」就给B+,未对照评分准则。
正确做法: 严格对照评分准则表。若描述缺少排除条款,维度1评分不得高于C。
Missing Context
忽略场景适配
Wrong: Grading a pure decision-tree skill poorly on Axis 5 (tools) because it has no scripts.
Right: Mark Axis 5 as "A — tools not applicable for this skill type."
错误做法: 因纯决策树技能无脚本,就给维度5(工具)低分。
正确做法: 将维度5标记为「A — 该技能类型无需工具」。
Ignoring Phantoms
忽略虚假工具
Wrong: Scoring Axis 5 as B because scripts are "referenced."
Right: Actually check if every referenced file exists. If is mentioned but doesn't exist, that's D.
scripts/validate.py错误做法: 因「提及了脚本」就给维度5评B。
正确做法: 实际检查所有提及的文件是否存在。若被提及但不存在,应评为D。
scripts/validate.py