agent-evaluation

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Agent Evaluation

Agent评估

Overview

概述

LLM-as-judge evaluation framework that scores AI-generated content on 5 dimensions using a 1-5 rubric. Agents evaluate outputs, compute a weighted composite score, and emit a structured verdict with evidence citations.
Core principle: Systematic quality verification before claiming completion. Agent-studio currently has no way to verify agent output quality — this skill fills that gap.
基于LLM-as-judge的评估框架,采用1-5分制的5维度评分标准对AI生成内容进行打分。Agent会评估输出结果、计算加权综合得分,并生成带有证据引用的结构化评估结论。
核心原则: 在标记任务完成前进行系统化的质量验证。目前Agent Studio无法验证Agent的输出质量——本技能填补了这一空白。

When to Use

适用场景

Always:
  • Before marking a task complete (pair with
    verification-before-completion
    )
  • After a plan is generated (evaluate plan quality)
  • After code review outputs (evaluate review quality)
  • During reflection cycles (evaluate agent responses)
  • When comparing multiple agent outputs
Don't Use:
  • For binary pass/fail checks (use
    verification-before-completion
    instead)
  • For security audits (use
    security-architect
    skill)
  • For syntax/lint checking (use
    pnpm lint:fix
    )
建议始终使用:
  • 在标记任务完成前(与
    verification-before-completion
    配合使用)
  • 生成计划后(评估计划质量)
  • 代码评审输出后(评估评审质量)
  • 反思周期中(评估Agent的响应)
  • 对比多个Agent的输出结果时
不建议使用:
  • 用于二元通过/失败检查(改用
    verification-before-completion
  • 用于安全审计(改用
    security-architect
    技能)
  • 用于语法/代码规范检查(使用
    pnpm lint:fix

The 5-Dimension Rubric

5维度评分标准

Every evaluation scores all 5 dimensions on a 1-5 scale:
DimensionWeightWhat It Measures
Accuracy30%Factual correctness; no hallucinations; claims are verifiable
Groundedness25%Claims are supported by citations, file references, or evidence from the codebase
Coherence15%Logical flow; internally consistent; no contradictions
Completeness20%All required aspects addressed; no critical gaps
Helpfulness10%Actionable; provides concrete next steps; reduces ambiguity
所有评估都会从5个维度进行1-5分制打分:
维度权重衡量内容
准确性30%事实正确性;无幻觉内容;所有主张可验证
事实依据性25%主张有引用、文件参考或代码库中的证据支持
连贯性15%逻辑流畅;内部一致;无矛盾
完整性20%覆盖所有必要方面;无关键空白
实用性10%具备可操作性;提供具体后续步骤;减少歧义

Scoring Scale (1-5)

评分等级(1-5分)

ScoreMeaning
5Excellent — fully meets the dimension's criteria with no gaps
4Good — meets criteria with minor gaps
3Adequate — partially meets criteria; some gaps present
2Poor — significant gaps or errors in this dimension
1Failing — does not meet the dimension's criteria
分数含义
5优秀——完全符合该维度的标准,无任何缺陷
4良好——符合标准,仅有微小缺陷
3合格——部分符合标准;存在一些缺陷
2较差——该维度存在显著缺陷或错误
1不合格——完全不符合该维度的标准

Execution Process

执行流程

Step 1: Load the Output to Evaluate

步骤1:加载待评估的输出内容

Identify what is being evaluated:
- Agent response (text)
- Plan document (file path)
- Code review output (text/file)
- Skill invocation result (text)
- Task completion claim (TaskGet metadata)
确定待评估的对象:
- Agent响应内容(文本)
- 计划文档(文件路径)
- 代码评审输出(文本/文件)
- 技能调用结果(文本)
- 任务完成声明(TaskGet元数据)

Step 2: Score Each Dimension

步骤2:对每个维度打分

For each of the 5 dimensions, provide:
  1. Score (1-5): The numeric score
  2. Evidence: Direct quote or file reference from the evaluated output
  3. Rationale: Why this score was given (1-2 sentences)
Dimension 1: Accuracy
Checklist:
- [ ] Claims are factually correct (verify against codebase if possible)
- [ ] No hallucinated file paths, function names, or API calls
- [ ] Numbers and counts are accurate
- [ ] No contradictions with existing documentation
Dimension 2: Groundedness
Checklist:
- [ ] Claims cite specific files, line numbers, or task IDs
- [ ] Recommendations reference observable evidence
- [ ] No unsupported assertions ("this is probably X")
- [ ] Code examples use actual project patterns
Dimension 3: Coherence
Checklist:
- [ ] Logical flow from problem → analysis → recommendation
- [ ] No internal contradictions
- [ ] Terminology is consistent throughout
- [ ] Steps are in a rational order
Dimension 4: Completeness
Checklist:
- [ ] All required aspects of the task are addressed
- [ ] Edge cases are mentioned (if relevant)
- [ ] No critical gaps that would block action
- [ ] Follow-up steps are included
Dimension 5: Helpfulness
Checklist:
- [ ] Provides actionable next steps (not just observations)
- [ ] Concrete enough to act on without further clarification
- [ ] Reduces ambiguity rather than adding it
- [ ] Appropriate for the intended audience
针对5个维度中的每一个,提供以下内容:
  1. 分数(1-5):数字评分
  2. 证据:待评估输出中的直接引用或文件参考
  3. 理由:给出该分数的原因(1-2句话)
维度1:准确性
检查清单:
- [ ] 所有主张符合事实(如有可能,对照代码库验证)
- [ ] 无虚构的文件路径、函数名称或API调用
- [ ] 数字和统计准确
- [ ] 与现有文档无矛盾
维度2:事实依据性
检查清单:
- [ ] 主张引用了具体文件、行号或任务ID
- [ ] 建议参考了可观察的证据
- [ ] 无无根据的断言(如“这可能是X”)
- [ ] 代码示例使用了项目中的实际模式
维度3:连贯性
检查清单:
- [ ] 从问题→分析→建议的逻辑流畅
- [ ] 无内部矛盾
- [ ] 术语全程一致
- [ ] 步骤顺序合理
维度4:完整性
检查清单:
- [ ] 覆盖了任务的所有必要方面
- [ ] 提及了相关的边缘情况(如适用)
- [ ] 无会阻碍行动的关键空白
- [ ] 包含后续步骤
维度5:实用性
检查清单:
- [ ] 提供了可操作的后续步骤(而非仅观察结果)
- [ ] 足够具体,无需进一步澄清即可执行
- [ ] 减少而非增加歧义
- [ ] 适合目标受众

Step 3: Compute Weighted Composite Score

步骤3:计算加权综合得分

composite = (accuracy × 0.30) + (groundedness × 0.25) + (completeness × 0.20) + (coherence × 0.15) + (helpfulness × 0.10)
综合得分 = (准确性 × 0.30) + (事实依据性 × 0.25) + (完整性 × 0.20) + (连贯性 × 0.15) + (实用性 × 0.10)

Step 4: Determine Verdict

步骤4:确定评估结论

Composite ScoreVerdictAction
4.5 – 5.0EXCELLENTApprove; proceed
3.5 – 4.4GOODApprove with minor notes
2.5 – 3.4ADEQUATERequest targeted improvements
1.5 – 2.4POORReject; requires significant rework
1.0 – 1.4FAILINGReject; restart task
综合得分范围评估结论对应行动
4.5 – 5.0优秀批准;继续推进
3.5 – 4.4良好批准,附带少量备注
2.5 – 3.4合格要求针对性改进
1.5 – 2.4较差拒绝;需要大幅返工
1.0 – 1.4不合格拒绝;重启任务

Step 5: Emit Structured Verdict

步骤5:生成结构化评估结论

Output the verdict in this format:
markdown
undefined
按照以下格式输出评估结论:
markdown
undefined

Evaluation Verdict

评估结论

Output Evaluated: [Brief description of what was evaluated] Evaluator: [Agent name / task ID] Date: [ISO 8601 date]
待评估输出:[待评估内容的简要描述] 评估者:[Agent名称 / 任务ID] 日期:[ISO 8601格式日期]

Dimension Scores

维度得分

DimensionScoreWeightWeighted Score
AccuracyX/530%X.X
GroundednessX/525%X.X
CompletenessX/520%X.X
CoherenceX/515%X.X
HelpfulnessX/510%X.X
CompositeX.X / 5.0
维度分数权重加权得分
准确性X/530%X.X
事实依据性X/525%X.X
完整性X/520%X.X
连贯性X/515%X.X
实用性X/510%X.X
综合得分X.X / 5.0

Evidence Citations

证据引用

Accuracy (X/5):
[Direct quote or file:line reference] Rationale: [Why this score]
Groundedness (X/5):
[Direct quote or file:line reference] Rationale: [Why this score]
Completeness (X/5):
[Direct quote or file:line reference] Rationale: [Why this score]
Coherence (X/5):
[Direct quote or file:line reference] Rationale: [Why this score]
Helpfulness (X/5):
[Direct quote or file:line reference] Rationale: [Why this score]
准确性 (X/5)
[直接引用或文件:行号] 理由:[给出该分数的原因]
事实依据性 (X/5)
[直接引用或文件:行号] 理由:[给出该分数的原因]
完整性 (X/5)
[直接引用或文件:行号] 理由:[给出该分数的原因]
连贯性 (X/5)
[直接引用或文件:行号] 理由:[给出该分数的原因]
实用性 (X/5)
[直接引用或文件:行号] 理由:[给出该分数的原因]

Verdict: [EXCELLENT | GOOD | ADEQUATE | POOR | FAILING]

评估结论:[优秀 | 良好 | 合格 | 较差 | 不合格]

Summary: [1-2 sentence overall assessment]
Required Actions (if verdict is ADEQUATE or worse):
  1. [Specific improvement needed]
  2. [Specific improvement needed]
undefined
总结:[1-2句话的整体评估]
必要行动(若评估结论为合格及以下):
  1. [具体改进需求]
  2. [具体改进需求]
undefined

Usage Examples

使用示例

Evaluate a Plan Document

评估计划文档

javascript
// Load plan document
Read({ file_path: '.claude/context/plans/auth-design-plan-2026-02-21.md' });

// Evaluate against 5-dimension rubric
Skill({ skill: 'agent-evaluation' });
// Provide the plan content as the output to evaluate
javascript
// 加载计划文档
Read({ file_path: '.claude/context/plans/auth-design-plan-2026-02-21.md' });

// 采用5维度评分标准进行评估
Skill({ skill: 'agent-evaluation' });
// 提供计划内容作为待评估输出

Evaluate Agent Response Before Completion

在任务完成前评估Agent的响应内容

javascript
// Agent generates implementation summary
// Before marking task complete, evaluate the summary quality
Skill({ skill: 'agent-evaluation' });
// If composite < 3.5, request improvements before TaskUpdate(completed)
javascript
// Agent生成实现总结
// 在标记任务完成前,评估总结的质量
Skill({ skill: 'agent-evaluation' });
// 如果综合得分 < 3.5,在调用TaskUpdate(completed)前要求改进

Evaluate Code Review Output

评估代码评审输出

javascript
// After code-reviewer runs, evaluate the review quality
Skill({ skill: 'agent-evaluation' });
// Ensures review is grounded in actual code evidence, not assertions
javascript
// code-reviewer执行完成后,评估评审质量
Skill({ skill: 'agent-evaluation' });
// 确保评审内容基于实际代码证据,而非断言

Batch Evaluation (comparing two outputs)

批量评估(对比两个输出内容)

javascript
// Evaluate output A
// Save verdict A
// Evaluate output B
// Save verdict B
// Compare composites → choose higher scoring output
javascript
// 评估输出内容A
// 保存评估结论A
// 评估输出内容B
// 保存评估结论B
// 对比综合得分 → 选择得分更高的输出

Integration with Verification-Before-Completion

与Verification-Before-Completion的集成

The recommended quality gate pattern:
javascript
// Step 1: Do the work
// Step 2: Evaluate with agent-evaluation
Skill({ skill: 'agent-evaluation' });
// If verdict is POOR or FAILING → rework before proceeding
// If verdict is ADEQUATE or better → proceed to verification
// Step 3: Final gate
Skill({ skill: 'verification-before-completion' });
// Step 4: Mark complete
TaskUpdate({ taskId: 'X', status: 'completed' });
推荐的质量把关流程:
javascript
// 步骤1:执行任务
// 步骤2:使用agent-evaluation进行评估
Skill({ skill: 'agent-evaluation' });
// 如果评估结论为较差或不合格 → 返工后再继续
// 如果评估结论为合格及以上 → 进入验证环节
// 步骤3:最终把关
Skill({ skill: 'verification-before-completion' });
// 步骤4:标记任务完成
TaskUpdate({ taskId: 'X', status: 'completed' });

Iron Laws

铁则

  1. NO COMPLETION CLAIM WITHOUT EVALUATION EVIDENCE — If composite score < 2.5 (POOR or FAILING), rework the output before marking any task complete.
  2. ALWAYS score all 5 dimensions — never skip dimensions to save time; each dimension catches different failure modes (accuracy ≠ completeness ≠ groundedness).
  3. ALWAYS cite specific evidence for every dimension score —
    "Evidence: [file:line or direct quote]"
    is mandatory, not optional. Assertions without grounding are invalid.
  4. ALWAYS use the weighted composite
    accuracy×0.30 + groundedness×0.25 + completeness×0.20 + coherence×0.15 + helpfulness×0.10
    . Never use simple average.
  5. NEVER evaluate before the work is complete — evaluating incomplete outputs produces falsely low scores and wastes context budget.
  1. 无评估证据不得声明任务完成 — 如果综合得分 < 2.5(较差或不合格),在标记任务完成前必须返工输出内容。
  2. 必须对所有5个维度打分 — 绝不能为了节省时间跳过维度;每个维度能发现不同的问题(准确性≠完整性≠事实依据性)。
  3. 每个维度的打分必须引用具体证据
    “证据:[文件:行号或直接引用]”
    是强制要求,而非可选。无依据的断言无效。
  4. 必须使用加权综合得分
    准确性×0.30 + 事实依据性×0.25 + 完整性×0.20 + 连贯性×0.15 + 实用性×0.10
    。绝不能使用简单平均值。
  5. 绝不能在任务完成前进行评估 — 评估未完成的输出会得到虚假的低分,且浪费上下文预算。

Anti-Patterns

反模式

Anti-PatternWhy It FailsCorrect Approach
Skipping dimensions to save timeEach dimension catches different failuresAlways score all 5 dimensions
No evidence citation per dimensionAssertions without grounding are invalidQuote specific text or file:line for every score
Using simple average for compositeAccuracy (30%) matters more than helpfulness (10%)Use the weighted composite formula
Only checking EXCELLENT vs FAILINGADEQUATE outputs need targeted improvements, not full reworkUse all 5 verdict tiers with appropriate action per tier
Evaluating before work is doneIncomplete outputs score falsely lowEvaluate completed outputs only
Treating evaluation as binary gateQuality is a spectrum; binary pass/fail loses nuanceUse composite score + per-dimension breakdown together
反模式失败原因正确做法
为节省时间跳过部分维度每个维度能发现不同的问题始终对所有5个维度打分
每个维度打分时不引用证据无依据的断言无效每个分数都要引用具体文本或文件:行号
使用简单平均值计算综合得分准确性(30%)比实用性(10%)更重要使用加权综合得分公式
仅区分优秀和不合格合格的输出需要针对性改进,而非全面返工使用所有5个评估结论等级,并针对每个等级采取相应行动
在任务完成前进行评估未完成的输出会得到虚假的低分仅评估已完成的输出内容
将评估视为二元把关机制质量是一个连续谱;二元通过/失败会丢失细节信息结合综合得分和各维度的详细分析一起使用

Assigned Agents

指定使用的Agent

This skill is used by:
  • qa
    — Primary: validates test outputs and QA reports before completion
  • code-reviewer
    — Supporting: evaluates code review quality
  • reflection-agent
    — Supporting: evaluates agent responses during reflection cycles
本技能的使用者包括:
  • qa
    — 主要使用者:在完成前验证测试输出和QA报告
  • code-reviewer
    — 辅助使用者:评估代码评审质量
  • reflection-agent
    — 辅助使用者:在反思周期中评估Agent的响应

Memory Protocol (MANDATORY)

记忆协议(强制要求)

Before starting:
bash
cat .claude/context/memory/learnings.md
Check for:
  • Previous evaluation scores for similar outputs
  • Known quality patterns in this codebase
  • Common failure modes for this task type
After completing:
  • Evaluation pattern found ->
    .claude/context/memory/learnings.md
  • Quality issue identified ->
    .claude/context/memory/issues.md
  • Decision about rubric weights ->
    .claude/context/memory/decisions.md
ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.
开始前:
bash
cat .claude/context/memory/learnings.md
检查以下内容:
  • 类似输出内容的过往评估得分
  • 本代码库中已知的质量模式
  • 该任务类型的常见问题
完成后:
  • 发现的评估模式 -> 写入
    .claude/context/memory/learnings.md
  • 识别到的质量问题 -> 写入
    .claude/context/memory/issues.md
  • 关于评分标准权重的决策 -> 写入
    .claude/context/memory/decisions.md
假设会被中断:你的上下文可能会重置。如果未写入记忆,就视为未发生过。