agent-evaluation

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Agent Evaluation

Agent评估

Overview

概述

LLM-as-judge evaluation framework that scores AI-generated content on 5 dimensions using a 1-5 rubric. Agents evaluate outputs, compute a weighted composite score, and emit a structured verdict with evidence citations.

Core principle: Systematic quality verification before claiming completion. Agent-studio currently has no way to verify agent output quality — this skill fills that gap.

基于LLM-as-judge的评估框架，采用1-5分制的5维度评分标准对AI生成内容进行打分。Agent会评估输出结果、计算加权综合得分，并生成带有证据引用的结构化评估结论。

核心原则： 在标记任务完成前进行系统化的质量验证。目前Agent Studio无法验证Agent的输出质量——本技能填补了这一空白。

When to Use

适用场景

Always:

Before marking a task complete (pair with
```
verification-before-completion
```
)
After a plan is generated (evaluate plan quality)
After code review outputs (evaluate review quality)
During reflection cycles (evaluate agent responses)
When comparing multiple agent outputs

Don't Use:

For binary pass/fail checks (use
```
verification-before-completion
```
instead)
For security audits (use
```
security-architect
```
skill)
For syntax/lint checking (use
```
pnpm lint:fix
```
)

建议始终使用：

在标记任务完成前（与
```
verification-before-completion
```
配合使用）
生成计划后（评估计划质量）
代码评审输出后（评估评审质量）
反思周期中（评估Agent的响应）
对比多个Agent的输出结果时

不建议使用：

用于二元通过/失败检查（改用
```
verification-before-completion
```
）
用于安全审计（改用
```
security-architect
```
技能）
用于语法/代码规范检查（使用
```
pnpm lint:fix
```
）

The 5-Dimension Rubric

5维度评分标准

Every evaluation scores all 5 dimensions on a 1-5 scale:

Dimension	Weight	What It Measures
Accuracy	30%	Factual correctness; no hallucinations; claims are verifiable
Groundedness	25%	Claims are supported by citations, file references, or evidence from the codebase
Coherence	15%	Logical flow; internally consistent; no contradictions
Completeness	20%	All required aspects addressed; no critical gaps
Helpfulness	10%	Actionable; provides concrete next steps; reduces ambiguity

所有评估都会从5个维度进行1-5分制打分：

维度	权重	衡量内容
准确性	30%	事实正确性；无幻觉内容；所有主张可验证
事实依据性	25%	主张有引用、文件参考或代码库中的证据支持
连贯性	15%	逻辑流畅；内部一致；无矛盾
完整性	20%	覆盖所有必要方面；无关键空白
实用性	10%	具备可操作性；提供具体后续步骤；减少歧义

Scoring Scale (1-5)

评分等级（1-5分）

Score	Meaning
5	Excellent — fully meets the dimension's criteria with no gaps
4	Good — meets criteria with minor gaps
3	Adequate — partially meets criteria; some gaps present
2	Poor — significant gaps or errors in this dimension
1	Failing — does not meet the dimension's criteria

分数	含义
5	优秀——完全符合该维度的标准，无任何缺陷
4	良好——符合标准，仅有微小缺陷
3	合格——部分符合标准；存在一些缺陷
2	较差——该维度存在显著缺陷或错误
1	不合格——完全不符合该维度的标准

Execution Process

执行流程

Step 1: Load the Output to Evaluate

步骤1：加载待评估的输出内容

Identify what is being evaluated:

- Agent response (text)
- Plan document (file path)
- Code review output (text/file)
- Skill invocation result (text)
- Task completion claim (TaskGet metadata)

确定待评估的对象：

- Agent响应内容（文本）
- 计划文档（文件路径）
- 代码评审输出（文本/文件）
- 技能调用结果（文本）
- 任务完成声明（TaskGet元数据）

Step 2: Score Each Dimension

步骤2：对每个维度打分

For each of the 5 dimensions, provide:

Score (1-5): The numeric score
Evidence: Direct quote or file reference from the evaluated output
Rationale: Why this score was given (1-2 sentences)

Dimension 1: Accuracy

Checklist:
- [ ] Claims are factually correct (verify against codebase if possible)
- [ ] No hallucinated file paths, function names, or API calls
- [ ] Numbers and counts are accurate
- [ ] No contradictions with existing documentation

Dimension 2: Groundedness

Checklist:
- [ ] Claims cite specific files, line numbers, or task IDs
- [ ] Recommendations reference observable evidence
- [ ] No unsupported assertions ("this is probably X")
- [ ] Code examples use actual project patterns

Dimension 3: Coherence

Checklist:
- [ ] Logical flow from problem → analysis → recommendation
- [ ] No internal contradictions
- [ ] Terminology is consistent throughout
- [ ] Steps are in a rational order

Dimension 4: Completeness

Checklist:
- [ ] All required aspects of the task are addressed
- [ ] Edge cases are mentioned (if relevant)
- [ ] No critical gaps that would block action
- [ ] Follow-up steps are included

Dimension 5: Helpfulness

Checklist:
- [ ] Provides actionable next steps (not just observations)
- [ ] Concrete enough to act on without further clarification
- [ ] Reduces ambiguity rather than adding it
- [ ] Appropriate for the intended audience

针对5个维度中的每一个，提供以下内容：

分数（1-5）：数字评分
证据：待评估输出中的直接引用或文件参考
理由：给出该分数的原因（1-2句话）

维度1：准确性

检查清单：
- [ ] 所有主张符合事实（如有可能，对照代码库验证）
- [ ] 无虚构的文件路径、函数名称或API调用
- [ ] 数字和统计准确
- [ ] 与现有文档无矛盾

维度2：事实依据性

检查清单：
- [ ] 主张引用了具体文件、行号或任务ID
- [ ] 建议参考了可观察的证据
- [ ] 无无根据的断言（如“这可能是X”）
- [ ] 代码示例使用了项目中的实际模式

维度3：连贯性

检查清单：
- [ ] 从问题→分析→建议的逻辑流畅
- [ ] 无内部矛盾
- [ ] 术语全程一致
- [ ] 步骤顺序合理

维度4：完整性

检查清单：
- [ ] 覆盖了任务的所有必要方面
- [ ] 提及了相关的边缘情况（如适用）
- [ ] 无会阻碍行动的关键空白
- [ ] 包含后续步骤

维度5：实用性

检查清单：
- [ ] 提供了可操作的后续步骤（而非仅观察结果）
- [ ] 足够具体，无需进一步澄清即可执行
- [ ] 减少而非增加歧义
- [ ] 适合目标受众

Step 3: Compute Weighted Composite Score

步骤3：计算加权综合得分

composite = (accuracy × 0.30) + (groundedness × 0.25) + (completeness × 0.20) + (coherence × 0.15) + (helpfulness × 0.10)

综合得分 = (准确性 × 0.30) + (事实依据性 × 0.25) + (完整性 × 0.20) + (连贯性 × 0.15) + (实用性 × 0.10)

Step 4: Determine Verdict

步骤4：确定评估结论

Composite Score	Verdict	Action
4.5 – 5.0	EXCELLENT	Approve; proceed
3.5 – 4.4	GOOD	Approve with minor notes
2.5 – 3.4	ADEQUATE	Request targeted improvements
1.5 – 2.4	POOR	Reject; requires significant rework
1.0 – 1.4	FAILING	Reject; restart task

综合得分范围	评估结论	对应行动
4.5 – 5.0	优秀	批准；继续推进
3.5 – 4.4	良好	批准，附带少量备注
2.5 – 3.4	合格	要求针对性改进
1.5 – 2.4	较差	拒绝；需要大幅返工
1.0 – 1.4	不合格	拒绝；重启任务

Step 5: Emit Structured Verdict

步骤5：生成结构化评估结论

Output the verdict in this format:

markdown

undefined

按照以下格式输出评估结论：

markdown

undefined

Evaluation Verdict

评估结论

Output Evaluated: [Brief description of what was evaluated] Evaluator: [Agent name / task ID] Date: [ISO 8601 date]

待评估输出：[待评估内容的简要描述] 评估者：[Agent名称 / 任务ID] 日期：[ISO 8601格式日期]

Dimension Scores

维度得分

Dimension	Score	Weight	Weighted Score
Accuracy	X/5	30%	X.X
Groundedness	X/5	25%	X.X
Completeness	X/5	20%	X.X
Coherence	X/5	15%	X.X
Helpfulness	X/5	10%	X.X
Composite			X.X / 5.0

维度	分数	权重	加权得分
准确性	X/5	30%	X.X
事实依据性	X/5	25%	X.X
完整性	X/5	20%	X.X
连贯性	X/5	15%	X.X
实用性	X/5	10%	X.X
综合得分			X.X / 5.0

Evidence Citations

证据引用

Accuracy (X/5):

[Direct quote or file:line reference] Rationale: [Why this score]

Groundedness (X/5):

[Direct quote or file:line reference] Rationale: [Why this score]

Completeness (X/5):

[Direct quote or file:line reference] Rationale: [Why this score]

Coherence (X/5):

[Direct quote or file:line reference] Rationale: [Why this score]

Helpfulness (X/5):

[Direct quote or file:line reference] Rationale: [Why this score]

准确性 (X/5)：

[直接引用或文件:行号] 理由：[给出该分数的原因]

事实依据性 (X/5)：

[直接引用或文件:行号] 理由：[给出该分数的原因]

完整性 (X/5)：

[直接引用或文件:行号] 理由：[给出该分数的原因]

连贯性 (X/5)：

[直接引用或文件:行号] 理由：[给出该分数的原因]

实用性 (X/5)：

[直接引用或文件:行号] 理由：[给出该分数的原因]

Verdict: [EXCELLENT | GOOD | ADEQUATE | POOR | FAILING]

评估结论：[优秀 | 良好 | 合格 | 较差 | 不合格]

Summary: [1-2 sentence overall assessment]

Required Actions (if verdict is ADEQUATE or worse):

[Specific improvement needed]
[Specific improvement needed]

undefined

总结：[1-2句话的整体评估]

必要行动（若评估结论为合格及以下）：

[具体改进需求]
[具体改进需求]

undefined

Usage Examples

使用示例

Evaluate a Plan Document

评估计划文档

javascript

// Load plan document
Read({ file_path: '.claude/context/plans/auth-design-plan-2026-02-21.md' });

// Evaluate against 5-dimension rubric
Skill({ skill: 'agent-evaluation' });
// Provide the plan content as the output to evaluate

javascript

// 加载计划文档
Read({ file_path: '.claude/context/plans/auth-design-plan-2026-02-21.md' });

// 采用5维度评分标准进行评估
Skill({ skill: 'agent-evaluation' });
// 提供计划内容作为待评估输出

Evaluate Agent Response Before Completion

在任务完成前评估Agent的响应内容

javascript

// Agent generates implementation summary
// Before marking task complete, evaluate the summary quality
Skill({ skill: 'agent-evaluation' });
// If composite < 3.5, request improvements before TaskUpdate(completed)

javascript

// Agent生成实现总结
// 在标记任务完成前，评估总结的质量
Skill({ skill: 'agent-evaluation' });
// 如果综合得分 < 3.5，在调用TaskUpdate(completed)前要求改进

Evaluate Code Review Output

评估代码评审输出

javascript

// After code-reviewer runs, evaluate the review quality
Skill({ skill: 'agent-evaluation' });
// Ensures review is grounded in actual code evidence, not assertions

javascript

// code-reviewer执行完成后，评估评审质量
Skill({ skill: 'agent-evaluation' });
// 确保评审内容基于实际代码证据，而非断言

Batch Evaluation (comparing two outputs)

批量评估（对比两个输出内容）

javascript

// Evaluate output A
// Save verdict A
// Evaluate output B
// Save verdict B
// Compare composites → choose higher scoring output

javascript

// 评估输出内容A
// 保存评估结论A
// 评估输出内容B
// 保存评估结论B
// 对比综合得分 → 选择得分更高的输出

Integration with Verification-Before-Completion

与Verification-Before-Completion的集成

The recommended quality gate pattern:

javascript

// Step 1: Do the work
// Step 2: Evaluate with agent-evaluation
Skill({ skill: 'agent-evaluation' });
// If verdict is POOR or FAILING → rework before proceeding
// If verdict is ADEQUATE or better → proceed to verification
// Step 3: Final gate
Skill({ skill: 'verification-before-completion' });
// Step 4: Mark complete
TaskUpdate({ taskId: 'X', status: 'completed' });

推荐的质量把关流程：

javascript

// 步骤1：执行任务
// 步骤2：使用agent-evaluation进行评估
Skill({ skill: 'agent-evaluation' });
// 如果评估结论为较差或不合格 → 返工后再继续
// 如果评估结论为合格及以上 → 进入验证环节
// 步骤3：最终把关
Skill({ skill: 'verification-before-completion' });
// 步骤4：标记任务完成
TaskUpdate({ taskId: 'X', status: 'completed' });

Iron Laws

铁则

NO COMPLETION CLAIM WITHOUT EVALUATION EVIDENCE — If composite score < 2.5 (POOR or FAILING), rework the output before marking any task complete.
ALWAYS score all 5 dimensions — never skip dimensions to save time; each dimension catches different failure modes (accuracy ≠ completeness ≠ groundedness).
ALWAYS cite specific evidence for every dimension score —
```
"Evidence: [file:line or direct quote]"
```
is mandatory, not optional. Assertions without grounding are invalid.

ALWAYS use the weighted composite —

accuracy×0.30 + groundedness×0.25 + completeness×0.20 + coherence×0.15 + helpfulness×0.10

. Never use simple average.

NEVER evaluate before the work is complete — evaluating incomplete outputs produces falsely low scores and wastes context budget.

无评估证据不得声明任务完成 — 如果综合得分 < 2.5（较差或不合格），在标记任务完成前必须返工输出内容。
必须对所有5个维度打分 — 绝不能为了节省时间跳过维度；每个维度能发现不同的问题（准确性≠完整性≠事实依据性）。
每个维度的打分必须引用具体证据 —
```
“证据：[文件:行号或直接引用]”
```
是强制要求，而非可选。无依据的断言无效。

必须使用加权综合得分 —

准确性×0.30 + 事实依据性×0.25 + 完整性×0.20 + 连贯性×0.15 + 实用性×0.10

。绝不能使用简单平均值。

绝不能在任务完成前进行评估 — 评估未完成的输出会得到虚假的低分，且浪费上下文预算。

Anti-Patterns

反模式

Anti-Pattern	Why It Fails	Correct Approach
Skipping dimensions to save time	Each dimension catches different failures	Always score all 5 dimensions
No evidence citation per dimension	Assertions without grounding are invalid	Quote specific text or file:line for every score
Using simple average for composite	Accuracy (30%) matters more than helpfulness (10%)	Use the weighted composite formula
Only checking EXCELLENT vs FAILING	ADEQUATE outputs need targeted improvements, not full rework	Use all 5 verdict tiers with appropriate action per tier
Evaluating before work is done	Incomplete outputs score falsely low	Evaluate completed outputs only
Treating evaluation as binary gate	Quality is a spectrum; binary pass/fail loses nuance	Use composite score + per-dimension breakdown together

反模式	失败原因	正确做法
为节省时间跳过部分维度	每个维度能发现不同的问题	始终对所有5个维度打分
每个维度打分时不引用证据	无依据的断言无效	每个分数都要引用具体文本或文件:行号
使用简单平均值计算综合得分	准确性（30%）比实用性（10%）更重要	使用加权综合得分公式
仅区分优秀和不合格	合格的输出需要针对性改进，而非全面返工	使用所有5个评估结论等级，并针对每个等级采取相应行动
在任务完成前进行评估	未完成的输出会得到虚假的低分	仅评估已完成的输出内容
将评估视为二元把关机制	质量是一个连续谱；二元通过/失败会丢失细节信息	结合综合得分和各维度的详细分析一起使用

Assigned Agents

指定使用的Agent

This skill is used by:

```
qa
```
— Primary: validates test outputs and QA reports before completion
```
code-reviewer
```
— Supporting: evaluates code review quality
```
reflection-agent
```
— Supporting: evaluates agent responses during reflection cycles

本技能的使用者包括：

```
qa
```
— 主要使用者：在完成前验证测试输出和QA报告
```
code-reviewer
```
— 辅助使用者：评估代码评审质量
```
reflection-agent
```
— 辅助使用者：在反思周期中评估Agent的响应

Memory Protocol (MANDATORY)

记忆协议（强制要求）

Before starting:

bash

cat .claude/context/memory/learnings.md

Check for:

Previous evaluation scores for similar outputs
Known quality patterns in this codebase
Common failure modes for this task type

After completing:

Evaluation pattern found ->
```
.claude/context/memory/learnings.md
```
Quality issue identified ->
```
.claude/context/memory/issues.md
```
Decision about rubric weights ->
```
.claude/context/memory/decisions.md
```

ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.

开始前：

bash

cat .claude/context/memory/learnings.md

检查以下内容：

类似输出内容的过往评估得分
本代码库中已知的质量模式
该任务类型的常见问题

完成后：

发现的评估模式 -> 写入
```
.claude/context/memory/learnings.md
```
识别到的质量问题 -> 写入
```
.claude/context/memory/issues.md
```
关于评分标准权重的决策 -> 写入
```
.claude/context/memory/decisions.md
```

假设会被中断：你的上下文可能会重置。如果未写入记忆，就视为未发生过。