agent-evaluation
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseAgent Evaluation
Agent评估
Overview
概述
LLM-as-judge evaluation framework that scores AI-generated content on 5 dimensions using a 1-5 rubric. Agents evaluate outputs, compute a weighted composite score, and emit a structured verdict with evidence citations.
Core principle: Systematic quality verification before claiming completion. Agent-studio currently has no way to verify agent output quality — this skill fills that gap.
基于LLM-as-judge的评估框架,采用1-5分制的5维度评分标准对AI生成内容进行打分。Agent会评估输出结果、计算加权综合得分,并生成带有证据引用的结构化评估结论。
核心原则: 在标记任务完成前进行系统化的质量验证。目前Agent Studio无法验证Agent的输出质量——本技能填补了这一空白。
When to Use
适用场景
Always:
- Before marking a task complete (pair with )
verification-before-completion - After a plan is generated (evaluate plan quality)
- After code review outputs (evaluate review quality)
- During reflection cycles (evaluate agent responses)
- When comparing multiple agent outputs
Don't Use:
- For binary pass/fail checks (use instead)
verification-before-completion - For security audits (use skill)
security-architect - For syntax/lint checking (use )
pnpm lint:fix
建议始终使用:
- 在标记任务完成前(与配合使用)
verification-before-completion - 生成计划后(评估计划质量)
- 代码评审输出后(评估评审质量)
- 反思周期中(评估Agent的响应)
- 对比多个Agent的输出结果时
不建议使用:
- 用于二元通过/失败检查(改用)
verification-before-completion - 用于安全审计(改用技能)
security-architect - 用于语法/代码规范检查(使用)
pnpm lint:fix
The 5-Dimension Rubric
5维度评分标准
Every evaluation scores all 5 dimensions on a 1-5 scale:
| Dimension | Weight | What It Measures |
|---|---|---|
| Accuracy | 30% | Factual correctness; no hallucinations; claims are verifiable |
| Groundedness | 25% | Claims are supported by citations, file references, or evidence from the codebase |
| Coherence | 15% | Logical flow; internally consistent; no contradictions |
| Completeness | 20% | All required aspects addressed; no critical gaps |
| Helpfulness | 10% | Actionable; provides concrete next steps; reduces ambiguity |
所有评估都会从5个维度进行1-5分制打分:
| 维度 | 权重 | 衡量内容 |
|---|---|---|
| 准确性 | 30% | 事实正确性;无幻觉内容;所有主张可验证 |
| 事实依据性 | 25% | 主张有引用、文件参考或代码库中的证据支持 |
| 连贯性 | 15% | 逻辑流畅;内部一致;无矛盾 |
| 完整性 | 20% | 覆盖所有必要方面;无关键空白 |
| 实用性 | 10% | 具备可操作性;提供具体后续步骤;减少歧义 |
Scoring Scale (1-5)
评分等级(1-5分)
| Score | Meaning |
|---|---|
| 5 | Excellent — fully meets the dimension's criteria with no gaps |
| 4 | Good — meets criteria with minor gaps |
| 3 | Adequate — partially meets criteria; some gaps present |
| 2 | Poor — significant gaps or errors in this dimension |
| 1 | Failing — does not meet the dimension's criteria |
| 分数 | 含义 |
|---|---|
| 5 | 优秀——完全符合该维度的标准,无任何缺陷 |
| 4 | 良好——符合标准,仅有微小缺陷 |
| 3 | 合格——部分符合标准;存在一些缺陷 |
| 2 | 较差——该维度存在显著缺陷或错误 |
| 1 | 不合格——完全不符合该维度的标准 |
Execution Process
执行流程
Step 1: Load the Output to Evaluate
步骤1:加载待评估的输出内容
Identify what is being evaluated:
- Agent response (text)
- Plan document (file path)
- Code review output (text/file)
- Skill invocation result (text)
- Task completion claim (TaskGet metadata)确定待评估的对象:
- Agent响应内容(文本)
- 计划文档(文件路径)
- 代码评审输出(文本/文件)
- 技能调用结果(文本)
- 任务完成声明(TaskGet元数据)Step 2: Score Each Dimension
步骤2:对每个维度打分
For each of the 5 dimensions, provide:
- Score (1-5): The numeric score
- Evidence: Direct quote or file reference from the evaluated output
- Rationale: Why this score was given (1-2 sentences)
Dimension 1: Accuracy
Checklist:
- [ ] Claims are factually correct (verify against codebase if possible)
- [ ] No hallucinated file paths, function names, or API calls
- [ ] Numbers and counts are accurate
- [ ] No contradictions with existing documentationDimension 2: Groundedness
Checklist:
- [ ] Claims cite specific files, line numbers, or task IDs
- [ ] Recommendations reference observable evidence
- [ ] No unsupported assertions ("this is probably X")
- [ ] Code examples use actual project patternsDimension 3: Coherence
Checklist:
- [ ] Logical flow from problem → analysis → recommendation
- [ ] No internal contradictions
- [ ] Terminology is consistent throughout
- [ ] Steps are in a rational orderDimension 4: Completeness
Checklist:
- [ ] All required aspects of the task are addressed
- [ ] Edge cases are mentioned (if relevant)
- [ ] No critical gaps that would block action
- [ ] Follow-up steps are includedDimension 5: Helpfulness
Checklist:
- [ ] Provides actionable next steps (not just observations)
- [ ] Concrete enough to act on without further clarification
- [ ] Reduces ambiguity rather than adding it
- [ ] Appropriate for the intended audience针对5个维度中的每一个,提供以下内容:
- 分数(1-5):数字评分
- 证据:待评估输出中的直接引用或文件参考
- 理由:给出该分数的原因(1-2句话)
维度1:准确性
检查清单:
- [ ] 所有主张符合事实(如有可能,对照代码库验证)
- [ ] 无虚构的文件路径、函数名称或API调用
- [ ] 数字和统计准确
- [ ] 与现有文档无矛盾维度2:事实依据性
检查清单:
- [ ] 主张引用了具体文件、行号或任务ID
- [ ] 建议参考了可观察的证据
- [ ] 无无根据的断言(如“这可能是X”)
- [ ] 代码示例使用了项目中的实际模式维度3:连贯性
检查清单:
- [ ] 从问题→分析→建议的逻辑流畅
- [ ] 无内部矛盾
- [ ] 术语全程一致
- [ ] 步骤顺序合理维度4:完整性
检查清单:
- [ ] 覆盖了任务的所有必要方面
- [ ] 提及了相关的边缘情况(如适用)
- [ ] 无会阻碍行动的关键空白
- [ ] 包含后续步骤维度5:实用性
检查清单:
- [ ] 提供了可操作的后续步骤(而非仅观察结果)
- [ ] 足够具体,无需进一步澄清即可执行
- [ ] 减少而非增加歧义
- [ ] 适合目标受众Step 3: Compute Weighted Composite Score
步骤3:计算加权综合得分
composite = (accuracy × 0.30) + (groundedness × 0.25) + (completeness × 0.20) + (coherence × 0.15) + (helpfulness × 0.10)综合得分 = (准确性 × 0.30) + (事实依据性 × 0.25) + (完整性 × 0.20) + (连贯性 × 0.15) + (实用性 × 0.10)Step 4: Determine Verdict
步骤4:确定评估结论
| Composite Score | Verdict | Action |
|---|---|---|
| 4.5 – 5.0 | EXCELLENT | Approve; proceed |
| 3.5 – 4.4 | GOOD | Approve with minor notes |
| 2.5 – 3.4 | ADEQUATE | Request targeted improvements |
| 1.5 – 2.4 | POOR | Reject; requires significant rework |
| 1.0 – 1.4 | FAILING | Reject; restart task |
| 综合得分范围 | 评估结论 | 对应行动 |
|---|---|---|
| 4.5 – 5.0 | 优秀 | 批准;继续推进 |
| 3.5 – 4.4 | 良好 | 批准,附带少量备注 |
| 2.5 – 3.4 | 合格 | 要求针对性改进 |
| 1.5 – 2.4 | 较差 | 拒绝;需要大幅返工 |
| 1.0 – 1.4 | 不合格 | 拒绝;重启任务 |
Step 5: Emit Structured Verdict
步骤5:生成结构化评估结论
Output the verdict in this format:
markdown
undefined按照以下格式输出评估结论:
markdown
undefinedEvaluation Verdict
评估结论
Output Evaluated: [Brief description of what was evaluated]
Evaluator: [Agent name / task ID]
Date: [ISO 8601 date]
待评估输出:[待评估内容的简要描述]
评估者:[Agent名称 / 任务ID]
日期:[ISO 8601格式日期]
Dimension Scores
维度得分
| Dimension | Score | Weight | Weighted Score |
|---|---|---|---|
| Accuracy | X/5 | 30% | X.X |
| Groundedness | X/5 | 25% | X.X |
| Completeness | X/5 | 20% | X.X |
| Coherence | X/5 | 15% | X.X |
| Helpfulness | X/5 | 10% | X.X |
| Composite | X.X / 5.0 |
| 维度 | 分数 | 权重 | 加权得分 |
|---|---|---|---|
| 准确性 | X/5 | 30% | X.X |
| 事实依据性 | X/5 | 25% | X.X |
| 完整性 | X/5 | 20% | X.X |
| 连贯性 | X/5 | 15% | X.X |
| 实用性 | X/5 | 10% | X.X |
| 综合得分 | X.X / 5.0 |
Evidence Citations
证据引用
Accuracy (X/5):
[Direct quote or file:line reference] Rationale: [Why this score]
Groundedness (X/5):
[Direct quote or file:line reference] Rationale: [Why this score]
Completeness (X/5):
[Direct quote or file:line reference] Rationale: [Why this score]
Coherence (X/5):
[Direct quote or file:line reference] Rationale: [Why this score]
Helpfulness (X/5):
[Direct quote or file:line reference] Rationale: [Why this score]
准确性 (X/5):
[直接引用或文件:行号] 理由:[给出该分数的原因]
事实依据性 (X/5):
[直接引用或文件:行号] 理由:[给出该分数的原因]
完整性 (X/5):
[直接引用或文件:行号] 理由:[给出该分数的原因]
连贯性 (X/5):
[直接引用或文件:行号] 理由:[给出该分数的原因]
实用性 (X/5):
[直接引用或文件:行号] 理由:[给出该分数的原因]
Verdict: [EXCELLENT | GOOD | ADEQUATE | POOR | FAILING]
评估结论:[优秀 | 良好 | 合格 | 较差 | 不合格]
Summary: [1-2 sentence overall assessment]
Required Actions (if verdict is ADEQUATE or worse):
- [Specific improvement needed]
- [Specific improvement needed]
undefined总结:[1-2句话的整体评估]
必要行动(若评估结论为合格及以下):
- [具体改进需求]
- [具体改进需求]
undefinedUsage Examples
使用示例
Evaluate a Plan Document
评估计划文档
javascript
// Load plan document
Read({ file_path: '.claude/context/plans/auth-design-plan-2026-02-21.md' });
// Evaluate against 5-dimension rubric
Skill({ skill: 'agent-evaluation' });
// Provide the plan content as the output to evaluatejavascript
// 加载计划文档
Read({ file_path: '.claude/context/plans/auth-design-plan-2026-02-21.md' });
// 采用5维度评分标准进行评估
Skill({ skill: 'agent-evaluation' });
// 提供计划内容作为待评估输出Evaluate Agent Response Before Completion
在任务完成前评估Agent的响应内容
javascript
// Agent generates implementation summary
// Before marking task complete, evaluate the summary quality
Skill({ skill: 'agent-evaluation' });
// If composite < 3.5, request improvements before TaskUpdate(completed)javascript
// Agent生成实现总结
// 在标记任务完成前,评估总结的质量
Skill({ skill: 'agent-evaluation' });
// 如果综合得分 < 3.5,在调用TaskUpdate(completed)前要求改进Evaluate Code Review Output
评估代码评审输出
javascript
// After code-reviewer runs, evaluate the review quality
Skill({ skill: 'agent-evaluation' });
// Ensures review is grounded in actual code evidence, not assertionsjavascript
// code-reviewer执行完成后,评估评审质量
Skill({ skill: 'agent-evaluation' });
// 确保评审内容基于实际代码证据,而非断言Batch Evaluation (comparing two outputs)
批量评估(对比两个输出内容)
javascript
// Evaluate output A
// Save verdict A
// Evaluate output B
// Save verdict B
// Compare composites → choose higher scoring outputjavascript
// 评估输出内容A
// 保存评估结论A
// 评估输出内容B
// 保存评估结论B
// 对比综合得分 → 选择得分更高的输出Integration with Verification-Before-Completion
与Verification-Before-Completion的集成
The recommended quality gate pattern:
javascript
// Step 1: Do the work
// Step 2: Evaluate with agent-evaluation
Skill({ skill: 'agent-evaluation' });
// If verdict is POOR or FAILING → rework before proceeding
// If verdict is ADEQUATE or better → proceed to verification
// Step 3: Final gate
Skill({ skill: 'verification-before-completion' });
// Step 4: Mark complete
TaskUpdate({ taskId: 'X', status: 'completed' });推荐的质量把关流程:
javascript
// 步骤1:执行任务
// 步骤2:使用agent-evaluation进行评估
Skill({ skill: 'agent-evaluation' });
// 如果评估结论为较差或不合格 → 返工后再继续
// 如果评估结论为合格及以上 → 进入验证环节
// 步骤3:最终把关
Skill({ skill: 'verification-before-completion' });
// 步骤4:标记任务完成
TaskUpdate({ taskId: 'X', status: 'completed' });Iron Laws
铁则
- NO COMPLETION CLAIM WITHOUT EVALUATION EVIDENCE — If composite score < 2.5 (POOR or FAILING), rework the output before marking any task complete.
- ALWAYS score all 5 dimensions — never skip dimensions to save time; each dimension catches different failure modes (accuracy ≠ completeness ≠ groundedness).
- ALWAYS cite specific evidence for every dimension score — is mandatory, not optional. Assertions without grounding are invalid.
"Evidence: [file:line or direct quote]" - ALWAYS use the weighted composite — . Never use simple average.
accuracy×0.30 + groundedness×0.25 + completeness×0.20 + coherence×0.15 + helpfulness×0.10 - NEVER evaluate before the work is complete — evaluating incomplete outputs produces falsely low scores and wastes context budget.
- 无评估证据不得声明任务完成 — 如果综合得分 < 2.5(较差或不合格),在标记任务完成前必须返工输出内容。
- 必须对所有5个维度打分 — 绝不能为了节省时间跳过维度;每个维度能发现不同的问题(准确性≠完整性≠事实依据性)。
- 每个维度的打分必须引用具体证据 — 是强制要求,而非可选。无依据的断言无效。
“证据:[文件:行号或直接引用]” - 必须使用加权综合得分 — 。绝不能使用简单平均值。
准确性×0.30 + 事实依据性×0.25 + 完整性×0.20 + 连贯性×0.15 + 实用性×0.10 - 绝不能在任务完成前进行评估 — 评估未完成的输出会得到虚假的低分,且浪费上下文预算。
Anti-Patterns
反模式
| Anti-Pattern | Why It Fails | Correct Approach |
|---|---|---|
| Skipping dimensions to save time | Each dimension catches different failures | Always score all 5 dimensions |
| No evidence citation per dimension | Assertions without grounding are invalid | Quote specific text or file:line for every score |
| Using simple average for composite | Accuracy (30%) matters more than helpfulness (10%) | Use the weighted composite formula |
| Only checking EXCELLENT vs FAILING | ADEQUATE outputs need targeted improvements, not full rework | Use all 5 verdict tiers with appropriate action per tier |
| Evaluating before work is done | Incomplete outputs score falsely low | Evaluate completed outputs only |
| Treating evaluation as binary gate | Quality is a spectrum; binary pass/fail loses nuance | Use composite score + per-dimension breakdown together |
| 反模式 | 失败原因 | 正确做法 |
|---|---|---|
| 为节省时间跳过部分维度 | 每个维度能发现不同的问题 | 始终对所有5个维度打分 |
| 每个维度打分时不引用证据 | 无依据的断言无效 | 每个分数都要引用具体文本或文件:行号 |
| 使用简单平均值计算综合得分 | 准确性(30%)比实用性(10%)更重要 | 使用加权综合得分公式 |
| 仅区分优秀和不合格 | 合格的输出需要针对性改进,而非全面返工 | 使用所有5个评估结论等级,并针对每个等级采取相应行动 |
| 在任务完成前进行评估 | 未完成的输出会得到虚假的低分 | 仅评估已完成的输出内容 |
| 将评估视为二元把关机制 | 质量是一个连续谱;二元通过/失败会丢失细节信息 | 结合综合得分和各维度的详细分析一起使用 |
Assigned Agents
指定使用的Agent
This skill is used by:
- — Primary: validates test outputs and QA reports before completion
qa - — Supporting: evaluates code review quality
code-reviewer - — Supporting: evaluates agent responses during reflection cycles
reflection-agent
本技能的使用者包括:
- — 主要使用者:在完成前验证测试输出和QA报告
qa - — 辅助使用者:评估代码评审质量
code-reviewer - — 辅助使用者:在反思周期中评估Agent的响应
reflection-agent
Memory Protocol (MANDATORY)
记忆协议(强制要求)
Before starting:
bash
cat .claude/context/memory/learnings.mdCheck for:
- Previous evaluation scores for similar outputs
- Known quality patterns in this codebase
- Common failure modes for this task type
After completing:
- Evaluation pattern found ->
.claude/context/memory/learnings.md - Quality issue identified ->
.claude/context/memory/issues.md - Decision about rubric weights ->
.claude/context/memory/decisions.md
ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.
开始前:
bash
cat .claude/context/memory/learnings.md检查以下内容:
- 类似输出内容的过往评估得分
- 本代码库中已知的质量模式
- 该任务类型的常见问题
完成后:
- 发现的评估模式 -> 写入
.claude/context/memory/learnings.md - 识别到的质量问题 -> 写入
.claude/context/memory/issues.md - 关于评分标准权重的决策 -> 写入
.claude/context/memory/decisions.md
假设会被中断:你的上下文可能会重置。如果未写入记忆,就视为未发生过。