agent-comparison

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Agent Comparison Skill

Agent对比Skill

Operator Context

操作者上下文

This skill operates as an operator for agent A/B testing workflows, configuring Claude's behavior for rigorous, evidence-based variant comparison. It implements the Benchmark Pipeline architectural pattern — prepare variants, run identical tasks, measure outcomes, report findings — with Domain Intelligence embedded in the comparison methodology.
本Skill作为Agent A/B测试工作流的操作者,配置Claude的行为以实现严谨、基于证据的变体对比。它采用基准测试流水线(Benchmark Pipeline)架构模式——准备变体、运行相同任务、衡量结果、报告发现——并在对比方法中嵌入领域智能(Domain Intelligence)

Hardcoded Behaviors (Always Apply)

硬编码行为(始终生效)

  • CLAUDE.md Compliance: Read and follow repository CLAUDE.md before execution
  • Over-Engineering Prevention: Keep benchmark scripts simple. No speculative features or configurable frameworks that were not requested
  • Identical Task Prompts: Both agents MUST receive the exact same task description, character-for-character
  • Isolated Execution: Each agent runs in a separate session to avoid contamination
  • Test-Based Validation: All generated code MUST pass the same test suite with
    -race
    flag
  • Evidence-Based Reporting: Every claim backed by measurable data (tokens, test counts, quality scores)
  • Total Session Cost: Measure total tokens to working solution, not just prompt size
  • 遵循CLAUDE.md规范:执行前阅读并遵循仓库中的CLAUDE.md
  • 避免过度设计:保持基准测试脚本简洁。不添加未被要求的推测性功能或可配置框架
  • 任务提示完全一致:两个Agent必须接收完全相同的任务描述,一字不差
  • 隔离执行:每个Agent在独立会话中运行,避免相互干扰
  • 基于测试的验证:所有生成的代码必须通过相同的测试套件,并使用
    -race
    标志
  • 基于证据的报告:所有结论均需有可衡量的数据支持(令牌数、测试次数、质量得分)
  • 会话总成本:衡量得到可行解决方案的总令牌数,而非仅提示词大小

Default Behaviors (ON unless disabled)

默认行为(启用状态,除非手动关闭)

  • Communication Style: Report facts without self-congratulation. Show command output rather than describing it
  • Temporary File Cleanup: Remove temporary benchmark files and debug outputs at completion. Keep only comparison report and generated code
  • Two-Tier Benchmarking: Run both simple (algorithmic) and complex (production) tasks
  • Token Tracking: Record input/output token counts per turn where visible
  • Quality Grading: Score code on correctness, error handling, idioms, documentation, testing
  • Comparative Summary: Generate side-by-side comparison report with clear verdict
  • 沟通风格:仅报告事实,不自我夸赞。直接展示命令输出而非描述输出内容
  • 临时文件清理:完成后删除临时基准测试文件和调试输出,仅保留对比报告和生成的代码
  • 双层基准测试:同时运行简单(算法类)和复杂(生产级)任务
  • 令牌跟踪:在可见的情况下记录每一轮的输入/输出令牌数
  • 质量评分:从正确性、错误处理、语言惯用法、文档、测试等维度对代码评分
  • 对比摘要:生成并列对比报告并给出明确结论

Optional Behaviors (OFF unless enabled)

可选行为(禁用状态,除非手动启用)

  • Multiple Runs: Run each benchmark 3x to account for variance
  • Blind Evaluation: Hide agent identity during quality grading
  • Extended Benchmark Suite: Run additional domain-specific tests
  • Historical Tracking: Compare against previous benchmark runs
  • 多次运行:每个基准测试运行3次以抵消差异
  • 盲态评估:在质量评分阶段隐藏Agent身份
  • 扩展基准测试套件:运行额外的领域特定测试
  • 历史跟踪:与之前的基准测试结果进行对比

What This Skill CAN Do

本Skill可实现的功能

  • Systematically compare agent variants through controlled benchmarks
  • Measure total session token cost (prompt + reasoning + tools + retries)
  • Grade code quality using domain-specific checklists
  • Reveal quality differences invisible to simple metrics (prompt size, line count)
  • Generate comparison reports with evidence-backed verdicts
  • 通过受控基准测试系统地对比Agent变体
  • 衡量会话总令牌成本(提示词+推理+工具+重试)
  • 使用领域特定清单对代码质量评分
  • 发现简单指标(提示词大小、代码行数)无法体现的质量差异
  • 生成基于证据的对比报告并给出结论

What This Skill CANNOT Do

本Skill无法实现的功能

  • Compare agents without running identical tasks on both
  • Declare a winner based on prompt size alone
  • Skip quality grading and rely only on test pass rates
  • Evaluate single agents in isolation (use quality-grading skill instead)
  • Compare skills or prompts (this is for agent variants only)

  • 在不为两个Agent运行相同任务的情况下进行对比
  • 仅根据提示词大小判定胜负
  • 跳过质量评分,仅依赖测试通过率
  • 孤立评估单个Agent(请使用质量评分Skill替代)
  • 对比Skill或提示词(本Skill仅适用于Agent变体对比)

Instructions

操作步骤

Phase 1: PREPARE

阶段1:准备

Goal: Create benchmark environment and validate both agent variants exist.
Step 1: Analyze original agent
bash
undefined
目标:创建基准测试环境并验证两个Agent变体均存在。
步骤1:分析原始Agent
bash
undefined

Count original agent size

统计原始Agent的行数

wc -l agents/{original-agent}.md
wc -l agents/{original-agent}.md

Identify major sections

识别主要章节

grep "^## " agents/{original-agent}.md
grep "^## " agents/{original-agent}.md

Count code examples (candidates for removal in compact version)

统计代码示例数量(精简版中可考虑移除的内容)

grep -c '```' agents/{original-agent}.md

**Step 2: Create or validate compact variant**

If creating a compact variant, preserve:
- YAML frontmatter (name, description, routing)
- Operator Context (Hardcoded/Default/Optional)
- Core patterns and principles
- Error handling philosophy

Remove or condense:
- Lengthy code examples (keep 1-2 representative per pattern)
- Verbose explanations (condense to bullet points)
- Redundant instructions and changelogs

Target: 10-15% of original size while keeping essential knowledge. Removing capability (error handling patterns, concurrency patterns) invalidates the comparison. Remove redundancy, not knowledge.

**Step 3: Validate compact variant structure**

```bash
grep -c '```' agents/{original-agent}.md

**步骤2:创建或验证精简变体**

若创建精简变体,需保留以下内容:
- YAML前置元数据(名称、描述、路由)
- 操作者上下文(硬编码/默认/可选行为)
- 核心模式与原则
- 错误处理理念

可移除或精简的内容:
- 冗长的代码示例(每种模式保留1-2个代表性示例即可)
- 繁琐的解释(精简为要点形式)
- 冗余的说明和变更日志

目标:在保留核心知识的前提下,将体积压缩至原始版本的10-15%。移除核心能力(如错误处理模式、并发模式)会使对比失去有效性。应移除冗余内容,而非核心知识。

**步骤3:验证精简变体的结构**

```bash

Verify YAML frontmatter

验证YAML前置元数据

head -20 agents/{compact-agent}.md | grep -E "^(name|description):"
head -20 agents/{compact-agent}.md | grep -E "^(name|description):"

Verify Operator Context preserved

验证操作者上下文是否保留

grep -c "### Hardcoded Behaviors" agents/{compact-agent}.md
grep -c "### Hardcoded Behaviors" agents/{compact-agent}.md

Compare sizes

对比体积

echo "Original: $(wc -l < agents/{original-agent}.md) lines" echo "Compact: $(wc -l < agents/{compact-agent}.md) lines"

**Step 4: Create benchmark directory and prepare prompts**

```bash
mkdir -p benchmark/{task-name}/{full,compact}
Write the task prompt ONCE, then copy it for both agents. NEVER customize prompts per agent.
Gate: Both agent variants exist with valid YAML frontmatter. Benchmark directories created. Identical task prompts written. Proceed only when gate passes.
echo "Original: $(wc -l < agents/{original-agent}.md) lines" echo "Compact: $(wc -l < agents/{compact-agent}.md) lines"

**步骤4:创建基准测试目录并准备提示词**

```bash
mkdir -p benchmark/{task-name}/{full,compact}
仅编写一次任务提示词,然后复制给两个Agent。绝不为不同Agent定制提示词。
准入条件:两个Agent变体均存在且YAML前置元数据有效,基准测试目录已创建,任务提示词完全相同。仅当满足所有条件时方可进入下一阶段。

Phase 2: BENCHMARK

阶段2:基准测试

Goal: Run identical tasks on both agents, capturing all metrics.
Step 1: Run simple task benchmark (2-3 tasks)
Use algorithmic problems with clear specifications (e.g., Advent of Code Day 1-6). Both agents should perform identically on well-defined problems. Simple tasks establish a baseline — if an agent fails here, it has fundamental issues.
Spawn both agents in parallel using Task tool for fair timing:
Task(
  prompt="[exact task prompt]\nSave to: benchmark/{task}/full/",
  subagent_type="{full-agent}"
)

Task(
  prompt="[exact task prompt]\nSave to: benchmark/{task}/compact/",
  subagent_type="{compact-agent}"
)
Run in parallel to avoid caching effects or system load variance skewing results.
Step 2: Run complex task benchmark (1-2 tasks)
Use production-style problems that require concurrency, error handling, edge case anticipation. These are where quality differences emerge. See
references/benchmark-tasks.md
for standard tasks.
Recommended complex tasks:
  • Worker Pool: Rate limiting, graceful shutdown, panic recovery
  • LRU Cache with TTL: Generics, background goroutines, zero-value semantics
  • HTTP Service: Middleware chains, structured errors, health checks
Step 3: Capture metrics for each run
Record immediately after each agent completes. Do NOT wait until all runs finish.
MetricFull AgentCompact Agent
Tests passX/XX/X
Race conditionsXX
Code lines (main)XX
Test linesXX
Session tokensXX
Wall-clock timeXm XsXm Xs
Retry cyclesXX
Step 4: Run tests with race detector
bash
cd benchmark/{task-name}/full && go test -race -v -count=1
cd benchmark/{task-name}/compact && go test -race -v -count=1
Use
-count=1
to disable test caching. Race conditions are automatic quality failures — record them but do NOT fix them for the agent being tested.
Gate: Both agents completed all tasks. Metrics captured for every run. Test output saved. Proceed only when gate passes.
目标:为两个Agent运行相同任务,捕获所有指标。
步骤1:运行简单任务基准测试(2-3个任务)
使用需求明确的算法类问题(如Advent of Code第1-6天的题目)。两个Agent在定义清晰的问题上应表现一致。简单任务用于建立基线——若Agent在此阶段失败,则说明存在根本性问题。
使用Task工具并行启动两个Agent以保证计时公平:
Task(
  prompt="[exact task prompt]\nSave to: benchmark/{task}/full/",
  subagent_type="{full-agent}"
)

Task(
  prompt="[exact task prompt]\nSave to: benchmark/{task}/compact/",
  subagent_type="{compact-agent}"
)
并行运行以避免缓存效应或系统负载差异影响结果。
步骤2:运行复杂任务基准测试(1-2个任务)
使用需要并发处理、错误处理、边缘场景预判的生产级问题。质量差异通常在此类场景中显现。标准任务可参考
references/benchmark-tasks.md
推荐的复杂任务:
  • 工作池:速率限制、优雅关闭、panic恢复
  • 带TTL的LRU缓存:泛型、后台goroutine、零值语义
  • HTTP服务:中间件链、结构化错误、健康检查
步骤3:捕获每次运行的指标
每个Agent完成后立即记录指标,请勿等待所有运行结束后再记录。
指标完整Agent精简Agent
测试通过数X/XX/X
竞态条件XX
主代码行数XX
测试代码行数XX
会话令牌数XX
实际耗时Xm XsXm Xs
重试次数XX
步骤4:使用竞态检测器运行测试
bash
cd benchmark/{task-name}/full && go test -race -v -count=1
cd benchmark/{task-name}/compact && go test -race -v -count=1
使用
-count=1
禁用测试缓存。竞态条件属于严重的质量问题——需记录此类问题,但请勿为被测试的Agent修复该问题。
准入条件:两个Agent均完成所有任务,所有运行的指标均已捕获,测试输出已保存。仅当满足所有条件时方可进入下一阶段。

Phase 3: GRADE

阶段3:评分

Goal: Score code quality beyond pass/fail using domain-specific checklists.
Step 1: Create quality checklist BEFORE reviewing code
Define criteria before seeing results to prevent bias. Do NOT invent criteria after seeing one agent's output. See
references/grading-rubric.md
for standard rubrics.
Criterion5/53/51/5
CorrectnessAll tests pass, no race conditionsSome failuresBroken
Error HandlingComprehensive, production-readyAdequateNone
IdiomsExemplary for the languageAcceptableAnti-patterns
DocumentationThoroughAdequateNone
TestingComprehensive coverageBasicMinimal
Step 2: Score each solution independently
Grade each agent's code on all five criteria. Score one agent completely before starting the other.
markdown
undefined
目标:使用领域特定清单对代码质量进行超越通过/失败的评分。
步骤1:在查看代码前创建质量清单
在查看结果前先定义评分标准,以避免偏见。请勿在看到某个Agent的输出后再制定标准。标准评分细则可参考
references/grading-rubric.md
评分项5/53/51/5
正确性所有测试通过,无竞态条件部分测试失败功能完全失效
错误处理全面且符合生产环境要求基本够用无错误处理
语言惯用法符合语言最佳实践可接受使用反模式
文档详尽完善基本够用无文档
测试覆盖全面基础覆盖仅最少量测试
步骤2:独立评分每个解决方案
针对所有5个评分项为每个Agent的代码评分。完成一个Agent的全部评分后再开始另一个Agent的评分。
markdown
undefined

{Agent} Solution - {Task}

{Agent} 解决方案 - {任务名称}

CriterionScoreNotes
CorrectnessX/5
Error HandlingX/5
IdiomsX/5
DocumentationX/5
TestingX/5
TotalX/25

**Step 3: Document specific bugs with production impact**

For each bug found, record:

```markdown
评分项得分备注
正确性X/5
错误处理X/5
语言惯用法X/5
文档X/5
测试X/5
总分X/25

**步骤3:记录具有生产影响的特定Bug**

针对发现的每个Bug,记录以下内容:

```markdown

Bug: {description}

Bug:{描述}

  • Agent: {which agent}
  • What happened: {behavior}
  • Correct behavior: {expected}
  • Production impact: {consequence}
  • Test coverage: {did tests catch it? why not?}

"Tests pass" is necessary but not sufficient. Production bugs often pass tests — Clear() returning nothing passes if no test checks the return value. TTL=0 bugs pass if no test uses zero TTL.

**Step 4: Calculate effective cost**
effective_cost = total_tokens * (1 + bug_count * 0.25)

An agent using 194k tokens with 0 bugs has better economics than one using 119k tokens with 5 bugs requiring fixes. The metric that matters is total cost to working, production-quality solution.

**Gate**: Both solutions graded with evidence. Specific bugs documented with production impact. Effective cost calculated. Proceed only when gate passes.
  • Agent:{涉及的Agent}
  • 实际表现:{具体行为}
  • 预期行为:{正确表现}
  • 生产影响:{后果}
  • 测试覆盖:{测试是否发现该问题?为何未发现?}

“测试通过”是必要条件但不充分。生产环境中的Bug常能通过测试——例如Clear()方法返回空值,但如果没有测试检查返回值,测试仍会通过;TTL=0的Bug若没有测试使用零TTL值,也会通过测试。

**步骤4:计算有效成本**
effective_cost = total_tokens * (1 + bug_count * 0.25)

使用194k令牌且无Bug的Agent,比使用119k令牌但有5个需修复Bug的Agent更具经济性。真正重要的指标是得到可用的生产级解决方案的总成本。

**准入条件**:两个解决方案均已基于证据完成评分,具有生产影响的特定Bug已记录,有效成本已计算。仅当满足所有条件时方可进入下一阶段。

Phase 4: REPORT

阶段4:报告

Goal: Generate comparison report with evidence-backed verdict.
Step 1: Generate comparison report
Use the report template from
references/report-template.md
. Include:
  • Executive summary with clear winner per metric
  • Per-task results with metrics tables
  • Token economics analysis (one-time prompt cost vs session cost)
  • Specific bugs found and their production impact
  • Verdict based on total evidence
Step 2: Run comparison analysis
bash
undefined
目标:生成基于证据的对比报告并给出结论。
步骤1:生成对比报告
使用
references/report-template.md
中的报告模板。报告需包含:
  • 执行摘要及各指标的明确胜者
  • 各任务的结果及指标表格
  • 令牌经济性分析(一次性提示词成本vs会话总成本)
  • 发现的特定Bug及其生产影响
  • 基于全部证据的结论
步骤2:运行对比分析
bash
undefined

TODO: scripts/compare.py not yet implemented

TODO: scripts/compare.py 尚未实现

Manual alternative: compare benchmark outputs side-by-side

手动替代方案:并列对比基准测试输出

diff benchmark/{task-name}/full/ benchmark/{task-name}/compact/

**Step 3: Analyze token economics**

The key economic insight: agent prompts are a one-time cost per session. Everything after — reasoning, code generation, debugging, retries — costs tokens on every turn.

| Pattern | Description |
|---------|-------------|
| Large agent, low churn | High initial cost, fewer retries, less debugging |
| Small agent, high churn | Low initial cost, more retries, more debugging |

When a micro agent produces correct code, it uses approximately the same total tokens. The savings appear only when it cuts corners.

**Step 4: State verdict with evidence**

The verdict MUST be backed by data. Include:
- Which agent won on simple tasks (expected: equivalent)
- Which agent won on complex tasks (expected: full agent)
- Total session cost comparison
- Effective cost comparison (with bug penalty)
- Clear recommendation for when to use each variant

See `references/methodology.md` for the complete testing methodology with December 2024 data.

**Gate**: Report generated with all metrics. Verdict stated with evidence. Report saved to benchmark directory.

---
diff benchmark/{task-name}/full/ benchmark/{task-name}/compact/

**步骤3:分析令牌经济性**

核心经济洞察:Agent提示词是每次会话的一次性成本。而提示词之后的所有操作——推理、代码生成、调试、重试——每一轮都会消耗令牌。

| 模式 | 描述 |
|---------|-------------|
| 大型Agent,低重试率 | 初始成本高,重试次数少,调试需求低 |
| 小型Agent,高重试率 | 初始成本低,重试次数多,调试需求高 |

当微型Agent生成正确代码时,其总令牌消耗与大型Agent相近。只有在偷工减料时才会体现出“节省”。

**步骤4:基于证据给出结论**

结论必须有数据支持。需包含:
- 简单任务的胜者(预期:表现相当)
- 复杂任务的胜者(预期:完整Agent更优)
- 会话总成本对比
- 有效成本对比(含Bug惩罚)
- 明确的各变体适用场景建议

完整测试方法论及2024年12月的数据可参考`references/methodology.md`。

**准入条件**:已生成包含所有指标的报告,结论基于证据给出,报告已保存至基准测试目录。

---

Examples

示例

Example 1: Creating a Compact Agent

示例1:创建精简版Agent

User says: "Create a compact version of golang-general-engineer and test it" Actions:
  1. Analyze original, create compact variant at 10-15% size (PREPARE)
  2. Run simple task (Advent of Code) + complex task (Worker Pool) on both (BENCHMARK)
  3. Score both with domain-specific checklist, calculate effective cost (GRADE)
  4. Generate comparison report with verdict (REPORT) Result: Data-driven recommendation on whether compact version is viable
用户需求:"创建golang-general-engineer的精简版并进行测试" 操作步骤:
  1. 分析原始Agent,创建体积为原版本10-15%的精简变体(准备阶段)
  2. 为两个Agent运行简单任务(Advent of Code)和复杂任务(工作池)(基准测试阶段)
  3. 使用领域特定清单为两个Agent评分,计算有效成本(评分阶段)
  4. 生成对比报告并给出结论(报告阶段) 结果:基于数据给出精简版是否可行的建议

Example 2: Comparing Internal vs External Agent

示例2:对比内部与外部Agent

User says: "Compare our Go agent against go-expert-0xfurai" Actions:
  1. Validate both agents exist, prepare identical task prompts (PREPARE)
  2. Run two-tier benchmarks with token tracking (BENCHMARK)
  3. Grade with production quality checklist, document all bugs (GRADE)
  4. Report with token economics showing prompt cost vs session cost (REPORT) Result: Evidence-based comparison showing true cost of each variant

用户需求:"对比我们的Go Agent与go-expert-0xfurai" 操作步骤:
  1. 验证两个Agent均存在,准备完全相同的任务提示词(准备阶段)
  2. 运行双层基准测试并跟踪令牌消耗(基准测试阶段)
  3. 使用生产级质量清单评分,记录所有Bug(评分阶段)
  4. 生成包含令牌经济性分析的报告,对比提示词成本与会话总成本(报告阶段) 结果:基于证据的对比,展示每个变体的真实成本

Error Handling

错误处理

Error: "Agent Type Not Found"

错误:"Agent Type Not Found"

Cause: Agent not registered or name misspelled Solution: Verify agent file exists in agents/ directory. Restart Claude Code client to pick up new definitions.
原因:Agent未注册或名称拼写错误 解决方案:验证agents/目录下是否存在该Agent文件。重启Claude Code客户端以加载新的定义。

Error: "Tests Fail with Race Condition"

错误:"Tests Fail with Race Condition"

Cause: Concurrent code has data races Solution: This is a real quality difference. Record as a finding in the grade. Do NOT fix for the agent being tested.
原因:并发代码存在数据竞态 解决方案:这是真实的质量差异。将其记录为评分阶段的发现,请勿为被测试的Agent修复该问题。

Error: "Different Test Counts Between Agents"

错误:"Different Test Counts Between Agents"

Cause: Agents wrote different test suites Solution: Valid data point. Grade on test coverage and quality, not raw count. More tests is not always better.
原因:两个Agent编写的测试套件不同 解决方案:这是有效的数据点。需根据测试覆盖范围和质量评分,而非原始数量。测试数量多并不一定更好。

Error: "Timeout During Agent Execution"

错误:"Timeout During Agent Execution"

Cause: Complex task taking too long or agent stuck in retry loop Solution: Note the timeout and number of retries attempted. Record as incomplete with partial metrics. Increase timeout limit if warranted, but excessive retries are a quality signal — an agent that needs many retries is less efficient regardless of final outcome.

原因:复杂任务耗时过长或Agent陷入重试循环 解决方案:记录超时情况及已尝试的重试次数。将其标记为未完成,并记录已捕获的部分指标。若有必要可增加超时限制,但过多重试本身就是质量信号——无论最终结果如何,需要多次重试的Agent效率更低。

Anti-Patterns

反模式

Anti-Pattern 1: Comparing Only Prompt Size

反模式1:仅对比提示词大小

What it looks like: "Compact agent is 90% smaller, therefore 90% more efficient" Why wrong: Prompt is one-time cost. Session reasoning, retries, and debugging dominate total tokens. Our data showed a 57-line agent used 69.5k tokens vs 69.6k for a 3,529-line agent on the same correct solution. Do instead: Measure total session tokens to working solution.
表现:"精简Agent体积小90%,因此效率高90%" 错误原因:提示词是一次性成本。会话推理、重试和调试才是总令牌消耗的主要部分。我们的数据显示,在生成正确解决方案的情况下,57行的Agent消耗69.5k令牌,而3529行的Agent消耗69.6k令牌。 正确做法:衡量得到可行解决方案的会话总令牌数。

Anti-Pattern 2: Different Task Prompts

反模式2:使用不同的任务提示词

What it looks like: Giving the full agent harder requirements than the compact agent Why wrong: Creates unfair comparison. Different requirements produce different solutions, invalidating all measurements. Do instead: Copy-paste identical prompts character-for-character. Verify before running.
表现:为完整Agent设置比精简Agent更难的需求 错误原因:会导致不公平对比。不同的需求会产生不同的解决方案,使所有测量结果失效。 正确做法:一字不差地复制相同的提示词。运行前需验证一致性。

Anti-Pattern 3: Treating Test Failures as Equal Quality

反模式3:将测试失败视为同等质量

What it looks like: "Both agents completed the task" when one has 12/12 tests and the other has 8/12 Why wrong: Bugs have real cost. False equivalence between producing code and producing working code. Do instead: Grade quality rigorously. Calculate effective cost with bug penalty multiplier.
表现:当一个Agent通过12/12测试,另一个通过8/12测试时,称"两个Agent均完成任务" 错误原因:Bug会产生真实成本。将生成代码与生成可用代码等同是错误的。 正确做法:严格进行质量评分。使用Bug惩罚乘数计算有效成本。

Anti-Pattern 4: Single Benchmark Declaration

反模式4:单次基准测试即下结论

What it looks like: "Tested on one puzzle. Compact agent wins!" Why wrong: Single data point is sensitive to task selection bias. Simple tasks mask differences in edge case handling. Cannot distinguish luck from systematic quality. Do instead: Run two-tier benchmarking with 2-3 simple tasks and 1-2 complex tasks.
表现:"在一个谜题上测试后,精简Agent获胜!" 错误原因:单个数据点易受任务选择偏见影响。简单任务会掩盖边缘场景处理能力的差异。无法区分运气与系统性质量差异。 正确做法:运行双层基准测试,包含2-3个简单任务和1-2个复杂任务。

Anti-Pattern 5: Removing Core Patterns to Create Compact Agent

反模式5:移除核心模式以创建精简Agent

What it looks like: Compact version removes error handling patterns, concurrency guidance, and testing requirements to reduce size Why wrong: Creates unfair comparison. Compact agent is missing essential knowledge, guaranteeing quality degradation rather than testing if brevity is possible. Do instead: Remove verbose examples and redundant explanations, not capability. Keep one representative example per pattern. Condense explanations to bullet points but retain key insights.

表现:精简版移除了错误处理模式、并发指导和测试要求以缩小体积 错误原因:会导致不公平对比。精简Agent缺失了核心知识,必然会导致质量下降,而非测试精简是否可行。 正确做法:移除冗长的示例和冗余的解释,但保留核心能力。每种模式保留一个代表性示例。将解释精简为要点,但保留关键见解。

References

参考资料

This skill uses these shared patterns:
  • Anti-Rationalization - Prevents shortcut rationalizations
  • Verification Checklist - Pre-completion checks
本Skill使用以下共享模式:
  • 反合理化 - 避免捷径式的合理化解释
  • 验证清单 - 完成前的检查项

Domain-Specific Anti-Rationalization

领域特定反合理化

RationalizationWhy It's WrongRequired Action
"Compact agent saved 50% tokens"Savings may come from cutting corners, not efficiencyCheck quality scores before claiming savings
"Tests pass, agents are equal"Tests can miss production bugs (goroutine leaks, wrong semantics)Apply domain-specific quality checklist
"One benchmark is enough"Single task is sensitive to selection biasRun two-tier benchmarks (simple + complex)
"Prompt size determines cost"Prompt is one-time; reasoning tokens dominate sessionsMeasure total session cost to working solution
合理化解释错误原因要求的操作
"精简Agent节省了50%的令牌"节省可能来自偷工减料,而非效率提升在宣称节省前先检查质量得分
"测试通过,两个Agent质量相当"测试可能遗漏生产环境Bug(如goroutine泄漏、语义错误)使用领域特定质量清单进行评估
"一次基准测试足够"单个任务易受选择偏见影响运行双层基准测试(简单+复杂)
"提示词大小决定成本"提示词是一次性成本,推理令牌才是会话的主要消耗衡量得到可行解决方案的会话总成本

Reference Files

参考文件

  • ${CLAUDE_SKILL_DIR}/references/methodology.md
    : Complete testing methodology with December 2024 data
  • ${CLAUDE_SKILL_DIR}/references/grading-rubric.md
    : Detailed grading criteria and quality checklists
  • ${CLAUDE_SKILL_DIR}/references/benchmark-tasks.md
    : Standard benchmark task descriptions and prompts
  • ${CLAUDE_SKILL_DIR}/references/report-template.md
    : Comparison report template with all required sections
  • ${CLAUDE_SKILL_DIR}/references/methodology.md
    :完整测试方法论及2024年12月的数据
  • ${CLAUDE_SKILL_DIR}/references/grading-rubric.md
    :详细的评分标准和质量清单
  • ${CLAUDE_SKILL_DIR}/references/benchmark-tasks.md
    :标准基准测试任务描述和提示词
  • ${CLAUDE_SKILL_DIR}/references/report-template.md
    :包含所有必填章节的对比报告模板