benchmark-runner

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Benchmark Runner

基准测试执行器

Standardizes performance comparison methodology: metric selection, test case design, environment capture, result formatting, and tradeoff analysis. Produces reproducible benchmark reports that support informed decisions — not just "A is faster than B" but "A is faster for small inputs while B scales better."

标准化了性能对比的方法流程：指标选择、测试用例设计、环境信息采集、结果格式化与权衡分析。可输出可复现的基准测试报告，助力做出明智决策——结论不会仅仅是"A比B快"，而是"A在小输入场景下更快，而B的可扩展性更好"。

Reference Files

参考文件

File	Contents	Load When
`references/metric-selection.md`	Metric catalog (latency percentiles, throughput, memory, accuracy), selection criteria per task type	Always
`references/test-case-design.md`	Representative input selection, scale variation, edge case coverage, warmup strategies	Always
`references/environment-capture.md`	Hardware/software context recording, reproducibility requirements, variance control	Always
`references/statistical-rigor.md`	Sample sizing, variance measurement, significance testing, outlier handling	Results need statistical validation

文件	内容说明	加载时机
`references/metric-selection.md`	指标目录（延迟分位数、吞吐量、内存、准确率），各任务类型的指标选择标准	始终加载
`references/test-case-design.md`	代表性输入选择、规模梯度设计、边界场景覆盖、预热策略	始终加载
`references/environment-capture.md`	软硬件上下文记录、可复现性要求、方差控制	始终加载
`references/statistical-rigor.md`	样本量设置、方差测量、显著性测试、异常值处理	结果需要统计验证时

Prerequisites

前置要求

Clear candidates to compare (at least 2)
Access to run or observe the candidates (code, API, or existing results)
Representative workload definition

明确的待比较候选对象（至少2个）
有权限运行或观测这些候选对象（代码、API或现有结果均可）
具备代表性的工作负载定义

Workflow

工作流程

Phase 1: Define Scope

阶段1：界定范围

What are the candidates? — Name each candidate precisely, including version. "Python dict vs Redis" is too vague. "Python 3.12 dict (in-process) vs Redis 7.2 (localhost, TCP)" is testable.
What claims need validation? — "A is faster" → faster at what? For what input size? Under what load? Benchmark design flows from the specific claim.
What is the decision context? — Why does this comparison matter? This determines which metrics are most important.

待比较的候选对象是什么？ —— 精准命名每个候选对象，包含版本信息。"Python dict vs Redis" 过于模糊，"Python 3.12 dict（进程内）vs Redis 7.2（本地部署，TCP连接）" 才是可测试的表述。
需要验证的结论是什么？ —— 若要验证"A更快"，需要明确：在什么场景下更快？对应多大的输入规模？在什么负载条件下？基准测试的设计要围绕具体的验证目标展开。
决策背景是什么？ —— 本次对比的意义是什么？这将决定哪些指标是优先级最高的。

Phase 2: Select Metrics

阶段2：选择指标

Choose metrics that match the decision context:

Metric Category	Specific Metrics	When Important
Latency	P50, P95, P99, mean, std dev	User-facing operations, API calls
Throughput	ops/sec, tokens/sec, MB/sec	Batch processing, streaming
Memory	Peak RSS, avg RSS, allocation rate	Resource-constrained environments
Accuracy	F1, BLEU, exact match, precision/recall	ML models, algorithms with quality tradeoffs
Cost	$/1K operations, $/hour, $/GB	Cloud services, API comparisons
Startup	Time to first operation, cold start	Serverless, CLI tools

Select 2-4 metrics. More than 4 makes comparison tables unreadable.

选择匹配决策背景的指标：

指标类别	具体指标	适用场景
延迟	P50, P95, P99, 平均值, 标准差	用户-facing操作、API调用
吞吐量	ops/sec, tokens/sec, MB/sec	批量处理、流式计算
内存	峰值RSS, 平均RSS, 分配速率	资源受限环境
准确率	F1, BLEU, 完全匹配率, 精确率/召回率	ML模型、存在效果权衡的算法
成本	$/1K操作, $/小时, $/GB	云服务、API对比
启动耗时	首次操作耗时, 冷启动耗时	Serverless、CLI工具

选择2-4个指标即可，超过4个会导致对比表难以阅读。

Phase 3: Design Test Cases

阶段3：设计测试用例

Create a matrix of inputs that reveal performance characteristics:

Scale variation — Small, medium, large inputs. Performance often changes non-linearly with scale.
Representative data — Use realistic inputs, not synthetic best-case data.
Edge cases — Empty input, maximum size, adversarial input.
Warmup — Exclude JIT compilation, cache warming, and connection establishment from measurements. Run N warmup iterations before recording.

创建能够暴露性能特征的输入矩阵：

规模梯度 —— 包含小、中、大三种输入规模，性能通常会随规模呈现非线性变化。
代表性数据 —— 使用真实场景的输入，而非合成的最优场景数据。
边界场景 —— 空输入、最大规模输入、对抗性输入。
预热处理 —— 测量时排除JIT编译、缓存预热、连接建立的耗时，在正式记录数据前运行N轮预热。

Phase 4: Specify Environment

阶段4：明确测试环境

Record everything needed to reproduce the results:

Hardware — CPU model, core count, RAM size, GPU model (if applicable)
Software — OS version, language runtime version, dependency versions
Configuration — Thread count, batch size, connection pool size, cache settings
Isolation — What else was running? Background processes affect results.

记录所有复现结果所需的信息：

硬件 —— CPU型号、核心数、RAM大小、GPU型号（如有）
软件 —— OS版本、语言运行时版本、依赖版本
配置 —— 线程数、批量大小、连接池大小、缓存设置
隔离性 —— 测试时还有什么其他程序在运行？后台进程会影响测试结果。

Phase 5: Structure Results

阶段5：整理结果

Produce comparison tables with clear winners per metric, followed by tradeoff analysis.

输出对比表，明确每个指标下的优胜者，随后附权衡分析。

Output Format

输出格式

text

undefined

text

undefined

Benchmark: {Descriptive Title}

Date: {YYYY-MM-DD} Hardware: {CPU}, {RAM}, {GPU if applicable} Software: {runtime versions} Configuration: {key settings that affect results}

Candidates

#	Candidate	Version	Configuration
A	{name}	{version}	{relevant config}
B	{name}	{version}	{relevant config}

#	Candidate	Version	Configuration
A	{name}	{version}	{relevant config}
B	{name}	{version}	{relevant config}

Test Cases

#	Name	Input Size	Description	Warmup	Iterations
1	Small	{size}	{what it represents}	{N}	{N}
2	Medium	{size}	{what it represents}	{N}	{N}
3	Large	{size}	{what it represents}	{N}	{N}

#	Name	Input Size	Description	Warmup	Iterations
1	Small	{size}	{what it represents}	{N}	{N}
2	Medium	{size}	{what it represents}	{N}	{N}
3	Large	{size}	{what it represents}	{N}	{N}

Results

Latency (ms, lower is better)

Test Case	A (P50 / P95 / P99)	B (P50 / P95 / P99)	Winner
Small	{values}	{values}	{A or B}
Medium	{values}	{values}	{A or B}
Large	{values}	{values}	{A or B}

Test Case	A (P50 / P95 / P99)	B (P50 / P95 / P99)	Winner
Small	{values}	{values}	{A or B}
Medium	{values}	{values}	{A or B}
Large	{values}	{values}	{A or B}

Memory (MB, lower is better)

Test Case	A (Peak)	B (Peak)	Winner
Small	{value}	{value}	{A or B}
Medium	{value}	{value}	{A or B}
Large	{value}	{value}	{A or B}

Test Case	A (Peak)	B (Peak)	Winner
Small	{value}	{value}	{A or B}
Medium	{value}	{value}	{A or B}
Large	{value}	{value}	{A or B}

Analysis

Overall Winner

{Candidate} wins on {N} of {M} metrics across all test cases.

Tradeoff Summary

Choose A when: {conditions where A is the better choice}
Choose B when: {conditions where B is the better choice}

Choose A when: {conditions where A is the better choice}
Choose B when: {conditions where B is the better choice}

Caveats

{Limitation of this benchmark}
{Condition under which results may differ}

{Limitation of this benchmark}
{Condition under which results may differ}

Reproduction

bash

undefined

bash

undefined

Environment setup

{commands to recreate the environment}

Run benchmark

{commands to execute the benchmark}


```text

{commands to execute the benchmark}

undefined

text

undefined

Configuring Scope

Mode	Candidates	Depth	When to Use
`quick`	2 candidates, 1-2 metrics	Single test case, no statistics	Rough comparison, sanity check
`standard`	2-3 candidates, 2-4 metrics	3 test cases, mean + std dev	Default for most comparisons
`rigorous`	Any count, full metric suite	Multiple test cases, percentiles, significance tests	Publication, critical decisions

Mode	Candidates	Depth	When to Use
`quick`	2 candidates, 1-2 metrics	Single test case, no statistics	Rough comparison, sanity check
`standard`	2-3 candidates, 2-4 metrics	3 test cases, mean + std dev	Default for most comparisons
`rigorous`	Any count, full metric suite	Multiple test cases, percentiles, significance tests	Publication, critical decisions

Calibration Rules

Measure, don't guess. Intuition about performance is unreliable. "Obviously faster" is not a benchmark result.
Apples to apples. Candidates must be compared under identical conditions. Different hardware, configuration, or input data invalidates the comparison.
Report variance, not just means. A mean of 50ms with std dev of 100ms is not the same as a mean of 50ms with std dev of 2ms. Always report spread.
Warm up before measuring. First-run performance includes JIT, cache warming, and connection setup. Exclude warmup iterations from results.
Representative inputs only. Benchmarking with synthetic best-case input is misleading. Use data that resembles production workloads.
State the winner per metric, not overall. "A is better" is lazy. "A has lower latency; B uses less memory" is useful.

Measure, don't guess. Intuition about performance is unreliable. "Obviously faster" is not a benchmark result.
Apples to apples. Candidates must be compared under identical conditions. Different hardware, configuration, or input data invalidates the comparison.
Report variance, not just means. A mean of 50ms with std dev of 100ms is not the same as a mean of 50ms with std dev of 2ms. Always report spread.
Warm up before measuring. First-run performance includes JIT, cache warming, and connection setup. Exclude warmup iterations from results.
Representative inputs only. Benchmarking with synthetic best-case input is misleading. Use data that resembles production workloads.
State the winner per metric, not overall. "A is better" is lazy. "A has lower latency; B uses less memory" is useful.

Error Handling

Problem	Resolution
Cannot run candidates locally	Design the benchmark specification. Document what to measure and how. The user executes separately.
Results are noisy (high variance)	Increase iteration count. Check for background processes. Use dedicated hardware or containers for isolation.
Candidates serve different purposes	Acknowledge that the comparison is partial. Benchmark only the overlapping functionality.
No baseline exists	Establish one candidate as the baseline. Report relative performance (e.g., "B is 1.3x faster than A").
Hardware context unavailable	Document what is known. Note that results may not be reproducible without full context.

Problem	Resolution
Cannot run candidates locally	Design the benchmark specification. Document what to measure and how. The user executes separately.
Results are noisy (high variance)	Increase iteration count. Check for background processes. Use dedicated hardware or containers for isolation.
Candidates serve different purposes	Acknowledge that the comparison is partial. Benchmark only the overlapping functionality.
No baseline exists	Establish one candidate as the baseline. Report relative performance (e.g., "B is 1.3x faster than A").
Hardware context unavailable	Document what is known. Note that results may not be reproducible without full context.

When NOT to Benchmark

Push back if:

The comparison is not performance-related (feature comparison → use a decision matrix or ADR instead)
The candidates are fundamentally different tools (comparing a database to a message queue)
The user wants to benchmark trivial operations (comparing two string concatenation methods in Python)
Results from others already exist and conditions match — link to existing benchmarks instead

undefined

Push back if:

The comparison is not performance-related (feature comparison → use a decision matrix or ADR instead)
The candidates are fundamentally different tools (comparing a database to a message queue)
The user wants to benchmark trivial operations (comparing two string concatenation methods in Python)
Results from others already exist and conditions match — link to existing benchmarks instead

undefined