benchmark-runner

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Benchmark Runner

基准测试执行器

Standardizes performance comparison methodology: metric selection, test case design, environment capture, result formatting, and tradeoff analysis. Produces reproducible benchmark reports that support informed decisions — not just "A is faster than B" but "A is faster for small inputs while B scales better."
标准化了性能对比的方法流程:指标选择、测试用例设计、环境信息采集、结果格式化与权衡分析。可输出可复现的基准测试报告,助力做出明智决策——结论不会仅仅是"A比B快",而是"A在小输入场景下更快,而B的可扩展性更好"。

Reference Files

参考文件

FileContentsLoad When
references/metric-selection.md
Metric catalog (latency percentiles, throughput, memory, accuracy), selection criteria per task typeAlways
references/test-case-design.md
Representative input selection, scale variation, edge case coverage, warmup strategiesAlways
references/environment-capture.md
Hardware/software context recording, reproducibility requirements, variance controlAlways
references/statistical-rigor.md
Sample sizing, variance measurement, significance testing, outlier handlingResults need statistical validation
文件内容说明加载时机
references/metric-selection.md
指标目录(延迟分位数、吞吐量、内存、准确率),各任务类型的指标选择标准始终加载
references/test-case-design.md
代表性输入选择、规模梯度设计、边界场景覆盖、预热策略始终加载
references/environment-capture.md
软硬件上下文记录、可复现性要求、方差控制始终加载
references/statistical-rigor.md
样本量设置、方差测量、显著性测试、异常值处理结果需要统计验证时

Prerequisites

前置要求

  • Clear candidates to compare (at least 2)
  • Access to run or observe the candidates (code, API, or existing results)
  • Representative workload definition
  • 明确的待比较候选对象(至少2个)
  • 有权限运行或观测这些候选对象(代码、API或现有结果均可)
  • 具备代表性的工作负载定义

Workflow

工作流程

Phase 1: Define Scope

阶段1:界定范围

  1. What are the candidates? — Name each candidate precisely, including version. "Python dict vs Redis" is too vague. "Python 3.12 dict (in-process) vs Redis 7.2 (localhost, TCP)" is testable.
  2. What claims need validation? — "A is faster" → faster at what? For what input size? Under what load? Benchmark design flows from the specific claim.
  3. What is the decision context? — Why does this comparison matter? This determines which metrics are most important.
  1. 待比较的候选对象是什么? —— 精准命名每个候选对象,包含版本信息。"Python dict vs Redis" 过于模糊,"Python 3.12 dict(进程内)vs Redis 7.2(本地部署,TCP连接)" 才是可测试的表述。
  2. 需要验证的结论是什么? —— 若要验证"A更快",需要明确:在什么场景下更快?对应多大的输入规模?在什么负载条件下?基准测试的设计要围绕具体的验证目标展开。
  3. 决策背景是什么? —— 本次对比的意义是什么?这将决定哪些指标是优先级最高的。

Phase 2: Select Metrics

阶段2:选择指标

Choose metrics that match the decision context:
Metric CategorySpecific MetricsWhen Important
LatencyP50, P95, P99, mean, std devUser-facing operations, API calls
Throughputops/sec, tokens/sec, MB/secBatch processing, streaming
MemoryPeak RSS, avg RSS, allocation rateResource-constrained environments
AccuracyF1, BLEU, exact match, precision/recallML models, algorithms with quality tradeoffs
Cost$/1K operations, $/hour, $/GBCloud services, API comparisons
StartupTime to first operation, cold startServerless, CLI tools
Select 2-4 metrics. More than 4 makes comparison tables unreadable.
选择匹配决策背景的指标:
指标类别具体指标适用场景
延迟P50, P95, P99, 平均值, 标准差用户-facing操作、API调用
吞吐量ops/sec, tokens/sec, MB/sec批量处理、流式计算
内存峰值RSS, 平均RSS, 分配速率资源受限环境
准确率F1, BLEU, 完全匹配率, 精确率/召回率ML模型、存在效果权衡的算法
成本$/1K操作, $/小时, $/GB云服务、API对比
启动耗时首次操作耗时, 冷启动耗时Serverless、CLI工具
选择2-4个指标即可,超过4个会导致对比表难以阅读。

Phase 3: Design Test Cases

阶段3:设计测试用例

Create a matrix of inputs that reveal performance characteristics:
  1. Scale variation — Small, medium, large inputs. Performance often changes non-linearly with scale.
  2. Representative data — Use realistic inputs, not synthetic best-case data.
  3. Edge cases — Empty input, maximum size, adversarial input.
  4. Warmup — Exclude JIT compilation, cache warming, and connection establishment from measurements. Run N warmup iterations before recording.
创建能够暴露性能特征的输入矩阵:
  1. 规模梯度 —— 包含小、中、大三种输入规模,性能通常会随规模呈现非线性变化。
  2. 代表性数据 —— 使用真实场景的输入,而非合成的最优场景数据。
  3. 边界场景 —— 空输入、最大规模输入、对抗性输入。
  4. 预热处理 —— 测量时排除JIT编译、缓存预热、连接建立的耗时,在正式记录数据前运行N轮预热。

Phase 4: Specify Environment

阶段4:明确测试环境

Record everything needed to reproduce the results:
  1. Hardware — CPU model, core count, RAM size, GPU model (if applicable)
  2. Software — OS version, language runtime version, dependency versions
  3. Configuration — Thread count, batch size, connection pool size, cache settings
  4. Isolation — What else was running? Background processes affect results.
记录所有复现结果所需的信息:
  1. 硬件 —— CPU型号、核心数、RAM大小、GPU型号(如有)
  2. 软件 —— OS版本、语言运行时版本、依赖版本
  3. 配置 —— 线程数、批量大小、连接池大小、缓存设置
  4. 隔离性 —— 测试时还有什么其他程序在运行?后台进程会影响测试结果。

Phase 5: Structure Results

阶段5:整理结果

Produce comparison tables with clear winners per metric, followed by tradeoff analysis.
输出对比表,明确每个指标下的优胜者,随后附权衡分析。

Output Format

输出格式

text
undefined
text
undefined

Benchmark: {Descriptive Title}

Benchmark: {Descriptive Title}

Date: {YYYY-MM-DD} Hardware: {CPU}, {RAM}, {GPU if applicable} Software: {runtime versions} Configuration: {key settings that affect results}
Date: {YYYY-MM-DD} Hardware: {CPU}, {RAM}, {GPU if applicable} Software: {runtime versions} Configuration: {key settings that affect results}

Candidates

Candidates

#CandidateVersionConfiguration
A{name}{version}{relevant config}
B{name}{version}{relevant config}
#CandidateVersionConfiguration
A{name}{version}{relevant config}
B{name}{version}{relevant config}

Test Cases

Test Cases

#NameInput SizeDescriptionWarmupIterations
1Small{size}{what it represents}{N}{N}
2Medium{size}{what it represents}{N}{N}
3Large{size}{what it represents}{N}{N}
#NameInput SizeDescriptionWarmupIterations
1Small{size}{what it represents}{N}{N}
2Medium{size}{what it represents}{N}{N}
3Large{size}{what it represents}{N}{N}

Results

Results

Latency (ms, lower is better)

Latency (ms, lower is better)

Test CaseA (P50 / P95 / P99)B (P50 / P95 / P99)Winner
Small{values}{values}{A or B}
Medium{values}{values}{A or B}
Large{values}{values}{A or B}
Test CaseA (P50 / P95 / P99)B (P50 / P95 / P99)Winner
Small{values}{values}{A or B}
Medium{values}{values}{A or B}
Large{values}{values}{A or B}

Memory (MB, lower is better)

Memory (MB, lower is better)

Test CaseA (Peak)B (Peak)Winner
Small{value}{value}{A or B}
Medium{value}{value}{A or B}
Large{value}{value}{A or B}
Test CaseA (Peak)B (Peak)Winner
Small{value}{value}{A or B}
Medium{value}{value}{A or B}
Large{value}{value}{A or B}

Analysis

Analysis

Overall Winner

Overall Winner

{Candidate} wins on {N} of {M} metrics across all test cases.
{Candidate} wins on {N} of {M} metrics across all test cases.

Tradeoff Summary

Tradeoff Summary

  • Choose A when: {conditions where A is the better choice}
  • Choose B when: {conditions where B is the better choice}
  • Choose A when: {conditions where A is the better choice}
  • Choose B when: {conditions where B is the better choice}

Caveats

Caveats

  • {Limitation of this benchmark}
  • {Condition under which results may differ}
  • {Limitation of this benchmark}
  • {Condition under which results may differ}

Reproduction

Reproduction

bash
undefined
bash
undefined

Environment setup

Environment setup

{commands to recreate the environment}
{commands to recreate the environment}

Run benchmark

Run benchmark

{commands to execute the benchmark}

```text
{commands to execute the benchmark}
undefined
text
undefined

Configuring Scope

Configuring Scope

ModeCandidatesDepthWhen to Use
quick
2 candidates, 1-2 metricsSingle test case, no statisticsRough comparison, sanity check
standard
2-3 candidates, 2-4 metrics3 test cases, mean + std devDefault for most comparisons
rigorous
Any count, full metric suiteMultiple test cases, percentiles, significance testsPublication, critical decisions
ModeCandidatesDepthWhen to Use
quick
2 candidates, 1-2 metricsSingle test case, no statisticsRough comparison, sanity check
standard
2-3 candidates, 2-4 metrics3 test cases, mean + std devDefault for most comparisons
rigorous
Any count, full metric suiteMultiple test cases, percentiles, significance testsPublication, critical decisions

Calibration Rules

Calibration Rules

  1. Measure, don't guess. Intuition about performance is unreliable. "Obviously faster" is not a benchmark result.
  2. Apples to apples. Candidates must be compared under identical conditions. Different hardware, configuration, or input data invalidates the comparison.
  3. Report variance, not just means. A mean of 50ms with std dev of 100ms is not the same as a mean of 50ms with std dev of 2ms. Always report spread.
  4. Warm up before measuring. First-run performance includes JIT, cache warming, and connection setup. Exclude warmup iterations from results.
  5. Representative inputs only. Benchmarking with synthetic best-case input is misleading. Use data that resembles production workloads.
  6. State the winner per metric, not overall. "A is better" is lazy. "A has lower latency; B uses less memory" is useful.
  1. Measure, don't guess. Intuition about performance is unreliable. "Obviously faster" is not a benchmark result.
  2. Apples to apples. Candidates must be compared under identical conditions. Different hardware, configuration, or input data invalidates the comparison.
  3. Report variance, not just means. A mean of 50ms with std dev of 100ms is not the same as a mean of 50ms with std dev of 2ms. Always report spread.
  4. Warm up before measuring. First-run performance includes JIT, cache warming, and connection setup. Exclude warmup iterations from results.
  5. Representative inputs only. Benchmarking with synthetic best-case input is misleading. Use data that resembles production workloads.
  6. State the winner per metric, not overall. "A is better" is lazy. "A has lower latency; B uses less memory" is useful.

Error Handling

Error Handling

ProblemResolution
Cannot run candidates locallyDesign the benchmark specification. Document what to measure and how. The user executes separately.
Results are noisy (high variance)Increase iteration count. Check for background processes. Use dedicated hardware or containers for isolation.
Candidates serve different purposesAcknowledge that the comparison is partial. Benchmark only the overlapping functionality.
No baseline existsEstablish one candidate as the baseline. Report relative performance (e.g., "B is 1.3x faster than A").
Hardware context unavailableDocument what is known. Note that results may not be reproducible without full context.
ProblemResolution
Cannot run candidates locallyDesign the benchmark specification. Document what to measure and how. The user executes separately.
Results are noisy (high variance)Increase iteration count. Check for background processes. Use dedicated hardware or containers for isolation.
Candidates serve different purposesAcknowledge that the comparison is partial. Benchmark only the overlapping functionality.
No baseline existsEstablish one candidate as the baseline. Report relative performance (e.g., "B is 1.3x faster than A").
Hardware context unavailableDocument what is known. Note that results may not be reproducible without full context.

When NOT to Benchmark

When NOT to Benchmark

Push back if:
  • The comparison is not performance-related (feature comparison → use a decision matrix or ADR instead)
  • The candidates are fundamentally different tools (comparing a database to a message queue)
  • The user wants to benchmark trivial operations (comparing two string concatenation methods in Python)
  • Results from others already exist and conditions match — link to existing benchmarks instead
undefined
Push back if:
  • The comparison is not performance-related (feature comparison → use a decision matrix or ADR instead)
  • The candidates are fundamentally different tools (comparing a database to a message queue)
  • The user wants to benchmark trivial operations (comparing two string concatenation methods in Python)
  • Results from others already exist and conditions match — link to existing benchmarks instead
undefined