benchmark-runner
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseBenchmark Runner
基准测试执行器
Standardizes performance comparison methodology: metric selection, test case design,
environment capture, result formatting, and tradeoff analysis. Produces reproducible
benchmark reports that support informed decisions — not just "A is faster than B" but
"A is faster for small inputs while B scales better."
标准化了性能对比的方法流程:指标选择、测试用例设计、环境信息采集、结果格式化与权衡分析。可输出可复现的基准测试报告,助力做出明智决策——结论不会仅仅是"A比B快",而是"A在小输入场景下更快,而B的可扩展性更好"。
Reference Files
参考文件
| File | Contents | Load When |
|---|---|---|
| Metric catalog (latency percentiles, throughput, memory, accuracy), selection criteria per task type | Always |
| Representative input selection, scale variation, edge case coverage, warmup strategies | Always |
| Hardware/software context recording, reproducibility requirements, variance control | Always |
| Sample sizing, variance measurement, significance testing, outlier handling | Results need statistical validation |
| 文件 | 内容说明 | 加载时机 |
|---|---|---|
| 指标目录(延迟分位数、吞吐量、内存、准确率),各任务类型的指标选择标准 | 始终加载 |
| 代表性输入选择、规模梯度设计、边界场景覆盖、预热策略 | 始终加载 |
| 软硬件上下文记录、可复现性要求、方差控制 | 始终加载 |
| 样本量设置、方差测量、显著性测试、异常值处理 | 结果需要统计验证时 |
Prerequisites
前置要求
- Clear candidates to compare (at least 2)
- Access to run or observe the candidates (code, API, or existing results)
- Representative workload definition
- 明确的待比较候选对象(至少2个)
- 有权限运行或观测这些候选对象(代码、API或现有结果均可)
- 具备代表性的工作负载定义
Workflow
工作流程
Phase 1: Define Scope
阶段1:界定范围
- What are the candidates? — Name each candidate precisely, including version. "Python dict vs Redis" is too vague. "Python 3.12 dict (in-process) vs Redis 7.2 (localhost, TCP)" is testable.
- What claims need validation? — "A is faster" → faster at what? For what input size? Under what load? Benchmark design flows from the specific claim.
- What is the decision context? — Why does this comparison matter? This determines which metrics are most important.
- 待比较的候选对象是什么? —— 精准命名每个候选对象,包含版本信息。"Python dict vs Redis" 过于模糊,"Python 3.12 dict(进程内)vs Redis 7.2(本地部署,TCP连接)" 才是可测试的表述。
- 需要验证的结论是什么? —— 若要验证"A更快",需要明确:在什么场景下更快?对应多大的输入规模?在什么负载条件下?基准测试的设计要围绕具体的验证目标展开。
- 决策背景是什么? —— 本次对比的意义是什么?这将决定哪些指标是优先级最高的。
Phase 2: Select Metrics
阶段2:选择指标
Choose metrics that match the decision context:
| Metric Category | Specific Metrics | When Important |
|---|---|---|
| Latency | P50, P95, P99, mean, std dev | User-facing operations, API calls |
| Throughput | ops/sec, tokens/sec, MB/sec | Batch processing, streaming |
| Memory | Peak RSS, avg RSS, allocation rate | Resource-constrained environments |
| Accuracy | F1, BLEU, exact match, precision/recall | ML models, algorithms with quality tradeoffs |
| Cost | $/1K operations, $/hour, $/GB | Cloud services, API comparisons |
| Startup | Time to first operation, cold start | Serverless, CLI tools |
Select 2-4 metrics. More than 4 makes comparison tables unreadable.
选择匹配决策背景的指标:
| 指标类别 | 具体指标 | 适用场景 |
|---|---|---|
| 延迟 | P50, P95, P99, 平均值, 标准差 | 用户-facing操作、API调用 |
| 吞吐量 | ops/sec, tokens/sec, MB/sec | 批量处理、流式计算 |
| 内存 | 峰值RSS, 平均RSS, 分配速率 | 资源受限环境 |
| 准确率 | F1, BLEU, 完全匹配率, 精确率/召回率 | ML模型、存在效果权衡的算法 |
| 成本 | $/1K操作, $/小时, $/GB | 云服务、API对比 |
| 启动耗时 | 首次操作耗时, 冷启动耗时 | Serverless、CLI工具 |
选择2-4个指标即可,超过4个会导致对比表难以阅读。
Phase 3: Design Test Cases
阶段3:设计测试用例
Create a matrix of inputs that reveal performance characteristics:
- Scale variation — Small, medium, large inputs. Performance often changes non-linearly with scale.
- Representative data — Use realistic inputs, not synthetic best-case data.
- Edge cases — Empty input, maximum size, adversarial input.
- Warmup — Exclude JIT compilation, cache warming, and connection establishment from measurements. Run N warmup iterations before recording.
创建能够暴露性能特征的输入矩阵:
- 规模梯度 —— 包含小、中、大三种输入规模,性能通常会随规模呈现非线性变化。
- 代表性数据 —— 使用真实场景的输入,而非合成的最优场景数据。
- 边界场景 —— 空输入、最大规模输入、对抗性输入。
- 预热处理 —— 测量时排除JIT编译、缓存预热、连接建立的耗时,在正式记录数据前运行N轮预热。
Phase 4: Specify Environment
阶段4:明确测试环境
Record everything needed to reproduce the results:
- Hardware — CPU model, core count, RAM size, GPU model (if applicable)
- Software — OS version, language runtime version, dependency versions
- Configuration — Thread count, batch size, connection pool size, cache settings
- Isolation — What else was running? Background processes affect results.
记录所有复现结果所需的信息:
- 硬件 —— CPU型号、核心数、RAM大小、GPU型号(如有)
- 软件 —— OS版本、语言运行时版本、依赖版本
- 配置 —— 线程数、批量大小、连接池大小、缓存设置
- 隔离性 —— 测试时还有什么其他程序在运行?后台进程会影响测试结果。
Phase 5: Structure Results
阶段5:整理结果
Produce comparison tables with clear winners per metric, followed by tradeoff analysis.
输出对比表,明确每个指标下的优胜者,随后附权衡分析。
Output Format
输出格式
text
undefinedtext
undefinedBenchmark: {Descriptive Title}
Benchmark: {Descriptive Title}
Date: {YYYY-MM-DD}
Hardware: {CPU}, {RAM}, {GPU if applicable}
Software: {runtime versions}
Configuration: {key settings that affect results}
Date: {YYYY-MM-DD}
Hardware: {CPU}, {RAM}, {GPU if applicable}
Software: {runtime versions}
Configuration: {key settings that affect results}
Candidates
Candidates
| # | Candidate | Version | Configuration |
|---|---|---|---|
| A | {name} | {version} | {relevant config} |
| B | {name} | {version} | {relevant config} |
| # | Candidate | Version | Configuration |
|---|---|---|---|
| A | {name} | {version} | {relevant config} |
| B | {name} | {version} | {relevant config} |
Test Cases
Test Cases
| # | Name | Input Size | Description | Warmup | Iterations |
|---|---|---|---|---|---|
| 1 | Small | {size} | {what it represents} | {N} | {N} |
| 2 | Medium | {size} | {what it represents} | {N} | {N} |
| 3 | Large | {size} | {what it represents} | {N} | {N} |
| # | Name | Input Size | Description | Warmup | Iterations |
|---|---|---|---|---|---|
| 1 | Small | {size} | {what it represents} | {N} | {N} |
| 2 | Medium | {size} | {what it represents} | {N} | {N} |
| 3 | Large | {size} | {what it represents} | {N} | {N} |
Results
Results
Latency (ms, lower is better)
Latency (ms, lower is better)
| Test Case | A (P50 / P95 / P99) | B (P50 / P95 / P99) | Winner |
|---|---|---|---|
| Small | {values} | {values} | {A or B} |
| Medium | {values} | {values} | {A or B} |
| Large | {values} | {values} | {A or B} |
| Test Case | A (P50 / P95 / P99) | B (P50 / P95 / P99) | Winner |
|---|---|---|---|
| Small | {values} | {values} | {A or B} |
| Medium | {values} | {values} | {A or B} |
| Large | {values} | {values} | {A or B} |
Memory (MB, lower is better)
Memory (MB, lower is better)
| Test Case | A (Peak) | B (Peak) | Winner |
|---|---|---|---|
| Small | {value} | {value} | {A or B} |
| Medium | {value} | {value} | {A or B} |
| Large | {value} | {value} | {A or B} |
| Test Case | A (Peak) | B (Peak) | Winner |
|---|---|---|---|
| Small | {value} | {value} | {A or B} |
| Medium | {value} | {value} | {A or B} |
| Large | {value} | {value} | {A or B} |
Analysis
Analysis
Overall Winner
Overall Winner
{Candidate} wins on {N} of {M} metrics across all test cases.
{Candidate} wins on {N} of {M} metrics across all test cases.
Tradeoff Summary
Tradeoff Summary
- Choose A when: {conditions where A is the better choice}
- Choose B when: {conditions where B is the better choice}
- Choose A when: {conditions where A is the better choice}
- Choose B when: {conditions where B is the better choice}
Caveats
Caveats
- {Limitation of this benchmark}
- {Condition under which results may differ}
- {Limitation of this benchmark}
- {Condition under which results may differ}
Reproduction
Reproduction
bash
undefinedbash
undefinedEnvironment setup
Environment setup
{commands to recreate the environment}
{commands to recreate the environment}
Run benchmark
Run benchmark
{commands to execute the benchmark}
```text{commands to execute the benchmark}
undefinedtext
undefinedConfiguring Scope
Configuring Scope
| Mode | Candidates | Depth | When to Use |
|---|---|---|---|
| 2 candidates, 1-2 metrics | Single test case, no statistics | Rough comparison, sanity check |
| 2-3 candidates, 2-4 metrics | 3 test cases, mean + std dev | Default for most comparisons |
| Any count, full metric suite | Multiple test cases, percentiles, significance tests | Publication, critical decisions |
| Mode | Candidates | Depth | When to Use |
|---|---|---|---|
| 2 candidates, 1-2 metrics | Single test case, no statistics | Rough comparison, sanity check |
| 2-3 candidates, 2-4 metrics | 3 test cases, mean + std dev | Default for most comparisons |
| Any count, full metric suite | Multiple test cases, percentiles, significance tests | Publication, critical decisions |
Calibration Rules
Calibration Rules
- Measure, don't guess. Intuition about performance is unreliable. "Obviously faster" is not a benchmark result.
- Apples to apples. Candidates must be compared under identical conditions. Different hardware, configuration, or input data invalidates the comparison.
- Report variance, not just means. A mean of 50ms with std dev of 100ms is not the same as a mean of 50ms with std dev of 2ms. Always report spread.
- Warm up before measuring. First-run performance includes JIT, cache warming, and connection setup. Exclude warmup iterations from results.
- Representative inputs only. Benchmarking with synthetic best-case input is misleading. Use data that resembles production workloads.
- State the winner per metric, not overall. "A is better" is lazy. "A has lower latency; B uses less memory" is useful.
- Measure, don't guess. Intuition about performance is unreliable. "Obviously faster" is not a benchmark result.
- Apples to apples. Candidates must be compared under identical conditions. Different hardware, configuration, or input data invalidates the comparison.
- Report variance, not just means. A mean of 50ms with std dev of 100ms is not the same as a mean of 50ms with std dev of 2ms. Always report spread.
- Warm up before measuring. First-run performance includes JIT, cache warming, and connection setup. Exclude warmup iterations from results.
- Representative inputs only. Benchmarking with synthetic best-case input is misleading. Use data that resembles production workloads.
- State the winner per metric, not overall. "A is better" is lazy. "A has lower latency; B uses less memory" is useful.
Error Handling
Error Handling
| Problem | Resolution |
|---|---|
| Cannot run candidates locally | Design the benchmark specification. Document what to measure and how. The user executes separately. |
| Results are noisy (high variance) | Increase iteration count. Check for background processes. Use dedicated hardware or containers for isolation. |
| Candidates serve different purposes | Acknowledge that the comparison is partial. Benchmark only the overlapping functionality. |
| No baseline exists | Establish one candidate as the baseline. Report relative performance (e.g., "B is 1.3x faster than A"). |
| Hardware context unavailable | Document what is known. Note that results may not be reproducible without full context. |
| Problem | Resolution |
|---|---|
| Cannot run candidates locally | Design the benchmark specification. Document what to measure and how. The user executes separately. |
| Results are noisy (high variance) | Increase iteration count. Check for background processes. Use dedicated hardware or containers for isolation. |
| Candidates serve different purposes | Acknowledge that the comparison is partial. Benchmark only the overlapping functionality. |
| No baseline exists | Establish one candidate as the baseline. Report relative performance (e.g., "B is 1.3x faster than A"). |
| Hardware context unavailable | Document what is known. Note that results may not be reproducible without full context. |
When NOT to Benchmark
When NOT to Benchmark
Push back if:
- The comparison is not performance-related (feature comparison → use a decision matrix or ADR instead)
- The candidates are fundamentally different tools (comparing a database to a message queue)
- The user wants to benchmark trivial operations (comparing two string concatenation methods in Python)
- Results from others already exist and conditions match — link to existing benchmarks instead
undefinedPush back if:
- The comparison is not performance-related (feature comparison → use a decision matrix or ADR instead)
- The candidates are fundamentally different tools (comparing a database to a message queue)
- The user wants to benchmark trivial operations (comparing two string concatenation methods in Python)
- Results from others already exist and conditions match — link to existing benchmarks instead
undefined