benchmark-runner

Original：🇺🇸 English

Translated

Designs structured benchmarks for comparing algorithms, models, or implementations. Selects appropriate metrics (latency, throughput, memory, accuracy), designs representative test cases, captures hardware/software context, produces comparison tables with tradeoff analysis, and includes reproduction instructions. Triggers on: "benchmark", "compare performance", "which is faster", "latency comparison", "memory comparison", "run benchmark", "design benchmark", "compare implementations", "evaluate algorithms", "performance comparison", "throughput test", "speed test". Use this skill when comparing two or more implementations, algorithms, or models.

10installs

Sourcemathews-tom/armory

Added on2026-04-10

NPX Install

npx skill4agent add mathews-tom/armory benchmark-runner

SKILL.md Content

View Translation Comparison →

Benchmark Runner

Standardizes performance comparison methodology: metric selection, test case design, environment capture, result formatting, and tradeoff analysis. Produces reproducible benchmark reports that support informed decisions — not just "A is faster than B" but "A is faster for small inputs while B scales better."

Reference Files

File	Contents	Load When
`references/metric-selection.md`	Metric catalog (latency percentiles, throughput, memory, accuracy), selection criteria per task type	Always
`references/test-case-design.md`	Representative input selection, scale variation, edge case coverage, warmup strategies	Always
`references/environment-capture.md`	Hardware/software context recording, reproducibility requirements, variance control	Always
`references/statistical-rigor.md`	Sample sizing, variance measurement, significance testing, outlier handling	Results need statistical validation

Prerequisites

Clear candidates to compare (at least 2)
Access to run or observe the candidates (code, API, or existing results)
Representative workload definition

Workflow

Phase 1: Define Scope

What are the candidates? — Name each candidate precisely, including version. "Python dict vs Redis" is too vague. "Python 3.12 dict (in-process) vs Redis 7.2 (localhost, TCP)" is testable.
What claims need validation? — "A is faster" → faster at what? For what input size? Under what load? Benchmark design flows from the specific claim.
What is the decision context? — Why does this comparison matter? This determines which metrics are most important.

Phase 2: Select Metrics

Choose metrics that match the decision context:

Metric Category	Specific Metrics	When Important
Latency	P50, P95, P99, mean, std dev	User-facing operations, API calls
Throughput	ops/sec, tokens/sec, MB/sec	Batch processing, streaming
Memory	Peak RSS, avg RSS, allocation rate	Resource-constrained environments
Accuracy	F1, BLEU, exact match, precision/recall	ML models, algorithms with quality tradeoffs
Cost	$/1K operations, $/hour, $/GB	Cloud services, API comparisons
Startup	Time to first operation, cold start	Serverless, CLI tools

Select 2-4 metrics. More than 4 makes comparison tables unreadable.

Phase 3: Design Test Cases

Create a matrix of inputs that reveal performance characteristics:

Scale variation — Small, medium, large inputs. Performance often changes non-linearly with scale.
Representative data — Use realistic inputs, not synthetic best-case data.
Edge cases — Empty input, maximum size, adversarial input.
Warmup — Exclude JIT compilation, cache warming, and connection establishment from measurements. Run N warmup iterations before recording.

Phase 4: Specify Environment

Record everything needed to reproduce the results:

Hardware — CPU model, core count, RAM size, GPU model (if applicable)
Software — OS version, language runtime version, dependency versions
Configuration — Thread count, batch size, connection pool size, cache settings
Isolation — What else was running? Background processes affect results.

Phase 5: Structure Results

Produce comparison tables with clear winners per metric, followed by tradeoff analysis.

Output Format

text

# Benchmark: {Descriptive Title}

**Date:** {YYYY-MM-DD}
**Hardware:** {CPU}, {RAM}, {GPU if applicable}
**Software:** {runtime versions}
**Configuration:** {key settings that affect results}

## Candidates

| # | Candidate | Version | Configuration |
|---|-----------|---------|---------------|
| A | {name} | {version} | {relevant config} |
| B | {name} | {version} | {relevant config} |

## Test Cases

| # | Name | Input Size | Description | Warmup | Iterations |
|---|------|------------|-------------|--------|------------|
| 1 | Small | {size} | {what it represents} | {N} | {N} |
| 2 | Medium | {size} | {what it represents} | {N} | {N} |
| 3 | Large | {size} | {what it represents} | {N} | {N} |

## Results

### Latency (ms, lower is better)

| Test Case | A (P50 / P95 / P99) | B (P50 / P95 / P99) | Winner |
|-----------|---------------------|---------------------|--------|
| Small | {values} | {values} | {A or B} |
| Medium | {values} | {values} | {A or B} |
| Large | {values} | {values} | {A or B} |

### Memory (MB, lower is better)

| Test Case | A (Peak) | B (Peak) | Winner |
|-----------|----------|----------|--------|
| Small | {value} | {value} | {A or B} |
| Medium | {value} | {value} | {A or B} |
| Large | {value} | {value} | {A or B} |

## Analysis

### Overall Winner
**{Candidate}** wins on {N} of {M} metrics across all test cases.

### Tradeoff Summary
- **Choose A when:** {conditions where A is the better choice}
- **Choose B when:** {conditions where B is the better choice}

### Caveats
- {Limitation of this benchmark}
- {Condition under which results may differ}

## Reproduction

```bash
# Environment setup
{commands to recreate the environment}

# Run benchmark
{commands to execute the benchmark}