cost-benchmark

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Cost Benchmark

成本基准测试

Runs
scripts/bench.mjs
against the structural+adversarial corpus and writes per-case + summary results to
docs/benchmarks/runs/
. This is the verification gate that backs every measurable claim in
cost-booster-edit
/
cost-booster-route
.
针对结构化+对抗性语料库运行
scripts/bench.mjs
,并将单案例结果及汇总结果写入
docs/benchmarks/runs/
。这是支撑
cost-booster-edit
/
cost-booster-route
中所有可量化声明的验证关卡。

When to use

使用场景

  • Before publishing a release — verify booster win rate didn't regress.
  • After expanding
    bench/booster-corpus.json
    — confirm new cases route correctly.
  • When auditing a "claimed upstream" tag — flip it to "verified" once the bench supports it.
  • On a cost question ("is Sonnet 4.6 cheaper than Opus 4.7 for these tasks?") — re-run with
    BENCH_ANTHROPIC=1
    .
  • 发布版本前——验证booster的胜率未出现退化。
  • 扩展
    bench/booster-corpus.json
    后——确认新案例路由正确。
  • 审核“声明的上游”标签时——一旦基准测试支持该标签,将其切换为“已验证”。
  • 遇到成本相关问题(如“对于这些任务,Sonnet 4.6是否比Opus 4.7更便宜?”)——设置
    BENCH_ANTHROPIC=1
    后重新运行。

Steps

步骤

  1. Run the bench from
    v3/
    (where
    agent-booster
    resolves):
    bash
    ( cd v3 && node ../plugins/ruflo-cost-tracker/scripts/bench.mjs )                  # booster only — free, ~85 ms
    ( cd v3 && BENCH_LLM_BASELINE=1 node ../plugins/ruflo-cost-tracker/scripts/bench.mjs ) # + Gemini 2.0 Flash (cheap)
    ( cd v3 && BENCH_LLM_BASELINE=1 BENCH_ANTHROPIC=1 \
         node ../plugins/ruflo-cost-tracker/scripts/bench.mjs )                          # + Sonnet 4.6 + Opus 4.7
  2. Inspect the markdown summary printed to stdout. The gate metric is
    winRate
    (Tier 1 cases). Adversarial cases are tracked separately as
    escalationRate
    .
  3. Persisted output lands at:
    • docs/benchmarks/runs/latest.json
      — pointer to the most recent run
    • docs/benchmarks/runs/<ISO-timestamp>.json
      — historical record
  4. Read it back in subsequent skills (e.g.
    cost-report
    step 2 reads
    latest.json
    for live tier-spend numbers).
  1. v3/
    目录运行基准测试
    agent-booster
    的解析目录):
    bash
    ( cd v3 && node ../plugins/ruflo-cost-tracker/scripts/bench.mjs )                  # 仅booster——免费,约85毫秒
    ( cd v3 && BENCH_LLM_BASELINE=1 node ../plugins/ruflo-cost-tracker/scripts/bench.mjs ) # + Gemini 2.0 Flash(低成本)
    ( cd v3 && BENCH_LLM_BASELINE=1 BENCH_ANTHROPIC=1 \
         node ../plugins/ruflo-cost-tracker/scripts/bench.mjs )                          # + Sonnet 4.6 + Opus 4.7
  2. 检查输出到标准输出的Markdown汇总。核心指标为
    winRate
    (Tier 1案例)。对抗性案例单独作为
    escalationRate
    跟踪。
  3. 持久化输出存储在以下位置
    • docs/benchmarks/runs/latest.json
      ——指向最新测试运行的指针
    • docs/benchmarks/runs/<ISO-timestamp>.json
      ——历史记录
  4. 在后续技能中读取该输出(例如
    cost-report
    步骤2会读取
    latest.json
    以获取实时层级支出数据)。

Smoke gates

冒烟测试关卡

  • winRate ≥ 0.80
    on Tier 1 cases (smoke step 23). Lower the threshold by editing
    scripts/smoke.sh
    .
  • escalationRate
    is reported but ungated — adversarial cases are diagnostic.
  • Tier 1案例的
    winRate ≥ 0.80
    (冒烟测试步骤23)。可通过编辑
    scripts/smoke.sh
    调整阈值。
  • escalationRate
    仅作报告,不设关卡——对抗性案例用于诊断。

Env overrides

环境变量覆盖

Env varDefaultPurpose
BENCH_LLM_BASELINE
unset
=1
runs the OpenAI-compat baseline
BENCH_LLM_MODEL
models/gemini-2.0-flash
Override the OpenAI-compat model
BENCH_LLM_BASE_URL
Gemini OpenAI shimOverride endpoint
BENCH_ANTHROPIC
unset
=1
runs Anthropic baseline (Sonnet 4.6 + Opus 4.7)
BENCH_ANTHROPIC_MODELS
claude-sonnet-4-6,claude-opus-4-7
Comma-separated Claude IDs
BENCH_OUT
timestamped fileOverride output path
BENCH_QUIET=1
unsetSuppress markdown summary
API keys auto-pulled from
gcloud secrets
(
GOOGLE_AI_API_KEY
,
ANTHROPIC_API_KEY
); override with
BENCH_LLM_API_KEY
/
BENCH_ANTHROPIC_API_KEY
.
环境变量默认值用途
BENCH_LLM_BASELINE
未设置
=1
时运行OpenAI兼容基线
BENCH_LLM_MODEL
models/gemini-2.0-flash
覆盖OpenAI兼容模型
BENCH_LLM_BASE_URL
Gemini OpenAI垫片覆盖端点
BENCH_ANTHROPIC
未设置
=1
时运行Anthropic基线(Sonnet 4.6 + Opus 4.7)
BENCH_ANTHROPIC_MODELS
claude-sonnet-4-6,claude-opus-4-7
逗号分隔的Claude模型ID
BENCH_OUT
带时间戳的文件覆盖输出路径
BENCH_QUIET=1
未设置抑制Markdown汇总输出
API密钥会自动从
gcloud secrets
获取(
GOOGLE_AI_API_KEY
ANTHROPIC_API_KEY
);可通过
BENCH_LLM_API_KEY
/
BENCH_ANTHROPIC_API_KEY
覆盖。

Cross-references

交叉引用

ADR-0002 §"Decision 1" / §"Riskiest assumption" ·
cost-booster-edit/SKILL.md
(verification table consumes this skill's output) ·
cost-report/SKILL.md
step 2 (reads
runs/latest.json
).
ADR-0002 §"决策1" / §"风险最高的假设" ·
cost-booster-edit/SKILL.md
(验证表会使用本技能的输出) ·
cost-report/SKILL.md
步骤2(读取
runs/latest.json
)。