update-golden-values

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Update golden values + relative-diff summary

更新基准值 + 相对差异摘要

End-to-end workflow for refreshing golden values from a GitHub Actions workflow run, scoring the update with a per-metric average normalized relative difference, and writing a PR-ready summary.
The skill orchestrates two scripts that already live in the repo:
  • tests/test_utils/python_scripts/download_golden_values.py
    — pulls artifacts from a workflow run and overwrites
    tests/functional_tests/test_cases/**/golden_values_*.json
    .
  • tests/test_utils/python_scripts/compare_golden_values_kl.py
    — diffs the working-tree goldens against
    git HEAD
    and reports per-metric
    avg_rel_diff = mean((old − new) / old)
    . (Filename keeps the legacy
    _kl
    suffix; the script no longer computes KL divergence.)
这是一个端到端工作流,用于从GitHub Actions工作流运行中刷新基准值,通过每指标平均归一化相对差异评估更新,并编写可直接用于PR的摘要。
该流程协调仓库中已有的两个脚本:
  • tests/test_utils/python_scripts/download_golden_values.py
    — 从工作流运行中拉取工件并覆盖
    tests/functional_tests/test_cases/**/golden_values_*.json
    文件。
  • tests/test_utils/python_scripts/compare_golden_values_kl.py
    — 将工作区的基准值与
    git HEAD
    版本对比,报告每指标的
    avg_rel_diff = mean((old − new) / old)
    。(文件名保留了旧的
    _kl
    后缀;该脚本已不再计算KL散度。)

Inputs to gather from the user

需要向用户收集的输入

  1. GitHub Actions workflow run ID (e.g.
    25341543542
    ). It's the numeric ID in the run URL.
  2. Source: should be
    github
    for this workflow. (
    gitlab
    is supported by the download script but uses a different env path.)
  3. Scope — accept one of:
    • only-failing
      → run with
      --only-failing
      (download from failing/cancelled jobs only). Use this for "fix the broken tests" workflows.
    • all
      → run without
      --only-failing
      (download from every job that produced golden values). Use this when the user wants a full refresh.
    If the user doesn't specify, ask. Don't silently default.
  1. GitHub Actions工作流运行ID(例如:
    25341543542
    )。即运行URL中的数字ID。
  2. 来源:此工作流应设为
    github
    。(下载脚本支持
    gitlab
    ,但使用不同的环境路径。)
  3. 范围 — 接受以下选项之一:
    • only-failing
      → 带
      --only-failing
      参数运行(仅从失败/取消的任务下载)。适用于“修复失败测试”的工作流。
    • all
      → 不带
      --only-failing
      参数运行(从所有生成基准值的任务下载)。适用于用户需要全面刷新的场景。
    如果用户未指定,需询问确认,不得默认设置。

Workflow

工作流步骤

- [ ] Step 1: Set up env (token + venv with deps)
- [ ] Step 2: Reset prior golden-value edits
- [ ] Step 3: Download goldens (scope = only-failing | all)
- [ ] Step 4: Run relative-diff comparison + capture CSV
- [ ] Step 5: Produce summary blurb
- [ ] 步骤1:配置环境(令牌 + 包含依赖的venv)
- [ ] 步骤2:重置之前的基准值编辑
- [ ] 步骤3:下载基准值(范围为only-failing | all)
- [ ] 步骤4:运行相对差异对比并捕获CSV结果
- [ ] 步骤5:生成摘要内容

Step 1 — Environment

步骤1 — 环境配置

The download script needs
GITHUB_TOKEN
. If the user has the
gh
CLI authenticated, derive it; do NOT export the token into a long-lived shell or commit it.
bash
undefined
下载脚本需要
GITHUB_TOKEN
。如果用户已通过
gh
CLI认证,可从中获取;请勿将令牌导出到长期运行的shell或提交到仓库。
bash
undefined

token (one-shot, scoped to the command)

令牌(单次使用,命令级作用域)

export GITHUB_TOKEN="$(gh auth token)"
export GITHUB_TOKEN="$(gh auth token)"

python deps (the script imports click, gitlab, requests)

Python依赖(脚本导入了click、gitlab、requests)

python3 -m venv /tmp/gv_venv /tmp/gv_venv/bin/pip install --quiet click python-gitlab requests

Reuse `/tmp/gv_venv` if it already exists. The comparison script only depends on `click` (also in the venv).
python3 -m venv /tmp/gv_venv /tmp/gv_venv/bin/pip install --quiet click python-gitlab requests

如果`/tmp/gv_venv`已存在则复用。对比脚本仅依赖`click`(已包含在该虚拟环境中)。

Step 2 — Reset prior edits (only if user re-runs)

步骤2 — 重置之前的编辑(仅当用户重新运行时)

If the working tree already has prior golden-value modifications you want to discard before re-downloading:
bash
git checkout -- tests/functional_tests/test_cases/
git ls-files --others --exclude-standard tests/functional_tests/test_cases/ \
  | while IFS= read -r f; do rm -f "$f"; done
Skip this step when the user explicitly wants to layer a new download on top of an in-progress branch.
如果工作区已有需要丢弃的基准值修改,在重新下载前执行以下操作:
bash
git checkout -- tests/functional_tests/test_cases/
git ls-files --others --exclude-standard tests/functional_tests/test_cases/ \
  | while IFS= read -r f; do rm -f "$f"; done
如果用户明确希望在现有分支上叠加新的下载,则跳过此步骤。

Step 3 — Download

步骤3 — 下载基准值

Build the command from the user-provided scope:
bash
undefined
根据用户提供的范围构建命令:
bash
undefined

scope = only-failing (default for "fix broken tests")

范围=only-failing(“修复失败测试”场景的默认值)

/tmp/gv_venv/bin/python tests/test_utils/python_scripts/download_golden_values.py
--source github --pipeline-id <WORKFLOW_RUN_ID> --only-failing
/tmp/gv_venv/bin/python tests/test_utils/python_scripts/download_golden_values.py
--source github --pipeline-id <WORKFLOW_RUN_ID> --only-failing

scope = all (full refresh; omit the flag)

范围=all(全面刷新;省略该参数)

/tmp/gv_venv/bin/python tests/test_utils/python_scripts/download_golden_values.py
--source github --pipeline-id <WORKFLOW_RUN_ID>

When `--only-failing` is set, the GitHub path filters at `_fetch_and_filter_artifacts` on `matched_job["conclusion"] == "success"`, so only failing/cancelled jobs contribute artifacts. Without the flag, every job's golden-value artifact is pulled.

Capture the final two log lines for the summary; they look like:
INFO:main:Total tests with golden values: <N> INFO:main:Total golden values found: <M>
undefined
/tmp/gv_venv/bin/python tests/test_utils/python_scripts/download_golden_values.py
--source github --pipeline-id <WORKFLOW_RUN_ID>

当设置`--only-failing`时,GitHub路径会在`_fetch_and_filter_artifacts`中筛选`matched_job["conclusion"] == "success"`,因此仅失败/取消的任务会提供工件。不设置该参数时,会拉取所有任务的基准值工件。

记录日志的最后两行用于摘要,格式如下:
INFO:main:Total tests with golden values: <N> INFO:main:Total golden values found: <M>
undefined

Step 4 — Relative-diff comparison

步骤4 — 相对差异对比

bash
/tmp/gv_venv/bin/python tests/test_utils/python_scripts/compare_golden_values_kl.py \
  --top 20 --csv /tmp/reldiff_summary.csv
The CSV holds one row per
(file, metric)
with four columns:
file, metric, n_steps, avg_rel_diff
  • n_steps
    — count of shared steps that contributed (steps where
    |old| < 1e-12
    are skipped to avoid div-by-zero; NaN/inf are dropped).
  • avg_rel_diff
    mean((old − new) / old)
    . Signed: positive = the new run is smaller than the old run at the typical step (e.g. loss decreased), negative = larger.
Then derive aggregates from the CSV (do this in Python; do not paste raw CSV into the summary):
python
import csv, collections
rows = list(csv.DictReader(open('/tmp/reldiff_summary.csv')))
for r in rows:
    r['n_steps']      = int(r['n_steps'])
    r['avg_rel_diff'] = float(r['avg_rel_diff'])
    r['abs']          = abs(r['avg_rel_diff'])

by_metric = collections.defaultdict(list)
for r in rows:
    by_metric[r['metric']].append(r['abs'])
bash
/tmp/gv_venv/bin/python tests/test_utils/python_scripts/compare_golden_values_kl.py \
  --top 20 --csv /tmp/reldiff_summary.csv
CSV文件包含每行对应一个
(file, metric)
,共四列:
file, metric, n_steps, avg_rel_diff
  • n_steps
    — 参与计算的共享步骤数(跳过
    |old| < 1e-12
    的步骤以避免除零;NaN/inf值会被丢弃)。
  • avg_rel_diff
    mean((old − new) / old)
    带符号:正值表示新运行的典型步骤值小于旧值(例如损失下降),负值表示大于旧值。
然后从CSV中导出聚合数据(用Python实现;请勿将原始CSV粘贴到摘要中):
python
import csv, collections
rows = list(csv.DictReader(open('/tmp/reldiff_summary.csv')))
for r in rows:
    r['n_steps']      = int(r['n_steps'])
    r['avg_rel_diff'] = float(r['avg_rel_diff'])
    r['abs']          = abs(r['avg_rel_diff'])

by_metric = collections.defaultdict(list)
for r in rows:
    by_metric[r['metric']].append(r['abs'])

headline numbers per metric (using |avg_rel_diff|)

每指标的关键数据(基于|avg_rel_diff|)

for m, vs in sorted(by_metric.items()): vs.sort() print(m, len(vs), 'median', vs[len(vs)//2], 'max', vs[-1])
for m, vs in sorted(by_metric.items()): vs.sort() print(m, len(vs), 'median', vs[len(vs)//2], 'max', vs[-1])

bucket counts across all rows, on |avg_rel_diff|

所有行中|avg_rel_diff|的区间统计

buckets = [('==0', lambda x: x == 0), ('(0,1e-6)', lambda x: 0 < x < 1e-6), ('[1e-6,1e-4)', lambda x: 1e-6 <= x < 1e-4), ('[1e-4,1e-3)', lambda x: 1e-4 <= x < 1e-3), ('[1e-3,1e-2)', lambda x: 1e-3 <= x < 1e-2), ('[1e-2,1e-1)', lambda x: 1e-2 <= x < 1e-1), ('>=1e-1', lambda x: x >= 1e-1)] abs_all = [r['abs'] for r in rows] for label, pred in buckets: print(label, sum(1 for v in abs_all if pred(v)))
undefined
buckets = [('==0', lambda x: x == 0), ('(0,1e-6)', lambda x: 0 < x < 1e-6), ('[1e-6,1e-4)', lambda x: 1e-6 <= x < 1e-4), ('[1e-4,1e-3)', lambda x: 1e-4 <= x < 1e-3), ('[1e-3,1e-2)', lambda x: 1e-3 <= x < 1e-2), ('[1e-2,1e-1)', lambda x: 1e-2 <= x < 1e-1), ('>=1e-1', lambda x: x >= 1e-1)] abs_all = [r['abs'] for r in rows] for label, pred in buckets: print(label, sum(1 for v in abs_all if pred(v)))
undefined

Step 5 — Summary blurb

步骤5 — 摘要内容

Use this template verbatim, filling in
<…>
from steps 3–4. Drop sections that don't apply to the run.
Pick the wording for the first line based on the scope used:
  • only-failing
    → "Refresh of golden values for failing functional tests from GitHub workflow run …"
  • all
    → "Full refresh of golden values from GitHub workflow run …"
Match the
download_golden_values.py
command in the bullet list to the scope used (with or without
--only-failing
).
markdown
undefined
使用以下模板,从步骤3-4中填充
<…>
部分。删除与当前运行不相关的章节。
根据使用的范围选择第一行的措辞:
  • only-failing
    → “从GitHub工作流运行<WORKFLOW_RUN_ID>刷新失败功能测试的基准值”
  • all
    → “从GitHub工作流运行<WORKFLOW_RUN_ID>全面刷新基准值”
在项目符号列表中匹配使用的范围对应的
download_golden_values.py
命令(带或不带
--only-failing
参数)。
markdown
undefined

Summary

摘要

<scope-appropriate sentence> from GitHub workflow run
<WORKFLOW_RUN_ID>
.
Golden value updates
  • Re-ran
    tests/test_utils/python_scripts/download_golden_values.py --source github --pipeline-id <WORKFLOW_RUN_ID> <--only-failing if scope=only-failing>
    .
  • Updated <N> golden-value files under
    tests/functional_tests/test_cases/
    .
<符合范围的句子>,来自GitHub工作流运行
<WORKFLOW_RUN_ID>
基准值更新情况
  • 执行命令:
    tests/test_utils/python_scripts/download_golden_values.py --source github --pipeline-id <WORKFLOW_RUN_ID> <若范围为only-failing则添加--only-failing>
  • 更新了
    tests/functional_tests/test_cases/
    下的**<N>个基准值文件**。

Relative-difference summary

相对差异摘要

Comparison covers <FILES_WITH_BASELINE> files × <NUM_METRICS> metrics = <TOTAL_ROWS>
(file, metric)
pairs
. Per row:
avg_rel_diff = mean((old − new) / old)
over shared steps.
Per-metric headline numbers (over
|avg_rel_diff|
)
metricnmedian |avg_rel_diff|max |avg_rel_diff|
lm loss
<…><…><…>
num-zeros
<…><…><…>
iteration-time
<…><…><…>
mem-allocated-bytes
<…><…><…>
mem-max-allocated-bytes
<…><…><…>
Distribution of
|avg_rel_diff|
across all <TOTAL_ROWS> rows
|avg_rel_diff| bucketcount
== 0
<…>
(0, 1e-6)
<…>
[1e-6, 1e-4)
<…>
[1e-4, 1e-3)
<…>
[1e-3, 1e-2)
<…>
[1e-2, 1e-1)
<…>
>= 1e-1
<…>
Interpretation (apply only the bullets that match the data)
  • lm loss
    max
    |avg_rel_diff|
    <X> / median <Y> — loss trajectories match old goldens to numerical noise (sub-1e-4 is within run-to-run variance).
  • mem-*
    metrics typically sit at
    == 0
    or
    (0, 1e-6)
    ; flag any row that lands above
    [1e-4, 1e-3)
    .
  • iteration-time
    movement is dominated by warmup/scheduler noise; signed avg near zero means the run was simply jitterier, not slower or faster on average.
  • num-zeros
    shifts cluster on
    <list of test patterns>
    ; within historical run-to-run variance.
undefined
对比覆盖了<FILES_WITH_BASELINE>个文件 × <NUM_METRICS>个指标 = <TOTAL_ROWS>个
(file, metric)
组合
。每行的
avg_rel_diff
为共享步骤的
mean((old − new) / old)
每指标关键数据(基于
|avg_rel_diff|
metric数量中位数|avg_rel_diff|最大值|avg_rel_diff|
lm loss
<…><…><…>
num-zeros
<…><…><…>
iteration-time
<…><…><…>
mem-allocated-bytes
<…><…><…>
mem-max-allocated-bytes
<…><…><…>
所有<TOTAL_ROWS>行中
|avg_rel_diff|
的分布
|avg_rel_diff|区间数量
== 0
<…>
(0, 1e-6)
<…>
[1e-6, 1e-4)
<…>
[1e-4, 1e-3)
<…>
[1e-3, 1e-2)
<…>
[1e-2, 1e-1)
<…>
>= 1e-1
<…>
解读(仅应用与数据匹配的项目符号)
  • lm loss
    的最大
    |avg_rel_diff|
    <X> / 中位数为<Y> — 损失轨迹与旧基准值的差异在数值噪声范围内(小于1e-4属于运行间正常波动)。
  • mem-*
    指标通常处于
    == 0
    (0, 1e-6)
    区间;若有行落在
    [1e-4, 1e-3)
    及以上区间需标记。
  • iteration-time
    的变化主要由预热/调度器噪声导致;带符号平均值接近零表示运行仅波动更大,并非整体更快或更慢。
  • num-zeros
    的变化集中在<测试模式列表>;属于历史运行间的正常波动范围。
undefined

Reading the columns

列说明

columnmeaning
n_steps
shared step indices used in the average (NaN/inf and steps with
|old| < 1e-12
are dropped).
avg_rel_diff
mean((old − new) / old)
over
n_steps
. Signed: positive = new < old, negative = new > old.
When sorting / filtering, the script ranks by
|avg_rel_diff|
. Keep the sign in the printed table so reviewers can see direction.
Triage rules of thumb:
  • lm loss
    /
    num-zeros
    rows with
    |avg_rel_diff|
    ≲ 1e-4 are run-to-run noise.
  • iteration-time
    divergences are usually warmup/scheduler noise; a small signed mean near zero says the run was jitterier, not systematically faster or slower.
  • Focus reviewer attention on
    lm loss
    and
    num-zeros
    rows with
    |avg_rel_diff|
    ≥ ~1e-3.
列名含义
n_steps
用于计算平均值的共享步骤索引(NaN/inf值和
|old| < 1e-12
的步骤会被排除)。
avg_rel_diff
基于
n_steps
mean((old − new) / old)
。带符号:正值表示新值<旧值,负值表示新值>旧值。
排序/筛选时,脚本按
|avg_rel_diff|
排序。在打印表格中保留符号,以便评审人员查看变化方向。
分类经验法则:
  • lm loss
    /
    num-zeros
    行的
    |avg_rel_diff|
    ≲ 1e-4属于运行间噪声。
  • iteration-time
    的差异通常是预热/调度器噪声;带符号平均值接近零表示运行仅波动更大,并非系统性更快或更慢。
  • 需将评审人员的注意力集中在
    lm loss
    num-zeros
    行中
    |avg_rel_diff|
    ≥ ~1e-3的内容。

Notes & gotchas

注意事项

  • The download script's
    _fetch_and_filter_artifacts
    honors
    --only-failing
    only on the GitHub path. The Gitlab path applies it per-job inside
    download_from_gitlab
    .
  • A brand-new golden file (no
    git HEAD
    baseline) is silently skipped by the comparison script with a warning. Subtract these from the file count when reporting "files with baseline".
  • Steps where
    |old|
    is below
    1e-12
    are excluded from the average — division blows up there (think
    num-zeros
    step 0 on a dense model, or
    mem-*
    before allocation). If every shared step is excluded for a metric, that
    (file, metric)
    row is omitted entirely.
  • Some artifacts have a literal string
    "nan"
    in step 1 of
    iteration-time
    ; the comparison script filters those out, so other steps for that metric still contribute. Don't flag
    iteration-time
    as a correctness problem unless something else also moved.
  • The script's filename is
    compare_golden_values_kl.py
    for legacy reasons; it no longer computes KL divergence. The function and CSV column names reflect what it actually does (
    avg_rel_diff
    ).
  • Never commit
    GITHUB_TOKEN
    ,
    RO_API_TOKEN
    , or any value derived from
    gh auth token
    . If the user wants you to commit, only stage golden-value files and the optional CSV — not the env or the venv.
  • 下载脚本的
    _fetch_and_filter_artifacts
    仅在GitHub路径中遵循
    --only-failing
    参数。Gitlab路径在
    download_from_gitlab
    中按任务应用该参数。
  • 全新的基准值文件(无
    git HEAD
    基线)会被对比脚本静默跳过并发出警告。在报告“有基线的文件”数量时需减去这些文件。
  • |old|
    低于1e-12的步骤会被排除在平均值计算之外——此处除法会溢出(例如密集模型第0步的
    num-zeros
    ,或分配前的
    mem-*
    )。如果某个指标的所有共享步骤都被排除,则对应的
    (file, metric)
    行会被完全省略。
  • 部分工件的
    iteration-time
    第1步包含字符串
    "nan"
    ;对比脚本会过滤这些值,因此该指标的其他步骤仍会参与计算。除非还有其他指标变化,否则不要将
    iteration-time
    标记为正确性问题。
  • 脚本文件名
    compare_golden_values_kl.py
    是历史遗留命名;它已不再计算KL散度。函数和CSV列名反映了其实际功能(
    avg_rel_diff
    )。
  • 切勿提交
    GITHUB_TOKEN
    RO_API_TOKEN
    或任何从
    gh auth token
    衍生的值。如果用户要求提交,仅暂存基准值文件和可选的CSV——不要提交环境配置或虚拟环境。