update-golden-values

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Update golden values + relative-diff summary

更新基准值 + 相对差异摘要

End-to-end workflow for refreshing golden values from a GitHub Actions workflow run, scoring the update with a per-metric average normalized relative difference, and writing a PR-ready summary.

The skill orchestrates two scripts that already live in the repo:

tests/test_utils/python_scripts/download_golden_values.py

— pulls artifacts from a workflow run and overwrites

tests/functional_tests/test_cases/**/golden_values_*.json

```
tests/test_utils/python_scripts/compare_golden_values_kl.py
```
— diffs the working-tree goldens against
```
git HEAD
```
and reports per-metric
```
avg_rel_diff = mean((old − new) / old)
```
. (Filename keeps the legacy
```
_kl
```
suffix; the script no longer computes KL divergence.)

这是一个端到端工作流，用于从GitHub Actions工作流运行中刷新基准值，通过每指标平均归一化相对差异评估更新，并编写可直接用于PR的摘要。

该流程协调仓库中已有的两个脚本：

tests/test_utils/python_scripts/download_golden_values.py

— 从工作流运行中拉取工件并覆盖

tests/functional_tests/test_cases/**/golden_values_*.json

文件。

```
tests/test_utils/python_scripts/compare_golden_values_kl.py
```
— 将工作区的基准值与
```
git HEAD
```
版本对比，报告每指标的
```
avg_rel_diff = mean((old − new) / old)
```
。（文件名保留了旧的
```
_kl
```
后缀；该脚本已不再计算KL散度。）

Inputs to gather from the user

需要向用户收集的输入

GitHub Actions workflow run ID (e.g.
```
25341543542
```
). It's the numeric ID in the run URL.
Source: should be
```
github
```
for this workflow. (
```
gitlab
```
is supported by the download script but uses a different env path.)
Scope — accept one of:
- ```
only-failing
```
  → run with
```
--only-failing
```
  (download from failing/cancelled jobs only). Use this for "fix the broken tests" workflows.
- ```
all
```
  → run without
```
--only-failing
```
  (download from every job that produced golden values). Use this when the user wants a full refresh.
If the user doesn't specify, ask. Don't silently default.

GitHub Actions工作流运行ID（例如：
```
25341543542
```
）。即运行URL中的数字ID。
来源：此工作流应设为
```
github
```
。（下载脚本支持
```
gitlab
```
，但使用不同的环境路径。）
范围 — 接受以下选项之一：
- ```
only-failing
```
  → 带
```
--only-failing
```
  参数运行（仅从失败/取消的任务下载）。适用于“修复失败测试”的工作流。
- ```
all
```
  → 不带
```
--only-failing
```
  参数运行（从所有生成基准值的任务下载）。适用于用户需要全面刷新的场景。
如果用户未指定，需询问确认，不得默认设置。

Workflow

工作流步骤

- [ ] Step 1: Set up env (token + venv with deps)
- [ ] Step 2: Reset prior golden-value edits
- [ ] Step 3: Download goldens (scope = only-failing | all)
- [ ] Step 4: Run relative-diff comparison + capture CSV
- [ ] Step 5: Produce summary blurb

- [ ] 步骤1：配置环境（令牌 + 包含依赖的venv）
- [ ] 步骤2：重置之前的基准值编辑
- [ ] 步骤3：下载基准值（范围为only-failing | all）
- [ ] 步骤4：运行相对差异对比并捕获CSV结果
- [ ] 步骤5：生成摘要内容

Step 1 — Environment

步骤1 — 环境配置

The download script needs

GITHUB_TOKEN

. If the user has the

gh

CLI authenticated, derive it; do NOT export the token into a long-lived shell or commit it.

bash

undefined

下载脚本需要

GITHUB_TOKEN

。如果用户已通过

gh

CLI认证，可从中获取；请勿将令牌导出到长期运行的shell或提交到仓库。

bash

undefined

token (one-shot, scoped to the command)

令牌（单次使用，命令级作用域）

export GITHUB_TOKEN="$(gh auth token)"

python deps (the script imports click, gitlab, requests)

Python依赖（脚本导入了click、gitlab、requests）

python3 -m venv /tmp/gv_venv /tmp/gv_venv/bin/pip install --quiet click python-gitlab requests


Reuse `/tmp/gv_venv` if it already exists. The comparison script only depends on `click` (also in the venv).

python3 -m venv /tmp/gv_venv /tmp/gv_venv/bin/pip install --quiet click python-gitlab requests


如果`/tmp/gv_venv`已存在则复用。对比脚本仅依赖`click`（已包含在该虚拟环境中）。

Step 2 — Reset prior edits (only if user re-runs)

步骤2 — 重置之前的编辑（仅当用户重新运行时）

If the working tree already has prior golden-value modifications you want to discard before re-downloading:

bash

git checkout -- tests/functional_tests/test_cases/
git ls-files --others --exclude-standard tests/functional_tests/test_cases/ \
  | while IFS= read -r f; do rm -f "$f"; done

Skip this step when the user explicitly wants to layer a new download on top of an in-progress branch.

如果工作区已有需要丢弃的基准值修改，在重新下载前执行以下操作：

bash

git checkout -- tests/functional_tests/test_cases/
git ls-files --others --exclude-standard tests/functional_tests/test_cases/ \
  | while IFS= read -r f; do rm -f "$f"; done

如果用户明确希望在现有分支上叠加新的下载，则跳过此步骤。

Step 3 — Download

步骤3 — 下载基准值

Build the command from the user-provided scope:

bash

undefined

根据用户提供的范围构建命令：

bash

undefined

scope = only-failing (default for "fix broken tests")

范围=only-failing（“修复失败测试”场景的默认值）

/tmp/gv_venv/bin/python tests/test_utils/python_scripts/download_golden_values.py
--source github --pipeline-id <WORKFLOW_RUN_ID> --only-failing

scope = all (full refresh; omit the flag)

范围=all（全面刷新；省略该参数）

/tmp/gv_venv/bin/python tests/test_utils/python_scripts/download_golden_values.py
--source github --pipeline-id <WORKFLOW_RUN_ID>


When `--only-failing` is set, the GitHub path filters at `_fetch_and_filter_artifacts` on `matched_job["conclusion"] == "success"`, so only failing/cancelled jobs contribute artifacts. Without the flag, every job's golden-value artifact is pulled.

Capture the final two log lines for the summary; they look like:

INFO:main:Total tests with golden values: <N> INFO:main:Total golden values found: <M>

undefined

/tmp/gv_venv/bin/python tests/test_utils/python_scripts/download_golden_values.py
--source github --pipeline-id <WORKFLOW_RUN_ID>


当设置`--only-failing`时，GitHub路径会在`_fetch_and_filter_artifacts`中筛选`matched_job["conclusion"] == "success"`，因此仅失败/取消的任务会提供工件。不设置该参数时，会拉取所有任务的基准值工件。

记录日志的最后两行用于摘要，格式如下：

INFO:main:Total tests with golden values: <N> INFO:main:Total golden values found: <M>

undefined

Step 4 — Relative-diff comparison

步骤4 — 相对差异对比

bash

/tmp/gv_venv/bin/python tests/test_utils/python_scripts/compare_golden_values_kl.py \
  --top 20 --csv /tmp/reldiff_summary.csv

The CSV holds one row per

(file, metric)

with four columns:

file, metric, n_steps, avg_rel_diff

```
n_steps
```
— count of shared steps that contributed (steps where
```
|old| < 1e-12
```
are skipped to avoid div-by-zero; NaN/inf are dropped).
```
avg_rel_diff
```
—
```
mean((old − new) / old)
```
. Signed: positive = the new run is smaller than the old run at the typical step (e.g. loss decreased), negative = larger.

Then derive aggregates from the CSV (do this in Python; do not paste raw CSV into the summary):

python

import csv, collections
rows = list(csv.DictReader(open('/tmp/reldiff_summary.csv')))
for r in rows:
    r['n_steps']      = int(r['n_steps'])
    r['avg_rel_diff'] = float(r['avg_rel_diff'])
    r['abs']          = abs(r['avg_rel_diff'])

by_metric = collections.defaultdict(list)
for r in rows:
    by_metric[r['metric']].append(r['abs'])

bash

/tmp/gv_venv/bin/python tests/test_utils/python_scripts/compare_golden_values_kl.py \
  --top 20 --csv /tmp/reldiff_summary.csv

CSV文件包含每行对应一个

(file, metric)

，共四列：

file, metric, n_steps, avg_rel_diff

```
n_steps
```
— 参与计算的共享步骤数（跳过
```
|old| < 1e-12
```
的步骤以避免除零；NaN/inf值会被丢弃）。
```
avg_rel_diff
```
—
```
mean((old − new) / old)
```
。带符号：正值表示新运行的典型步骤值小于旧值（例如损失下降），负值表示大于旧值。

然后从CSV中导出聚合数据（用Python实现；请勿将原始CSV粘贴到摘要中）：

python

import csv, collections
rows = list(csv.DictReader(open('/tmp/reldiff_summary.csv')))
for r in rows:
    r['n_steps']      = int(r['n_steps'])
    r['avg_rel_diff'] = float(r['avg_rel_diff'])
    r['abs']          = abs(r['avg_rel_diff'])

by_metric = collections.defaultdict(list)
for r in rows:
    by_metric[r['metric']].append(r['abs'])

headline numbers per metric (using |avg_rel_diff|)

每指标的关键数据（基于|avg_rel_diff|）

for m, vs in sorted(by_metric.items()): vs.sort() print(m, len(vs), 'median', vs[len(vs)//2], 'max', vs[-1])

bucket counts across all rows, on |avg_rel_diff|

所有行中|avg_rel_diff|的区间统计

buckets = [('==0', lambda x: x == 0), ('(0,1e-6)', lambda x: 0 < x < 1e-6), ('[1e-6,1e-4)', lambda x: 1e-6 <= x < 1e-4), ('[1e-4,1e-3)', lambda x: 1e-4 <= x < 1e-3), ('[1e-3,1e-2)', lambda x: 1e-3 <= x < 1e-2), ('[1e-2,1e-1)', lambda x: 1e-2 <= x < 1e-1), ('>=1e-1', lambda x: x >= 1e-1)] abs_all = [r['abs'] for r in rows] for label, pred in buckets: print(label, sum(1 for v in abs_all if pred(v)))

undefined

undefined

Step 5 — Summary blurb

步骤5 — 摘要内容

Use this template verbatim, filling in

<…>

from steps 3–4. Drop sections that don't apply to the run.

Pick the wording for the first line based on the scope used:

```
only-failing
```
→ "Refresh of golden values for failing functional tests from GitHub workflow run …"
```
all
```
→ "Full refresh of golden values from GitHub workflow run …"

Match the

download_golden_values.py

command in the bullet list to the scope used (with or without

--only-failing

markdown

undefined

使用以下模板，从步骤3-4中填充

<…>

部分。删除与当前运行不相关的章节。

根据使用的范围选择第一行的措辞：

```
only-failing
```
→ “从GitHub工作流运行<WORKFLOW_RUN_ID>刷新失败功能测试的基准值”
```
all
```
→ “从GitHub工作流运行<WORKFLOW_RUN_ID>全面刷新基准值”

在项目符号列表中匹配使用的范围对应的

download_golden_values.py

命令（带或不带

--only-failing

参数）。

markdown

undefined

Summary

摘要

<scope-appropriate sentence> from GitHub workflow run

<WORKFLOW_RUN_ID>

Golden value updates

Re-ran

tests/test_utils/python_scripts/download_golden_values.py --source github --pipeline-id <WORKFLOW_RUN_ID> <--only-failing if scope=only-failing>

Updated <N> golden-value files under
```
tests/functional_tests/test_cases/
```
.

<符合范围的句子>，来自GitHub工作流运行

<WORKFLOW_RUN_ID>

。

基准值更新情况

执行命令：

tests/test_utils/python_scripts/download_golden_values.py --source github --pipeline-id <WORKFLOW_RUN_ID> <若范围为only-failing则添加--only-failing>

。

更新了
```
tests/functional_tests/test_cases/
```
下的**<N>个基准值文件**。

Relative-difference summary

相对差异摘要

Comparison covers <FILES_WITH_BASELINE> files × <NUM_METRICS> metrics = <TOTAL_ROWS>
(file, metric)
pairs. Per row:

avg_rel_diff = mean((old − new) / old)

over shared steps.

Per-metric headline numbers (over

|avg_rel_diff|

)

metric	n	median \|avg_rel_diff\|	max \|avg_rel_diff\|
`lm loss`	<…>	<…>	<…>
`num-zeros`	<…>	<…>	<…>
`iteration-time`	<…>	<…>	<…>
`mem-allocated-bytes`	<…>	<…>	<…>
`mem-max-allocated-bytes`	<…>	<…>	<…>

Distribution of
|avg_rel_diff|
across all <TOTAL_ROWS> rows

\|avg_rel_diff\| bucket	count
`== 0`	<…>
`(0, 1e-6)`	<…>
`[1e-6, 1e-4)`	<…>
`[1e-4, 1e-3)`	<…>
`[1e-3, 1e-2)`	<…>
`[1e-2, 1e-1)`	<…>
`>= 1e-1`	<…>

Interpretation (apply only the bullets that match the data)

```
lm loss
```
max
```
|avg_rel_diff|
```
<X> / median <Y> — loss trajectories match old goldens to numerical noise (sub-1e-4 is within run-to-run variance).
```
mem-*
```
metrics typically sit at
```
== 0
```
or
```
(0, 1e-6)
```
; flag any row that lands above
```
[1e-4, 1e-3)
```
.
```
iteration-time
```
movement is dominated by warmup/scheduler noise; signed avg near zero means the run was simply jitterier, not slower or faster on average.
```
num-zeros
```
shifts cluster on
```
<list of test patterns>
```
; within historical run-to-run variance.

undefined

对比覆盖了<FILES_WITH_BASELINE>个文件 × <NUM_METRICS>个指标 = <TOTAL_ROWS>个
(file, metric)
组合。每行的

avg_rel_diff

为共享步骤的

mean((old − new) / old)

。

每指标关键数据（基于

|avg_rel_diff|

）

metric	数量	中位数\|avg_rel_diff\|	最大值\|avg_rel_diff\|
`lm loss`	<…>	<…>	<…>
`num-zeros`	<…>	<…>	<…>
`iteration-time`	<…>	<…>	<…>
`mem-allocated-bytes`	<…>	<…>	<…>
`mem-max-allocated-bytes`	<…>	<…>	<…>

所有<TOTAL_ROWS>行中
|avg_rel_diff|
的分布

\|avg_rel_diff\|区间	数量
`== 0`	<…>
`(0, 1e-6)`	<…>
`[1e-6, 1e-4)`	<…>
`[1e-4, 1e-3)`	<…>
`[1e-3, 1e-2)`	<…>
`[1e-2, 1e-1)`	<…>
`>= 1e-1`	<…>

解读（仅应用与数据匹配的项目符号）

```
lm loss
```
的最大
```
|avg_rel_diff|
```
为<X> / 中位数为<Y> — 损失轨迹与旧基准值的差异在数值噪声范围内（小于1e-4属于运行间正常波动）。
```
mem-*
```
指标通常处于
```
== 0
```
或
```
(0, 1e-6)
```
区间；若有行落在
```
[1e-4, 1e-3)
```
及以上区间需标记。
```
iteration-time
```
的变化主要由预热/调度器噪声导致；带符号平均值接近零表示运行仅波动更大，并非整体更快或更慢。
```
num-zeros
```
的变化集中在<测试模式列表>；属于历史运行间的正常波动范围。

undefined

Reading the columns

列说明

column	meaning
`n_steps`	shared step indices used in the average (NaN/inf and steps with `\|old\| < 1e-12` are dropped).
`avg_rel_diff`	`mean((old − new) / old)` over `n_steps` . Signed: positive = new < old, negative = new > old.

When sorting / filtering, the script ranks by

|avg_rel_diff|

. Keep the sign in the printed table so reviewers can see direction.

Triage rules of thumb:

```
lm loss
```
/
```
num-zeros
```
rows with
```
|avg_rel_diff|
```
≲ 1e-4 are run-to-run noise.
```
iteration-time
```
divergences are usually warmup/scheduler noise; a small signed mean near zero says the run was jitterier, not systematically faster or slower.
Focus reviewer attention on
```
lm loss
```
and
```
num-zeros
```
rows with
```
|avg_rel_diff|
```
≥ ~1e-3.

列名	含义
`n_steps`	用于计算平均值的共享步骤索引（NaN/inf值和 `\|old\| < 1e-12` 的步骤会被排除）。
`avg_rel_diff`	基于 `n_steps` 的 `mean((old − new) / old)` 。带符号：正值表示新值<旧值，负值表示新值>旧值。

排序/筛选时，脚本按

|avg_rel_diff|

排序。在打印表格中保留符号，以便评审人员查看变化方向。

分类经验法则：

```
lm loss
```
/
```
num-zeros
```
行的
```
|avg_rel_diff|
```
≲ 1e-4属于运行间噪声。
```
iteration-time
```
的差异通常是预热/调度器噪声；带符号平均值接近零表示运行仅波动更大，并非系统性更快或更慢。
需将评审人员的注意力集中在
```
lm loss
```
和
```
num-zeros
```
行中
```
|avg_rel_diff|
```
≥ ~1e-3的内容。

Notes & gotchas

注意事项

The download script's
```
_fetch_and_filter_artifacts
```
honors
```
--only-failing
```
only on the GitHub path. The Gitlab path applies it per-job inside
```
download_from_gitlab
```
.
A brand-new golden file (no
```
git HEAD
```
baseline) is silently skipped by the comparison script with a warning. Subtract these from the file count when reporting "files with baseline".
Steps where
```
|old|
```
is below
```
1e-12
```
are excluded from the average — division blows up there (think
```
num-zeros
```
step 0 on a dense model, or
```
mem-*
```
before allocation). If every shared step is excluded for a metric, that
```
(file, metric)
```
row is omitted entirely.
Some artifacts have a literal string
```
"nan"
```
in step 1 of
```
iteration-time
```
; the comparison script filters those out, so other steps for that metric still contribute. Don't flag
```
iteration-time
```
as a correctness problem unless something else also moved.
The script's filename is
```
compare_golden_values_kl.py
```
for legacy reasons; it no longer computes KL divergence. The function and CSV column names reflect what it actually does (
```
avg_rel_diff
```
).
Never commit
```
GITHUB_TOKEN
```
,
```
RO_API_TOKEN
```
, or any value derived from
```
gh auth token
```
. If the user wants you to commit, only stage golden-value files and the optional CSV — not the env or the venv.

下载脚本的
```
_fetch_and_filter_artifacts
```
仅在GitHub路径中遵循
```
--only-failing
```
参数。Gitlab路径在
```
download_from_gitlab
```
中按任务应用该参数。
全新的基准值文件（无
```
git HEAD
```
基线）会被对比脚本静默跳过并发出警告。在报告“有基线的文件”数量时需减去这些文件。
```
|old|
```
低于1e-12的步骤会被排除在平均值计算之外——此处除法会溢出（例如密集模型第0步的
```
num-zeros
```
，或分配前的
```
mem-*
```
）。如果某个指标的所有共享步骤都被排除，则对应的
```
(file, metric)
```
行会被完全省略。
部分工件的
```
iteration-time
```
第1步包含字符串
```
"nan"
```
；对比脚本会过滤这些值，因此该指标的其他步骤仍会参与计算。除非还有其他指标变化，否则不要将
```
iteration-time
```
标记为正确性问题。
脚本文件名
```
compare_golden_values_kl.py
```
是历史遗留命名；它已不再计算KL散度。函数和CSV列名反映了其实际功能（
```
avg_rel_diff
```
）。
切勿提交
```
GITHUB_TOKEN
```
、
```
RO_API_TOKEN
```
或任何从
```
gh auth token
```
衍生的值。如果用户要求提交，仅暂存基准值文件和可选的CSV——不要提交环境配置或虚拟环境。