update-golden-values
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseUpdate golden values + relative-diff summary
更新基准值 + 相对差异摘要
End-to-end workflow for refreshing golden values from a GitHub Actions workflow run, scoring the update with a per-metric average normalized relative difference, and writing a PR-ready summary.
The skill orchestrates two scripts that already live in the repo:
- — pulls artifacts from a workflow run and overwrites
tests/test_utils/python_scripts/download_golden_values.py.tests/functional_tests/test_cases/**/golden_values_*.json - — diffs the working-tree goldens against
tests/test_utils/python_scripts/compare_golden_values_kl.pyand reports per-metricgit HEAD. (Filename keeps the legacyavg_rel_diff = mean((old − new) / old)suffix; the script no longer computes KL divergence.)_kl
这是一个端到端工作流,用于从GitHub Actions工作流运行中刷新基准值,通过每指标平均归一化相对差异评估更新,并编写可直接用于PR的摘要。
该流程协调仓库中已有的两个脚本:
- — 从工作流运行中拉取工件并覆盖
tests/test_utils/python_scripts/download_golden_values.py文件。tests/functional_tests/test_cases/**/golden_values_*.json - — 将工作区的基准值与
tests/test_utils/python_scripts/compare_golden_values_kl.py版本对比,报告每指标的git HEAD。(文件名保留了旧的avg_rel_diff = mean((old − new) / old)后缀;该脚本已不再计算KL散度。)_kl
Inputs to gather from the user
需要向用户收集的输入
-
GitHub Actions workflow run ID (e.g.). It's the numeric ID in the run URL.
25341543542 -
Source: should befor this workflow. (
githubis supported by the download script but uses a different env path.)gitlab -
Scope — accept one of:
- → run with
only-failing(download from failing/cancelled jobs only). Use this for "fix the broken tests" workflows.--only-failing - → run without
all(download from every job that produced golden values). Use this when the user wants a full refresh.--only-failing
If the user doesn't specify, ask. Don't silently default.
-
GitHub Actions工作流运行ID(例如:)。即运行URL中的数字ID。
25341543542 -
来源:此工作流应设为。(下载脚本支持
github,但使用不同的环境路径。)gitlab -
范围 — 接受以下选项之一:
- → 带
only-failing参数运行(仅从失败/取消的任务下载)。适用于“修复失败测试”的工作流。--only-failing - → 不带
all参数运行(从所有生成基准值的任务下载)。适用于用户需要全面刷新的场景。--only-failing
如果用户未指定,需询问确认,不得默认设置。
Workflow
工作流步骤
- [ ] Step 1: Set up env (token + venv with deps)
- [ ] Step 2: Reset prior golden-value edits
- [ ] Step 3: Download goldens (scope = only-failing | all)
- [ ] Step 4: Run relative-diff comparison + capture CSV
- [ ] Step 5: Produce summary blurb- [ ] 步骤1:配置环境(令牌 + 包含依赖的venv)
- [ ] 步骤2:重置之前的基准值编辑
- [ ] 步骤3:下载基准值(范围为only-failing | all)
- [ ] 步骤4:运行相对差异对比并捕获CSV结果
- [ ] 步骤5:生成摘要内容Step 1 — Environment
步骤1 — 环境配置
The download script needs . If the user has the CLI authenticated, derive it; do NOT export the token into a long-lived shell or commit it.
GITHUB_TOKENghbash
undefined下载脚本需要。如果用户已通过 CLI认证,可从中获取;请勿将令牌导出到长期运行的shell或提交到仓库。
GITHUB_TOKENghbash
undefinedtoken (one-shot, scoped to the command)
令牌(单次使用,命令级作用域)
export GITHUB_TOKEN="$(gh auth token)"
export GITHUB_TOKEN="$(gh auth token)"
python deps (the script imports click, gitlab, requests)
Python依赖(脚本导入了click、gitlab、requests)
python3 -m venv /tmp/gv_venv
/tmp/gv_venv/bin/pip install --quiet click python-gitlab requests
Reuse `/tmp/gv_venv` if it already exists. The comparison script only depends on `click` (also in the venv).python3 -m venv /tmp/gv_venv
/tmp/gv_venv/bin/pip install --quiet click python-gitlab requests
如果`/tmp/gv_venv`已存在则复用。对比脚本仅依赖`click`(已包含在该虚拟环境中)。Step 2 — Reset prior edits (only if user re-runs)
步骤2 — 重置之前的编辑(仅当用户重新运行时)
If the working tree already has prior golden-value modifications you want to discard before re-downloading:
bash
git checkout -- tests/functional_tests/test_cases/
git ls-files --others --exclude-standard tests/functional_tests/test_cases/ \
| while IFS= read -r f; do rm -f "$f"; doneSkip this step when the user explicitly wants to layer a new download on top of an in-progress branch.
如果工作区已有需要丢弃的基准值修改,在重新下载前执行以下操作:
bash
git checkout -- tests/functional_tests/test_cases/
git ls-files --others --exclude-standard tests/functional_tests/test_cases/ \
| while IFS= read -r f; do rm -f "$f"; done如果用户明确希望在现有分支上叠加新的下载,则跳过此步骤。
Step 3 — Download
步骤3 — 下载基准值
Build the command from the user-provided scope:
bash
undefined根据用户提供的范围构建命令:
bash
undefinedscope = only-failing (default for "fix broken tests")
范围=only-failing(“修复失败测试”场景的默认值)
/tmp/gv_venv/bin/python tests/test_utils/python_scripts/download_golden_values.py
--source github --pipeline-id <WORKFLOW_RUN_ID> --only-failing
--source github --pipeline-id <WORKFLOW_RUN_ID> --only-failing
/tmp/gv_venv/bin/python tests/test_utils/python_scripts/download_golden_values.py
--source github --pipeline-id <WORKFLOW_RUN_ID> --only-failing
--source github --pipeline-id <WORKFLOW_RUN_ID> --only-failing
scope = all (full refresh; omit the flag)
范围=all(全面刷新;省略该参数)
/tmp/gv_venv/bin/python tests/test_utils/python_scripts/download_golden_values.py
--source github --pipeline-id <WORKFLOW_RUN_ID>
--source github --pipeline-id <WORKFLOW_RUN_ID>
When `--only-failing` is set, the GitHub path filters at `_fetch_and_filter_artifacts` on `matched_job["conclusion"] == "success"`, so only failing/cancelled jobs contribute artifacts. Without the flag, every job's golden-value artifact is pulled.
Capture the final two log lines for the summary; they look like:
INFO:main:Total tests with golden values: <N>
INFO:main:Total golden values found: <M>
undefined/tmp/gv_venv/bin/python tests/test_utils/python_scripts/download_golden_values.py
--source github --pipeline-id <WORKFLOW_RUN_ID>
--source github --pipeline-id <WORKFLOW_RUN_ID>
当设置`--only-failing`时,GitHub路径会在`_fetch_and_filter_artifacts`中筛选`matched_job["conclusion"] == "success"`,因此仅失败/取消的任务会提供工件。不设置该参数时,会拉取所有任务的基准值工件。
记录日志的最后两行用于摘要,格式如下:
INFO:main:Total tests with golden values: <N>
INFO:main:Total golden values found: <M>
undefinedStep 4 — Relative-diff comparison
步骤4 — 相对差异对比
bash
/tmp/gv_venv/bin/python tests/test_utils/python_scripts/compare_golden_values_kl.py \
--top 20 --csv /tmp/reldiff_summary.csvThe CSV holds one row per with four columns:
(file, metric)file, metric, n_steps, avg_rel_diff- — count of shared steps that contributed (steps where
n_stepsare skipped to avoid div-by-zero; NaN/inf are dropped).|old| < 1e-12 - —
avg_rel_diff. Signed: positive = the new run is smaller than the old run at the typical step (e.g. loss decreased), negative = larger.mean((old − new) / old)
Then derive aggregates from the CSV (do this in Python; do not paste raw CSV into the summary):
python
import csv, collections
rows = list(csv.DictReader(open('/tmp/reldiff_summary.csv')))
for r in rows:
r['n_steps'] = int(r['n_steps'])
r['avg_rel_diff'] = float(r['avg_rel_diff'])
r['abs'] = abs(r['avg_rel_diff'])
by_metric = collections.defaultdict(list)
for r in rows:
by_metric[r['metric']].append(r['abs'])bash
/tmp/gv_venv/bin/python tests/test_utils/python_scripts/compare_golden_values_kl.py \
--top 20 --csv /tmp/reldiff_summary.csvCSV文件包含每行对应一个,共四列:
(file, metric)file, metric, n_steps, avg_rel_diff- — 参与计算的共享步骤数(跳过
n_steps的步骤以避免除零;NaN/inf值会被丢弃)。|old| < 1e-12 - —
avg_rel_diff。带符号:正值表示新运行的典型步骤值小于旧值(例如损失下降),负值表示大于旧值。mean((old − new) / old)
然后从CSV中导出聚合数据(用Python实现;请勿将原始CSV粘贴到摘要中):
python
import csv, collections
rows = list(csv.DictReader(open('/tmp/reldiff_summary.csv')))
for r in rows:
r['n_steps'] = int(r['n_steps'])
r['avg_rel_diff'] = float(r['avg_rel_diff'])
r['abs'] = abs(r['avg_rel_diff'])
by_metric = collections.defaultdict(list)
for r in rows:
by_metric[r['metric']].append(r['abs'])headline numbers per metric (using |avg_rel_diff|)
每指标的关键数据(基于|avg_rel_diff|)
for m, vs in sorted(by_metric.items()):
vs.sort()
print(m, len(vs), 'median', vs[len(vs)//2], 'max', vs[-1])
for m, vs in sorted(by_metric.items()):
vs.sort()
print(m, len(vs), 'median', vs[len(vs)//2], 'max', vs[-1])
bucket counts across all rows, on |avg_rel_diff|
所有行中|avg_rel_diff|的区间统计
buckets = [('==0', lambda x: x == 0),
('(0,1e-6)', lambda x: 0 < x < 1e-6),
('[1e-6,1e-4)', lambda x: 1e-6 <= x < 1e-4),
('[1e-4,1e-3)', lambda x: 1e-4 <= x < 1e-3),
('[1e-3,1e-2)', lambda x: 1e-3 <= x < 1e-2),
('[1e-2,1e-1)', lambda x: 1e-2 <= x < 1e-1),
('>=1e-1', lambda x: x >= 1e-1)]
abs_all = [r['abs'] for r in rows]
for label, pred in buckets:
print(label, sum(1 for v in abs_all if pred(v)))
undefinedbuckets = [('==0', lambda x: x == 0),
('(0,1e-6)', lambda x: 0 < x < 1e-6),
('[1e-6,1e-4)', lambda x: 1e-6 <= x < 1e-4),
('[1e-4,1e-3)', lambda x: 1e-4 <= x < 1e-3),
('[1e-3,1e-2)', lambda x: 1e-3 <= x < 1e-2),
('[1e-2,1e-1)', lambda x: 1e-2 <= x < 1e-1),
('>=1e-1', lambda x: x >= 1e-1)]
abs_all = [r['abs'] for r in rows]
for label, pred in buckets:
print(label, sum(1 for v in abs_all if pred(v)))
undefinedStep 5 — Summary blurb
步骤5 — 摘要内容
Use this template verbatim, filling in from steps 3–4. Drop sections that don't apply to the run.
<…>Pick the wording for the first line based on the scope used:
- → "Refresh of golden values for failing functional tests from GitHub workflow run …"
only-failing - → "Full refresh of golden values from GitHub workflow run …"
all
Match the command in the bullet list to the scope used (with or without ).
download_golden_values.py--only-failingmarkdown
undefined使用以下模板,从步骤3-4中填充部分。删除与当前运行不相关的章节。
<…>根据使用的范围选择第一行的措辞:
- → “从GitHub工作流运行<WORKFLOW_RUN_ID>刷新失败功能测试的基准值”
only-failing - → “从GitHub工作流运行<WORKFLOW_RUN_ID>全面刷新基准值”
all
在项目符号列表中匹配使用的范围对应的命令(带或不带参数)。
download_golden_values.py--only-failingmarkdown
undefinedSummary
摘要
<scope-appropriate sentence> from GitHub workflow run .
<WORKFLOW_RUN_ID>Golden value updates
- Re-ran .
tests/test_utils/python_scripts/download_golden_values.py --source github --pipeline-id <WORKFLOW_RUN_ID> <--only-failing if scope=only-failing> - Updated <N> golden-value files under .
tests/functional_tests/test_cases/
<符合范围的句子>,来自GitHub工作流运行。
<WORKFLOW_RUN_ID>基准值更新情况
- 执行命令:。
tests/test_utils/python_scripts/download_golden_values.py --source github --pipeline-id <WORKFLOW_RUN_ID> <若范围为only-failing则添加--only-failing> - 更新了下的**<N>个基准值文件**。
tests/functional_tests/test_cases/
Relative-difference summary
相对差异摘要
Comparison covers <FILES_WITH_BASELINE> files × <NUM_METRICS> metrics = <TOTAL_ROWS> pairs. Per row: over shared steps.
(file, metric)avg_rel_diff = mean((old − new) / old)Per-metric headline numbers (over )
|avg_rel_diff|| metric | n | median |avg_rel_diff| | max |avg_rel_diff| |
|---|---|---|---|
| <…> | <…> | <…> |
| <…> | <…> | <…> |
| <…> | <…> | <…> |
| <…> | <…> | <…> |
| <…> | <…> | <…> |
Distribution of across all <TOTAL_ROWS> rows
|avg_rel_diff|| |avg_rel_diff| bucket | count |
|---|---|
| <…> |
| <…> |
| <…> |
| <…> |
| <…> |
| <…> |
| <…> |
Interpretation (apply only the bullets that match the data)
- max
lm loss<X> / median <Y> — loss trajectories match old goldens to numerical noise (sub-1e-4 is within run-to-run variance).|avg_rel_diff| - metrics typically sit at
mem-*or== 0; flag any row that lands above(0, 1e-6).[1e-4, 1e-3) - movement is dominated by warmup/scheduler noise; signed avg near zero means the run was simply jitterier, not slower or faster on average.
iteration-time - shifts cluster on
num-zeros; within historical run-to-run variance.<list of test patterns>
undefined对比覆盖了<FILES_WITH_BASELINE>个文件 × <NUM_METRICS>个指标 = <TOTAL_ROWS>个组合。每行的为共享步骤的。
(file, metric)avg_rel_diffmean((old − new) / old)每指标关键数据(基于)
|avg_rel_diff|| metric | 数量 | 中位数|avg_rel_diff| | 最大值|avg_rel_diff| |
|---|---|---|---|
| <…> | <…> | <…> |
| <…> | <…> | <…> |
| <…> | <…> | <…> |
| <…> | <…> | <…> |
| <…> | <…> | <…> |
所有<TOTAL_ROWS>行中的分布
|avg_rel_diff|| |avg_rel_diff|区间 | 数量 |
|---|---|
| <…> |
| <…> |
| <…> |
| <…> |
| <…> |
| <…> |
| <…> |
解读(仅应用与数据匹配的项目符号)
- 的最大
lm loss为<X> / 中位数为<Y> — 损失轨迹与旧基准值的差异在数值噪声范围内(小于1e-4属于运行间正常波动)。|avg_rel_diff| - 指标通常处于
mem-*或== 0区间;若有行落在(0, 1e-6)及以上区间需标记。[1e-4, 1e-3) - 的变化主要由预热/调度器噪声导致;带符号平均值接近零表示运行仅波动更大,并非整体更快或更慢。
iteration-time - 的变化集中在<测试模式列表>;属于历史运行间的正常波动范围。
num-zeros
undefinedReading the columns
列说明
| column | meaning |
|---|---|
| shared step indices used in the average (NaN/inf and steps with |
| |
When sorting / filtering, the script ranks by . Keep the sign in the printed table so reviewers can see direction.
|avg_rel_diff|Triage rules of thumb:
- /
lm lossrows withnum-zeros≲ 1e-4 are run-to-run noise.|avg_rel_diff| - divergences are usually warmup/scheduler noise; a small signed mean near zero says the run was jitterier, not systematically faster or slower.
iteration-time - Focus reviewer attention on and
lm lossrows withnum-zeros≥ ~1e-3.|avg_rel_diff|
| 列名 | 含义 |
|---|---|
| 用于计算平均值的共享步骤索引(NaN/inf值和 |
| 基于 |
排序/筛选时,脚本按排序。在打印表格中保留符号,以便评审人员查看变化方向。
|avg_rel_diff|分类经验法则:
- /
lm loss行的num-zeros≲ 1e-4属于运行间噪声。|avg_rel_diff| - 的差异通常是预热/调度器噪声;带符号平均值接近零表示运行仅波动更大,并非系统性更快或更慢。
iteration-time - 需将评审人员的注意力集中在和
lm loss行中num-zeros≥ ~1e-3的内容。|avg_rel_diff|
Notes & gotchas
注意事项
- The download script's honors
_fetch_and_filter_artifactsonly on the GitHub path. The Gitlab path applies it per-job inside--only-failing.download_from_gitlab - A brand-new golden file (no baseline) is silently skipped by the comparison script with a warning. Subtract these from the file count when reporting "files with baseline".
git HEAD - Steps where is below
|old|are excluded from the average — division blows up there (think1e-12step 0 on a dense model, ornum-zerosbefore allocation). If every shared step is excluded for a metric, thatmem-*row is omitted entirely.(file, metric) - Some artifacts have a literal string in step 1 of
"nan"; the comparison script filters those out, so other steps for that metric still contribute. Don't flagiteration-timeas a correctness problem unless something else also moved.iteration-time - The script's filename is for legacy reasons; it no longer computes KL divergence. The function and CSV column names reflect what it actually does (
compare_golden_values_kl.py).avg_rel_diff - Never commit ,
GITHUB_TOKEN, or any value derived fromRO_API_TOKEN. If the user wants you to commit, only stage golden-value files and the optional CSV — not the env or the venv.gh auth token
- 下载脚本的仅在GitHub路径中遵循
_fetch_and_filter_artifacts参数。Gitlab路径在--only-failing中按任务应用该参数。download_from_gitlab - 全新的基准值文件(无基线)会被对比脚本静默跳过并发出警告。在报告“有基线的文件”数量时需减去这些文件。
git HEAD - 低于1e-12的步骤会被排除在平均值计算之外——此处除法会溢出(例如密集模型第0步的
|old|,或分配前的num-zeros)。如果某个指标的所有共享步骤都被排除,则对应的mem-*行会被完全省略。(file, metric) - 部分工件的第1步包含字符串
iteration-time;对比脚本会过滤这些值,因此该指标的其他步骤仍会参与计算。除非还有其他指标变化,否则不要将"nan"标记为正确性问题。iteration-time - 脚本文件名是历史遗留命名;它已不再计算KL散度。函数和CSV列名反映了其实际功能(
compare_golden_values_kl.py)。avg_rel_diff - 切勿提交、
GITHUB_TOKEN或任何从RO_API_TOKEN衍生的值。如果用户要求提交,仅暂存基准值文件和可选的CSV——不要提交环境配置或虚拟环境。gh auth token