Loading...
Loading...
Refresh golden values from a GitHub Actions workflow run (failing-only or all jobs), score the change with average normalized relative differences, and produce a PR-ready summary. Use when the user asks to update goldens for a CI run, refresh golden values from a workflow ID, or generate a golden-value diff summary for a PR description.
npx skill4agent add nvidia/skills update-golden-valuestests/test_utils/python_scripts/download_golden_values.pytests/functional_tests/test_cases/**/golden_values_*.jsontests/test_utils/python_scripts/compare_golden_values_kl.pygit HEADavg_rel_diff = mean((old − new) / old)_kl25341543542githubgitlabonly-failing--only-failingall--only-failing- [ ] Step 1: Set up env (token + venv with deps)
- [ ] Step 2: Reset prior golden-value edits
- [ ] Step 3: Download goldens (scope = only-failing | all)
- [ ] Step 4: Run relative-diff comparison + capture CSV
- [ ] Step 5: Produce summary blurbGITHUB_TOKENgh# token (one-shot, scoped to the command)
export GITHUB_TOKEN="$(gh auth token)"
# python deps (the script imports click, gitlab, requests)
python3 -m venv /tmp/gv_venv
/tmp/gv_venv/bin/pip install --quiet click python-gitlab requests/tmp/gv_venvclickgit checkout -- tests/functional_tests/test_cases/
git ls-files --others --exclude-standard tests/functional_tests/test_cases/ \
| while IFS= read -r f; do rm -f "$f"; done# scope = only-failing (default for "fix broken tests")
/tmp/gv_venv/bin/python tests/test_utils/python_scripts/download_golden_values.py \
--source github --pipeline-id <WORKFLOW_RUN_ID> --only-failing
# scope = all (full refresh; omit the flag)
/tmp/gv_venv/bin/python tests/test_utils/python_scripts/download_golden_values.py \
--source github --pipeline-id <WORKFLOW_RUN_ID>--only-failing_fetch_and_filter_artifactsmatched_job["conclusion"] == "success"INFO:__main__:Total tests with golden values: <N>
INFO:__main__:Total golden values found: <M>/tmp/gv_venv/bin/python tests/test_utils/python_scripts/compare_golden_values_kl.py \
--top 20 --csv /tmp/reldiff_summary.csv(file, metric)file, metric, n_steps, avg_rel_diffn_steps|old| < 1e-12avg_rel_diffmean((old − new) / old)import csv, collections
rows = list(csv.DictReader(open('/tmp/reldiff_summary.csv')))
for r in rows:
r['n_steps'] = int(r['n_steps'])
r['avg_rel_diff'] = float(r['avg_rel_diff'])
r['abs'] = abs(r['avg_rel_diff'])
by_metric = collections.defaultdict(list)
for r in rows:
by_metric[r['metric']].append(r['abs'])
# headline numbers per metric (using |avg_rel_diff|)
for m, vs in sorted(by_metric.items()):
vs.sort()
print(m, len(vs), 'median', vs[len(vs)//2], 'max', vs[-1])
# bucket counts across all rows, on |avg_rel_diff|
buckets = [('==0', lambda x: x == 0),
('(0,1e-6)', lambda x: 0 < x < 1e-6),
('[1e-6,1e-4)', lambda x: 1e-6 <= x < 1e-4),
('[1e-4,1e-3)', lambda x: 1e-4 <= x < 1e-3),
('[1e-3,1e-2)', lambda x: 1e-3 <= x < 1e-2),
('[1e-2,1e-1)', lambda x: 1e-2 <= x < 1e-1),
('>=1e-1', lambda x: x >= 1e-1)]
abs_all = [r['abs'] for r in rows]
for label, pred in buckets:
print(label, sum(1 for v in abs_all if pred(v)))<…>only-failingalldownload_golden_values.py--only-failing### Summary
<scope-appropriate sentence> from GitHub workflow run `<WORKFLOW_RUN_ID>`.
**Golden value updates**
- Re-ran `tests/test_utils/python_scripts/download_golden_values.py --source github --pipeline-id <WORKFLOW_RUN_ID> <--only-failing if scope=only-failing>`.
- Updated **<N> golden-value files** under `tests/functional_tests/test_cases/`.
### Relative-difference summary
Comparison covers <FILES_WITH_BASELINE> files × <NUM_METRICS> metrics = **<TOTAL_ROWS> `(file, metric)` pairs**. Per row: `avg_rel_diff = mean((old − new) / old)` over shared steps.
**Per-metric headline numbers** (over `|avg_rel_diff|`)
| metric | n | median \|avg_rel_diff\| | max \|avg_rel_diff\| |
| ------------------------- | --: | -----------------------: | -------------------: |
| `lm loss` | <…> | <…> | <…> |
| `num-zeros` | <…> | <…> | <…> |
| `iteration-time` | <…> | <…> | <…> |
| `mem-allocated-bytes` | <…> | <…> | <…> |
| `mem-max-allocated-bytes` | <…> | <…> | <…> |
**Distribution of `|avg_rel_diff|` across all <TOTAL_ROWS> rows**
| \|avg_rel_diff\| bucket | count |
| ----------------------- | ----: |
| `== 0` | <…> |
| `(0, 1e-6)` | <…> |
| `[1e-6, 1e-4)` | <…> |
| `[1e-4, 1e-3)` | <…> |
| `[1e-3, 1e-2)` | <…> |
| `[1e-2, 1e-1)` | <…> |
| `>= 1e-1` | <…> |
**Interpretation** (apply only the bullets that match the data)
- `lm loss` max `|avg_rel_diff|` <X> / median <Y> — loss trajectories match old goldens to numerical noise (sub-1e-4 is within run-to-run variance).
- `mem-*` metrics typically sit at `== 0` or `(0, 1e-6)`; flag any row that lands above `[1e-4, 1e-3)`.
- `iteration-time` movement is dominated by warmup/scheduler noise; signed avg near zero means the run was simply jitterier, not slower or faster on average.
- `num-zeros` shifts cluster on `<list of test patterns>`; within historical run-to-run variance.| column | meaning |
|---|---|
| shared step indices used in the average (NaN/inf and steps with |
| |
|avg_rel_diff|lm lossnum-zeros|avg_rel_diff|iteration-timelm lossnum-zeros|avg_rel_diff|_fetch_and_filter_artifacts--only-failingdownload_from_gitlabgit HEAD|old|1e-12num-zerosmem-*(file, metric)"nan"iteration-timeiteration-timecompare_golden_values_kl.pyavg_rel_diffGITHUB_TOKENRO_API_TOKENgh auth token