Loading...
Loading...
Compare original and translation side by side
pyproject.tomloverride-dependenciestransformer-engine @ git+...@<ref>3rdparty/Megatron-LMuv.lockbuild-and-dependencypyproject.tomloverride-dependenciestransformer-engine @ git+...@<ref>3rdparty/Megatron-LMuv.lockbuild-and-dependencyuv lockcopy-pr-bot/ok to testactive/flaky/git mvuv lockcopy-pr-bot/ok to testactive/flaky/git mvmainuv lockgit submodule update --init 3rdparty/Megatron-LMuv lockpyproject.tomloverride-dependencies = [
...
"transformer-engine @ git+https://github.com/NVIDIA/TransformerEngine.git@<new-ref>",
...
]release_v2.15release_vX.Yrelease/vX.Ygit ls-remote https://github.com/NVIDIA/TransformerEngine.gitmainuv lockgit submodule update --init 3rdparty/Megatron-LMuv lockpyproject.tomloverride-dependencies = [
...
"transformer-engine @ git+https://github.com/NVIDIA/TransformerEngine.git@<new-ref>",
...
]release_v2.15release_vX.Yrelease/vX.Ygit ls-remote https://github.com/NVIDIA/TransformerEngine.gituv lockgit diff --stat pyproject.toml uv.lockoverride-dependenciesuv lockgit diff --stat pyproject.toml uv.lockoverride-dependenciesgit add pyproject.toml uv.lock
git commit -S -s -m "[build] chore: bump <package> to <ref>"
git push -u origin <branch-name>-Scopy-pr-bot/ok to test/ok to testgit add pyproject.toml uv.lock
git commit -S -s -m "[build] chore: bump <package> to <ref>"
git push -u origin <branch-name>-Scopy-pr-bot/ok to test/ok to testneeds-more-testsfull-test-suite<details><summary>Claude summary</summary>needs-more-testsfull-test-suite<details><summary>Claude 摘要</summary>uv.lockuv.lockUpdated <package> <old> -> <new>Updated <package> <old> -> <new>needs-more-testsneeds-more-testsgh api -X PATCH "repos/NVIDIA-NeMo/Megatron-Bridge/pulls/<N>" -F "body=@/tmp/pr-body.md"gh pr editgh api -X PATCH "repos/NVIDIA-NeMo/Megatron-Bridge/pulls/<N>" -F "body=@/tmp/pr-body.md"gh pr edit/ok to test $(git rev-parse HEAD)/ok to test $(git rev-parse HEAD)CICD NeMoTaskStopCICD NeMoTaskStop/tmp/watchdog-<PR>.sh#!/usr/bin/env bash/tmp/watchdog-<PR>.sh#!/usr/bin/env bashundefinedundefinedMonitor(
description="CICD NeMo run state changes on PR <N>",
command="bash /tmp/watchdog-<N>.sh",
persistent=true,
timeout_ms=3600000
)persistent: trueTaskStop(<task-id>)Monitor(
description="CICD NeMo run state changes on PR <N>",
command="bash /tmp/watchdog-<N>.sh",
persistent=true,
timeout_ms=3600000
)persistent: trueTaskStop(<task-id>)JOB <name> -> failureRUN_ID=<from "RUN ... STARTED" event>
gh run view "$RUN_ID" --repo NVIDIA-NeMo/Megatron-Bridge --log-failed > /tmp/run.log
wc -l /tmp/run.log
tail -200 /tmp/run.logmainflaky/gb200_gb200/active/h100/active/.shgit commit -S -s -m "[ci] chore: quarantine flaky <test> for <package> bump"
git push
gh pr comment <N> --repo NVIDIA-NeMo/Megatron-Bridge \
--body "/ok to test $(git rev-parse HEAD)"gh api PATCHRUN <id> STARTEDJOB <name> -> failureRUN_ID=<来自"RUN ... STARTED"事件>
gh run view "$RUN_ID" --repo NVIDIA-NeMo/Megatron-Bridge --log-failed > /tmp/run.log
wc -l /tmp/run.log
tail -200 /tmp/run.logmainflaky/gb200_gb200/active/h100/active/.shgit commit -S -s -m "[ci] chore: quarantine flaky <test> for <package> bump"
git push
gh pr comment <N> --repo NVIDIA-NeMo/Megatron-Bridge \
--body "/ok to test $(git rev-parse HEAD)"gh api PATCHRUN <id> STARTEDRUN <id> COMPLETED conclusion=successgh pr checks <N> --repo NVIDIA-NeMo/Megatron-Bridge | awk '{print $2}' | sort | uniq -c
TaskStop(<watchdog-task-id>)
gh api -X PATCH "repos/NVIDIA-NeMo/Megatron-Bridge/pulls/<N>" -F "body=@/tmp/pr-body.md"RUN <id> COMPLETED conclusion=successgh pr checks <N> --repo NVIDIA-NeMo/Megatron-Bridge | awk '{print $2}' | sort | uniq -c
TaskStop(<watchdog-task-id>)
gh api -X PATCH "repos/NVIDIA-NeMo/Megatron-Bridge/pulls/<N>" -F "body=@/tmp/pr-body.md"| Symptom | Cause | Fix |
|---|---|---|
Wrong TE branch ref ( | TE uses | Verify with |
| Lockfile diff includes unrelated CVE-pinned packages | | Re-run lock and accept; don't try to revert those |
| Signed first push triggers CI but later pushes don't | | Always re-post |
| Watchdog goes silent for 30+ min | | Bump poll interval; |
Job name doesn't map to a script in | | Strip |
| 症状 | 原因 | 修复方案 |
|---|---|---|
错误的TE分支引用( | TE使用下划线分隔的 | 锁定前通过 |
| 锁文件差异包含无关的CVE固定版本包 | | 重新生成锁文件并接受变更;不要尝试回退 |
| 首次签名推送触发了CI,但后续推送未触发 | | 严格按照步骤5,每次推送后重新发布 |
| 监控程序静默超过30分钟 | | 增加轮询间隔;执行 |
任务名称无法映射到 | | 去除 |
CICD NeMomaingh pr editgh api PATCHgh pr create --body--body-fileCICD NeMomaingh pr editgh api PATCHgh pr create --body--body-file