nemo-mbridge-resiliency

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Resiliency

弹性功能

Stable docs: @docs/training/resiliency.md, @docs/training/checkpointing.md Card: @skills/nemo-mbridge-resiliency/card.yaml

稳定文档：@docs/training/resiliency.md、@docs/training/checkpointing.md 卡片：@skills/nemo-mbridge-resiliency/card.yaml

Enablement

功能启用

Fault tolerance (Slurm only)

容错机制（仅支持Slurm）

Option 1: NeMo Run plugin (recommended)

选项1：NeMo Run插件（推荐）

python

from megatron.bridge.recipes.run_plugins import FaultTolerancePlugin
import nemo_run as run

task = run.Script(...)
run_plugins = [
    FaultTolerancePlugin(
        enable_ft_package=True,
        calc_ft_timeouts=True,
        num_in_job_restarts=3,
        num_job_retries_on_failure=2,
        initial_rank_heartbeat_timeout=1800,
        rank_heartbeat_timeout=300,
    )
]
run.run(task, plugins=run_plugins, executor=executor)

Plugin parameter	Default	Description
`num_in_job_restarts`	3	Max restarts within same job
`num_job_retries_on_failure`	2	Max new job launches on failure
`initial_rank_heartbeat_timeout`	1800	First heartbeat timeout (seconds)
`rank_heartbeat_timeout`	300	Subsequent heartbeat timeout (seconds)

python

from megatron.bridge.recipes.run_plugins import FaultTolerancePlugin
import nemo_run as run

task = run.Script(...)
run_plugins = [
    FaultTolerancePlugin(
        enable_ft_package=True,
        calc_ft_timeouts=True,
        num_in_job_restarts=3,
        num_job_retries_on_failure=2,
        initial_rank_heartbeat_timeout=1800,
        rank_heartbeat_timeout=300,
    )
]
run.run(task, plugins=run_plugins, executor=executor)

插件参数	默认值	说明
`num_in_job_restarts`	3	同任务内最大重启次数
`num_job_retries_on_failure`	2	任务失败后最大重新启动次数
`initial_rank_heartbeat_timeout`	1800	首次心跳超时时间（秒）
`rank_heartbeat_timeout`	300	后续心跳超时时间（秒）

Option 2: Direct config + ft_launcher

选项2：直接配置 + ft_launcher

python

from megatron.bridge.training.config import FaultToleranceConfig

cfg.ft = FaultToleranceConfig(
    enable_ft_package=True,
    calc_ft_timeouts=True,
    simulate_fault=False,
    simulated_fault_type="random",
)

Launch with

ft_launcher

(not

torchrun

bash

export GROUP_RANK=0  # required for non-Slurm
ft_launcher \
    --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
    --nnodes=${NUM_NODES} --nproc-per-node=${NUM_GPUS_PER_NODE} \
    --ft-rank_section_timeouts=setup:600,step:180,checkpointing:420 \
    --ft-rank_out_of_section_timeout=300 \
    your_training_script.py

Config parameter	Default	Description
`enable_ft_package`	False	Enable fault tolerance
`calc_ft_timeouts`	False	Auto-compute optimal timeouts
`simulate_fault`	False	Enable fault simulation for testing
`simulated_fault_type`	`"random"`	`"rank_hung"` , `"rank_killed"` , or `"random"`
`simulated_fault_rank`	None	Specific rank to fault (random if None)
`simulated_fault_base_delay`	0	Base delay before simulating fault

Section-based timeout monitoring covers setup, training steps, checkpointing, and out-of-section time independently. Timeouts are saved to

ft_state.json

for subsequent runs when

calc_ft_timeouts=True

python

from megatron.bridge.training.config import FaultToleranceConfig

cfg.ft = FaultToleranceConfig(
    enable_ft_package=True,
    calc_ft_timeouts=True,
    simulate_fault=False,
    simulated_fault_type="random",
)

使用

ft_launcher

启动（而非

torchrun

）：

bash

export GROUP_RANK=0  # 非Slurm环境必填
ft_launcher \
    --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
    --nnodes=${NUM_NODES} --nproc-per-node=${NUM_GPUS_PER_NODE} \
    --ft-rank_section_timeouts=setup:600,step:180,checkpointing:420 \
    --ft-rank_out_of_section_timeout=300 \
    your_training_script.py

配置参数	默认值	说明
`enable_ft_package`	False	启用容错机制
`calc_ft_timeouts`	False	自动计算最优超时时间
`simulate_fault`	False	启用故障模拟以用于测试
`simulated_fault_type`	`"random"`	可选值： `"rank_hung"` 、 `"rank_killed"` 或 `"random"`
`simulated_fault_rank`	None	指定要模拟故障的rank（为None时随机选择）
`simulated_fault_base_delay`	0	模拟故障前的基础延迟时间

基于阶段的超时监控独立覆盖初始化、训练步骤、Checkpoint存储以及阶段外时间。当

calc_ft_timeouts=True

时，超时时间会保存到

ft_state.json

供后续运行使用。

NVRx straggler detection

NVRx掉队者检测

python

from megatron.bridge.training.config import NVRxStragglerDetectionConfig

cfg.nvrx_straggler = NVRxStragglerDetectionConfig(
    enabled=True,
    report_time_interval=300.0,
    calc_relative_gpu_perf=True,
    calc_individual_gpu_perf=True,
    num_gpu_perf_scores_to_print=5,
    gpu_relative_perf_threshold=0.7,
    gpu_individual_perf_threshold=0.7,
    stop_if_detected=False,
    enable_logging=True,
)

Parameter	Default	Description
`enabled`	False	Enable straggler detection
`report_time_interval`	300.0	Seconds between straggler checks
`calc_relative_gpu_perf`	True	Compare ranks against each other
`calc_individual_gpu_perf`	True	Track per-rank degradation over time
`gpu_relative_perf_threshold`	0.7	Threshold for relative performance (0-1)
`gpu_individual_perf_threshold`	0.7	Threshold for individual performance (0-1)
`stop_if_detected`	False	Terminate training on straggler
`num_gpu_perf_scores_to_print`	5	Number of best/worst scores to print
`profiling_interval`	1	Profiling interval for detector

python

from megatron.bridge.training.config import NVRxStragglerDetectionConfig

cfg.nvrx_straggler = NVRxStragglerDetectionConfig(
    enabled=True,
    report_time_interval=300.0,
    calc_relative_gpu_perf=True,
    calc_individual_gpu_perf=True,
    num_gpu_perf_scores_to_print=5,
    gpu_relative_perf_threshold=0.7,
    gpu_individual_perf_threshold=0.7,
    stop_if_detected=False,
    enable_logging=True,
)

参数	默认值	说明
`enabled`	False	启用掉队者检测
`report_time_interval`	300.0	掉队者检查的时间间隔（秒）
`calc_relative_gpu_perf`	True	对比不同rank的性能
`calc_individual_gpu_perf`	True	跟踪单个rank的性能随时间的退化情况
`gpu_relative_perf_threshold`	0.7	相对性能阈值（0-1）
`gpu_individual_perf_threshold`	0.7	单个性能阈值（0-1）
`stop_if_detected`	False	检测到掉队者时终止训练
`num_gpu_perf_scores_to_print`	5	要打印的最优/最差性能分数数量
`profiling_interval`	1	检测器的性能分析间隔

Preemption

抢占机制

Plugin (Slurm)

插件（Slurm）

python

from megatron.bridge.recipes.run_plugins import PreemptionPlugin

plugins = [
    PreemptionPlugin(
        preempt_time=60,
        enable_exit_handler=True,
        enable_exit_handler_for_data_loader=False,
    )
]

Plugin parameter	Default	Description
`preempt_time`	60	Seconds before job limit to send signal
`enable_exit_handler`	True	Enable signal handler in training
`enable_exit_handler_for_data_loader`	False	Enable for dataloader workers

python

from megatron.bridge.recipes.run_plugins import PreemptionPlugin

plugins = [
    PreemptionPlugin(
        preempt_time=60,
        enable_exit_handler=True,
        enable_exit_handler_for_data_loader=False,
    )
]

插件参数	默认值	说明
`preempt_time`	60	到达任务限制前发送信号的提前时间（秒）
`enable_exit_handler`	True	在训练中启用信号处理器
`enable_exit_handler_for_data_loader`	False	为数据加载器工作进程启用信号处理器

Direct config

直接配置

python

import signal
cfg.train.exit_signal_handler = True
cfg.train.exit_signal = signal.SIGTERM
cfg.train.exit_signal_handler_for_dataloader = False

python

import signal
cfg.train.exit_signal_handler = True
cfg.train.exit_signal = signal.SIGTERM
cfg.train.exit_signal_handler_for_dataloader = False

Re-run state machine (experimental)

重跑状态机（实验性）

python

from megatron.bridge.training.config import RerunStateMachineConfig

cfg.rerun_state_machine = RerunStateMachineConfig(
    rerun_mode="validate_results",
    check_for_nan_in_loss=True,
    check_for_spiky_loss=False,
    spiky_loss_factor=10.0,
)

Parameter	Default	Description
`rerun_mode`	`"disabled"`	`"disabled"` , `"validate_results"` , `"report_determinism_stats"`
`check_for_nan_in_loss`	True	Check for NaN in loss
`check_for_spiky_loss`	False	Check for unexpectedly large loss
`spiky_loss_factor`	10.0	Loss flagged if > factor * max observed (increase for large models)

Exit codes: 16 = resume to disambiguate, 17 = failed validation.

python

from megatron.bridge.training.config import RerunStateMachineConfig

cfg.rerun_state_machine = RerunStateMachineConfig(
    rerun_mode="validate_results",
    check_for_nan_in_loss=True,
    check_for_spiky_loss=False,
    spiky_loss_factor=10.0,
)

参数	默认值	说明
`rerun_mode`	`"disabled"`	可选值： `"disabled"` 、 `"validate_results"` 、 `"report_determinism_stats"`
`check_for_nan_in_loss`	True	检查损失值中是否存在NaN
`check_for_spiky_loss`	False	检查是否存在异常激增的损失值
`spiky_loss_factor`	10.0	当损失值大于此系数乘以历史最大值时标记为异常（大模型需增大该值）

退出码：16 = 需要恢复以消除歧义，17 = 验证失败。

In-process restart (experimental)

进程内重启（实验性）

python

from megatron.bridge.training.config import InProcessRestartConfig

cfg.inprocess_restart = InProcessRestartConfig(
    enabled=True,
    granularity="node",
    soft_timeout=60.0,
    hard_timeout=90.0,
)

Parameter	Default	Description
`enabled`	False	Enable in-process restart
`active_world_size`	None	Ranks executing workload (rest are warm reserves)
`granularity`	`"node"`	`"node"` or `"rank"` restart granularity
`max_iterations`	None	Max restart attempts (None = unlimited)
`soft_timeout`	60.0	Detect GIL-released hangs (seconds)
`hard_timeout`	90.0	Force-terminate hung ranks (seconds)
`heartbeat_interval`	30.0	Heartbeat interval (seconds)
`heartbeat_timeout`	60.0	Missing heartbeat timeout (seconds)
`barrier_timeout`	120.0	Distributed barrier timeout (seconds)
`completion_timeout`	120.0	Completion barrier timeout (seconds)
`empty_cuda_cache`	True	Clear CUDA cache during restart
`max_rank_faults`	None	Max rank faults before terminating
`monitor_process_logdir`	None	Directory for monitor logs

Required environment variables:

bash

export TORCH_CPP_LOG_LEVEL=error
export TORCH_NCCL_RETHROW_CUDA_ERRORS=0
export NCCL_NVLS_ENABLE=0

The PyTorch NCCL watchdog timeout must exceed

hard_timeout

. NeMo-Run's Slurm Executor is not supported; launch directly with

srun --kill-on-bad-exit=0

python

from megatron.bridge.training.config import InProcessRestartConfig

cfg.inprocess_restart = InProcessRestartConfig(
    enabled=True,
    granularity="node",
    soft_timeout=60.0,
    hard_timeout=90.0,
)

参数	默认值	说明
`enabled`	False	启用进程内重启
`active_world_size`	None	执行工作负载的rank数量（其余为热备）
`granularity`	`"node"`	重启粒度： `"node"` 或 `"rank"`
`max_iterations`	None	最大重启尝试次数（None表示无限制）
`soft_timeout`	60.0	检测GIL释放后的挂起情况（秒）
`hard_timeout`	90.0	强制终止挂起的rank（秒）
`heartbeat_interval`	30.0	心跳间隔（秒）
`heartbeat_timeout`	60.0	心跳丢失超时时间（秒）
`barrier_timeout`	120.0	分布式屏障超时时间（秒）
`completion_timeout`	120.0	完成屏障超时时间（秒）
`empty_cuda_cache`	True	重启时清理CUDA缓存
`max_rank_faults`	None	终止训练前允许的最大rank故障次数
`monitor_process_logdir`	None	监控日志的存储目录

必填环境变量：

bash

export TORCH_CPP_LOG_LEVEL=error
export TORCH_NCCL_RETHROW_CUDA_ERRORS=0
export NCCL_NVLS_ENABLE=0

PyTorch NCCL看门狗超时时间必须大于

hard_timeout

。不支持NeMo-Run的Slurm执行器；需直接使用

srun --kill-on-bad-exit=0

启动。

Async checkpoint save

异步Checkpoint保存

python

cfg.checkpoint.async_save = True
cfg.checkpoint.ckpt_format = "torch_dist"

python

cfg.checkpoint.async_save = True
cfg.checkpoint.ckpt_format = "torch_dist"

Local checkpointing (NVRx)

本地Checkpoint存储（NVRx）

python

cfg.checkpoint.non_persistent_local_ckpt_dir = "/local/scratch/ckpt"
cfg.checkpoint.non_persistent_local_ckpt_algo = "fully_parallel"

python

cfg.checkpoint.non_persistent_local_ckpt_dir = "/local/scratch/ckpt"
cfg.checkpoint.non_persistent_local_ckpt_algo = "fully_parallel"

Code Anchors

代码锚点

Fault tolerance

容错机制

Config:

src/megatron/bridge/training/config.py

—

FaultToleranceConfig

Runtime:

src/megatron/bridge/training/fault_tolerance.py

Plugin:

src/megatron/bridge/recipes/run_plugins.py

—

FaultTolerancePlugin

Perf plugin:

scripts/performance/nemo-mbridge-resiliency_plugins.py

Tests:

tests/unit_tests/training/test_fault_tolerance.py

Example:

examples/training_features/nemo-mbridge-resiliency/fault_tolerance/

配置：

src/megatron/bridge/training/config.py

—

FaultToleranceConfig

运行时：

src/megatron/bridge/training/fault_tolerance.py

插件：

src/megatron/bridge/recipes/run_plugins.py

—

FaultTolerancePlugin

性能插件：

scripts/performance/nemo-mbridge-resiliency_plugins.py

测试：

tests/unit_tests/training/test_fault_tolerance.py

示例：

examples/training_features/nemo-mbridge-resiliency/fault_tolerance/

Straggler detection

掉队者检测

Config:

src/megatron/bridge/training/config.py

—

NVRxStragglerDetectionConfig

Runtime:

src/megatron/bridge/training/nvrx_straggler.py

Train loop:

src/megatron/bridge/training/train.py

—

check_nvrx_straggler_detection

Tests:

tests/unit_tests/training/test_nvrx_straggler.py

tests/functional_tests/training/test_nvrx_straggler.py

Example:

examples/training_features/nemo-mbridge-resiliency/straggler_detection/

配置：

src/megatron/bridge/training/config.py

—

NVRxStragglerDetectionConfig

运行时：

src/megatron/bridge/training/nvrx_straggler.py

训练循环：

src/megatron/bridge/training/train.py

—

check_nvrx_straggler_detection

测试：

tests/unit_tests/training/test_nvrx_straggler.py

、

tests/functional_tests/training/test_nvrx_straggler.py

示例：

examples/training_features/nemo-mbridge-resiliency/straggler_detection/

In-process restart

进程内重启

Config:

src/megatron/bridge/training/config.py

—

InProcessRestartConfig

Runtime:

src/megatron/bridge/training/inprocess_restart.py

Entry point:

src/megatron/bridge/training/pretrain.py

—

maybe_wrap_for_inprocess_restart

Tests:

tests/unit_tests/training/test_inprocess_restart.py

tests/functional_tests/training/test_inprocess_restart.py

配置：

src/megatron/bridge/training/config.py

—

InProcessRestartConfig

运行时：

src/megatron/bridge/training/inprocess_restart.py

入口：

src/megatron/bridge/training/pretrain.py

—

maybe_wrap_for_inprocess_restart

测试：

tests/unit_tests/training/test_inprocess_restart.py

、

tests/functional_tests/training/test_inprocess_restart.py

Preemption

抢占机制

Plugin:

src/megatron/bridge/recipes/run_plugins.py

—

PreemptionPlugin

Signal handler:

src/megatron/bridge/training/utils/sig_utils.py

Tests:

tests/unit_tests/recipes/test_run_plugins.py

插件：

src/megatron/bridge/recipes/run_plugins.py

—

PreemptionPlugin

信号处理器：

src/megatron/bridge/training/utils/sig_utils.py

测试：

tests/unit_tests/recipes/test_run_plugins.py

Re-run state machine

重跑状态机

Config:

src/megatron/bridge/training/config.py

—

RerunStateMachineConfig

Init:

src/megatron/bridge/training/initialize.py

—

init_rerun_state

配置：

src/megatron/bridge/training/config.py

—

RerunStateMachineConfig

初始化：

src/megatron/bridge/training/initialize.py

—

init_rerun_state

Checkpointing

Checkpoint管理

Async save:

src/megatron/bridge/training/checkpointing.py

—

schedule_async_save

Local ckpt:

src/megatron/bridge/training/checkpointing.py

—

LocalCheckpointManager

Tests:

tests/functional_tests/training/test_local_checkpointing.py

异步保存：

src/megatron/bridge/training/checkpointing.py

—

schedule_async_save

本地Checkpoint：

src/megatron/bridge/training/checkpointing.py

—

LocalCheckpointManager

测试：

tests/functional_tests/training/test_local_checkpointing.py

Pitfalls

注意事项

ft_launcher, not torchrun: Direct
```
FaultToleranceConfig
```
requires
```
ft_launcher
```
. Using
```
torchrun
```
silently disables FT. For non-Slurm, set
```
GROUP_RANK=0
```
.
Async save requires torch_dist:
```
async_save=True
```
only works with
```
ckpt_format="torch_dist"
```
. Other formats silently fail or error.
IPR + NeMo-Run: In-process restart is not compatible with NeMo-Run or Slurm preemption plugins. Requires specific PyTorch/NCCL versions and env vars.
NVRx vs legacy straggler: Two detectors exist. Use NVRx (
```
nvrx_straggler
```
); do not enable both.
stop_if_detected default: NVRx logs but does not stop training by default. Set
```
stop_if_detected=True
```
for automatic termination.
NCCL watchdog vs hard_timeout: For IPR, NCCL watchdog timeout must exceed
```
hard_timeout
```
or PyTorch kills the process before recovery.
Rerun state machine is alpha: Use
```
check_for_nan_in_loss=True
```
for NaN detection, but don't rely on full rerun workflows yet.

使用ft_launcher而非torchrun：直接配置
```
FaultToleranceConfig
```
需要使用
```
ft_launcher
```
。使用
```
torchrun
```
会静默禁用容错机制。非Slurm环境需设置
```
GROUP_RANK=0
```
。
异步保存仅支持torch_dist格式：
```
async_save=True
```
仅在
```
ckpt_format="torch_dist"
```
时生效。其他格式会静默失败或报错。
进程内重启与NeMo-Run不兼容：进程内重启无法与NeMo-Run或Slurm抢占插件配合使用。需要特定版本的PyTorch/NCCL以及环境变量。
NVRx与旧版掉队者检测器：存在两种检测器。请使用NVRx版本（
```
nvrx_straggler
```
）；不要同时启用两者。
stop_if_detected默认行为：NVRx默认仅记录日志但不终止训练。需设置
```
stop_if_detected=True
```
以实现自动终止。
NCCL看门狗与hard_timeout：对于进程内重启，NCCL看门狗超时时间必须大于
```
hard_timeout
```
，否则PyTorch会在恢复前终止进程。
重跑状态机处于alpha阶段：可使用
```
check_for_nan_in_loss=True
```
检测NaN，但暂不要依赖完整的重跑工作流。

Verification

验证方法

Fault tolerance

容错机制

bash

./examples/training_features/nemo-mbridge-resiliency/fault_tolerance/run_fault_tolerance.sh
./examples/training_features/nemo-mbridge-resiliency/fault_tolerance/run_fault_tolerance.sh --simulate-fault

Look for

[FaultTolerance]

[RankMonitorServer]

log lines with section timeouts. Simulated fault should trigger restart from checkpoint.

bash

./examples/training_features/nemo-mbridge-resiliency/fault_tolerance/run_fault_tolerance.sh
./examples/training_features/nemo-mbridge-resiliency/fault_tolerance/run_fault_tolerance.sh --simulate-fault

查找包含

[FaultTolerance]

[RankMonitorServer]

的日志行，其中包含阶段超时信息。模拟故障应触发从Checkpoint重启。

Straggler detection

掉队者检测

bash

uv run python -m torch.distributed.run --nproc_per_node=2 \
    examples/training_features/nemo-mbridge-resiliency/straggler_detection/straggler_detection_example.py

Look for

GPU relative performance

and

GPU individual performance

reports with per-rank scores.

bash

uv run python -m torch.distributed.run --nproc_per_node=2 \
    examples/training_features/nemo-mbridge-resiliency/straggler_detection/straggler_detection_example.py

查找包含

GPU relative performance

和

GPU individual performance

的报告，其中包含各rank的性能分数。

Async checkpoint

异步Checkpoint

Look for

Scheduling async checkpoint save

in logs. Training iterations should continue while checkpoint files are being written.

在日志中查找

Scheduling async checkpoint save

信息。训练迭代应在Checkpoint文件写入时持续进行。

In-process restart

进程内重启

bash

pytest tests/functional_tests/training/test_inprocess_restart.py -v

Requires compatible PyTorch/NCCL versions.

bash

pytest tests/functional_tests/training/test_inprocess_restart.py -v

需要兼容版本的PyTorch/NCCL。