nemo-mbridge-resiliency

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Resiliency

弹性功能

Stable docs: @docs/training/resiliency.md, @docs/training/checkpointing.md Card: @skills/nemo-mbridge-resiliency/card.yaml
稳定文档:@docs/training/resiliency.md、@docs/training/checkpointing.md 卡片:@skills/nemo-mbridge-resiliency/card.yaml

Enablement

功能启用

Fault tolerance (Slurm only)

容错机制(仅支持Slurm)

Option 1: NeMo Run plugin (recommended)

选项1:NeMo Run插件(推荐)

python
from megatron.bridge.recipes.run_plugins import FaultTolerancePlugin
import nemo_run as run

task = run.Script(...)
run_plugins = [
    FaultTolerancePlugin(
        enable_ft_package=True,
        calc_ft_timeouts=True,
        num_in_job_restarts=3,
        num_job_retries_on_failure=2,
        initial_rank_heartbeat_timeout=1800,
        rank_heartbeat_timeout=300,
    )
]
run.run(task, plugins=run_plugins, executor=executor)
Plugin parameterDefaultDescription
num_in_job_restarts
3Max restarts within same job
num_job_retries_on_failure
2Max new job launches on failure
initial_rank_heartbeat_timeout
1800First heartbeat timeout (seconds)
rank_heartbeat_timeout
300Subsequent heartbeat timeout (seconds)
python
from megatron.bridge.recipes.run_plugins import FaultTolerancePlugin
import nemo_run as run

task = run.Script(...)
run_plugins = [
    FaultTolerancePlugin(
        enable_ft_package=True,
        calc_ft_timeouts=True,
        num_in_job_restarts=3,
        num_job_retries_on_failure=2,
        initial_rank_heartbeat_timeout=1800,
        rank_heartbeat_timeout=300,
    )
]
run.run(task, plugins=run_plugins, executor=executor)
插件参数默认值说明
num_in_job_restarts
3同任务内最大重启次数
num_job_retries_on_failure
2任务失败后最大重新启动次数
initial_rank_heartbeat_timeout
1800首次心跳超时时间(秒)
rank_heartbeat_timeout
300后续心跳超时时间(秒)

Option 2: Direct config + ft_launcher

选项2:直接配置 + ft_launcher

python
from megatron.bridge.training.config import FaultToleranceConfig

cfg.ft = FaultToleranceConfig(
    enable_ft_package=True,
    calc_ft_timeouts=True,
    simulate_fault=False,
    simulated_fault_type="random",
)
Launch with
ft_launcher
(not
torchrun
):
bash
export GROUP_RANK=0  # required for non-Slurm
ft_launcher \
    --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
    --nnodes=${NUM_NODES} --nproc-per-node=${NUM_GPUS_PER_NODE} \
    --ft-rank_section_timeouts=setup:600,step:180,checkpointing:420 \
    --ft-rank_out_of_section_timeout=300 \
    your_training_script.py
Config parameterDefaultDescription
enable_ft_package
FalseEnable fault tolerance
calc_ft_timeouts
FalseAuto-compute optimal timeouts
simulate_fault
FalseEnable fault simulation for testing
simulated_fault_type
"random"
"rank_hung"
,
"rank_killed"
, or
"random"
simulated_fault_rank
NoneSpecific rank to fault (random if None)
simulated_fault_base_delay
0Base delay before simulating fault
Section-based timeout monitoring covers setup, training steps, checkpointing, and out-of-section time independently. Timeouts are saved to
ft_state.json
for subsequent runs when
calc_ft_timeouts=True
.
python
from megatron.bridge.training.config import FaultToleranceConfig

cfg.ft = FaultToleranceConfig(
    enable_ft_package=True,
    calc_ft_timeouts=True,
    simulate_fault=False,
    simulated_fault_type="random",
)
使用
ft_launcher
启动(而非
torchrun
):
bash
export GROUP_RANK=0  # 非Slurm环境必填
ft_launcher \
    --rdzv_backend=c10d --rdzv_endpoint=${MASTER_ADDR}:${MASTER_PORT} \
    --nnodes=${NUM_NODES} --nproc-per-node=${NUM_GPUS_PER_NODE} \
    --ft-rank_section_timeouts=setup:600,step:180,checkpointing:420 \
    --ft-rank_out_of_section_timeout=300 \
    your_training_script.py
配置参数默认值说明
enable_ft_package
False启用容错机制
calc_ft_timeouts
False自动计算最优超时时间
simulate_fault
False启用故障模拟以用于测试
simulated_fault_type
"random"
可选值:
"rank_hung"
"rank_killed"
"random"
simulated_fault_rank
None指定要模拟故障的rank(为None时随机选择)
simulated_fault_base_delay
0模拟故障前的基础延迟时间
基于阶段的超时监控独立覆盖初始化、训练步骤、Checkpoint存储以及阶段外时间。当
calc_ft_timeouts=True
时,超时时间会保存到
ft_state.json
供后续运行使用。

NVRx straggler detection

NVRx掉队者检测

python
from megatron.bridge.training.config import NVRxStragglerDetectionConfig

cfg.nvrx_straggler = NVRxStragglerDetectionConfig(
    enabled=True,
    report_time_interval=300.0,
    calc_relative_gpu_perf=True,
    calc_individual_gpu_perf=True,
    num_gpu_perf_scores_to_print=5,
    gpu_relative_perf_threshold=0.7,
    gpu_individual_perf_threshold=0.7,
    stop_if_detected=False,
    enable_logging=True,
)
ParameterDefaultDescription
enabled
FalseEnable straggler detection
report_time_interval
300.0Seconds between straggler checks
calc_relative_gpu_perf
TrueCompare ranks against each other
calc_individual_gpu_perf
TrueTrack per-rank degradation over time
gpu_relative_perf_threshold
0.7Threshold for relative performance (0-1)
gpu_individual_perf_threshold
0.7Threshold for individual performance (0-1)
stop_if_detected
FalseTerminate training on straggler
num_gpu_perf_scores_to_print
5Number of best/worst scores to print
profiling_interval
1Profiling interval for detector
python
from megatron.bridge.training.config import NVRxStragglerDetectionConfig

cfg.nvrx_straggler = NVRxStragglerDetectionConfig(
    enabled=True,
    report_time_interval=300.0,
    calc_relative_gpu_perf=True,
    calc_individual_gpu_perf=True,
    num_gpu_perf_scores_to_print=5,
    gpu_relative_perf_threshold=0.7,
    gpu_individual_perf_threshold=0.7,
    stop_if_detected=False,
    enable_logging=True,
)
参数默认值说明
enabled
False启用掉队者检测
report_time_interval
300.0掉队者检查的时间间隔(秒)
calc_relative_gpu_perf
True对比不同rank的性能
calc_individual_gpu_perf
True跟踪单个rank的性能随时间的退化情况
gpu_relative_perf_threshold
0.7相对性能阈值(0-1)
gpu_individual_perf_threshold
0.7单个性能阈值(0-1)
stop_if_detected
False检测到掉队者时终止训练
num_gpu_perf_scores_to_print
5要打印的最优/最差性能分数数量
profiling_interval
1检测器的性能分析间隔

Preemption

抢占机制

Plugin (Slurm)

插件(Slurm)

python
from megatron.bridge.recipes.run_plugins import PreemptionPlugin

plugins = [
    PreemptionPlugin(
        preempt_time=60,
        enable_exit_handler=True,
        enable_exit_handler_for_data_loader=False,
    )
]
Plugin parameterDefaultDescription
preempt_time
60Seconds before job limit to send signal
enable_exit_handler
TrueEnable signal handler in training
enable_exit_handler_for_data_loader
FalseEnable for dataloader workers
python
from megatron.bridge.recipes.run_plugins import PreemptionPlugin

plugins = [
    PreemptionPlugin(
        preempt_time=60,
        enable_exit_handler=True,
        enable_exit_handler_for_data_loader=False,
    )
]
插件参数默认值说明
preempt_time
60到达任务限制前发送信号的提前时间(秒)
enable_exit_handler
True在训练中启用信号处理器
enable_exit_handler_for_data_loader
False为数据加载器工作进程启用信号处理器

Direct config

直接配置

python
import signal
cfg.train.exit_signal_handler = True
cfg.train.exit_signal = signal.SIGTERM
cfg.train.exit_signal_handler_for_dataloader = False
python
import signal
cfg.train.exit_signal_handler = True
cfg.train.exit_signal = signal.SIGTERM
cfg.train.exit_signal_handler_for_dataloader = False

Re-run state machine (experimental)

重跑状态机(实验性)

python
from megatron.bridge.training.config import RerunStateMachineConfig

cfg.rerun_state_machine = RerunStateMachineConfig(
    rerun_mode="validate_results",
    check_for_nan_in_loss=True,
    check_for_spiky_loss=False,
    spiky_loss_factor=10.0,
)
ParameterDefaultDescription
rerun_mode
"disabled"
"disabled"
,
"validate_results"
,
"report_determinism_stats"
check_for_nan_in_loss
TrueCheck for NaN in loss
check_for_spiky_loss
FalseCheck for unexpectedly large loss
spiky_loss_factor
10.0Loss flagged if > factor * max observed (increase for large models)
Exit codes: 16 = resume to disambiguate, 17 = failed validation.
python
from megatron.bridge.training.config import RerunStateMachineConfig

cfg.rerun_state_machine = RerunStateMachineConfig(
    rerun_mode="validate_results",
    check_for_nan_in_loss=True,
    check_for_spiky_loss=False,
    spiky_loss_factor=10.0,
)
参数默认值说明
rerun_mode
"disabled"
可选值:
"disabled"
"validate_results"
"report_determinism_stats"
check_for_nan_in_loss
True检查损失值中是否存在NaN
check_for_spiky_loss
False检查是否存在异常激增的损失值
spiky_loss_factor
10.0当损失值大于此系数乘以历史最大值时标记为异常(大模型需增大该值)
退出码:16 = 需要恢复以消除歧义,17 = 验证失败。

In-process restart (experimental)

进程内重启(实验性)

python
from megatron.bridge.training.config import InProcessRestartConfig

cfg.inprocess_restart = InProcessRestartConfig(
    enabled=True,
    granularity="node",
    soft_timeout=60.0,
    hard_timeout=90.0,
)
ParameterDefaultDescription
enabled
FalseEnable in-process restart
active_world_size
NoneRanks executing workload (rest are warm reserves)
granularity
"node"
"node"
or
"rank"
restart granularity
max_iterations
NoneMax restart attempts (None = unlimited)
soft_timeout
60.0Detect GIL-released hangs (seconds)
hard_timeout
90.0Force-terminate hung ranks (seconds)
heartbeat_interval
30.0Heartbeat interval (seconds)
heartbeat_timeout
60.0Missing heartbeat timeout (seconds)
barrier_timeout
120.0Distributed barrier timeout (seconds)
completion_timeout
120.0Completion barrier timeout (seconds)
empty_cuda_cache
TrueClear CUDA cache during restart
max_rank_faults
NoneMax rank faults before terminating
monitor_process_logdir
NoneDirectory for monitor logs
Required environment variables:
bash
export TORCH_CPP_LOG_LEVEL=error
export TORCH_NCCL_RETHROW_CUDA_ERRORS=0
export NCCL_NVLS_ENABLE=0
The PyTorch NCCL watchdog timeout must exceed
hard_timeout
. NeMo-Run's Slurm Executor is not supported; launch directly with
srun --kill-on-bad-exit=0
.
python
from megatron.bridge.training.config import InProcessRestartConfig

cfg.inprocess_restart = InProcessRestartConfig(
    enabled=True,
    granularity="node",
    soft_timeout=60.0,
    hard_timeout=90.0,
)
参数默认值说明
enabled
False启用进程内重启
active_world_size
None执行工作负载的rank数量(其余为热备)
granularity
"node"
重启粒度:
"node"
"rank"
max_iterations
None最大重启尝试次数(None表示无限制)
soft_timeout
60.0检测GIL释放后的挂起情况(秒)
hard_timeout
90.0强制终止挂起的rank(秒)
heartbeat_interval
30.0心跳间隔(秒)
heartbeat_timeout
60.0心跳丢失超时时间(秒)
barrier_timeout
120.0分布式屏障超时时间(秒)
completion_timeout
120.0完成屏障超时时间(秒)
empty_cuda_cache
True重启时清理CUDA缓存
max_rank_faults
None终止训练前允许的最大rank故障次数
monitor_process_logdir
None监控日志的存储目录
必填环境变量:
bash
export TORCH_CPP_LOG_LEVEL=error
export TORCH_NCCL_RETHROW_CUDA_ERRORS=0
export NCCL_NVLS_ENABLE=0
PyTorch NCCL看门狗超时时间必须大于
hard_timeout
。不支持NeMo-Run的Slurm执行器;需直接使用
srun --kill-on-bad-exit=0
启动。

Async checkpoint save

异步Checkpoint保存

python
cfg.checkpoint.async_save = True
cfg.checkpoint.ckpt_format = "torch_dist"
python
cfg.checkpoint.async_save = True
cfg.checkpoint.ckpt_format = "torch_dist"

Local checkpointing (NVRx)

本地Checkpoint存储(NVRx)

python
cfg.checkpoint.non_persistent_local_ckpt_dir = "/local/scratch/ckpt"
cfg.checkpoint.non_persistent_local_ckpt_algo = "fully_parallel"
python
cfg.checkpoint.non_persistent_local_ckpt_dir = "/local/scratch/ckpt"
cfg.checkpoint.non_persistent_local_ckpt_algo = "fully_parallel"

Code Anchors

代码锚点

Fault tolerance

容错机制

  • Config:
    src/megatron/bridge/training/config.py
    FaultToleranceConfig
  • Runtime:
    src/megatron/bridge/training/fault_tolerance.py
  • Plugin:
    src/megatron/bridge/recipes/run_plugins.py
    FaultTolerancePlugin
  • Perf plugin:
    scripts/performance/nemo-mbridge-resiliency_plugins.py
  • Tests:
    tests/unit_tests/training/test_fault_tolerance.py
  • Example:
    examples/training_features/nemo-mbridge-resiliency/fault_tolerance/
  • 配置:
    src/megatron/bridge/training/config.py
    FaultToleranceConfig
  • 运行时:
    src/megatron/bridge/training/fault_tolerance.py
  • 插件:
    src/megatron/bridge/recipes/run_plugins.py
    FaultTolerancePlugin
  • 性能插件:
    scripts/performance/nemo-mbridge-resiliency_plugins.py
  • 测试:
    tests/unit_tests/training/test_fault_tolerance.py
  • 示例:
    examples/training_features/nemo-mbridge-resiliency/fault_tolerance/

Straggler detection

掉队者检测

  • Config:
    src/megatron/bridge/training/config.py
    NVRxStragglerDetectionConfig
  • Runtime:
    src/megatron/bridge/training/nvrx_straggler.py
  • Train loop:
    src/megatron/bridge/training/train.py
    check_nvrx_straggler_detection
  • Tests:
    tests/unit_tests/training/test_nvrx_straggler.py
    ,
    tests/functional_tests/training/test_nvrx_straggler.py
  • Example:
    examples/training_features/nemo-mbridge-resiliency/straggler_detection/
  • 配置:
    src/megatron/bridge/training/config.py
    NVRxStragglerDetectionConfig
  • 运行时:
    src/megatron/bridge/training/nvrx_straggler.py
  • 训练循环:
    src/megatron/bridge/training/train.py
    check_nvrx_straggler_detection
  • 测试:
    tests/unit_tests/training/test_nvrx_straggler.py
    tests/functional_tests/training/test_nvrx_straggler.py
  • 示例:
    examples/training_features/nemo-mbridge-resiliency/straggler_detection/

In-process restart

进程内重启

  • Config:
    src/megatron/bridge/training/config.py
    InProcessRestartConfig
  • Runtime:
    src/megatron/bridge/training/inprocess_restart.py
  • Entry point:
    src/megatron/bridge/training/pretrain.py
    maybe_wrap_for_inprocess_restart
  • Tests:
    tests/unit_tests/training/test_inprocess_restart.py
    ,
    tests/functional_tests/training/test_inprocess_restart.py
  • 配置:
    src/megatron/bridge/training/config.py
    InProcessRestartConfig
  • 运行时:
    src/megatron/bridge/training/inprocess_restart.py
  • 入口:
    src/megatron/bridge/training/pretrain.py
    maybe_wrap_for_inprocess_restart
  • 测试:
    tests/unit_tests/training/test_inprocess_restart.py
    tests/functional_tests/training/test_inprocess_restart.py

Preemption

抢占机制

  • Plugin:
    src/megatron/bridge/recipes/run_plugins.py
    PreemptionPlugin
  • Signal handler:
    src/megatron/bridge/training/utils/sig_utils.py
  • Tests:
    tests/unit_tests/recipes/test_run_plugins.py
  • 插件:
    src/megatron/bridge/recipes/run_plugins.py
    PreemptionPlugin
  • 信号处理器:
    src/megatron/bridge/training/utils/sig_utils.py
  • 测试:
    tests/unit_tests/recipes/test_run_plugins.py

Re-run state machine

重跑状态机

  • Config:
    src/megatron/bridge/training/config.py
    RerunStateMachineConfig
  • Init:
    src/megatron/bridge/training/initialize.py
    init_rerun_state
  • 配置:
    src/megatron/bridge/training/config.py
    RerunStateMachineConfig
  • 初始化:
    src/megatron/bridge/training/initialize.py
    init_rerun_state

Checkpointing

Checkpoint管理

  • Async save:
    src/megatron/bridge/training/checkpointing.py
    schedule_async_save
  • Local ckpt:
    src/megatron/bridge/training/checkpointing.py
    LocalCheckpointManager
  • Tests:
    tests/functional_tests/training/test_local_checkpointing.py
  • 异步保存:
    src/megatron/bridge/training/checkpointing.py
    schedule_async_save
  • 本地Checkpoint:
    src/megatron/bridge/training/checkpointing.py
    LocalCheckpointManager
  • 测试:
    tests/functional_tests/training/test_local_checkpointing.py

Pitfalls

注意事项

  1. ft_launcher, not torchrun: Direct
    FaultToleranceConfig
    requires
    ft_launcher
    . Using
    torchrun
    silently disables FT. For non-Slurm, set
    GROUP_RANK=0
    .
  2. Async save requires torch_dist:
    async_save=True
    only works with
    ckpt_format="torch_dist"
    . Other formats silently fail or error.
  3. IPR + NeMo-Run: In-process restart is not compatible with NeMo-Run or Slurm preemption plugins. Requires specific PyTorch/NCCL versions and env vars.
  4. NVRx vs legacy straggler: Two detectors exist. Use NVRx (
    nvrx_straggler
    ); do not enable both.
  5. stop_if_detected default: NVRx logs but does not stop training by default. Set
    stop_if_detected=True
    for automatic termination.
  6. NCCL watchdog vs hard_timeout: For IPR, NCCL watchdog timeout must exceed
    hard_timeout
    or PyTorch kills the process before recovery.
  7. Rerun state machine is alpha: Use
    check_for_nan_in_loss=True
    for NaN detection, but don't rely on full rerun workflows yet.
  1. 使用ft_launcher而非torchrun:直接配置
    FaultToleranceConfig
    需要使用
    ft_launcher
    。使用
    torchrun
    会静默禁用容错机制。非Slurm环境需设置
    GROUP_RANK=0
  2. 异步保存仅支持torch_dist格式
    async_save=True
    仅在
    ckpt_format="torch_dist"
    时生效。其他格式会静默失败或报错。
  3. 进程内重启与NeMo-Run不兼容:进程内重启无法与NeMo-Run或Slurm抢占插件配合使用。需要特定版本的PyTorch/NCCL以及环境变量。
  4. NVRx与旧版掉队者检测器:存在两种检测器。请使用NVRx版本(
    nvrx_straggler
    );不要同时启用两者。
  5. stop_if_detected默认行为:NVRx默认仅记录日志但不终止训练。需设置
    stop_if_detected=True
    以实现自动终止。
  6. NCCL看门狗与hard_timeout:对于进程内重启,NCCL看门狗超时时间必须大于
    hard_timeout
    ,否则PyTorch会在恢复前终止进程。
  7. 重跑状态机处于alpha阶段:可使用
    check_for_nan_in_loss=True
    检测NaN,但暂不要依赖完整的重跑工作流。

Verification

验证方法

Fault tolerance

容错机制

bash
./examples/training_features/nemo-mbridge-resiliency/fault_tolerance/run_fault_tolerance.sh
./examples/training_features/nemo-mbridge-resiliency/fault_tolerance/run_fault_tolerance.sh --simulate-fault
Look for
[FaultTolerance]
/
[RankMonitorServer]
log lines with section timeouts. Simulated fault should trigger restart from checkpoint.
bash
./examples/training_features/nemo-mbridge-resiliency/fault_tolerance/run_fault_tolerance.sh
./examples/training_features/nemo-mbridge-resiliency/fault_tolerance/run_fault_tolerance.sh --simulate-fault
查找包含
[FaultTolerance]
/
[RankMonitorServer]
的日志行,其中包含阶段超时信息。模拟故障应触发从Checkpoint重启。

Straggler detection

掉队者检测

bash
uv run python -m torch.distributed.run --nproc_per_node=2 \
    examples/training_features/nemo-mbridge-resiliency/straggler_detection/straggler_detection_example.py
Look for
GPU relative performance
and
GPU individual performance
reports with per-rank scores.
bash
uv run python -m torch.distributed.run --nproc_per_node=2 \
    examples/training_features/nemo-mbridge-resiliency/straggler_detection/straggler_detection_example.py
查找包含
GPU relative performance
GPU individual performance
的报告,其中包含各rank的性能分数。

Async checkpoint

异步Checkpoint

Look for
Scheduling async checkpoint save
in logs. Training iterations should continue while checkpoint files are being written.
在日志中查找
Scheduling async checkpoint save
信息。训练迭代应在Checkpoint文件写入时持续进行。

In-process restart

进程内重启

bash
pytest tests/functional_tests/training/test_inprocess_restart.py -v
Requires compatible PyTorch/NCCL versions.
bash
pytest tests/functional_tests/training/test_inprocess_restart.py -v
需要兼容版本的PyTorch/NCCL。