distributed-llm-pretraining-torchtitan

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

TorchTitan - PyTorch Native Distributed LLM Pretraining

TorchTitan - PyTorch原生分布式大语言模型预训练

Quick start

快速开始

TorchTitan is PyTorch's official platform for large-scale LLM pretraining with composable 4D parallelism (FSDP2, TP, PP, CP), achieving 65%+ speedups over baselines on H100 GPUs.
Installation:
bash
undefined
TorchTitan是PyTorch官方推出的大规模LLM预训练平台,支持可组合的4D并行(FSDP2、TP、PP、CP),在H100 GPU上相比基准实现了65%以上的速度提升。
安装方式:
bash
undefined

From PyPI (stable)

从PyPI安装(稳定版)

pip install torchtitan
pip install torchtitan

From source (latest features, requires PyTorch nightly)

从源码安装(含最新功能,需PyTorch nightly版本)

git clone https://github.com/pytorch/torchtitan cd torchtitan pip install -r requirements.txt

**Download tokenizer**:
```bash
git clone https://github.com/pytorch/torchtitan cd torchtitan pip install -r requirements.txt

**下载分词器**:
```bash
python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=...

**Start training on 8 GPUs**:
```bash
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh
python scripts/download_hf_assets.py --repo_id meta-llama/Llama-3.1-8B --assets tokenizer --hf_token=...

**在8块GPU上启动训练**:
```bash
CONFIG_FILE="./torchtitan/models/llama3/train_configs/llama3_8b.toml" ./run_train.sh

Common workflows

常见工作流

Workflow 1: Pretrain Llama 3.1 8B on single node

工作流1:单节点预训练Llama 3.1 8B

Copy this checklist:
Single Node Pretraining:
- [ ] Step 1: Download tokenizer
- [ ] Step 2: Configure training
- [ ] Step 3: Launch training
- [ ] Step 4: Monitor and checkpoint
Step 1: Download tokenizer
bash
python scripts/download_hf_assets.py \
  --repo_id meta-llama/Llama-3.1-8B \
  --assets tokenizer \
  --hf_token=YOUR_HF_TOKEN
Step 2: Configure training
Edit or create a TOML config file:
toml
undefined
复制以下检查清单:
单节点预训练:
- [ ] 步骤1:下载分词器
- [ ] 步骤2:配置训练参数
- [ ] 步骤3:启动训练
- [ ] 步骤4:监控训练与保存检查点
步骤1:下载分词器
bash
python scripts/download_hf_assets.py \
  --repo_id meta-llama/Llama-3.1-8B \
  --assets tokenizer \
  --hf_token=YOUR_HF_TOKEN
步骤2:配置训练参数
编辑或创建TOML配置文件:
toml
undefined

llama3_8b_custom.toml

llama3_8b_custom.toml

[job] dump_folder = "./outputs" description = "Llama 3.1 8B training"
[model] name = "llama3" flavor = "8B" hf_assets_path = "./assets/hf/Llama-3.1-8B"
[optimizer] name = "AdamW" lr = 3e-4
[lr_scheduler] warmup_steps = 200
[training] local_batch_size = 2 seq_len = 8192 max_norm = 1.0 steps = 1000 dataset = "c4"
[parallelism] data_parallel_shard_degree = -1 # Use all GPUs for FSDP
[activation_checkpoint] mode = "selective" selective_ac_option = "op"
[checkpoint] enable = true folder = "checkpoint" interval = 500

**Step 3: Launch training**

```bash
[job] dump_folder = "./outputs" description = "Llama 3.1 8B training"
[model] name = "llama3" flavor = "8B" hf_assets_path = "./assets/hf/Llama-3.1-8B"
[optimizer] name = "AdamW" lr = 3e-4
[lr_scheduler] warmup_steps = 200
[training] local_batch_size = 2 seq_len = 8192 max_norm = 1.0 steps = 1000 dataset = "c4"
[parallelism] data_parallel_shard_degree = -1 # 使用所有GPU进行FSDP
[activation_checkpoint] mode = "selective" selective_ac_option = "op"
[checkpoint] enable = true folder = "checkpoint" interval = 500

**步骤3:启动训练**

```bash

8 GPUs on single node

单节点8块GPU

CONFIG_FILE="./llama3_8b_custom.toml" ./run_train.sh
CONFIG_FILE="./llama3_8b_custom.toml" ./run_train.sh

Or explicitly with torchrun

或使用torchrun显式启动

torchrun --nproc_per_node=8
-m torchtitan.train
--job.config_file ./llama3_8b_custom.toml

**Step 4: Monitor and checkpoint**

TensorBoard logs are saved to `./outputs/tb/`:
```bash
tensorboard --logdir ./outputs/tb
torchrun --nproc_per_node=8
-m torchtitan.train
--job.config_file ./llama3_8b_custom.toml

**步骤4:监控训练与保存检查点**

TensorBoard日志保存至`./outputs/tb/`:
```bash
tensorboard --logdir ./outputs/tb

Workflow 2: Multi-node training with SLURM

工作流2:基于SLURM的多节点训练

Multi-Node Training:
- [ ] Step 1: Configure parallelism for scale
- [ ] Step 2: Set up SLURM script
- [ ] Step 3: Submit job
- [ ] Step 4: Resume from checkpoint
Step 1: Configure parallelism for scale
For 70B model on 256 GPUs (32 nodes):
toml
[parallelism]
data_parallel_shard_degree = 32  # FSDP across 32 ranks
tensor_parallel_degree = 8        # TP within node
pipeline_parallel_degree = 1      # No PP for 70B
context_parallel_degree = 1       # Increase for long sequences
Step 2: Set up SLURM script
bash
#!/bin/bash
#SBATCH --job-name=llama70b
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8

srun torchrun \
  --nnodes=32 \
  --nproc_per_node=8 \
  --rdzv_backend=c10d \
  --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
  -m torchtitan.train \
  --job.config_file ./llama3_70b.toml
Step 3: Submit job
bash
sbatch multinode_trainer.slurm
Step 4: Resume from checkpoint
Training auto-resumes if checkpoint exists in configured folder.
多节点训练:
- [ ] 步骤1:配置大规模并行参数
- [ ] 步骤2:编写SLURM脚本
- [ ] 步骤3:提交任务
- [ ] 步骤4:从检查点恢复训练
步骤1:配置大规模并行参数
针对256块GPU(32个节点)训练70B模型:
toml
[parallelism]
data_parallel_shard_degree = 32  # 跨32个进程的FSDP
tensor_parallel_degree = 8        # 节点内的TP
pipeline_parallel_degree = 1      # 70B模型不使用PP
context_parallel_degree = 1       # 长序列场景可增大该值
步骤2:编写SLURM脚本
bash
#!/bin/bash
#SBATCH --job-name=llama70b
#SBATCH --nodes=32
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8

srun torchrun \
  --nnodes=32 \
  --nproc_per_node=8 \
  --rdzv_backend=c10d \
  --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
  -m torchtitan.train \
  --job.config_file ./llama3_70b.toml
步骤3:提交任务
bash
sbatch multinode_trainer.slurm
步骤4:从检查点恢复训练
若配置的检查点文件夹中存在检查点,训练会自动恢复。

Workflow 3: Enable Float8 training for H100s

工作流3:为H100 GPU启用Float8训练

Float8 provides 30-50% speedup on H100 GPUs.
Float8 Training:
- [ ] Step 1: Install torchao
- [ ] Step 2: Configure Float8
- [ ] Step 3: Launch with compile
Step 1: Install torchao
bash
USE_CPP=0 pip install git+https://github.com/pytorch/ao.git
Step 2: Configure Float8
Add to your TOML config:
toml
[model]
converters = ["quantize.linear.float8"]

[quantize.linear.float8]
enable_fsdp_float8_all_gather = true
precompute_float8_dynamic_scale_for_fsdp = true
filter_fqns = ["output"]  # Exclude output layer

[compile]
enable = true
components = ["model", "loss"]
Step 3: Launch with compile
bash
CONFIG_FILE="./llama3_8b.toml" ./run_train.sh \
  --model.converters="quantize.linear.float8" \
  --quantize.linear.float8.enable_fsdp_float8_all_gather \
  --compile.enable
Float8可使H100 GPU的训练速度提升30-50%。
Float8训练:
- [ ] 步骤1:安装torchao
- [ ] 步骤2:配置Float8参数
- [ ] 步骤3:结合compile启动训练
步骤1:安装torchao
bash
USE_CPP=0 pip install git+https://github.com/pytorch/ao.git
步骤2:配置Float8参数
在TOML配置文件中添加以下内容:
toml
[model]
converters = ["quantize.linear.float8"]

[quantize.linear.float8]
enable_fsdp_float8_all_gather = true
precompute_float8_dynamic_scale_for_fsdp = true
filter_fqns = ["output"]  # 排除输出层

[compile]
enable = true
components = ["model", "loss"]
步骤3:结合compile启动训练
bash
CONFIG_FILE="./llama3_8b.toml" ./run_train.sh \
  --model.converters="quantize.linear.float8" \
  --quantize.linear.float8.enable_fsdp_float8_all_gather \
  --compile.enable

Workflow 4: 4D parallelism for 405B models

工作流4:针对405B模型的4D并行训练

4D Parallelism (FSDP + TP + PP + CP):
- [ ] Step 1: Create seed checkpoint
- [ ] Step 2: Configure 4D parallelism
- [ ] Step 3: Launch on 512 GPUs
Step 1: Create seed checkpoint
Required for consistent initialization across PP stages:
bash
NGPU=1 CONFIG_FILE=./llama3_405b.toml ./run_train.sh \
  --checkpoint.enable \
  --checkpoint.create_seed_checkpoint \
  --parallelism.data_parallel_shard_degree 1 \
  --parallelism.tensor_parallel_degree 1 \
  --parallelism.pipeline_parallel_degree 1
Step 2: Configure 4D parallelism
toml
[parallelism]
data_parallel_shard_degree = 8   # FSDP
tensor_parallel_degree = 8       # TP within node
pipeline_parallel_degree = 8     # PP across nodes
context_parallel_degree = 1      # CP for long sequences

[training]
local_batch_size = 32
seq_len = 8192
Step 3: Launch on 512 GPUs
bash
undefined
4D并行(FSDP + TP + PP + CP):
- [ ] 步骤1:创建种子检查点
- [ ] 步骤2:配置4D并行参数
- [ ] 步骤3:在512块GPU上启动训练
步骤1:创建种子检查点
为确保PP阶段初始化一致性,需创建种子检查点:
bash
NGPU=1 CONFIG_FILE=./llama3_405b.toml ./run_train.sh \
  --checkpoint.enable \
  --checkpoint.create_seed_checkpoint \
  --parallelism.data_parallel_shard_degree 1 \
  --parallelism.tensor_parallel_degree 1 \
  --parallelism.pipeline_parallel_degree 1
步骤2:配置4D并行参数
toml
[parallelism]
data_parallel_shard_degree = 8   # FSDP
tensor_parallel_degree = 8       # 节点内的TP
pipeline_parallel_degree = 8     # 跨节点的PP
context_parallel_degree = 1      # 长序列场景的CP

[training]
local_batch_size = 32
seq_len = 8192
步骤3:在512块GPU上启动训练
bash
undefined

64 nodes x 8 GPUs = 512 GPUs

64个节点 × 8块GPU = 512块GPU

srun torchrun --nnodes=64 --nproc_per_node=8
-m torchtitan.train
--job.config_file ./llama3_405b.toml
undefined
srun torchrun --nnodes=64 --nproc_per_node=8
-m torchtitan.train
--job.config_file ./llama3_405b.toml
undefined

When to use vs alternatives

适用场景与替代方案对比

Use TorchTitan when:
  • Pretraining LLMs from scratch (8B to 405B+)
  • Need PyTorch-native solution without third-party dependencies
  • Require composable 4D parallelism (FSDP2, TP, PP, CP)
  • Training on H100s with Float8 support
  • Want interoperable checkpoints with torchtune/HuggingFace
Use alternatives instead:
  • Megatron-LM: Maximum performance for NVIDIA-only deployments
  • DeepSpeed: Broader ZeRO optimization ecosystem, inference support
  • Axolotl/TRL: Fine-tuning rather than pretraining
  • LitGPT: Educational, smaller-scale training
选择TorchTitan的场景:
  • 从零开始预训练LLM(8B至405B+参数规模)
  • 需要无第三方依赖的PyTorch原生解决方案
  • 需支持可组合的4D并行(FSDP2、TP、PP、CP)
  • 在H100 GPU上进行Float8训练
  • 希望与torchtune/HuggingFace实现检查点互通
选择替代方案的场景:
  • Megatron-LM: 仅NVIDIA部署场景下追求极致性能
  • DeepSpeed: 需要更丰富的ZeRo优化生态及推理支持
  • Axolotl/TRL: 侧重微调而非预训练
  • LitGPT: 用于教学或小规模训练场景

Common issues

常见问题

Issue: Out of memory on large models
Enable activation checkpointing and reduce batch size:
toml
[activation_checkpoint]
mode = "full"  # Instead of "selective"

[training]
local_batch_size = 1
Or use gradient accumulation:
toml
[training]
local_batch_size = 1
global_batch_size = 32  # Accumulates gradients
Issue: TP causes high memory with async collectives
Set environment variable:
bash
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
Issue: Float8 training not faster
Float8 only benefits large GEMMs. Filter small layers:
toml
[quantize.linear.float8]
filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"]
Issue: Checkpoint loading fails after parallelism change
Use DCP's resharding capability:
bash
undefined
问题:大模型训练时出现内存不足
启用激活检查点并减小批量大小:
toml
[activation_checkpoint]
mode = "full"  # 替代"selective"

[training]
local_batch_size = 1
或使用梯度累积:
toml
[training]
local_batch_size = 1
global_batch_size = 32  # 累积梯度
问题:TP结合异步集合操作导致内存占用过高
设置环境变量:
bash
export TORCH_NCCL_AVOID_RECORD_STREAMS=1
问题:Float8训练未提升速度
Float8仅对大型GEMM运算有效,需过滤小型层:
toml
[quantize.linear.float8]
filter_fqns = ["attention.wk", "attention.wv", "output", "auto_filter_small_kn"]
问题:修改并行配置后检查点加载失败
使用DCP的重分片功能:
bash
undefined

Convert sharded checkpoint to single file

将分片检查点转换为单文件

python -m torch.distributed.checkpoint.format_utils
dcp_to_torch checkpoint/step-1000 checkpoint.pt

**Issue: Pipeline parallelism initialization**

Create seed checkpoint first (see Workflow 4, Step 1).
python -m torch.distributed.checkpoint.format_utils
dcp_to_torch checkpoint/step-1000 checkpoint.pt

**问题:流水线并行初始化失败**

先创建种子检查点(参考工作流4的步骤1)。

Supported models

支持的模型

ModelSizesStatus
Llama 3.18B, 70B, 405BProduction
Llama 4VariousExperimental
DeepSeek V316B, 236B, 671B (MoE)Experimental
GPT-OSS20B, 120B (MoE)Experimental
Qwen 3VariousExperimental
FluxDiffusionExperimental
模型参数规模状态
Llama 3.18B, 70B, 405B正式可用
Llama 4多种规模实验性
DeepSeek V316B, 236B, 671B (MoE)实验性
GPT-OSS20B, 120B (MoE)实验性
Qwen 3多种规模实验性
FluxDiffusion实验性

Performance benchmarks (H100)

性能基准(H100 GPU)

ModelGPUsParallelismTPS/GPUTechniques
Llama 8B8FSDP5,762Baseline
Llama 8B8FSDP+compile+FP88,532+48%
Llama 70B256FSDP+TP+AsyncTP8762D parallel
Llama 405B512FSDP+TP+PP1283D parallel
模型GPU数量并行方案TPS/GPU优化技术
Llama 8B8FSDP5,762基准
Llama 8B8FSDP+compile+FP88,532+48%
Llama 70B256FSDP+TP+AsyncTP8762D并行
Llama 405B512FSDP+TP+PP1283D并行

Advanced topics

进阶主题

FSDP2 configuration: See references/fsdp.md for detailed FSDP2 vs FSDP1 comparison and ZeRO equivalents.
Float8 training: See references/float8.md for tensorwise vs rowwise scaling recipes.
Checkpointing: See references/checkpoint.md for HuggingFace conversion and async checkpointing.
Adding custom models: See references/custom-models.md for TrainSpec protocol.
FSDP2配置: 详见references/fsdp.md中FSDP2与FSDP1的详细对比及ZeRo等效配置。
Float8训练: 详见references/float8.md中张量级与行级缩放方案。
检查点: 详见references/checkpoint.md中HuggingFace转换及异步检查点方案。
添加自定义模型: 详见references/custom-models.md中的TrainSpec协议。

Resources

资源