pytorch-lightning
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesePyTorch Lightning - High-Level Training Framework
PyTorch Lightning - 高级训练框架
Quick start
快速开始
PyTorch Lightning organizes PyTorch code to eliminate boilerplate while maintaining flexibility.
Installation:
bash
pip install lightningConvert PyTorch to Lightning (3 steps):
python
import lightning as L
import torch
from torch import nn
from torch.utils.data import DataLoader, DatasetPyTorch Lightning 可整理PyTorch代码,在保持灵活性的同时消除样板代码。
安装:
bash
pip install lightning将PyTorch代码转换为Lightning代码(3个步骤):
python
import lightning as L
import torch
from torch import nn
from torch.utils.data import DataLoader, DatasetStep 1: Define LightningModule (organize your PyTorch code)
步骤1:定义LightningModule(整理你的PyTorch代码)
class LitModel(L.LightningModule):
def init(self, hidden_size=128):
super().init()
self.model = nn.Sequential(
nn.Linear(28 * 28, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, 10)
)
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
loss = nn.functional.cross_entropy(y_hat, y)
self.log('train_loss', loss) # Auto-logged to TensorBoard
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=1e-3)class LitModel(L.LightningModule):
def init(self, hidden_size=128):
super().init()
self.model = nn.Sequential(
nn.Linear(28 * 28, hidden_size),
nn.ReLU(),
nn.Linear(hidden_size, 10)
)
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
loss = nn.functional.cross_entropy(y_hat, y)
self.log('train_loss', loss) # 自动记录到TensorBoard
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=1e-3)Step 2: Create data
步骤2:创建数据
train_loader = DataLoader(train_dataset, batch_size=32)
train_loader = DataLoader(train_dataset, batch_size=32)
Step 3: Train with Trainer (handles everything else!)
步骤3:使用Trainer进行训练(其余所有工作都由它处理!)
trainer = L.Trainer(max_epochs=10, accelerator='gpu', devices=2)
model = LitModel()
trainer.fit(model, train_loader)
**That's it!** Trainer handles:
- GPU/TPU/CPU switching
- Distributed training (DDP, FSDP, DeepSpeed)
- Mixed precision (FP16, BF16)
- Gradient accumulation
- Checkpointing
- Logging
- Progress barstrainer = L.Trainer(max_epochs=10, accelerator='gpu', devices=2)
model = LitModel()
trainer.fit(model, train_loader)
**就是这么简单!** Trainer会处理以下事项:
- GPU/TPU/CPU切换
- 分布式训练(DDP、FSDP、DeepSpeed)
- 混合精度(FP16、BF16)
- 梯度累积
- 模型 checkpoint
- 日志记录
- 进度条Common workflows
常见工作流
Workflow 1: From PyTorch to Lightning
工作流1:从PyTorch迁移到Lightning
Original PyTorch code:
python
model = MyModel()
optimizer = torch.optim.Adam(model.parameters())
model.to('cuda')
for epoch in range(max_epochs):
for batch in train_loader:
batch = batch.to('cuda')
optimizer.zero_grad()
loss = model(batch)
loss.backward()
optimizer.step()Lightning version:
python
class LitModel(L.LightningModule):
def __init__(self):
super().__init__()
self.model = MyModel()
def training_step(self, batch, batch_idx):
loss = self.model(batch) # No .to('cuda') needed!
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters())原始PyTorch代码:
python
model = MyModel()
optimizer = torch.optim.Adam(model.parameters())
model.to('cuda')
for epoch in range(max_epochs):
for batch in train_loader:
batch = batch.to('cuda')
optimizer.zero_grad()
loss = model(batch)
loss.backward()
optimizer.step()Lightning版本:
python
class LitModel(L.LightningModule):
def __init__(self):
super().__init__()
self.model = MyModel()
def training_step(self, batch, batch_idx):
loss = self.model(batch) # 无需调用.to('cuda')!
return loss
def configure_optimizers(self):
return torch.optim.Adam(self.parameters())Train
训练
trainer = L.Trainer(max_epochs=10, accelerator='gpu')
trainer.fit(LitModel(), train_loader)
**Benefits**: 40+ lines → 15 lines, no device management, automatic distributedtrainer = L.Trainer(max_epochs=10, accelerator='gpu')
trainer.fit(LitModel(), train_loader)
**优势**: 40+行代码 → 15行代码,无需设备管理,自动支持分布式训练Workflow 2: Validation and testing
工作流2:验证与测试
python
class LitModel(L.LightningModule):
def __init__(self):
super().__init__()
self.model = MyModel()
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
loss = nn.functional.cross_entropy(y_hat, y)
self.log('train_loss', loss)
return loss
def validation_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
val_loss = nn.functional.cross_entropy(y_hat, y)
acc = (y_hat.argmax(dim=1) == y).float().mean()
self.log('val_loss', val_loss)
self.log('val_acc', acc)
def test_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
test_loss = nn.functional.cross_entropy(y_hat, y)
self.log('test_loss', test_loss)
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=1e-3)python
class LitModel(L.LightningModule):
def __init__(self):
super().__init__()
self.model = MyModel()
def training_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
loss = nn.functional.cross_entropy(y_hat, y)
self.log('train_loss', loss)
return loss
def validation_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
val_loss = nn.functional.cross_entropy(y_hat, y)
acc = (y_hat.argmax(dim=1) == y).float().mean()
self.log('val_loss', val_loss)
self.log('val_acc', acc)
def test_step(self, batch, batch_idx):
x, y = batch
y_hat = self.model(x)
test_loss = nn.functional.cross_entropy(y_hat, y)
self.log('test_loss', test_loss)
def configure_optimizers(self):
return torch.optim.Adam(self.parameters(), lr=1e-3)Train with validation
带验证的训练
trainer = L.Trainer(max_epochs=10)
trainer.fit(model, train_loader, val_loader)
trainer = L.Trainer(max_epochs=10)
trainer.fit(model, train_loader, val_loader)
Test
测试
trainer.test(model, test_loader)
**Automatic features**:
- Validation runs every epoch by default
- Metrics logged to TensorBoard
- Best model checkpointing based on val_losstrainer.test(model, test_loader)
**自动功能**:
- 默认每个 epoch 运行一次验证
- 指标自动记录到TensorBoard
- 基于val_loss保存最佳模型 checkpointWorkflow 3: Distributed training (DDP)
工作流3:分布式训练(DDP)
python
undefinedpython
undefinedSame code as single GPU!
和单GPU代码完全相同!
model = LitModel()
model = LitModel()
8 GPUs with DDP (automatic!)
8卡GPU + DDP(自动配置!)
trainer = L.Trainer(
accelerator='gpu',
devices=8,
strategy='ddp' # Or 'fsdp', 'deepspeed'
)
trainer.fit(model, train_loader)
**Launch**:
```bashtrainer = L.Trainer(
accelerator='gpu',
devices=8,
strategy='ddp' # 或者 'fsdp'、'deepspeed'
)
trainer.fit(model, train_loader)
**启动命令**:
```bashSingle command, Lightning handles the rest
只需一条命令,Lightning处理其余所有事项
python train.py
**No changes needed**:
- Automatic data distribution
- Gradient synchronization
- Multi-node support (just set `num_nodes=2`)python train.py
**无需修改代码**:
- 自动数据分发
- 梯度同步
- 支持多节点(只需设置`num_nodes=2`)Workflow 4: Callbacks for monitoring
工作流4:使用回调进行监控
python
from lightning.pytorch.callbacks import ModelCheckpoint, EarlyStopping, LearningRateMonitorpython
from lightning.pytorch.callbacks import ModelCheckpoint, EarlyStopping, LearningRateMonitorCreate callbacks
创建回调
checkpoint = ModelCheckpoint(
monitor='val_loss',
mode='min',
save_top_k=3,
filename='model-{epoch:02d}-{val_loss:.2f}'
)
early_stop = EarlyStopping(
monitor='val_loss',
patience=5,
mode='min'
)
lr_monitor = LearningRateMonitor(logging_interval='epoch')
checkpoint = ModelCheckpoint(
monitor='val_loss',
mode='min',
save_top_k=3,
filename='model-{epoch:02d}-{val_loss:.2f}'
)
early_stop = EarlyStopping(
monitor='val_loss',
patience=5,
mode='min'
)
lr_monitor = LearningRateMonitor(logging_interval='epoch')
Add to Trainer
添加到Trainer
trainer = L.Trainer(
max_epochs=100,
callbacks=[checkpoint, early_stop, lr_monitor]
)
trainer.fit(model, train_loader, val_loader)
**Result**:
- Auto-saves best 3 models
- Stops early if no improvement for 5 epochs
- Logs learning rate to TensorBoardtrainer = L.Trainer(
max_epochs=100,
callbacks=[checkpoint, early_stop, lr_monitor]
)
trainer.fit(model, train_loader, val_loader)
**效果**:
- 自动保存Top3最佳模型
- 若5个epoch无性能提升则提前停止训练
- 将学习率记录到TensorBoardWorkflow 5: Learning rate scheduling
工作流5:学习率调度
python
class LitModel(L.LightningModule):
# ... (training_step, etc.)
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
# Cosine annealing
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer,
T_max=100,
eta_min=1e-5
)
return {
'optimizer': optimizer,
'lr_scheduler': {
'scheduler': scheduler,
'interval': 'epoch', # Update per epoch
'frequency': 1
}
}python
class LitModel(L.LightningModule):
# ...(training_step等方法)
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)
# 余弦退火调度
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer,
T_max=100,
eta_min=1e-5
)
return {
'optimizer': optimizer,
'lr_scheduler': {
'scheduler': scheduler,
'interval': 'epoch', # 每个epoch更新一次
'frequency': 1
}
}Learning rate auto-logged!
学习率自动记录!
trainer = L.Trainer(max_epochs=100)
trainer.fit(model, train_loader)
undefinedtrainer = L.Trainer(max_epochs=100)
trainer.fit(model, train_loader)
undefinedWhen to use vs alternatives
适用场景与替代方案对比
Use PyTorch Lightning when:
- Want clean, organized code
- Need production-ready training loops
- Switching between single GPU, multi-GPU, TPU
- Want built-in callbacks and logging
- Team collaboration (standardized structure)
Key advantages:
- Organized: Separates research code from engineering
- Automatic: DDP, FSDP, DeepSpeed with 1 line
- Callbacks: Modular training extensions
- Reproducible: Less boilerplate = fewer bugs
- Tested: 1M+ downloads/month, battle-tested
Use alternatives instead:
- Accelerate: Minimal changes to existing code, more flexibility
- Ray Train: Multi-node orchestration, hyperparameter tuning
- Raw PyTorch: Maximum control, learning purposes
- Keras: TensorFlow ecosystem
适合使用PyTorch Lightning的场景:
- 希望代码整洁、结构清晰
- 需要生产级别的训练循环
- 需在单GPU、多GPU、TPU之间切换
- 想要内置的回调和日志功能
- 团队协作(标准化代码结构)
核心优势:
- 结构化: 将研究代码与工程代码分离
- 自动化: 只需一行代码即可启用DDP、FSDP、DeepSpeed
- 模块化: 可通过回调扩展训练功能
- 可复现: 样板代码少=bug更少
- 经过验证: 月下载量超100万次,久经考验
适合使用替代方案的场景:
- Accelerate: 对现有代码改动极小,灵活性更高
- Ray Train: 多节点编排、超参数调优
- 原生PyTorch: 需要最大控制权、用于学习目的
- Keras: 属于TensorFlow生态系统
Common issues
常见问题
Issue: Loss not decreasing
Check data and model setup:
python
undefined问题:损失值不下降
检查数据和模型设置:
python
undefinedAdd to training_step
添加到training_step方法中
def training_step(self, batch, batch_idx):
if batch_idx == 0:
print(f"Batch shape: {batch[0].shape}")
print(f"Labels: {batch[1]}")
loss = ...
return loss
**Issue: Out of memory**
Reduce batch size or use gradient accumulation:
```python
trainer = L.Trainer(
accumulate_grad_batches=4, # Effective batch = batch_size × 4
precision='bf16' # Or 'fp16', reduces memory 50%
)Issue: Validation not running
Ensure you pass val_loader:
python
undefineddef training_step(self, batch, batch_idx):
if batch_idx == 0:
print(f"Batch shape: {batch[0].shape}")
print(f"Labels: {batch[1]}")
loss = ...
return loss
**问题:内存不足**
减小批量大小或使用梯度累积:
```python
trainer = L.Trainer(
accumulate_grad_batches=4, # 等效批量大小 = 原批量大小 × 4
precision='bf16' # 或者 'fp16',可减少50%内存占用
)问题:验证过程未运行
确保传入了val_loader:
python
undefinedWRONG
错误写法
trainer.fit(model, train_loader)
trainer.fit(model, train_loader)
CORRECT
正确写法
trainer.fit(model, train_loader, val_loader)
**Issue: DDP spawns multiple processes unexpectedly**
Lightning auto-detects GPUs. Explicitly set devices:
```pythontrainer.fit(model, train_loader, val_loader)
**问题:DDP意外启动多个进程**
Lightning会自动检测GPU。可显式设置设备:
```pythonTest on CPU first
先在CPU上测试
trainer = L.Trainer(accelerator='cpu', devices=1)
trainer = L.Trainer(accelerator='cpu', devices=1)
Then GPU
再切换到GPU
trainer = L.Trainer(accelerator='gpu', devices=1)
undefinedtrainer = L.Trainer(accelerator='gpu', devices=1)
undefinedAdvanced topics
高级主题
Callbacks: See references/callbacks.md for EarlyStopping, ModelCheckpoint, custom callbacks, and callback hooks.
Distributed strategies: See references/distributed.md for DDP, FSDP, DeepSpeed ZeRO integration, multi-node setup.
Hyperparameter tuning: See references/hyperparameter-tuning.md for integration with Optuna, Ray Tune, and WandB sweeps.
回调:详见references/callbacks.md,包含EarlyStopping、ModelCheckpoint、自定义回调及回调钩子的相关内容。
分布式策略:详见references/distributed.md,包含DDP、FSDP、DeepSpeed ZeRo集成、多节点设置的相关内容。
超参数调优:详见references/hyperparameter-tuning.md,包含与Optuna、Ray Tune、WandB sweeps集成的相关内容。
Hardware requirements
硬件要求
- CPU: Works (good for debugging)
- Single GPU: Works
- Multi-GPU: DDP (default), FSDP, or DeepSpeed
- Multi-node: DDP, FSDP, DeepSpeed
- TPU: Supported (8 cores)
- Apple MPS: Supported
Precision options:
- FP32 (default)
- FP16 (V100, older GPUs)
- BF16 (A100/H100, recommended)
- FP8 (H100)
- CPU:支持(适合调试)
- 单GPU:支持
- 多GPU:支持DDP(默认)、FSDP或DeepSpeed
- 多节点:支持DDP、FSDP、DeepSpeed
- TPU:支持(8核)
- Apple MPS:支持
精度选项:
- FP32(默认)
- FP16(适用于V100及旧款GPU)
- BF16(适用于A100/H100,推荐使用)
- FP8(适用于H100)
Resources
资源
- Docs: https://lightning.ai/docs/pytorch/stable/
- GitHub: https://github.com/Lightning-AI/pytorch-lightning ⭐ 29,000+
- Version: 2.5.5+
- Examples: https://github.com/Lightning-AI/pytorch-lightning/tree/master/examples
- Discord: https://discord.gg/lightning-ai
- Used by: Kaggle winners, research labs, production teams
- 文档:https://lightning.ai/docs/pytorch/stable/
- GitHub:https://github.com/Lightning-AI/pytorch-lightning ⭐ 29,000+
- 版本:2.5.5+
- 示例:https://github.com/Lightning-AI/pytorch-lightning/tree/master/examples
- Discord社区:https://discord.gg/lightning-ai
- 用户:Kaggle获奖者、研究实验室、生产团队