pytorch-lightning

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

PyTorch Lightning - High-Level Training Framework

PyTorch Lightning - 高级训练框架

Quick start

快速开始

PyTorch Lightning organizes PyTorch code to eliminate boilerplate while maintaining flexibility.
Installation:
bash
pip install lightning
Convert PyTorch to Lightning (3 steps):
python
import lightning as L
import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset
PyTorch Lightning 可整理PyTorch代码,在保持灵活性的同时消除样板代码。
安装:
bash
pip install lightning
将PyTorch代码转换为Lightning代码(3个步骤):
python
import lightning as L
import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset

Step 1: Define LightningModule (organize your PyTorch code)

步骤1:定义LightningModule(整理你的PyTorch代码)

class LitModel(L.LightningModule): def init(self, hidden_size=128): super().init() self.model = nn.Sequential( nn.Linear(28 * 28, hidden_size), nn.ReLU(), nn.Linear(hidden_size, 10) )
def training_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.model(x)
    loss = nn.functional.cross_entropy(y_hat, y)
    self.log('train_loss', loss)  # Auto-logged to TensorBoard
    return loss

def configure_optimizers(self):
    return torch.optim.Adam(self.parameters(), lr=1e-3)
class LitModel(L.LightningModule): def init(self, hidden_size=128): super().init() self.model = nn.Sequential( nn.Linear(28 * 28, hidden_size), nn.ReLU(), nn.Linear(hidden_size, 10) )
def training_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.model(x)
    loss = nn.functional.cross_entropy(y_hat, y)
    self.log('train_loss', loss)  # 自动记录到TensorBoard
    return loss

def configure_optimizers(self):
    return torch.optim.Adam(self.parameters(), lr=1e-3)

Step 2: Create data

步骤2:创建数据

train_loader = DataLoader(train_dataset, batch_size=32)
train_loader = DataLoader(train_dataset, batch_size=32)

Step 3: Train with Trainer (handles everything else!)

步骤3:使用Trainer进行训练(其余所有工作都由它处理!)

trainer = L.Trainer(max_epochs=10, accelerator='gpu', devices=2) model = LitModel() trainer.fit(model, train_loader)

**That's it!** Trainer handles:
- GPU/TPU/CPU switching
- Distributed training (DDP, FSDP, DeepSpeed)
- Mixed precision (FP16, BF16)
- Gradient accumulation
- Checkpointing
- Logging
- Progress bars
trainer = L.Trainer(max_epochs=10, accelerator='gpu', devices=2) model = LitModel() trainer.fit(model, train_loader)

**就是这么简单!** Trainer会处理以下事项:
- GPU/TPU/CPU切换
- 分布式训练(DDP、FSDP、DeepSpeed)
- 混合精度(FP16、BF16)
- 梯度累积
- 模型 checkpoint
- 日志记录
- 进度条

Common workflows

常见工作流

Workflow 1: From PyTorch to Lightning

工作流1:从PyTorch迁移到Lightning

Original PyTorch code:
python
model = MyModel()
optimizer = torch.optim.Adam(model.parameters())
model.to('cuda')

for epoch in range(max_epochs):
    for batch in train_loader:
        batch = batch.to('cuda')
        optimizer.zero_grad()
        loss = model(batch)
        loss.backward()
        optimizer.step()
Lightning version:
python
class LitModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = MyModel()

    def training_step(self, batch, batch_idx):
        loss = self.model(batch)  # No .to('cuda') needed!
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters())
原始PyTorch代码:
python
model = MyModel()
optimizer = torch.optim.Adam(model.parameters())
model.to('cuda')

for epoch in range(max_epochs):
    for batch in train_loader:
        batch = batch.to('cuda')
        optimizer.zero_grad()
        loss = model(batch)
        loss.backward()
        optimizer.step()
Lightning版本:
python
class LitModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = MyModel()

    def training_step(self, batch, batch_idx):
        loss = self.model(batch)  # 无需调用.to('cuda')!
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters())

Train

训练

trainer = L.Trainer(max_epochs=10, accelerator='gpu') trainer.fit(LitModel(), train_loader)

**Benefits**: 40+ lines → 15 lines, no device management, automatic distributed
trainer = L.Trainer(max_epochs=10, accelerator='gpu') trainer.fit(LitModel(), train_loader)

**优势**: 40+行代码 → 15行代码,无需设备管理,自动支持分布式训练

Workflow 2: Validation and testing

工作流2:验证与测试

python
class LitModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = MyModel()

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.model(x)
        loss = nn.functional.cross_entropy(y_hat, y)
        self.log('train_loss', loss)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.model(x)
        val_loss = nn.functional.cross_entropy(y_hat, y)
        acc = (y_hat.argmax(dim=1) == y).float().mean()
        self.log('val_loss', val_loss)
        self.log('val_acc', acc)

    def test_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.model(x)
        test_loss = nn.functional.cross_entropy(y_hat, y)
        self.log('test_loss', test_loss)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=1e-3)
python
class LitModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = MyModel()

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.model(x)
        loss = nn.functional.cross_entropy(y_hat, y)
        self.log('train_loss', loss)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.model(x)
        val_loss = nn.functional.cross_entropy(y_hat, y)
        acc = (y_hat.argmax(dim=1) == y).float().mean()
        self.log('val_loss', val_loss)
        self.log('val_acc', acc)

    def test_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.model(x)
        test_loss = nn.functional.cross_entropy(y_hat, y)
        self.log('test_loss', test_loss)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=1e-3)

Train with validation

带验证的训练

trainer = L.Trainer(max_epochs=10) trainer.fit(model, train_loader, val_loader)
trainer = L.Trainer(max_epochs=10) trainer.fit(model, train_loader, val_loader)

Test

测试

trainer.test(model, test_loader)

**Automatic features**:
- Validation runs every epoch by default
- Metrics logged to TensorBoard
- Best model checkpointing based on val_loss
trainer.test(model, test_loader)

**自动功能**:
- 默认每个 epoch 运行一次验证
- 指标自动记录到TensorBoard
- 基于val_loss保存最佳模型 checkpoint

Workflow 3: Distributed training (DDP)

工作流3:分布式训练(DDP)

python
undefined
python
undefined

Same code as single GPU!

和单GPU代码完全相同!

model = LitModel()
model = LitModel()

8 GPUs with DDP (automatic!)

8卡GPU + DDP(自动配置!)

trainer = L.Trainer( accelerator='gpu', devices=8, strategy='ddp' # Or 'fsdp', 'deepspeed' )
trainer.fit(model, train_loader)

**Launch**:
```bash
trainer = L.Trainer( accelerator='gpu', devices=8, strategy='ddp' # 或者 'fsdp'、'deepspeed' )
trainer.fit(model, train_loader)

**启动命令**:
```bash

Single command, Lightning handles the rest

只需一条命令,Lightning处理其余所有事项

python train.py

**No changes needed**:
- Automatic data distribution
- Gradient synchronization
- Multi-node support (just set `num_nodes=2`)
python train.py

**无需修改代码**:
- 自动数据分发
- 梯度同步
- 支持多节点(只需设置`num_nodes=2`)

Workflow 4: Callbacks for monitoring

工作流4:使用回调进行监控

python
from lightning.pytorch.callbacks import ModelCheckpoint, EarlyStopping, LearningRateMonitor
python
from lightning.pytorch.callbacks import ModelCheckpoint, EarlyStopping, LearningRateMonitor

Create callbacks

创建回调

checkpoint = ModelCheckpoint( monitor='val_loss', mode='min', save_top_k=3, filename='model-{epoch:02d}-{val_loss:.2f}' )
early_stop = EarlyStopping( monitor='val_loss', patience=5, mode='min' )
lr_monitor = LearningRateMonitor(logging_interval='epoch')
checkpoint = ModelCheckpoint( monitor='val_loss', mode='min', save_top_k=3, filename='model-{epoch:02d}-{val_loss:.2f}' )
early_stop = EarlyStopping( monitor='val_loss', patience=5, mode='min' )
lr_monitor = LearningRateMonitor(logging_interval='epoch')

Add to Trainer

添加到Trainer

trainer = L.Trainer( max_epochs=100, callbacks=[checkpoint, early_stop, lr_monitor] )
trainer.fit(model, train_loader, val_loader)

**Result**:
- Auto-saves best 3 models
- Stops early if no improvement for 5 epochs
- Logs learning rate to TensorBoard
trainer = L.Trainer( max_epochs=100, callbacks=[checkpoint, early_stop, lr_monitor] )
trainer.fit(model, train_loader, val_loader)

**效果**:
- 自动保存Top3最佳模型
- 若5个epoch无性能提升则提前停止训练
- 将学习率记录到TensorBoard

Workflow 5: Learning rate scheduling

工作流5:学习率调度

python
class LitModel(L.LightningModule):
    # ... (training_step, etc.)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)

        # Cosine annealing
        scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
            optimizer,
            T_max=100,
            eta_min=1e-5
        )

        return {
            'optimizer': optimizer,
            'lr_scheduler': {
                'scheduler': scheduler,
                'interval': 'epoch',  # Update per epoch
                'frequency': 1
            }
        }
python
class LitModel(L.LightningModule):
    # ...(training_step等方法)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)

        # 余弦退火调度
        scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
            optimizer,
            T_max=100,
            eta_min=1e-5
        )

        return {
            'optimizer': optimizer,
            'lr_scheduler': {
                'scheduler': scheduler,
                'interval': 'epoch',  # 每个epoch更新一次
                'frequency': 1
            }
        }

Learning rate auto-logged!

学习率自动记录!

trainer = L.Trainer(max_epochs=100) trainer.fit(model, train_loader)
undefined
trainer = L.Trainer(max_epochs=100) trainer.fit(model, train_loader)
undefined

When to use vs alternatives

适用场景与替代方案对比

Use PyTorch Lightning when:
  • Want clean, organized code
  • Need production-ready training loops
  • Switching between single GPU, multi-GPU, TPU
  • Want built-in callbacks and logging
  • Team collaboration (standardized structure)
Key advantages:
  • Organized: Separates research code from engineering
  • Automatic: DDP, FSDP, DeepSpeed with 1 line
  • Callbacks: Modular training extensions
  • Reproducible: Less boilerplate = fewer bugs
  • Tested: 1M+ downloads/month, battle-tested
Use alternatives instead:
  • Accelerate: Minimal changes to existing code, more flexibility
  • Ray Train: Multi-node orchestration, hyperparameter tuning
  • Raw PyTorch: Maximum control, learning purposes
  • Keras: TensorFlow ecosystem
适合使用PyTorch Lightning的场景:
  • 希望代码整洁、结构清晰
  • 需要生产级别的训练循环
  • 需在单GPU、多GPU、TPU之间切换
  • 想要内置的回调和日志功能
  • 团队协作(标准化代码结构)
核心优势:
  • 结构化: 将研究代码与工程代码分离
  • 自动化: 只需一行代码即可启用DDP、FSDP、DeepSpeed
  • 模块化: 可通过回调扩展训练功能
  • 可复现: 样板代码少=bug更少
  • 经过验证: 月下载量超100万次,久经考验
适合使用替代方案的场景:
  • Accelerate: 对现有代码改动极小,灵活性更高
  • Ray Train: 多节点编排、超参数调优
  • 原生PyTorch: 需要最大控制权、用于学习目的
  • Keras: 属于TensorFlow生态系统

Common issues

常见问题

Issue: Loss not decreasing
Check data and model setup:
python
undefined
问题:损失值不下降
检查数据和模型设置:
python
undefined

Add to training_step

添加到training_step方法中

def training_step(self, batch, batch_idx): if batch_idx == 0: print(f"Batch shape: {batch[0].shape}") print(f"Labels: {batch[1]}") loss = ... return loss

**Issue: Out of memory**

Reduce batch size or use gradient accumulation:
```python
trainer = L.Trainer(
    accumulate_grad_batches=4,  # Effective batch = batch_size × 4
    precision='bf16'  # Or 'fp16', reduces memory 50%
)
Issue: Validation not running
Ensure you pass val_loader:
python
undefined
def training_step(self, batch, batch_idx): if batch_idx == 0: print(f"Batch shape: {batch[0].shape}") print(f"Labels: {batch[1]}") loss = ... return loss

**问题:内存不足**

减小批量大小或使用梯度累积:
```python
trainer = L.Trainer(
    accumulate_grad_batches=4,  # 等效批量大小 = 原批量大小 × 4
    precision='bf16'  # 或者 'fp16',可减少50%内存占用
)
问题:验证过程未运行
确保传入了val_loader:
python
undefined

WRONG

错误写法

trainer.fit(model, train_loader)
trainer.fit(model, train_loader)

CORRECT

正确写法

trainer.fit(model, train_loader, val_loader)

**Issue: DDP spawns multiple processes unexpectedly**

Lightning auto-detects GPUs. Explicitly set devices:
```python
trainer.fit(model, train_loader, val_loader)

**问题:DDP意外启动多个进程**

Lightning会自动检测GPU。可显式设置设备:
```python

Test on CPU first

先在CPU上测试

trainer = L.Trainer(accelerator='cpu', devices=1)
trainer = L.Trainer(accelerator='cpu', devices=1)

Then GPU

再切换到GPU

trainer = L.Trainer(accelerator='gpu', devices=1)
undefined
trainer = L.Trainer(accelerator='gpu', devices=1)
undefined

Advanced topics

高级主题

Callbacks: See references/callbacks.md for EarlyStopping, ModelCheckpoint, custom callbacks, and callback hooks.
Distributed strategies: See references/distributed.md for DDP, FSDP, DeepSpeed ZeRO integration, multi-node setup.
Hyperparameter tuning: See references/hyperparameter-tuning.md for integration with Optuna, Ray Tune, and WandB sweeps.
回调:详见references/callbacks.md,包含EarlyStopping、ModelCheckpoint、自定义回调及回调钩子的相关内容。
分布式策略:详见references/distributed.md,包含DDP、FSDP、DeepSpeed ZeRo集成、多节点设置的相关内容。
超参数调优:详见references/hyperparameter-tuning.md,包含与Optuna、Ray Tune、WandB sweeps集成的相关内容。

Hardware requirements

硬件要求

  • CPU: Works (good for debugging)
  • Single GPU: Works
  • Multi-GPU: DDP (default), FSDP, or DeepSpeed
  • Multi-node: DDP, FSDP, DeepSpeed
  • TPU: Supported (8 cores)
  • Apple MPS: Supported
Precision options:
  • FP32 (default)
  • FP16 (V100, older GPUs)
  • BF16 (A100/H100, recommended)
  • FP8 (H100)
  • CPU:支持(适合调试)
  • 单GPU:支持
  • 多GPU:支持DDP(默认)、FSDP或DeepSpeed
  • 多节点:支持DDP、FSDP、DeepSpeed
  • TPU:支持(8核)
  • Apple MPS:支持
精度选项:
  • FP32(默认)
  • FP16(适用于V100及旧款GPU)
  • BF16(适用于A100/H100,推荐使用)
  • FP8(适用于H100)

Resources

资源