pytorch-lightning

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

PyTorch Lightning - High-Level Training Framework

PyTorch Lightning - 高级训练框架

Quick start

快速开始

PyTorch Lightning organizes PyTorch code to eliminate boilerplate while maintaining flexibility.

Installation:

bash

pip install lightning

Convert PyTorch to Lightning (3 steps):

python

import lightning as L
import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset

PyTorch Lightning 可整理PyTorch代码，在保持灵活性的同时消除样板代码。

安装:

bash

pip install lightning

将PyTorch代码转换为Lightning代码（3个步骤）:

python

import lightning as L
import torch
from torch import nn
from torch.utils.data import DataLoader, Dataset

Step 1: Define LightningModule (organize your PyTorch code)

步骤1：定义LightningModule（整理你的PyTorch代码）

class LitModel(L.LightningModule): def init(self, hidden_size=128): super().init() self.model = nn.Sequential( nn.Linear(28 * 28, hidden_size), nn.ReLU(), nn.Linear(hidden_size, 10) )

def training_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.model(x)
    loss = nn.functional.cross_entropy(y_hat, y)
    self.log('train_loss', loss)  # Auto-logged to TensorBoard
    return loss

def configure_optimizers(self):
    return torch.optim.Adam(self.parameters(), lr=1e-3)

class LitModel(L.LightningModule): def init(self, hidden_size=128): super().init() self.model = nn.Sequential( nn.Linear(28 * 28, hidden_size), nn.ReLU(), nn.Linear(hidden_size, 10) )

def training_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self.model(x)
    loss = nn.functional.cross_entropy(y_hat, y)
    self.log('train_loss', loss)  # 自动记录到TensorBoard
    return loss

def configure_optimizers(self):
    return torch.optim.Adam(self.parameters(), lr=1e-3)

Step 2: Create data

步骤2：创建数据

train_loader = DataLoader(train_dataset, batch_size=32)

Step 3: Train with Trainer (handles everything else!)

步骤3：使用Trainer进行训练（其余所有工作都由它处理！）

trainer = L.Trainer(max_epochs=10, accelerator='gpu', devices=2) model = LitModel() trainer.fit(model, train_loader)


**That's it!** Trainer handles:
- GPU/TPU/CPU switching
- Distributed training (DDP, FSDP, DeepSpeed)
- Mixed precision (FP16, BF16)
- Gradient accumulation
- Checkpointing
- Logging
- Progress bars

trainer = L.Trainer(max_epochs=10, accelerator='gpu', devices=2) model = LitModel() trainer.fit(model, train_loader)


**就是这么简单！** Trainer会处理以下事项：
- GPU/TPU/CPU切换
- 分布式训练（DDP、FSDP、DeepSpeed）
- 混合精度（FP16、BF16）
- 梯度累积
- 模型 checkpoint
- 日志记录
- 进度条

Common workflows

常见工作流

Workflow 1: From PyTorch to Lightning

工作流1：从PyTorch迁移到Lightning

Original PyTorch code:

python

model = MyModel()
optimizer = torch.optim.Adam(model.parameters())
model.to('cuda')

for epoch in range(max_epochs):
    for batch in train_loader:
        batch = batch.to('cuda')
        optimizer.zero_grad()
        loss = model(batch)
        loss.backward()
        optimizer.step()

Lightning version:

python

class LitModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = MyModel()

    def training_step(self, batch, batch_idx):
        loss = self.model(batch)  # No .to('cuda') needed!
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters())

原始PyTorch代码:

python

model = MyModel()
optimizer = torch.optim.Adam(model.parameters())
model.to('cuda')

for epoch in range(max_epochs):
    for batch in train_loader:
        batch = batch.to('cuda')
        optimizer.zero_grad()
        loss = model(batch)
        loss.backward()
        optimizer.step()

Lightning版本:

python

class LitModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = MyModel()

    def training_step(self, batch, batch_idx):
        loss = self.model(batch)  # 无需调用.to('cuda')！
        return loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters())

Train

训练

trainer = L.Trainer(max_epochs=10, accelerator='gpu') trainer.fit(LitModel(), train_loader)


**Benefits**: 40+ lines → 15 lines, no device management, automatic distributed

trainer = L.Trainer(max_epochs=10, accelerator='gpu') trainer.fit(LitModel(), train_loader)


**优势**: 40+行代码 → 15行代码，无需设备管理，自动支持分布式训练

Workflow 2: Validation and testing

工作流2：验证与测试

python

class LitModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = MyModel()

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.model(x)
        loss = nn.functional.cross_entropy(y_hat, y)
        self.log('train_loss', loss)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.model(x)
        val_loss = nn.functional.cross_entropy(y_hat, y)
        acc = (y_hat.argmax(dim=1) == y).float().mean()
        self.log('val_loss', val_loss)
        self.log('val_acc', acc)

    def test_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.model(x)
        test_loss = nn.functional.cross_entropy(y_hat, y)
        self.log('test_loss', test_loss)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=1e-3)

python

class LitModel(L.LightningModule):
    def __init__(self):
        super().__init__()
        self.model = MyModel()

    def training_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.model(x)
        loss = nn.functional.cross_entropy(y_hat, y)
        self.log('train_loss', loss)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.model(x)
        val_loss = nn.functional.cross_entropy(y_hat, y)
        acc = (y_hat.argmax(dim=1) == y).float().mean()
        self.log('val_loss', val_loss)
        self.log('val_acc', acc)

    def test_step(self, batch, batch_idx):
        x, y = batch
        y_hat = self.model(x)
        test_loss = nn.functional.cross_entropy(y_hat, y)
        self.log('test_loss', test_loss)

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=1e-3)

Train with validation

带验证的训练

trainer = L.Trainer(max_epochs=10) trainer.fit(model, train_loader, val_loader)

Test

测试

trainer.test(model, test_loader)


**Automatic features**:
- Validation runs every epoch by default
- Metrics logged to TensorBoard
- Best model checkpointing based on val_loss

trainer.test(model, test_loader)


**自动功能**:
- 默认每个 epoch 运行一次验证
- 指标自动记录到TensorBoard
- 基于val_loss保存最佳模型 checkpoint

Workflow 3: Distributed training (DDP)

工作流3：分布式训练（DDP）

python

undefined

python

undefined

Same code as single GPU!

和单GPU代码完全相同！

model = LitModel()

8 GPUs with DDP (automatic!)

8卡GPU + DDP（自动配置！）

trainer = L.Trainer( accelerator='gpu', devices=8, strategy='ddp' # Or 'fsdp', 'deepspeed' )

trainer.fit(model, train_loader)


**Launch**:
```bash

trainer = L.Trainer( accelerator='gpu', devices=8, strategy='ddp' # 或者 'fsdp'、'deepspeed' )

trainer.fit(model, train_loader)


**启动命令**:
```bash

Single command, Lightning handles the rest

只需一条命令，Lightning处理其余所有事项

python train.py


**No changes needed**:
- Automatic data distribution
- Gradient synchronization
- Multi-node support (just set `num_nodes=2`)

python train.py


**无需修改代码**:
- 自动数据分发
- 梯度同步
- 支持多节点（只需设置`num_nodes=2`）

Workflow 4: Callbacks for monitoring

工作流4：使用回调进行监控

python

from lightning.pytorch.callbacks import ModelCheckpoint, EarlyStopping, LearningRateMonitor

python

from lightning.pytorch.callbacks import ModelCheckpoint, EarlyStopping, LearningRateMonitor

Create callbacks

创建回调

checkpoint = ModelCheckpoint( monitor='val_loss', mode='min', save_top_k=3, filename='model-{epoch:02d}-{val_loss:.2f}' )

early_stop = EarlyStopping( monitor='val_loss', patience=5, mode='min' )

lr_monitor = LearningRateMonitor(logging_interval='epoch')

checkpoint = ModelCheckpoint( monitor='val_loss', mode='min', save_top_k=3, filename='model-{epoch:02d}-{val_loss:.2f}' )

early_stop = EarlyStopping( monitor='val_loss', patience=5, mode='min' )

lr_monitor = LearningRateMonitor(logging_interval='epoch')

Add to Trainer

添加到Trainer

trainer = L.Trainer( max_epochs=100, callbacks=[checkpoint, early_stop, lr_monitor] )

trainer.fit(model, train_loader, val_loader)


**Result**:
- Auto-saves best 3 models
- Stops early if no improvement for 5 epochs
- Logs learning rate to TensorBoard

trainer = L.Trainer( max_epochs=100, callbacks=[checkpoint, early_stop, lr_monitor] )

trainer.fit(model, train_loader, val_loader)


**效果**:
- 自动保存Top3最佳模型
- 若5个epoch无性能提升则提前停止训练
- 将学习率记录到TensorBoard

Workflow 5: Learning rate scheduling

工作流5：学习率调度

python

class LitModel(L.LightningModule):
    # ... (training_step, etc.)

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)

        # Cosine annealing
        scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
            optimizer,
            T_max=100,
            eta_min=1e-5
        )

        return {
            'optimizer': optimizer,
            'lr_scheduler': {
                'scheduler': scheduler,
                'interval': 'epoch',  # Update per epoch
                'frequency': 1
            }
        }

python

class LitModel(L.LightningModule):
    # ...（training_step等方法）

    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.parameters(), lr=1e-3)

        # 余弦退火调度
        scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
            optimizer,
            T_max=100,
            eta_min=1e-5
        )

        return {
            'optimizer': optimizer,
            'lr_scheduler': {
                'scheduler': scheduler,
                'interval': 'epoch',  # 每个epoch更新一次
                'frequency': 1
            }
        }

Learning rate auto-logged!

学习率自动记录！

trainer = L.Trainer(max_epochs=100) trainer.fit(model, train_loader)

undefined

trainer = L.Trainer(max_epochs=100) trainer.fit(model, train_loader)

undefined

When to use vs alternatives

适用场景与替代方案对比

Use PyTorch Lightning when:

Want clean, organized code
Need production-ready training loops
Switching between single GPU, multi-GPU, TPU
Want built-in callbacks and logging
Team collaboration (standardized structure)

Key advantages:

Organized: Separates research code from engineering
Automatic: DDP, FSDP, DeepSpeed with 1 line
Callbacks: Modular training extensions
Reproducible: Less boilerplate = fewer bugs
Tested: 1M+ downloads/month, battle-tested

Use alternatives instead:

Accelerate: Minimal changes to existing code, more flexibility
Ray Train: Multi-node orchestration, hyperparameter tuning
Raw PyTorch: Maximum control, learning purposes
Keras: TensorFlow ecosystem

适合使用PyTorch Lightning的场景:

希望代码整洁、结构清晰
需要生产级别的训练循环
需在单GPU、多GPU、TPU之间切换
想要内置的回调和日志功能
团队协作（标准化代码结构）

核心优势:

结构化: 将研究代码与工程代码分离
自动化: 只需一行代码即可启用DDP、FSDP、DeepSpeed
模块化: 可通过回调扩展训练功能
可复现: 样板代码少=bug更少
经过验证: 月下载量超100万次，久经考验

适合使用替代方案的场景:

Accelerate: 对现有代码改动极小，灵活性更高
Ray Train: 多节点编排、超参数调优
原生PyTorch: 需要最大控制权、用于学习目的
Keras: 属于TensorFlow生态系统

Common issues

常见问题

Issue: Loss not decreasing

Check data and model setup:

python

undefined

问题：损失值不下降

检查数据和模型设置：

python

undefined

Add to training_step

添加到training_step方法中

def training_step(self, batch, batch_idx): if batch_idx == 0: print(f"Batch shape: {batch[0].shape}") print(f"Labels: {batch[1]}") loss = ... return loss


**Issue: Out of memory**

Reduce batch size or use gradient accumulation:
```python
trainer = L.Trainer(
    accumulate_grad_batches=4,  # Effective batch = batch_size × 4
    precision='bf16'  # Or 'fp16', reduces memory 50%
)

Issue: Validation not running

Ensure you pass val_loader:

python

undefined

def training_step(self, batch, batch_idx): if batch_idx == 0: print(f"Batch shape: {batch[0].shape}") print(f"Labels: {batch[1]}") loss = ... return loss


**问题：内存不足**

减小批量大小或使用梯度累积：
```python
trainer = L.Trainer(
    accumulate_grad_batches=4,  # 等效批量大小 = 原批量大小 × 4
    precision='bf16'  # 或者 'fp16'，可减少50%内存占用
)

问题：验证过程未运行

确保传入了val_loader：

python

undefined

WRONG

错误写法

trainer.fit(model, train_loader)

CORRECT

正确写法

trainer.fit(model, train_loader, val_loader)


**Issue: DDP spawns multiple processes unexpectedly**

Lightning auto-detects GPUs. Explicitly set devices:
```python

trainer.fit(model, train_loader, val_loader)


**问题：DDP意外启动多个进程**

Lightning会自动检测GPU。可显式设置设备：
```python

Test on CPU first

先在CPU上测试

trainer = L.Trainer(accelerator='cpu', devices=1)

Then GPU

再切换到GPU

trainer = L.Trainer(accelerator='gpu', devices=1)

undefined

trainer = L.Trainer(accelerator='gpu', devices=1)

undefined

Advanced topics

高级主题

Callbacks: See references/callbacks.md for EarlyStopping, ModelCheckpoint, custom callbacks, and callback hooks.

Distributed strategies: See references/distributed.md for DDP, FSDP, DeepSpeed ZeRO integration, multi-node setup.

Hyperparameter tuning: See references/hyperparameter-tuning.md for integration with Optuna, Ray Tune, and WandB sweeps.

回调：详见references/callbacks.md，包含EarlyStopping、ModelCheckpoint、自定义回调及回调钩子的相关内容。

分布式策略：详见references/distributed.md，包含DDP、FSDP、DeepSpeed ZeRo集成、多节点设置的相关内容。

超参数调优：详见references/hyperparameter-tuning.md，包含与Optuna、Ray Tune、WandB sweeps集成的相关内容。

Hardware requirements

硬件要求

CPU: Works (good for debugging)
Single GPU: Works
Multi-GPU: DDP (default), FSDP, or DeepSpeed
Multi-node: DDP, FSDP, DeepSpeed
TPU: Supported (8 cores)
Apple MPS: Supported

Precision options:

FP32 (default)
FP16 (V100, older GPUs)
BF16 (A100/H100, recommended)
FP8 (H100)

CPU：支持（适合调试）
单GPU：支持
多GPU：支持DDP（默认）、FSDP或DeepSpeed
多节点：支持DDP、FSDP、DeepSpeed
TPU：支持（8核）
Apple MPS：支持

精度选项:

FP32（默认）
FP16（适用于V100及旧款GPU）
BF16（适用于A100/H100，推荐使用）
FP8（适用于H100）

Resources

资源

Docs: https://lightning.ai/docs/pytorch/stable/
GitHub: https://github.com/Lightning-AI/pytorch-lightning ⭐ 29,000+
Version: 2.5.5+
Examples: https://github.com/Lightning-AI/pytorch-lightning/tree/master/examples
Discord: https://discord.gg/lightning-ai
Used by: Kaggle winners, research labs, production teams

文档：https://lightning.ai/docs/pytorch/stable/
GitHub：https://github.com/Lightning-AI/pytorch-lightning ⭐ 29,000+
版本：2.5.5+
示例：https://github.com/Lightning-AI/pytorch-lightning/tree/master/examples
Discord社区：https://discord.gg/lightning-ai
用户：Kaggle获奖者、研究实验室、生产团队