fine-tuning-with-trl

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

TRL - Transformer Reinforcement Learning

TRL - Transformer强化学习

Quick start

快速开始

TRL provides post-training methods for aligning language models with human preferences.
Installation:
bash
pip install trl transformers datasets peft accelerate
Supervised Fine-Tuning (instruction tuning):
python
from trl import SFTTrainer

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B",
    train_dataset=dataset,  # Prompt-completion pairs
)
trainer.train()
DPO (align with preferences):
python
from trl import DPOTrainer, DPOConfig

config = DPOConfig(output_dir="model-dpo", beta=0.1)
trainer = DPOTrainer(
    model=model,
    args=config,
    train_dataset=preference_dataset,  # chosen/rejected pairs
    processing_class=tokenizer
)
trainer.train()
TRL提供了用于将语言模型与人类偏好对齐的后训练方法。
安装:
bash
pip install trl transformers datasets peft accelerate
监督微调(指令微调):
python
from trl import SFTTrainer

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B",
    train_dataset=dataset,  # Prompt-completion pairs
)
trainer.train()
DPO(偏好对齐):
python
from trl import DPOTrainer, DPOConfig

config = DPOConfig(output_dir="model-dpo", beta=0.1)
trainer = DPOTrainer(
    model=model,
    args=config,
    train_dataset=preference_dataset,  # chosen/rejected pairs
    processing_class=tokenizer
)
trainer.train()

Common workflows

常见工作流

Workflow 1: Full RLHF pipeline (SFT → Reward Model → PPO)

工作流1:完整RLHF流程(SFT → 奖励模型 → PPO)

Complete pipeline from base model to human-aligned model.
Copy this checklist:
RLHF Training:
- [ ] Step 1: Supervised fine-tuning (SFT)
- [ ] Step 2: Train reward model
- [ ] Step 3: PPO reinforcement learning
- [ ] Step 4: Evaluate aligned model
Step 1: Supervised fine-tuning
Train base model on instruction-following data:
python
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
从基础模型到人类对齐模型的完整流程。
复制以下检查清单:
RLHF Training:
- [ ] Step 1: Supervised fine-tuning (SFT)
- [ ] Step 2: Train reward model
- [ ] Step 3: PPO reinforcement learning
- [ ] Step 4: Evaluate aligned model
步骤1:监督微调
在遵循指令的数据集上训练基础模型:
python
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

Load model

加载模型

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B") tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B") tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B")

Load instruction dataset

加载指令数据集

dataset = load_dataset("trl-lib/Capybara", split="train")
dataset = load_dataset("trl-lib/Capybara", split="train")

Configure training

配置训练参数

training_args = SFTConfig( output_dir="Qwen2.5-0.5B-SFT", per_device_train_batch_size=4, num_train_epochs=1, learning_rate=2e-5, logging_steps=10, save_strategy="epoch" )
training_args = SFTConfig( output_dir="Qwen2.5-0.5B-SFT", per_device_train_batch_size=4, num_train_epochs=1, learning_rate=2e-5, logging_steps=10, save_strategy="epoch" )

Train

开始训练

trainer = SFTTrainer( model=model, args=training_args, train_dataset=dataset, tokenizer=tokenizer ) trainer.train() trainer.save_model()

**Step 2: Train reward model**

Train model to predict human preferences:

```python
from transformers import AutoModelForSequenceClassification
from trl import RewardTrainer, RewardConfig
trainer = SFTTrainer( model=model, args=training_args, train_dataset=dataset, tokenizer=tokenizer ) trainer.train() trainer.save_model()

**步骤2:训练奖励模型**

训练模型以预测人类偏好:

```python
from transformers import AutoModelForSequenceClassification
from trl import RewardTrainer, RewardConfig

Load SFT model as base

加载SFT模型作为基础模型

model = AutoModelForSequenceClassification.from_pretrained( "Qwen2.5-0.5B-SFT", num_labels=1 # Single reward score ) tokenizer = AutoTokenizer.from_pretrained("Qwen2.5-0.5B-SFT")
model = AutoModelForSequenceClassification.from_pretrained( "Qwen2.5-0.5B-SFT", num_labels=1 # 输出单一奖励分数 ) tokenizer = AutoTokenizer.from_pretrained("Qwen2.5-0.5B-SFT")

Load preference data (chosen/rejected pairs)

加载偏好数据集(选择/拒绝样本对)

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

Configure training

配置训练参数

training_args = RewardConfig( output_dir="Qwen2.5-0.5B-Reward", per_device_train_batch_size=2, num_train_epochs=1, learning_rate=1e-5 )
training_args = RewardConfig( output_dir="Qwen2.5-0.5B-Reward", per_device_train_batch_size=2, num_train_epochs=1, learning_rate=1e-5 )

Train reward model

训练奖励模型

trainer = RewardTrainer( model=model, args=training_args, processing_class=tokenizer, train_dataset=dataset ) trainer.train() trainer.save_model()

**Step 3: PPO reinforcement learning**

Optimize policy using reward model:

```bash
python -m trl.scripts.ppo \
    --model_name_or_path Qwen2.5-0.5B-SFT \
    --reward_model_path Qwen2.5-0.5B-Reward \
    --dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
    --output_dir Qwen2.5-0.5B-PPO \
    --learning_rate 3e-6 \
    --per_device_train_batch_size 64 \
    --total_episodes 10000
Step 4: Evaluate
python
from transformers import pipeline
trainer = RewardTrainer( model=model, args=training_args, processing_class=tokenizer, train_dataset=dataset ) trainer.train() trainer.save_model()

**步骤3:PPO强化学习**

使用奖励模型优化策略:

```bash
python -m trl.scripts.ppo \
    --model_name_or_path Qwen2.5-0.5B-SFT \
    --reward_model_path Qwen2.5-0.5B-Reward \
    --dataset_name trl-internal-testing/descriptiveness-sentiment-trl-style \
    --output_dir Qwen2.5-0.5B-PPO \
    --learning_rate 3e-6 \
    --per_device_train_batch_size 64 \
    --total_episodes 10000
步骤4:评估
python
from transformers import pipeline

Load aligned model

加载对齐后的模型

generator = pipeline("text-generation", model="Qwen2.5-0.5B-PPO")
generator = pipeline("text-generation", model="Qwen2.5-0.5B-PPO")

Test

测试

prompt = "Explain quantum computing to a 10-year-old" output = generator(prompt, max_length=200)[0]["generated_text"] print(output)
undefined
prompt = "Explain quantum computing to a 10-year-old" output = generator(prompt, max_length=200)[0]["generated_text"] print(output)
undefined

Workflow 2: Simple preference alignment with DPO

工作流2:使用DPO进行简单偏好对齐

Align model with preferences without reward model.
Copy this checklist:
DPO Training:
- [ ] Step 1: Prepare preference dataset
- [ ] Step 2: Configure DPO
- [ ] Step 3: Train with DPOTrainer
- [ ] Step 4: Evaluate alignment
Step 1: Prepare preference dataset
Dataset format:
json
{
  "prompt": "What is the capital of France?",
  "chosen": "The capital of France is Paris.",
  "rejected": "I don't know."
}
Load dataset:
python
from datasets import load_dataset

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")
无需奖励模型即可将模型与偏好对齐。
复制以下检查清单:
DPO Training:
- [ ] Step 1: Prepare preference dataset
- [ ] Step 2: Configure DPO
- [ ] Step 3: Train with DPOTrainer
- [ ] Step 4: Evaluate alignment
步骤1:准备偏好数据集
数据集格式:
json
{
  "prompt": "What is the capital of France?",
  "chosen": "The capital of France is Paris.",
  "rejected": "I don't know."
}
加载数据集:
python
from datasets import load_dataset

dataset = load_dataset("trl-lib/ultrafeedback_binarized", split="train")

Or load your own

或加载自定义数据集

dataset = load_dataset("json", data_files="preferences.json")

dataset = load_dataset("json", data_files="preferences.json")


**Step 2: Configure DPO**

```python
from trl import DPOConfig

config = DPOConfig(
    output_dir="Qwen2.5-0.5B-DPO",
    per_device_train_batch_size=4,
    num_train_epochs=1,
    learning_rate=5e-7,
    beta=0.1,  # KL penalty strength
    max_prompt_length=512,
    max_length=1024,
    logging_steps=10
)
Step 3: Train with DPOTrainer
python
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

trainer = DPOTrainer(
    model=model,
    args=config,
    train_dataset=dataset,
    processing_class=tokenizer
)

trainer.train()
trainer.save_model()
CLI alternative:
bash
trl dpo \
    --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
    --dataset_name argilla/Capybara-Preferences \
    --output_dir Qwen2.5-0.5B-DPO \
    --per_device_train_batch_size 4 \
    --learning_rate 5e-7 \
    --beta 0.1

**步骤2:配置DPO**

```python
from trl import DPOConfig

config = DPOConfig(
    output_dir="Qwen2.5-0.5B-DPO",
    per_device_train_batch_size=4,
    num_train_epochs=1,
    learning_rate=5e-7,
    beta=0.1,  # KL惩罚强度
    max_prompt_length=512,
    max_length=1024,
    logging_steps=10
)
步骤3:使用DPOTrainer训练
python
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import DPOTrainer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")

trainer = DPOTrainer(
    model=model,
    args=config,
    train_dataset=dataset,
    processing_class=tokenizer
)

trainer.train()
trainer.save_model()
CLI替代方案
bash
trl dpo \
    --model_name_or_path Qwen/Qwen2.5-0.5B-Instruct \
    --dataset_name argilla/Capybara-Preferences \
    --output_dir Qwen2.5-0.5B-DPO \
    --per_device_train_batch_size 4 \
    --learning_rate 5e-7 \
    --beta 0.1

Workflow 3: Memory-efficient online RL with GRPO

工作流3:使用GRPO进行内存高效的在线强化学习

Train with reinforcement learning using minimal memory.
Copy this checklist:
GRPO Training:
- [ ] Step 1: Define reward function
- [ ] Step 2: Configure GRPO
- [ ] Step 3: Train with GRPOTrainer
Step 1: Define reward function
python
def reward_function(completions, **kwargs):
    """
    Compute rewards for completions.

    Args:
        completions: List of generated texts

    Returns:
        List of reward scores (floats)
    """
    rewards = []
    for completion in completions:
        # Example: reward based on length and unique words
        score = len(completion.split())  # Favor longer responses
        score += len(set(completion.lower().split()))  # Reward unique words
        rewards.append(score)
    return rewards
Or use a reward model:
python
from transformers import pipeline

reward_model = pipeline("text-classification", model="reward-model-path")

def reward_from_model(completions, prompts, **kwargs):
    # Combine prompt + completion
    full_texts = [p + c for p, c in zip(prompts, completions)]
    # Get reward scores
    results = reward_model(full_texts)
    return [r["score"] for r in results]
Step 2: Configure GRPO
python
from trl import GRPOConfig

config = GRPOConfig(
    output_dir="Qwen2-GRPO",
    per_device_train_batch_size=4,
    num_train_epochs=1,
    learning_rate=1e-5,
    num_generations=4,  # Generate 4 completions per prompt
    max_new_tokens=128
)
Step 3: Train with GRPOTrainer
python
from datasets import load_dataset
from trl import GRPOTrainer
使用最少内存进行强化学习训练。
复制以下检查清单:
GRPO Training:
- [ ] Step 1: Define reward function
- [ ] Step 2: Configure GRPO
- [ ] Step 3: Train with GRPOTrainer
步骤1:定义奖励函数
python
def reward_function(completions, **kwargs):
    """
    计算生成内容的奖励分数。

    参数:
        completions: 生成文本列表

    返回:
        奖励分数列表(浮点数)
    """
    rewards = []
    for completion in completions:
        # 示例:基于长度和唯一单词数计算奖励
        score = len(completion.split())  # 偏好更长的回复
        score += len(set(completion.lower().split()))  # 奖励使用更多独特词汇
        rewards.append(score)
    return rewards
或使用奖励模型:
python
from transformers import pipeline

reward_model = pipeline("text-classification", model="reward-model-path")

def reward_from_model(completions, prompts, **kwargs):
    # 拼接提示与生成内容
    full_texts = [p + c for p, c in zip(prompts, completions)]
    # 获取奖励分数
    results = reward_model(full_texts)
    return [r["score"] for r in results]
步骤2:配置GRPO
python
from trl import GRPOConfig

config = GRPOConfig(
    output_dir="Qwen2-GRPO",
    per_device_train_batch_size=4,
    num_train_epochs=1,
    learning_rate=1e-5,
    num_generations=4,  # 每个提示生成4条回复
    max_new_tokens=128
)
步骤3:使用GRPOTrainer训练
python
from datasets import load_dataset
from trl import GRPOTrainer

Load prompt-only dataset

加载仅含提示的数据集

dataset = load_dataset("trl-lib/tldr", split="train")
trainer = GRPOTrainer( model="Qwen/Qwen2-0.5B-Instruct", reward_funcs=reward_function, # Your reward function args=config, train_dataset=dataset )
trainer.train()

**CLI**:
```bash
trl grpo \
    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
    --dataset_name trl-lib/tldr \
    --output_dir Qwen2-GRPO \
    --num_generations 4
dataset = load_dataset("trl-lib/tldr", split="train")
trainer = GRPOTrainer( model="Qwen/Qwen2-0.5B-Instruct", reward_funcs=reward_function, # 自定义奖励函数 args=config, train_dataset=dataset )
trainer.train()

**CLI命令**:
```bash
trl grpo \
    --model_name_or_path Qwen/Qwen2-0.5B-Instruct \
    --dataset_name trl-lib/tldr \
    --output_dir Qwen2-GRPO \
    --num_generations 4

When to use vs alternatives

适用场景与替代方案对比

Use TRL when:
  • Need to align model with human preferences
  • Have preference data (chosen/rejected pairs)
  • Want to use reinforcement learning (PPO, GRPO)
  • Need reward model training
  • Doing RLHF (full pipeline)
Method selection:
  • SFT: Have prompt-completion pairs, want basic instruction following
  • DPO: Have preferences, want simple alignment (no reward model needed)
  • PPO: Have reward model, need maximum control over RL
  • GRPO: Memory-constrained, want online RL
  • Reward Model: Building RLHF pipeline, need to score generations
Use alternatives instead:
  • HuggingFace Trainer: Basic fine-tuning without RL
  • Axolotl: YAML-based training configuration
  • LitGPT: Educational, minimal fine-tuning
  • Unsloth: Fast LoRA training
使用TRL的场景
  • 需要将模型与人类偏好对齐
  • 拥有偏好数据集(选择/拒绝样本对)
  • 希望使用强化学习(PPO、GRPO)
  • 需要训练奖励模型
  • 构建完整RLHF流程
方法选择指南
  • SFT:拥有提示-完成样本对,需要基础指令跟随能力
  • DPO:拥有偏好数据,希望进行简单对齐(无需奖励模型)
  • PPO:拥有奖励模型,需要对强化学习进行最大程度控制
  • GRPO:内存受限,希望进行在线强化学习
  • 奖励模型:构建RLHF流程,需要对生成内容打分
选择替代方案的场景
  • HuggingFace Trainer:无需强化学习的基础微调
  • Axolotl:基于YAML的训练配置
  • LitGPT:用于教学的极简微调工具
  • Unsloth:快速LoRA训练工具

Common issues

常见问题

Issue: OOM during DPO training
Reduce batch size and sequence length:
python
config = DPOConfig(
    per_device_train_batch_size=1,  # Reduce from 4
    max_length=512,  # Reduce from 1024
    gradient_accumulation_steps=8  # Maintain effective batch
)
Or use gradient checkpointing:
python
model.gradient_checkpointing_enable()
Issue: Poor alignment quality
Tune beta parameter:
python
undefined
问题:DPO训练时出现OOM(内存不足)
减小批量大小和序列长度:
python
config = DPOConfig(
    per_device_train_batch_size=1,  # 从4减小
    max_length=512,  # 从1024减小
    gradient_accumulation_steps=8  # 维持有效批量大小
)
或启用梯度检查点:
python
model.gradient_checkpointing_enable()
问题:对齐质量不佳
调整beta参数:
python
undefined

Higher beta = more conservative (stays closer to reference)

更高的beta = 更保守(更接近基准模型)

config = DPOConfig(beta=0.5) # Default 0.1
config = DPOConfig(beta=0.5) # 默认值0.1

Lower beta = more aggressive alignment

更低的beta = 更激进的对齐

config = DPOConfig(beta=0.01)

**Issue: Reward model not learning**

Check loss type and learning rate:
```python
config = RewardConfig(
    learning_rate=1e-5,  # Try different LR
    num_train_epochs=3  # Train longer
)
Ensure preference dataset has clear winners:
python
undefined
config = DPOConfig(beta=0.01)

**问题:奖励模型未有效学习**

检查损失类型和学习率:
```python
config = RewardConfig(
    learning_rate=1e-5,  # 尝试不同的学习率
    num_train_epochs=3  # 延长训练周期
)
确保偏好数据集的样本有明确的优劣区分:
python
undefined

Verify dataset

验证数据集

print(dataset[0])
print(dataset[0])

Should have clear chosen > rejected

样本应明确体现chosen(优)> rejected(劣)


**Issue: PPO training unstable**

Adjust KL coefficient:
```python
config = PPOConfig(
    kl_coef=0.1,  # Increase from 0.05
    cliprange=0.1  # Reduce from 0.2
)

**问题:PPO训练不稳定**

调整KL系数:
```python
config = PPOConfig(
    kl_coef=0.1,  # 从0.05增加
    cliprange=0.1  # 从0.2减小
)

Advanced topics

进阶主题

SFT training guide: See references/sft-training.md for dataset formats, chat templates, packing strategies, and multi-GPU training.
DPO variants: See references/dpo-variants.md for IPO, cDPO, RPO, and other DPO loss functions with recommended hyperparameters.
Reward modeling: See references/reward-modeling.md for outcome vs process rewards, Bradley-Terry loss, and reward model evaluation.
Online RL methods: See references/online-rl.md for PPO, GRPO, RLOO, and OnlineDPO with detailed configurations.
SFT训练指南:请参考references/sft-training.md,了解数据集格式、聊天模板、打包策略和多GPU训练。
DPO变体:请参考references/dpo-variants.md,了解IPO、cDPO、RPO等DPO损失函数及推荐超参数。
奖励建模:请参考references/reward-modeling.md,了解结果奖励vs过程奖励、Bradley-Terry损失和奖励模型评估。
在线强化学习方法:请参考references/online-rl.md,了解PPO、GRPO、RLOO和OnlineDPO的详细配置。

Hardware requirements

硬件要求

  • GPU: NVIDIA (CUDA required)
  • VRAM: Depends on model and method
    • SFT 7B: 16GB (with LoRA)
    • DPO 7B: 24GB (stores reference model)
    • PPO 7B: 40GB (policy + reward model)
    • GRPO 7B: 24GB (more memory efficient)
  • Multi-GPU: Supported via
    accelerate
  • Mixed precision: BF16 recommended (A100/H100)
Memory optimization:
  • Use LoRA/QLoRA for all methods
  • Enable gradient checkpointing
  • Use smaller batch sizes with gradient accumulation
  • GPU:NVIDIA(需支持CUDA)
  • 显存(VRAM):取决于模型和使用方法
    • SFT 7B模型:16GB(使用LoRA时)
    • DPO 7B模型:24GB(需存储基准模型)
    • PPO 7B模型:40GB(策略模型+奖励模型)
    • GRPO 7B模型:24GB(内存效率更高)
  • 多GPU:通过
    accelerate
    支持
  • 混合精度:推荐使用BF16(适用于A100/H100)
内存优化技巧
  • 对所有方法使用LoRA/QLoRA
  • 启用梯度检查点
  • 使用更小的批量大小并配合梯度累积

Resources

资源