grpo-rl-training
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseGRPO/RL Training with TRL
使用TRL进行GRPO/RL训练
Expert-level guidance for implementing Group Relative Policy Optimization (GRPO) using the Transformer Reinforcement Learning (TRL) library. This skill provides battle-tested patterns, critical insights, and production-ready workflows for fine-tuning language models with custom reward functions.
专家级指导,介绍如何使用Transformer Reinforcement Learning(TRL)库实现Group Relative Policy Optimization(GRPO)。本技能提供了经过实战验证的模式、关键见解和可用于生产环境的工作流,用于结合自定义奖励函数微调语言模型。
When to Use This Skill
何时使用本技能
Use GRPO training when you need to:
- Enforce specific output formats (e.g., XML tags, JSON, structured reasoning)
- Teach verifiable tasks with objective correctness metrics (math, coding, fact-checking)
- Improve reasoning capabilities by rewarding chain-of-thought patterns
- Align models to domain-specific behaviors without labeled preference data
- Optimize for multiple objectives simultaneously (format + correctness + style)
Do NOT use GRPO for:
- Simple supervised fine-tuning tasks (use SFT instead)
- Tasks without clear reward signals
- When you already have high-quality preference pairs (use DPO/PPO instead)
当你需要以下场景时,使用GRPO训练:
- 强制执行特定输出格式(例如XML标签、JSON、结构化推理)
- 教授可验证任务,带有客观正确性指标(数学、编码、事实核查)
- 提升推理能力,通过奖励思维链模式
- 使模型适应特定领域行为,无需标注偏好数据
- 同时优化多个目标(格式+正确性+风格)
请勿将GRPO用于:
- 简单的监督微调任务(改用SFT)
- 没有明确奖励信号的任务
- 已有高质量偏好对的场景(改用DPO/PPO)
Core Concepts
核心概念
1. GRPO Algorithm Fundamentals
1. GRPO算法基础
Key Mechanism:
- Generates multiple completions for each prompt (group size: 4-16)
- Compares completions within each group using reward functions
- Updates policy to favor higher-rewarded responses relative to the group
Critical Difference from PPO:
- No separate reward model needed
- More sample-efficient (learns from within-group comparisons)
- Simpler to implement and debug
Mathematical Intuition:
For each prompt p:
1. Generate N completions: {c₁, c₂, ..., cₙ}
2. Compute rewards: {r₁, r₂, ..., rₙ}
3. Learn to increase probability of high-reward completions
relative to low-reward ones in the same group核心机制:
- 为每个提示生成多个补全结果(组大小:4-16)
- 使用奖励函数比较每组内的补全结果
- 更新策略,使组内奖励更高的响应更受青睐
与PPO的关键区别:
- 无需单独的奖励模型
- 样本效率更高(从组内比较中学习)
- 实现和调试更简单
数学直觉:
For each prompt p:
1. Generate N completions: {c₁, c₂, ..., cₙ}
2. Compute rewards: {r₁, r₂, ..., rₙ}
3. Learn to increase probability of high-reward completions
relative to low-reward ones in the same group2. Reward Function Design Philosophy
2. 奖励函数设计理念
Golden Rules:
- Compose multiple reward functions - Each handles one aspect (format, correctness, style)
- Scale rewards appropriately - Higher weight = stronger signal
- Use incremental rewards - Partial credit for partial compliance
- Test rewards independently - Debug each reward function in isolation
Reward Function Types:
| Type | Use Case | Example Weight |
|---|---|---|
| Correctness | Verifiable tasks (math, code) | 2.0 (highest) |
| Format | Strict structure enforcement | 0.5-1.0 |
| Length | Encourage verbosity/conciseness | 0.1-0.5 |
| Style | Penalize unwanted patterns | -0.5 to 0.5 |
黄金法则:
- 组合多个奖励函数——每个函数负责一个维度(格式、正确性、风格)
- 适当缩放奖励——权重越高,信号越强
- 使用增量奖励——部分合规给予部分奖励
- 独立测试奖励——单独调试每个奖励函数
奖励函数类型:
| 类型 | 使用场景 | 示例权重 |
|---|---|---|
| 正确性 | 可验证任务(数学、代码) | 2.0(最高) |
| 格式 | 严格结构约束 | 0.5-1.0 |
| 长度 | 鼓励冗余/简洁性 | 0.1-0.5 |
| 风格 | 惩罚不良模式 | -0.5 至 0.5 |
Implementation Workflow
实现工作流
Step 1: Dataset Preparation
Step 1: 数据集准备
Critical Requirements:
- Prompts in chat format (list of dicts with 'role' and 'content')
- Include system prompts to set expectations
- For verifiable tasks, include ground truth answers as additional columns
Example Structure:
python
from datasets import load_dataset, Dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
[Your step-by-step thinking]
</reasoning>
<answer>
[Final answer]
</answer>
"""
def prepare_dataset(raw_data):
"""
Transform raw data into GRPO-compatible format.
Returns: Dataset with columns:
- 'prompt': List[Dict] with role/content (system + user messages)
- 'answer': str (ground truth, optional but recommended)
"""
return raw_data.map(lambda x: {
'prompt': [
{'role': 'system', 'content': SYSTEM_PROMPT},
{'role': 'user', 'content': x['question']}
],
'answer': extract_answer(x['raw_answer'])
})Pro Tips:
- Use one-shot or few-shot examples in system prompt for complex formats
- Keep prompts concise (max_prompt_length: 256-512 tokens)
- Validate data quality before training (garbage in = garbage out)
关键要求:
- 提示采用对话格式(包含'role'和'content'的字典列表)
- 包含系统提示以设定预期
- 对于可验证任务,将标准答案作为额外列包含
示例结构:
python
from datasets import load_dataset, Dataset
SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
[Your step-by-step thinking]
</reasoning>
<answer>
[Final answer]
</answer>
"""
def prepare_dataset(raw_data):
"""
Transform raw data into GRPO-compatible format.
Returns: Dataset with columns:
- 'prompt': List[Dict] with role/content (system + user messages)
- 'answer': str (ground truth, optional but recommended)
"""
return raw_data.map(lambda x: {
'prompt': [
{'role': 'system', 'content': SYSTEM_PROMPT},
{'role': 'user', 'content': x['question']}
],
'answer': extract_answer(x['raw_answer'])
})专业提示:
- 对于复杂格式,在系统提示中使用单样本或少样本示例
- 保持提示简洁(max_prompt_length: 256-512 tokens)
- 训练前验证数据质量(输入垃圾,输出垃圾)
Step 2: Reward Function Implementation
Step 2: 奖励函数实现
Template Structure:
python
def reward_function_name(
prompts, # List[List[Dict]]: Original prompts
completions, # List[List[Dict]]: Model generations
answer=None, # Optional: Ground truth from dataset
**kwargs # Additional dataset columns
) -> list[float]:
"""
Evaluate completions and return rewards.
Returns: List of floats (one per completion)
"""
# Extract completion text
responses = [comp[0]['content'] for comp in completions]
# Compute rewards
rewards = []
for response in responses:
score = compute_score(response)
rewards.append(score)
return rewardsExample 1: Correctness Reward (Math/Coding)
python
def correctness_reward(prompts, completions, answer, **kwargs):
"""Reward correct answers with high score."""
responses = [comp[0]['content'] for comp in completions]
extracted = [extract_final_answer(r) for r in responses]
return [2.0 if ans == gt else 0.0
for ans, gt in zip(extracted, answer)]Example 2: Format Reward (Structured Output)
python
import re
def format_reward(completions, **kwargs):
"""Reward XML-like structured format."""
pattern = r'<reasoning>.*?</reasoning>\s*<answer>.*?</answer>'
responses = [comp[0]['content'] for comp in completions]
return [1.0 if re.search(pattern, r, re.DOTALL) else 0.0
for r in responses]Example 3: Incremental Format Reward (Partial Credit)
python
def incremental_format_reward(completions, **kwargs):
"""Award partial credit for format compliance."""
responses = [comp[0]['content'] for comp in completions]
rewards = []
for r in responses:
score = 0.0
if '<reasoning>' in r:
score += 0.25
if '</reasoning>' in r:
score += 0.25
if '<answer>' in r:
score += 0.25
if '</answer>' in r:
score += 0.25
# Penalize extra text after closing tag
if r.count('</answer>') == 1:
extra_text = r.split('</answer>')[-1].strip()
score -= len(extra_text) * 0.001
rewards.append(score)
return rewardsCritical Insight:
Combine 3-5 reward functions for robust training. Order matters less than diversity of signals.
模板结构:
python
def reward_function_name(
prompts, # List[List[Dict]]: Original prompts
completions, # List[List[Dict]]: Model generations
answer=None, # Optional: Ground truth from dataset
**kwargs # Additional dataset columns
) -> list[float]:
"""
Evaluate completions and return rewards.
Returns: List of floats (one per completion)
"""
# Extract completion text
responses = [comp[0]['content'] for comp in completions]
# Compute rewards
rewards = []
for response in responses:
score = compute_score(response)
rewards.append(score)
return rewards示例1: 正确性奖励(数学/编码)
python
def correctness_reward(prompts, completions, answer, **kwargs):
"""Reward correct answers with high score."""
responses = [comp[0]['content'] for comp in completions]
extracted = [extract_final_answer(r) for r in responses]
return [2.0 if ans == gt else 0.0
for ans, gt in zip(extracted, answer)]示例2: 格式奖励(结构化输出)
python
import re
def format_reward(completions, **kwargs):
"""Reward XML-like structured format."""
pattern = r'<reasoning>.*?</reasoning>\s*<answer>.*?</answer>'
responses = [comp[0]['content'] for comp in completions]
return [1.0 if re.search(pattern, r, re.DOTALL) else 0.0
for r in responses]示例3: 增量格式奖励(部分奖励)
python
def incremental_format_reward(completions, **kwargs):
"""Award partial credit for format compliance."""
responses = [comp[0]['content'] for comp in completions]
rewards = []
for r in responses:
score = 0.0
if '<reasoning>' in r:
score += 0.25
if '</reasoning>' in r:
score += 0.25
if '<answer>' in r:
score += 0.25
if '</answer>' in r:
score += 0.25
# Penalize extra text after closing tag
if r.count('</answer>') == 1:
extra_text = r.split('</answer>')[-1].strip()
score -= len(extra_text) * 0.001
rewards.append(score)
return rewards关键见解:
结合3-5个奖励函数进行稳健训练。顺序不如信号多样性重要。
Step 3: Training Configuration
Step 3: 训练配置
Memory-Optimized Config (Small GPU)
python
from trl import GRPOConfig
training_args = GRPOConfig(
output_dir="outputs/grpo-model",
# Learning rate
learning_rate=5e-6, # Lower = more stable
adam_beta1=0.9,
adam_beta2=0.99,
weight_decay=0.1,
warmup_ratio=0.1,
lr_scheduler_type='cosine',
# Batch settings
per_device_train_batch_size=1,
gradient_accumulation_steps=4, # Effective batch = 4
# GRPO-specific
num_generations=8, # Group size: 8-16 recommended
max_prompt_length=256,
max_completion_length=512,
# Training duration
num_train_epochs=1,
max_steps=None, # Or set fixed steps (e.g., 500)
# Optimization
bf16=True, # Faster on A100/H100
optim="adamw_8bit", # Memory-efficient optimizer
max_grad_norm=0.1,
# Logging
logging_steps=1,
save_steps=100,
report_to="wandb", # Or "none" for no logging
)High-Performance Config (Large GPU)
python
training_args = GRPOConfig(
output_dir="outputs/grpo-model",
learning_rate=1e-5,
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
num_generations=16, # Larger groups = better signal
max_prompt_length=512,
max_completion_length=1024,
num_train_epochs=1,
bf16=True,
use_vllm=True, # Fast generation with vLLM
logging_steps=10,
)Critical Hyperparameters:
| Parameter | Impact | Tuning Advice |
|---|---|---|
| Group size for comparison | Start with 8, increase to 16 if GPU allows |
| Convergence speed/stability | 5e-6 (safe), 1e-5 (faster, riskier) |
| Output verbosity | Match your task (512 for reasoning, 256 for short answers) |
| Effective batch size | Increase if GPU memory limited |
内存优化配置(小型GPU)
python
from trl import GRPOConfig
training_args = GRPOConfig(
output_dir="outputs/grpo-model",
# Learning rate
learning_rate=5e-6, # Lower = more stable
adam_beta1=0.9,
adam_beta2=0.99,
weight_decay=0.1,
warmup_ratio=0.1,
lr_scheduler_type='cosine',
# Batch settings
per_device_train_batch_size=1,
gradient_accumulation_steps=4, # Effective batch = 4
# GRPO-specific
num_generations=8, # Group size: 8-16 recommended
max_prompt_length=256,
max_completion_length=512,
# Training duration
num_train_epochs=1,
max_steps=None, # Or set fixed steps (e.g., 500)
# Optimization
bf16=True, # Faster on A100/H100
optim="adamw_8bit", # Memory-efficient optimizer
max_grad_norm=0.1,
# Logging
logging_steps=1,
save_steps=100,
report_to="wandb", # Or "none" for no logging
)高性能配置(大型GPU)
python
training_args = GRPOConfig(
output_dir="outputs/grpo-model",
learning_rate=1e-5,
per_device_train_batch_size=4,
gradient_accumulation_steps=2,
num_generations=16, # Larger groups = better signal
max_prompt_length=512,
max_completion_length=1024,
num_train_epochs=1,
bf16=True,
use_vllm=True, # Fast generation with vLLM
logging_steps=10,
)关键超参数:
| 参数 | 影响 | 调优建议 |
|---|---|---|
| 用于比较的组大小 | 从8开始,若GPU允许可增加到16 |
| 收敛速度/稳定性 | 5e-6(安全),1e-5(更快,风险更高) |
| 输出冗余度 | 匹配任务需求(推理用512,短答案用256) |
| 有效批次大小 | 若GPU内存受限则增加 |
Step 4: Model Setup and Training
Step 4: 模型设置与训练
Standard Setup (Transformers)
python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import GRPOTrainer标准设置(Transformers)
python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import GRPOTrainerLoad model
Load model
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2", # 2-3x faster
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2", # 2-3x faster
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
Optional: LoRA for parameter-efficient training
Optional: LoRA for parameter-efficient training
peft_config = LoraConfig(
r=16, # Rank (higher = more capacity)
lora_alpha=32, # Scaling factor (typically 2*r)
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
task_type="CAUSAL_LM",
lora_dropout=0.05,
)
peft_config = LoraConfig(
r=16, # Rank (higher = more capacity)
lora_alpha=32, # Scaling factor (typically 2*r)
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"
],
task_type="CAUSAL_LM",
lora_dropout=0.05,
)
Initialize trainer
Initialize trainer
trainer = GRPOTrainer(
model=model,
processing_class=tokenizer,
reward_funcs=[
incremental_format_reward,
format_reward,
correctness_reward,
],
args=training_args,
train_dataset=dataset,
peft_config=peft_config, # Remove for full fine-tuning
)
trainer = GRPOTrainer(
model=model,
processing_class=tokenizer,
reward_funcs=[
incremental_format_reward,
format_reward,
correctness_reward,
],
args=training_args,
train_dataset=dataset,
peft_config=peft_config, # Remove for full fine-tuning
)
Train
Train
trainer.train()
trainer.train()
Save
Save
trainer.save_model("final_model")
**Unsloth Setup (2-3x Faster)**
```python
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="google/gemma-3-1b-it",
max_seq_length=1024,
load_in_4bit=True,
fast_inference=True,
max_lora_rank=32,
)
model = FastLanguageModel.get_peft_model(
model,
r=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=32,
use_gradient_checkpointing="unsloth",
)trainer.save_model("final_model")
**Unsloth设置(快2-3倍)**
```python
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="google/gemma-3-1b-it",
max_seq_length=1024,
load_in_4bit=True,
fast_inference=True,
max_lora_rank=32,
)
model = FastLanguageModel.get_peft_model(
model,
r=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=32,
use_gradient_checkpointing="unsloth",
)Rest is identical to standard setup
Rest is identical to standard setup
trainer = GRPOTrainer(model=model, ...)
trainer.train()
---trainer = GRPOTrainer(model=model, ...)
trainer.train()
---Critical Training Insights
关键训练见解
1. Loss Behavior (EXPECTED PATTERN)
1. 损失行为(预期模式)
- Loss starts near 0 and INCREASES during training
- This is CORRECT - loss measures KL divergence from initial policy
- Model is learning (diverging from original behavior to optimize rewards)
- Monitor reward metrics instead of loss for progress
- 损失初始接近0,训练期间上升
- 这是正常现象——损失衡量与初始策略的KL散度
- 模型正在学习(偏离原始行为以优化奖励)
- 监控奖励指标而非损失来判断进度
2. Reward Tracking
2. 奖励跟踪
Key metrics to watch:
- : Average across all completions
reward - : Diversity within groups (should remain > 0)
reward_std - : KL divergence from reference (should grow moderately)
kl
Healthy Training Pattern:
Step Reward Reward_Std KL
100 0.5 0.3 0.02
200 0.8 0.25 0.05
300 1.2 0.2 0.08 ← Good progression
400 1.5 0.15 0.12Warning Signs:
- Reward std → 0 (model collapsing to single response)
- KL exploding (> 0.5) (diverging too much, reduce LR)
- Reward stuck (reward functions too harsh or model capacity issue)
需要关注的关键指标:
- : 所有补全结果的平均值
reward - : 组内多样性(应保持>0)
reward_std - : 与基准模型的KL散度(应适度增长)
kl
健康训练模式:
Step Reward Reward_Std KL
100 0.5 0.3 0.02
200 0.8 0.25 0.05
300 1.2 0.2 0.08 ← 良好进展
400 1.5 0.15 0.12警告信号:
- Reward std → 0(模型坍缩为单一响应)
- KL激增(>0.5)(偏离过度,降低学习率)
- 奖励停滞(奖励函数过于严格或模型容量不足)
3. Common Pitfalls and Solutions
3. 常见问题与解决方案
| Problem | Symptom | Solution |
|---|---|---|
| Mode collapse | All completions identical | Increase |
| No learning | Flat rewards | Check reward function logic, increase LR |
| OOM errors | GPU memory exceeded | Reduce |
| Slow training | < 1 it/s | Enable |
| Format ignored | Model doesn't follow structure | Increase format reward weight, add incremental rewards |
| 问题 | 症状 | 解决方案 |
|---|---|---|
| 模式坍缩 | 所有补全结果相同 | 增加 |
| 无学习进展 | 奖励持平 | 检查奖励函数逻辑,提高学习率 |
| OOM错误 | GPU内存不足 | 减少 |
| 训练缓慢 | <1次迭代/秒 | 启用 |
| 格式被忽略 | 模型不遵循结构 | 提高格式奖励权重,添加增量奖励 |
Advanced Patterns
进阶模式
1. Multi-Stage Training
1. 多阶段训练
For complex tasks, train in stages:
python
undefined对于复杂任务,分阶段训练:
python
undefinedStage 1: Format compliance (epochs=1)
Stage 1: 格式合规(epochs=1)
trainer_stage1 = GRPOTrainer(
model=model,
reward_funcs=[incremental_format_reward, format_reward],
...
)
trainer_stage1.train()
trainer_stage1 = GRPOTrainer(
model=model,
reward_funcs=[incremental_format_reward, format_reward],
...
)
trainer_stage1.train()
Stage 2: Correctness (epochs=1)
Stage 2: 正确性(epochs=1)
trainer_stage2 = GRPOTrainer(
model=model,
reward_funcs=[format_reward, correctness_reward],
...
)
trainer_stage2.train()
undefinedtrainer_stage2 = GRPOTrainer(
model=model,
reward_funcs=[format_reward, correctness_reward],
...
)
trainer_stage2.train()
undefined2. Adaptive Reward Scaling
2. 自适应奖励缩放
python
class AdaptiveReward:
def __init__(self, base_reward_func, initial_weight=1.0):
self.func = base_reward_func
self.weight = initial_weight
def __call__(self, *args, **kwargs):
rewards = self.func(*args, **kwargs)
return [r * self.weight for r in rewards]
def adjust_weight(self, success_rate):
"""Increase weight if model struggling, decrease if succeeding."""
if success_rate < 0.3:
self.weight *= 1.2
elif success_rate > 0.8:
self.weight *= 0.9python
class AdaptiveReward:
def __init__(self, base_reward_func, initial_weight=1.0):
self.func = base_reward_func
self.weight = initial_weight
def __call__(self, *args, **kwargs):
rewards = self.func(*args, **kwargs)
return [r * self.weight for r in rewards]
def adjust_weight(self, success_rate):
"""Increase weight if model struggling, decrease if succeeding."""
if success_rate < 0.3:
self.weight *= 1.2
elif success_rate > 0.8:
self.weight *= 0.93. Custom Dataset Integration
3. 自定义数据集集成
python
def load_custom_knowledge_base(csv_path):
"""Example: School communication platform docs."""
import pandas as pd
df = pd.read_csv(csv_path)
dataset = Dataset.from_pandas(df).map(lambda x: {
'prompt': [
{'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT},
{'role': 'user', 'content': x['question']}
],
'answer': x['expert_answer']
})
return datasetpython
def load_custom_knowledge_base(csv_path):
"""Example: School communication platform docs."""
import pandas as pd
df = pd.read_csv(csv_path)
dataset = Dataset.from_pandas(df).map(lambda x: {
'prompt': [
{'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT},
{'role': 'user', 'content': x['question']}
],
'answer': x['expert_answer']
})
return datasetDeployment and Inference
部署与推理
Save and Merge LoRA
保存并合并LoRA
python
undefinedpython
undefinedMerge LoRA adapters into base model
Merge LoRA adapters into base model
if hasattr(trainer.model, 'merge_and_unload'):
merged_model = trainer.model.merge_and_unload()
merged_model.save_pretrained("production_model")
tokenizer.save_pretrained("production_model")
undefinedif hasattr(trainer.model, 'merge_and_unload'):
merged_model = trainer.model.merge_and_unload()
merged_model.save_pretrained("production_model")
tokenizer.save_pretrained("production_model")
undefinedInference Example
推理示例
python
from transformers import pipeline
generator = pipeline(
"text-generation",
model="production_model",
tokenizer=tokenizer
)
result = generator(
[
{'role': 'system', 'content': SYSTEM_PROMPT},
{'role': 'user', 'content': "What is 15 + 27?"}
],
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.9
)
print(result[0]['generated_text'])python
from transformers import pipeline
generator = pipeline(
"text-generation",
model="production_model",
tokenizer=tokenizer
)
result = generator(
[
{'role': 'system', 'content': SYSTEM_PROMPT},
{'role': 'user', 'content': "What is 15 + 27?"}
],
max_new_tokens=256,
do_sample=True,
temperature=0.7,
top_p=0.9
)
print(result[0]['generated_text'])Best Practices Checklist
最佳实践检查清单
Before Training:
- Validate dataset format (prompts as List[Dict])
- Test reward functions on sample data
- Calculate expected max_prompt_length from data
- Choose appropriate num_generations based on GPU memory
- Set up logging (wandb recommended)
During Training:
- Monitor reward progression (should increase)
- Check reward_std (should stay > 0.1)
- Watch for OOM errors (reduce batch size if needed)
- Sample generations every 50-100 steps
- Validate format compliance on holdout set
After Training:
- Merge LoRA weights if using PEFT
- Test on diverse prompts
- Compare to baseline model
- Document reward weights and hyperparameters
- Save reproducibility config
训练前:
- 验证数据集格式(提示为List[Dict])
- 在样本数据上测试奖励函数
- 根据数据计算预期的max_prompt_length
- 根据GPU内存选择合适的num_generations
- 设置日志记录(推荐使用wandb)
训练中:
- 监控奖励进展(应上升)
- 检查reward_std(应保持>0.1)
- 关注OOM错误(必要时减小批次大小)
- 每50-100步抽样生成结果
- 在保留集上验证格式合规性
训练后:
- 若使用PEFT则合并LoRA权重
- 在多样化提示上测试
- 与基准模型对比
- 记录奖励权重和超参数
- 保存可复现配置
Troubleshooting Guide
故障排除指南
Debugging Workflow
调试工作流
- Isolate reward functions - Test each independently
- Check data distribution - Ensure diversity in prompts
- Reduce complexity - Start with single reward, add gradually
- Monitor generations - Print samples every N steps
- Validate extraction logic - Ensure answer parsing works
- 隔离奖励函数——单独测试每个函数
- 检查数据分布——确保提示多样性
- 降低复杂度——从单一奖励开始,逐步添加
- 监控生成结果——每N步打印样本
- 验证提取逻辑——确保答案解析正常工作
Quick Fixes
快速修复
python
undefinedpython
undefinedDebug reward function
Debug reward function
def debug_reward(completions, **kwargs):
responses = [comp[0]['content'] for comp in completions]
for i, r in enumerate(responses[:2]): # Print first 2
print(f"Response {i}: {r[:200]}...")
return [1.0] * len(responses) # Dummy rewards
def debug_reward(completions, **kwargs):
responses = [comp[0]['content'] for comp in completions]
for i, r in enumerate(responses[:2]): # Print first 2
print(f"Response {i}: {r[:200]}...")
return [1.0] * len(responses) # Dummy rewards
Test without training
Test without training
trainer = GRPOTrainer(..., reward_funcs=[debug_reward])
trainer.generate_completions(dataset[:1]) # Generate without updating
---trainer = GRPOTrainer(..., reward_funcs=[debug_reward])
trainer.generate_completions(dataset[:1]) # Generate without updating
---References and Resources
参考资料
Official Documentation:
- TRL GRPO Trainer: https://huggingface.co/docs/trl/grpo_trainer
- DeepSeek R1 Paper: https://arxiv.org/abs/2501.12948
- Unsloth Docs: https://docs.unsloth.ai/
Example Repositories:
- Open R1 Implementation: https://github.com/huggingface/open-r1
- TRL Examples: https://github.com/huggingface/trl/tree/main/examples
Recommended Reading:
- Progressive Disclosure Pattern for agent instructions
- Reward shaping in RL (Ng et al.)
- LoRA paper (Hu et al., 2021)
官方文档:
- TRL GRPO Trainer: https://huggingface.co/docs/trl/grpo_trainer
- DeepSeek R1 Paper: https://arxiv.org/abs/2501.12948
- Unsloth Docs: https://docs.unsloth.ai/
示例仓库:
- Open R1 Implementation: https://github.com/huggingface/open-r1
- TRL Examples: https://github.com/huggingface/trl/tree/main/examples
推荐阅读:
- 智能体指令的渐进式披露模式
- RL中的奖励塑造(Ng等人)
- LoRA论文(Hu等人,2021)
Usage Instructions for Agents
智能体使用说明
When this skill is loaded:
- Read this entire file before implementing GRPO training
- Start with the simplest reward function (e.g., length-based) to validate setup
- Use the templates in directory as starting points
templates/ - Reference examples in for task-specific implementations
examples/ - Follow the workflow sequentially (don't skip steps)
- Debug incrementally - add one reward function at a time
Critical Reminders:
- Always use multiple reward functions (3-5 is optimal)
- Monitor reward metrics, not loss
- Test reward functions before training
- Start small (num_generations=4), scale up gradually
- Save checkpoints frequently (every 100 steps)
This skill is designed for expert-level implementation. Beginners should start with supervised fine-tuning before attempting GRPO.
加载本技能时:
- 在实现GRPO训练前通读全文
- 从最简单的奖励函数开始(例如基于长度的函数)以验证设置
- 使用目录中的模板作为起点
templates/ - 参考中的示例进行特定任务的实现
examples/ - 按顺序遵循工作流(不要跳过步骤)
- 逐步调试——一次添加一个奖励函数
关键提醒:
- 始终使用多个奖励函数(3-5个最佳)
- 监控奖励指标,而非损失
- 训练前测试奖励函数
- 从小规模开始(num_generations=4),逐步扩大
- 频繁保存检查点(每100步)
本技能专为专家级实现设计。初学者应先从监督微调开始,再尝试GRPO。