grpo-rl-training

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

GRPO/RL Training with TRL

使用TRL进行GRPO/RL训练

Expert-level guidance for implementing Group Relative Policy Optimization (GRPO) using the Transformer Reinforcement Learning (TRL) library. This skill provides battle-tested patterns, critical insights, and production-ready workflows for fine-tuning language models with custom reward functions.
专家级指导,介绍如何使用Transformer Reinforcement Learning(TRL)库实现Group Relative Policy Optimization(GRPO)。本技能提供了经过实战验证的模式、关键见解和可用于生产环境的工作流,用于结合自定义奖励函数微调语言模型。

When to Use This Skill

何时使用本技能

Use GRPO training when you need to:
  • Enforce specific output formats (e.g., XML tags, JSON, structured reasoning)
  • Teach verifiable tasks with objective correctness metrics (math, coding, fact-checking)
  • Improve reasoning capabilities by rewarding chain-of-thought patterns
  • Align models to domain-specific behaviors without labeled preference data
  • Optimize for multiple objectives simultaneously (format + correctness + style)
Do NOT use GRPO for:
  • Simple supervised fine-tuning tasks (use SFT instead)
  • Tasks without clear reward signals
  • When you already have high-quality preference pairs (use DPO/PPO instead)

当你需要以下场景时,使用GRPO训练:
  • 强制执行特定输出格式(例如XML标签、JSON、结构化推理)
  • 教授可验证任务,带有客观正确性指标(数学、编码、事实核查)
  • 提升推理能力,通过奖励思维链模式
  • 使模型适应特定领域行为,无需标注偏好数据
  • 同时优化多个目标(格式+正确性+风格)
请勿将GRPO用于:
  • 简单的监督微调任务(改用SFT)
  • 没有明确奖励信号的任务
  • 已有高质量偏好对的场景(改用DPO/PPO)

Core Concepts

核心概念

1. GRPO Algorithm Fundamentals

1. GRPO算法基础

Key Mechanism:
  • Generates multiple completions for each prompt (group size: 4-16)
  • Compares completions within each group using reward functions
  • Updates policy to favor higher-rewarded responses relative to the group
Critical Difference from PPO:
  • No separate reward model needed
  • More sample-efficient (learns from within-group comparisons)
  • Simpler to implement and debug
Mathematical Intuition:
For each prompt p:
  1. Generate N completions: {c₁, c₂, ..., cₙ}
  2. Compute rewards: {r₁, r₂, ..., rₙ}
  3. Learn to increase probability of high-reward completions
     relative to low-reward ones in the same group
核心机制:
  • 为每个提示生成多个补全结果(组大小:4-16)
  • 使用奖励函数比较每组内的补全结果
  • 更新策略,使组内奖励更高的响应更受青睐
与PPO的关键区别:
  • 无需单独的奖励模型
  • 样本效率更高(从组内比较中学习)
  • 实现和调试更简单
数学直觉:
For each prompt p:
  1. Generate N completions: {c₁, c₂, ..., cₙ}
  2. Compute rewards: {r₁, r₂, ..., rₙ}
  3. Learn to increase probability of high-reward completions
     relative to low-reward ones in the same group

2. Reward Function Design Philosophy

2. 奖励函数设计理念

Golden Rules:
  1. Compose multiple reward functions - Each handles one aspect (format, correctness, style)
  2. Scale rewards appropriately - Higher weight = stronger signal
  3. Use incremental rewards - Partial credit for partial compliance
  4. Test rewards independently - Debug each reward function in isolation
Reward Function Types:
TypeUse CaseExample Weight
CorrectnessVerifiable tasks (math, code)2.0 (highest)
FormatStrict structure enforcement0.5-1.0
LengthEncourage verbosity/conciseness0.1-0.5
StylePenalize unwanted patterns-0.5 to 0.5

黄金法则:
  1. 组合多个奖励函数——每个函数负责一个维度(格式、正确性、风格)
  2. 适当缩放奖励——权重越高,信号越强
  3. 使用增量奖励——部分合规给予部分奖励
  4. 独立测试奖励——单独调试每个奖励函数
奖励函数类型:
类型使用场景示例权重
正确性可验证任务(数学、代码)2.0(最高)
格式严格结构约束0.5-1.0
长度鼓励冗余/简洁性0.1-0.5
风格惩罚不良模式-0.5 至 0.5

Implementation Workflow

实现工作流

Step 1: Dataset Preparation

Step 1: 数据集准备

Critical Requirements:
  • Prompts in chat format (list of dicts with 'role' and 'content')
  • Include system prompts to set expectations
  • For verifiable tasks, include ground truth answers as additional columns
Example Structure:
python
from datasets import load_dataset, Dataset

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
[Your step-by-step thinking]
</reasoning>
<answer>
[Final answer]
</answer>
"""

def prepare_dataset(raw_data):
    """
    Transform raw data into GRPO-compatible format.

    Returns: Dataset with columns:
    - 'prompt': List[Dict] with role/content (system + user messages)
    - 'answer': str (ground truth, optional but recommended)
    """
    return raw_data.map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_answer(x['raw_answer'])
    })
Pro Tips:
  • Use one-shot or few-shot examples in system prompt for complex formats
  • Keep prompts concise (max_prompt_length: 256-512 tokens)
  • Validate data quality before training (garbage in = garbage out)
关键要求:
  • 提示采用对话格式(包含'role'和'content'的字典列表)
  • 包含系统提示以设定预期
  • 对于可验证任务,将标准答案作为额外列包含
示例结构:
python
from datasets import load_dataset, Dataset

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
[Your step-by-step thinking]
</reasoning>
<answer>
[Final answer]
</answer>
"""

def prepare_dataset(raw_data):
    """
    Transform raw data into GRPO-compatible format.

    Returns: Dataset with columns:
    - 'prompt': List[Dict] with role/content (system + user messages)
    - 'answer': str (ground truth, optional but recommended)
    """
    return raw_data.map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_answer(x['raw_answer'])
    })
专业提示:
  • 对于复杂格式,在系统提示中使用单样本或少样本示例
  • 保持提示简洁(max_prompt_length: 256-512 tokens)
  • 训练前验证数据质量(输入垃圾,输出垃圾)

Step 2: Reward Function Implementation

Step 2: 奖励函数实现

Template Structure:
python
def reward_function_name(
    prompts,        # List[List[Dict]]: Original prompts
    completions,    # List[List[Dict]]: Model generations
    answer=None,    # Optional: Ground truth from dataset
    **kwargs        # Additional dataset columns
) -> list[float]:
    """
    Evaluate completions and return rewards.

    Returns: List of floats (one per completion)
    """
    # Extract completion text
    responses = [comp[0]['content'] for comp in completions]

    # Compute rewards
    rewards = []
    for response in responses:
        score = compute_score(response)
        rewards.append(score)

    return rewards
Example 1: Correctness Reward (Math/Coding)
python
def correctness_reward(prompts, completions, answer, **kwargs):
    """Reward correct answers with high score."""
    responses = [comp[0]['content'] for comp in completions]
    extracted = [extract_final_answer(r) for r in responses]
    return [2.0 if ans == gt else 0.0
            for ans, gt in zip(extracted, answer)]
Example 2: Format Reward (Structured Output)
python
import re

def format_reward(completions, **kwargs):
    """Reward XML-like structured format."""
    pattern = r'<reasoning>.*?</reasoning>\s*<answer>.*?</answer>'
    responses = [comp[0]['content'] for comp in completions]
    return [1.0 if re.search(pattern, r, re.DOTALL) else 0.0
            for r in responses]
Example 3: Incremental Format Reward (Partial Credit)
python
def incremental_format_reward(completions, **kwargs):
    """Award partial credit for format compliance."""
    responses = [comp[0]['content'] for comp in completions]
    rewards = []

    for r in responses:
        score = 0.0
        if '<reasoning>' in r:
            score += 0.25
        if '</reasoning>' in r:
            score += 0.25
        if '<answer>' in r:
            score += 0.25
        if '</answer>' in r:
            score += 0.25
        # Penalize extra text after closing tag
        if r.count('</answer>') == 1:
            extra_text = r.split('</answer>')[-1].strip()
            score -= len(extra_text) * 0.001
        rewards.append(score)

    return rewards
Critical Insight: Combine 3-5 reward functions for robust training. Order matters less than diversity of signals.
模板结构:
python
def reward_function_name(
    prompts,        # List[List[Dict]]: Original prompts
    completions,    # List[List[Dict]]: Model generations
    answer=None,    # Optional: Ground truth from dataset
    **kwargs        # Additional dataset columns
) -> list[float]:
    """
    Evaluate completions and return rewards.

    Returns: List of floats (one per completion)
    """
    # Extract completion text
    responses = [comp[0]['content'] for comp in completions]

    # Compute rewards
    rewards = []
    for response in responses:
        score = compute_score(response)
        rewards.append(score)

    return rewards
示例1: 正确性奖励(数学/编码)
python
def correctness_reward(prompts, completions, answer, **kwargs):
    """Reward correct answers with high score."""
    responses = [comp[0]['content'] for comp in completions]
    extracted = [extract_final_answer(r) for r in responses]
    return [2.0 if ans == gt else 0.0
            for ans, gt in zip(extracted, answer)]
示例2: 格式奖励(结构化输出)
python
import re

def format_reward(completions, **kwargs):
    """Reward XML-like structured format."""
    pattern = r'<reasoning>.*?</reasoning>\s*<answer>.*?</answer>'
    responses = [comp[0]['content'] for comp in completions]
    return [1.0 if re.search(pattern, r, re.DOTALL) else 0.0
            for r in responses]
示例3: 增量格式奖励(部分奖励)
python
def incremental_format_reward(completions, **kwargs):
    """Award partial credit for format compliance."""
    responses = [comp[0]['content'] for comp in completions]
    rewards = []

    for r in responses:
        score = 0.0
        if '<reasoning>' in r:
            score += 0.25
        if '</reasoning>' in r:
            score += 0.25
        if '<answer>' in r:
            score += 0.25
        if '</answer>' in r:
            score += 0.25
        # Penalize extra text after closing tag
        if r.count('</answer>') == 1:
            extra_text = r.split('</answer>')[-1].strip()
            score -= len(extra_text) * 0.001
        rewards.append(score)

    return rewards
关键见解: 结合3-5个奖励函数进行稳健训练。顺序不如信号多样性重要。

Step 3: Training Configuration

Step 3: 训练配置

Memory-Optimized Config (Small GPU)
python
from trl import GRPOConfig

training_args = GRPOConfig(
    output_dir="outputs/grpo-model",

    # Learning rate
    learning_rate=5e-6,          # Lower = more stable
    adam_beta1=0.9,
    adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.1,
    lr_scheduler_type='cosine',

    # Batch settings
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,  # Effective batch = 4

    # GRPO-specific
    num_generations=8,            # Group size: 8-16 recommended
    max_prompt_length=256,
    max_completion_length=512,

    # Training duration
    num_train_epochs=1,
    max_steps=None,               # Or set fixed steps (e.g., 500)

    # Optimization
    bf16=True,                    # Faster on A100/H100
    optim="adamw_8bit",          # Memory-efficient optimizer
    max_grad_norm=0.1,

    # Logging
    logging_steps=1,
    save_steps=100,
    report_to="wandb",            # Or "none" for no logging
)
High-Performance Config (Large GPU)
python
training_args = GRPOConfig(
    output_dir="outputs/grpo-model",
    learning_rate=1e-5,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    num_generations=16,           # Larger groups = better signal
    max_prompt_length=512,
    max_completion_length=1024,
    num_train_epochs=1,
    bf16=True,
    use_vllm=True,                # Fast generation with vLLM
    logging_steps=10,
)
Critical Hyperparameters:
ParameterImpactTuning Advice
num_generations
Group size for comparisonStart with 8, increase to 16 if GPU allows
learning_rate
Convergence speed/stability5e-6 (safe), 1e-5 (faster, riskier)
max_completion_length
Output verbosityMatch your task (512 for reasoning, 256 for short answers)
gradient_accumulation_steps
Effective batch sizeIncrease if GPU memory limited
内存优化配置(小型GPU)
python
from trl import GRPOConfig

training_args = GRPOConfig(
    output_dir="outputs/grpo-model",

    # Learning rate
    learning_rate=5e-6,          # Lower = more stable
    adam_beta1=0.9,
    adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.1,
    lr_scheduler_type='cosine',

    # Batch settings
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,  # Effective batch = 4

    # GRPO-specific
    num_generations=8,            # Group size: 8-16 recommended
    max_prompt_length=256,
    max_completion_length=512,

    # Training duration
    num_train_epochs=1,
    max_steps=None,               # Or set fixed steps (e.g., 500)

    # Optimization
    bf16=True,                    # Faster on A100/H100
    optim="adamw_8bit",          # Memory-efficient optimizer
    max_grad_norm=0.1,

    # Logging
    logging_steps=1,
    save_steps=100,
    report_to="wandb",            # Or "none" for no logging
)
高性能配置(大型GPU)
python
training_args = GRPOConfig(
    output_dir="outputs/grpo-model",
    learning_rate=1e-5,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    num_generations=16,           # Larger groups = better signal
    max_prompt_length=512,
    max_completion_length=1024,
    num_train_epochs=1,
    bf16=True,
    use_vllm=True,                # Fast generation with vLLM
    logging_steps=10,
)
关键超参数:
参数影响调优建议
num_generations
用于比较的组大小从8开始,若GPU允许可增加到16
learning_rate
收敛速度/稳定性5e-6(安全),1e-5(更快,风险更高)
max_completion_length
输出冗余度匹配任务需求(推理用512,短答案用256)
gradient_accumulation_steps
有效批次大小若GPU内存受限则增加

Step 4: Model Setup and Training

Step 4: 模型设置与训练

Standard Setup (Transformers)
python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import GRPOTrainer
标准设置(Transformers)
python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import GRPOTrainer

Load model

Load model

model_name = "Qwen/Qwen2.5-1.5B-Instruct" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", # 2-3x faster device_map="auto" )
tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token
model_name = "Qwen/Qwen2.5-1.5B-Instruct" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", # 2-3x faster device_map="auto" )
tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token

Optional: LoRA for parameter-efficient training

Optional: LoRA for parameter-efficient training

peft_config = LoraConfig( r=16, # Rank (higher = more capacity) lora_alpha=32, # Scaling factor (typically 2*r) target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj" ], task_type="CAUSAL_LM", lora_dropout=0.05, )
peft_config = LoraConfig( r=16, # Rank (higher = more capacity) lora_alpha=32, # Scaling factor (typically 2*r) target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj" ], task_type="CAUSAL_LM", lora_dropout=0.05, )

Initialize trainer

Initialize trainer

trainer = GRPOTrainer( model=model, processing_class=tokenizer, reward_funcs=[ incremental_format_reward, format_reward, correctness_reward, ], args=training_args, train_dataset=dataset, peft_config=peft_config, # Remove for full fine-tuning )
trainer = GRPOTrainer( model=model, processing_class=tokenizer, reward_funcs=[ incremental_format_reward, format_reward, correctness_reward, ], args=training_args, train_dataset=dataset, peft_config=peft_config, # Remove for full fine-tuning )

Train

Train

trainer.train()
trainer.train()

Save

Save

trainer.save_model("final_model")

**Unsloth Setup (2-3x Faster)**
```python
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="google/gemma-3-1b-it",
    max_seq_length=1024,
    load_in_4bit=True,
    fast_inference=True,
    max_lora_rank=32,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    use_gradient_checkpointing="unsloth",
)
trainer.save_model("final_model")

**Unsloth设置(快2-3倍)**
```python
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="google/gemma-3-1b-it",
    max_seq_length=1024,
    load_in_4bit=True,
    fast_inference=True,
    max_lora_rank=32,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    use_gradient_checkpointing="unsloth",
)

Rest is identical to standard setup

Rest is identical to standard setup

trainer = GRPOTrainer(model=model, ...) trainer.train()

---
trainer = GRPOTrainer(model=model, ...) trainer.train()

---

Critical Training Insights

关键训练见解

1. Loss Behavior (EXPECTED PATTERN)

1. 损失行为(预期模式)

  • Loss starts near 0 and INCREASES during training
  • This is CORRECT - loss measures KL divergence from initial policy
  • Model is learning (diverging from original behavior to optimize rewards)
  • Monitor reward metrics instead of loss for progress
  • 损失初始接近0,训练期间上升
  • 这是正常现象——损失衡量与初始策略的KL散度
  • 模型正在学习(偏离原始行为以优化奖励)
  • 监控奖励指标而非损失来判断进度

2. Reward Tracking

2. 奖励跟踪

Key metrics to watch:
  • reward
    : Average across all completions
  • reward_std
    : Diversity within groups (should remain > 0)
  • kl
    : KL divergence from reference (should grow moderately)
Healthy Training Pattern:
Step   Reward    Reward_Std   KL
100    0.5       0.3          0.02
200    0.8       0.25         0.05
300    1.2       0.2          0.08  ← Good progression
400    1.5       0.15         0.12
Warning Signs:
  • Reward std → 0 (model collapsing to single response)
  • KL exploding (> 0.5) (diverging too much, reduce LR)
  • Reward stuck (reward functions too harsh or model capacity issue)
需要关注的关键指标:
  • reward
    : 所有补全结果的平均值
  • reward_std
    : 组内多样性(应保持>0)
  • kl
    : 与基准模型的KL散度(应适度增长)
健康训练模式:
Step   Reward    Reward_Std   KL
100    0.5       0.3          0.02
200    0.8       0.25         0.05
300    1.2       0.2          0.08  ← 良好进展
400    1.5       0.15         0.12
警告信号:
  • Reward std → 0(模型坍缩为单一响应)
  • KL激增(>0.5)(偏离过度,降低学习率)
  • 奖励停滞(奖励函数过于严格或模型容量不足)

3. Common Pitfalls and Solutions

3. 常见问题与解决方案

ProblemSymptomSolution
Mode collapseAll completions identicalIncrease
num_generations
, add diversity penalty
No learningFlat rewardsCheck reward function logic, increase LR
OOM errorsGPU memory exceededReduce
num_generations
, enable gradient checkpointing
Slow training< 1 it/sEnable
use_vllm=True
, use Unsloth, reduce seq length
Format ignoredModel doesn't follow structureIncrease format reward weight, add incremental rewards

问题症状解决方案
模式坍缩所有补全结果相同增加
num_generations
,添加多样性惩罚
无学习进展奖励持平检查奖励函数逻辑,提高学习率
OOM错误GPU内存不足减少
num_generations
,启用梯度检查点
训练缓慢<1次迭代/秒启用
use_vllm=True
,使用Unsloth,降低序列长度
格式被忽略模型不遵循结构提高格式奖励权重,添加增量奖励

Advanced Patterns

进阶模式

1. Multi-Stage Training

1. 多阶段训练

For complex tasks, train in stages:
python
undefined
对于复杂任务,分阶段训练:
python
undefined

Stage 1: Format compliance (epochs=1)

Stage 1: 格式合规(epochs=1)

trainer_stage1 = GRPOTrainer( model=model, reward_funcs=[incremental_format_reward, format_reward], ... ) trainer_stage1.train()
trainer_stage1 = GRPOTrainer( model=model, reward_funcs=[incremental_format_reward, format_reward], ... ) trainer_stage1.train()

Stage 2: Correctness (epochs=1)

Stage 2: 正确性(epochs=1)

trainer_stage2 = GRPOTrainer( model=model, reward_funcs=[format_reward, correctness_reward], ... ) trainer_stage2.train()
undefined
trainer_stage2 = GRPOTrainer( model=model, reward_funcs=[format_reward, correctness_reward], ... ) trainer_stage2.train()
undefined

2. Adaptive Reward Scaling

2. 自适应奖励缩放

python
class AdaptiveReward:
    def __init__(self, base_reward_func, initial_weight=1.0):
        self.func = base_reward_func
        self.weight = initial_weight

    def __call__(self, *args, **kwargs):
        rewards = self.func(*args, **kwargs)
        return [r * self.weight for r in rewards]

    def adjust_weight(self, success_rate):
        """Increase weight if model struggling, decrease if succeeding."""
        if success_rate < 0.3:
            self.weight *= 1.2
        elif success_rate > 0.8:
            self.weight *= 0.9
python
class AdaptiveReward:
    def __init__(self, base_reward_func, initial_weight=1.0):
        self.func = base_reward_func
        self.weight = initial_weight

    def __call__(self, *args, **kwargs):
        rewards = self.func(*args, **kwargs)
        return [r * self.weight for r in rewards]

    def adjust_weight(self, success_rate):
        """Increase weight if model struggling, decrease if succeeding."""
        if success_rate < 0.3:
            self.weight *= 1.2
        elif success_rate > 0.8:
            self.weight *= 0.9

3. Custom Dataset Integration

3. 自定义数据集集成

python
def load_custom_knowledge_base(csv_path):
    """Example: School communication platform docs."""
    import pandas as pd
    df = pd.read_csv(csv_path)

    dataset = Dataset.from_pandas(df).map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': x['expert_answer']
    })
    return dataset

python
def load_custom_knowledge_base(csv_path):
    """Example: School communication platform docs."""
    import pandas as pd
    df = pd.read_csv(csv_path)

    dataset = Dataset.from_pandas(df).map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': x['expert_answer']
    })
    return dataset

Deployment and Inference

部署与推理

Save and Merge LoRA

保存并合并LoRA

python
undefined
python
undefined

Merge LoRA adapters into base model

Merge LoRA adapters into base model

if hasattr(trainer.model, 'merge_and_unload'): merged_model = trainer.model.merge_and_unload() merged_model.save_pretrained("production_model") tokenizer.save_pretrained("production_model")
undefined
if hasattr(trainer.model, 'merge_and_unload'): merged_model = trainer.model.merge_and_unload() merged_model.save_pretrained("production_model") tokenizer.save_pretrained("production_model")
undefined

Inference Example

推理示例

python
from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="production_model",
    tokenizer=tokenizer
)

result = generator(
    [
        {'role': 'system', 'content': SYSTEM_PROMPT},
        {'role': 'user', 'content': "What is 15 + 27?"}
    ],
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)
print(result[0]['generated_text'])

python
from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="production_model",
    tokenizer=tokenizer
)

result = generator(
    [
        {'role': 'system', 'content': SYSTEM_PROMPT},
        {'role': 'user', 'content': "What is 15 + 27?"}
    ],
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)
print(result[0]['generated_text'])

Best Practices Checklist

最佳实践检查清单

Before Training:
  • Validate dataset format (prompts as List[Dict])
  • Test reward functions on sample data
  • Calculate expected max_prompt_length from data
  • Choose appropriate num_generations based on GPU memory
  • Set up logging (wandb recommended)
During Training:
  • Monitor reward progression (should increase)
  • Check reward_std (should stay > 0.1)
  • Watch for OOM errors (reduce batch size if needed)
  • Sample generations every 50-100 steps
  • Validate format compliance on holdout set
After Training:
  • Merge LoRA weights if using PEFT
  • Test on diverse prompts
  • Compare to baseline model
  • Document reward weights and hyperparameters
  • Save reproducibility config

训练前:
  • 验证数据集格式(提示为List[Dict])
  • 在样本数据上测试奖励函数
  • 根据数据计算预期的max_prompt_length
  • 根据GPU内存选择合适的num_generations
  • 设置日志记录(推荐使用wandb)
训练中:
  • 监控奖励进展(应上升)
  • 检查reward_std(应保持>0.1)
  • 关注OOM错误(必要时减小批次大小)
  • 每50-100步抽样生成结果
  • 在保留集上验证格式合规性
训练后:
  • 若使用PEFT则合并LoRA权重
  • 在多样化提示上测试
  • 与基准模型对比
  • 记录奖励权重和超参数
  • 保存可复现配置

Troubleshooting Guide

故障排除指南

Debugging Workflow

调试工作流

  1. Isolate reward functions - Test each independently
  2. Check data distribution - Ensure diversity in prompts
  3. Reduce complexity - Start with single reward, add gradually
  4. Monitor generations - Print samples every N steps
  5. Validate extraction logic - Ensure answer parsing works
  1. 隔离奖励函数——单独测试每个函数
  2. 检查数据分布——确保提示多样性
  3. 降低复杂度——从单一奖励开始,逐步添加
  4. 监控生成结果——每N步打印样本
  5. 验证提取逻辑——确保答案解析正常工作

Quick Fixes

快速修复

python
undefined
python
undefined

Debug reward function

Debug reward function

def debug_reward(completions, **kwargs): responses = [comp[0]['content'] for comp in completions] for i, r in enumerate(responses[:2]): # Print first 2 print(f"Response {i}: {r[:200]}...") return [1.0] * len(responses) # Dummy rewards
def debug_reward(completions, **kwargs): responses = [comp[0]['content'] for comp in completions] for i, r in enumerate(responses[:2]): # Print first 2 print(f"Response {i}: {r[:200]}...") return [1.0] * len(responses) # Dummy rewards

Test without training

Test without training

trainer = GRPOTrainer(..., reward_funcs=[debug_reward]) trainer.generate_completions(dataset[:1]) # Generate without updating

---
trainer = GRPOTrainer(..., reward_funcs=[debug_reward]) trainer.generate_completions(dataset[:1]) # Generate without updating

---

References and Resources

参考资料

Official Documentation:
Example Repositories:
Recommended Reading:
  • Progressive Disclosure Pattern for agent instructions
  • Reward shaping in RL (Ng et al.)
  • LoRA paper (Hu et al., 2021)

官方文档:
示例仓库:
推荐阅读:
  • 智能体指令的渐进式披露模式
  • RL中的奖励塑造(Ng等人)
  • LoRA论文(Hu等人,2021)

Usage Instructions for Agents

智能体使用说明

When this skill is loaded:
  1. Read this entire file before implementing GRPO training
  2. Start with the simplest reward function (e.g., length-based) to validate setup
  3. Use the templates in
    templates/
    directory as starting points
  4. Reference examples in
    examples/
    for task-specific implementations
  5. Follow the workflow sequentially (don't skip steps)
  6. Debug incrementally - add one reward function at a time
Critical Reminders:
  • Always use multiple reward functions (3-5 is optimal)
  • Monitor reward metrics, not loss
  • Test reward functions before training
  • Start small (num_generations=4), scale up gradually
  • Save checkpoints frequently (every 100 steps)
This skill is designed for expert-level implementation. Beginners should start with supervised fine-tuning before attempting GRPO.
加载本技能时:
  1. 在实现GRPO训练前通读全文
  2. 从最简单的奖励函数开始(例如基于长度的函数)以验证设置
  3. 使用
    templates/
    目录中的模板
    作为起点
  4. 参考
    examples/
    中的示例
    进行特定任务的实现
  5. 按顺序遵循工作流(不要跳过步骤)
  6. 逐步调试——一次添加一个奖励函数
关键提醒:
  • 始终使用多个奖励函数(3-5个最佳)
  • 监控奖励指标,而非损失
  • 训练前测试奖励函数
  • 从小规模开始(num_generations=4),逐步扩大
  • 频繁保存检查点(每100步)
本技能专为专家级实现设计。初学者应先从监督微调开始,再尝试GRPO。