grpo-rl-training

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

GRPO/RL Training with TRL

使用TRL进行GRPO/RL训练

Expert-level guidance for implementing Group Relative Policy Optimization (GRPO) using the Transformer Reinforcement Learning (TRL) library. This skill provides battle-tested patterns, critical insights, and production-ready workflows for fine-tuning language models with custom reward functions.

专家级指导，介绍如何使用Transformer Reinforcement Learning（TRL）库实现Group Relative Policy Optimization（GRPO）。本技能提供了经过实战验证的模式、关键见解和可用于生产环境的工作流，用于结合自定义奖励函数微调语言模型。

When to Use This Skill

何时使用本技能

Use GRPO training when you need to:

Enforce specific output formats (e.g., XML tags, JSON, structured reasoning)
Teach verifiable tasks with objective correctness metrics (math, coding, fact-checking)
Improve reasoning capabilities by rewarding chain-of-thought patterns
Align models to domain-specific behaviors without labeled preference data
Optimize for multiple objectives simultaneously (format + correctness + style)

Do NOT use GRPO for:

Simple supervised fine-tuning tasks (use SFT instead)
Tasks without clear reward signals
When you already have high-quality preference pairs (use DPO/PPO instead)

当你需要以下场景时，使用GRPO训练：

强制执行特定输出格式（例如XML标签、JSON、结构化推理）
教授可验证任务，带有客观正确性指标（数学、编码、事实核查）
提升推理能力，通过奖励思维链模式
使模型适应特定领域行为，无需标注偏好数据
同时优化多个目标（格式+正确性+风格）

请勿将GRPO用于：

简单的监督微调任务（改用SFT）
没有明确奖励信号的任务
已有高质量偏好对的场景（改用DPO/PPO）

Core Concepts

核心概念

1. GRPO Algorithm Fundamentals

1. GRPO算法基础

Key Mechanism:

Generates multiple completions for each prompt (group size: 4-16)
Compares completions within each group using reward functions
Updates policy to favor higher-rewarded responses relative to the group

Critical Difference from PPO:

No separate reward model needed
More sample-efficient (learns from within-group comparisons)
Simpler to implement and debug

Mathematical Intuition:

For each prompt p:
  1. Generate N completions: {c₁, c₂, ..., cₙ}
  2. Compute rewards: {r₁, r₂, ..., rₙ}
  3. Learn to increase probability of high-reward completions
     relative to low-reward ones in the same group

核心机制：

为每个提示生成多个补全结果（组大小：4-16）
使用奖励函数比较每组内的补全结果
更新策略，使组内奖励更高的响应更受青睐

与PPO的关键区别：

无需单独的奖励模型
样本效率更高（从组内比较中学习）
实现和调试更简单

数学直觉：

For each prompt p:
  1. Generate N completions: {c₁, c₂, ..., cₙ}
  2. Compute rewards: {r₁, r₂, ..., rₙ}
  3. Learn to increase probability of high-reward completions
     relative to low-reward ones in the same group

2. Reward Function Design Philosophy

2. 奖励函数设计理念

Golden Rules:

Compose multiple reward functions - Each handles one aspect (format, correctness, style)
Scale rewards appropriately - Higher weight = stronger signal
Use incremental rewards - Partial credit for partial compliance
Test rewards independently - Debug each reward function in isolation

Reward Function Types:

Type	Use Case	Example Weight
Correctness	Verifiable tasks (math, code)	2.0 (highest)
Format	Strict structure enforcement	0.5-1.0
Length	Encourage verbosity/conciseness	0.1-0.5
Style	Penalize unwanted patterns	-0.5 to 0.5

黄金法则：

组合多个奖励函数——每个函数负责一个维度（格式、正确性、风格）
适当缩放奖励——权重越高，信号越强
使用增量奖励——部分合规给予部分奖励
独立测试奖励——单独调试每个奖励函数

奖励函数类型：

类型	使用场景	示例权重
正确性	可验证任务（数学、代码）	2.0（最高）
格式	严格结构约束	0.5-1.0
长度	鼓励冗余/简洁性	0.1-0.5
风格	惩罚不良模式	-0.5 至 0.5

Implementation Workflow

实现工作流

Step 1: Dataset Preparation

Step 1: 数据集准备

Critical Requirements:

Prompts in chat format (list of dicts with 'role' and 'content')
Include system prompts to set expectations
For verifiable tasks, include ground truth answers as additional columns

Example Structure:

python

from datasets import load_dataset, Dataset

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
[Your step-by-step thinking]
</reasoning>
<answer>
[Final answer]
</answer>
"""

def prepare_dataset(raw_data):
    """
    Transform raw data into GRPO-compatible format.

    Returns: Dataset with columns:
    - 'prompt': List[Dict] with role/content (system + user messages)
    - 'answer': str (ground truth, optional but recommended)
    """
    return raw_data.map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_answer(x['raw_answer'])
    })

Pro Tips:

Use one-shot or few-shot examples in system prompt for complex formats
Keep prompts concise (max_prompt_length: 256-512 tokens)
Validate data quality before training (garbage in = garbage out)

关键要求：

提示采用对话格式（包含'role'和'content'的字典列表）
包含系统提示以设定预期
对于可验证任务，将标准答案作为额外列包含

示例结构：

python

from datasets import load_dataset, Dataset

SYSTEM_PROMPT = """
Respond in the following format:
<reasoning>
[Your step-by-step thinking]
</reasoning>
<answer>
[Final answer]
</answer>
"""

def prepare_dataset(raw_data):
    """
    Transform raw data into GRPO-compatible format.

    Returns: Dataset with columns:
    - 'prompt': List[Dict] with role/content (system + user messages)
    - 'answer': str (ground truth, optional but recommended)
    """
    return raw_data.map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': extract_answer(x['raw_answer'])
    })

专业提示：

对于复杂格式，在系统提示中使用单样本或少样本示例
保持提示简洁（max_prompt_length: 256-512 tokens）
训练前验证数据质量（输入垃圾，输出垃圾）

Step 2: Reward Function Implementation

Step 2: 奖励函数实现

Template Structure:

python

def reward_function_name(
    prompts,        # List[List[Dict]]: Original prompts
    completions,    # List[List[Dict]]: Model generations
    answer=None,    # Optional: Ground truth from dataset
    **kwargs        # Additional dataset columns
) -> list[float]:
    """
    Evaluate completions and return rewards.

    Returns: List of floats (one per completion)
    """
    # Extract completion text
    responses = [comp[0]['content'] for comp in completions]

    # Compute rewards
    rewards = []
    for response in responses:
        score = compute_score(response)
        rewards.append(score)

    return rewards

Example 1: Correctness Reward (Math/Coding)

python

def correctness_reward(prompts, completions, answer, **kwargs):
    """Reward correct answers with high score."""
    responses = [comp[0]['content'] for comp in completions]
    extracted = [extract_final_answer(r) for r in responses]
    return [2.0 if ans == gt else 0.0
            for ans, gt in zip(extracted, answer)]

Example 2: Format Reward (Structured Output)

python

import re

def format_reward(completions, **kwargs):
    """Reward XML-like structured format."""
    pattern = r'<reasoning>.*?</reasoning>\s*<answer>.*?</answer>'
    responses = [comp[0]['content'] for comp in completions]
    return [1.0 if re.search(pattern, r, re.DOTALL) else 0.0
            for r in responses]

Example 3: Incremental Format Reward (Partial Credit)

python

def incremental_format_reward(completions, **kwargs):
    """Award partial credit for format compliance."""
    responses = [comp[0]['content'] for comp in completions]
    rewards = []

    for r in responses:
        score = 0.0
        if '<reasoning>' in r:
            score += 0.25
        if '</reasoning>' in r:
            score += 0.25
        if '<answer>' in r:
            score += 0.25
        if '</answer>' in r:
            score += 0.25
        # Penalize extra text after closing tag
        if r.count('</answer>') == 1:
            extra_text = r.split('</answer>')[-1].strip()
            score -= len(extra_text) * 0.001
        rewards.append(score)

    return rewards

Critical Insight: Combine 3-5 reward functions for robust training. Order matters less than diversity of signals.

模板结构：

python

def reward_function_name(
    prompts,        # List[List[Dict]]: Original prompts
    completions,    # List[List[Dict]]: Model generations
    answer=None,    # Optional: Ground truth from dataset
    **kwargs        # Additional dataset columns
) -> list[float]:
    """
    Evaluate completions and return rewards.

    Returns: List of floats (one per completion)
    """
    # Extract completion text
    responses = [comp[0]['content'] for comp in completions]

    # Compute rewards
    rewards = []
    for response in responses:
        score = compute_score(response)
        rewards.append(score)

    return rewards

示例1: 正确性奖励（数学/编码）

python

def correctness_reward(prompts, completions, answer, **kwargs):
    """Reward correct answers with high score."""
    responses = [comp[0]['content'] for comp in completions]
    extracted = [extract_final_answer(r) for r in responses]
    return [2.0 if ans == gt else 0.0
            for ans, gt in zip(extracted, answer)]

示例2: 格式奖励（结构化输出）

python

import re

def format_reward(completions, **kwargs):
    """Reward XML-like structured format."""
    pattern = r'<reasoning>.*?</reasoning>\s*<answer>.*?</answer>'
    responses = [comp[0]['content'] for comp in completions]
    return [1.0 if re.search(pattern, r, re.DOTALL) else 0.0
            for r in responses]

示例3: 增量格式奖励（部分奖励）

python

def incremental_format_reward(completions, **kwargs):
    """Award partial credit for format compliance."""
    responses = [comp[0]['content'] for comp in completions]
    rewards = []

    for r in responses:
        score = 0.0
        if '<reasoning>' in r:
            score += 0.25
        if '</reasoning>' in r:
            score += 0.25
        if '<answer>' in r:
            score += 0.25
        if '</answer>' in r:
            score += 0.25
        # Penalize extra text after closing tag
        if r.count('</answer>') == 1:
            extra_text = r.split('</answer>')[-1].strip()
            score -= len(extra_text) * 0.001
        rewards.append(score)

    return rewards

关键见解： 结合3-5个奖励函数进行稳健训练。顺序不如信号多样性重要。

Step 3: Training Configuration

Step 3: 训练配置

Memory-Optimized Config (Small GPU)

python

from trl import GRPOConfig

training_args = GRPOConfig(
    output_dir="outputs/grpo-model",

    # Learning rate
    learning_rate=5e-6,          # Lower = more stable
    adam_beta1=0.9,
    adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.1,
    lr_scheduler_type='cosine',

    # Batch settings
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,  # Effective batch = 4

    # GRPO-specific
    num_generations=8,            # Group size: 8-16 recommended
    max_prompt_length=256,
    max_completion_length=512,

    # Training duration
    num_train_epochs=1,
    max_steps=None,               # Or set fixed steps (e.g., 500)

    # Optimization
    bf16=True,                    # Faster on A100/H100
    optim="adamw_8bit",          # Memory-efficient optimizer
    max_grad_norm=0.1,

    # Logging
    logging_steps=1,
    save_steps=100,
    report_to="wandb",            # Or "none" for no logging
)

High-Performance Config (Large GPU)

python

training_args = GRPOConfig(
    output_dir="outputs/grpo-model",
    learning_rate=1e-5,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    num_generations=16,           # Larger groups = better signal
    max_prompt_length=512,
    max_completion_length=1024,
    num_train_epochs=1,
    bf16=True,
    use_vllm=True,                # Fast generation with vLLM
    logging_steps=10,
)

Critical Hyperparameters:

Parameter	Impact	Tuning Advice
`num_generations`	Group size for comparison	Start with 8, increase to 16 if GPU allows
`learning_rate`	Convergence speed/stability	5e-6 (safe), 1e-5 (faster, riskier)
`max_completion_length`	Output verbosity	Match your task (512 for reasoning, 256 for short answers)
`gradient_accumulation_steps`	Effective batch size	Increase if GPU memory limited

内存优化配置（小型GPU）

python

from trl import GRPOConfig

training_args = GRPOConfig(
    output_dir="outputs/grpo-model",

    # Learning rate
    learning_rate=5e-6,          # Lower = more stable
    adam_beta1=0.9,
    adam_beta2=0.99,
    weight_decay=0.1,
    warmup_ratio=0.1,
    lr_scheduler_type='cosine',

    # Batch settings
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,  # Effective batch = 4

    # GRPO-specific
    num_generations=8,            # Group size: 8-16 recommended
    max_prompt_length=256,
    max_completion_length=512,

    # Training duration
    num_train_epochs=1,
    max_steps=None,               # Or set fixed steps (e.g., 500)

    # Optimization
    bf16=True,                    # Faster on A100/H100
    optim="adamw_8bit",          # Memory-efficient optimizer
    max_grad_norm=0.1,

    # Logging
    logging_steps=1,
    save_steps=100,
    report_to="wandb",            # Or "none" for no logging
)

高性能配置（大型GPU）

python

training_args = GRPOConfig(
    output_dir="outputs/grpo-model",
    learning_rate=1e-5,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    num_generations=16,           # Larger groups = better signal
    max_prompt_length=512,
    max_completion_length=1024,
    num_train_epochs=1,
    bf16=True,
    use_vllm=True,                # Fast generation with vLLM
    logging_steps=10,
)

关键超参数：

参数	影响	调优建议
`num_generations`	用于比较的组大小	从8开始，若GPU允许可增加到16
`learning_rate`	收敛速度/稳定性	5e-6（安全），1e-5（更快，风险更高）
`max_completion_length`	输出冗余度	匹配任务需求（推理用512，短答案用256）
`gradient_accumulation_steps`	有效批次大小	若GPU内存受限则增加

Step 4: Model Setup and Training

Step 4: 模型设置与训练

Standard Setup (Transformers)

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import GRPOTrainer

标准设置（Transformers）

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig
from trl import GRPOTrainer

Load model

model_name = "Qwen/Qwen2.5-1.5B-Instruct" model = AutoModelForCausalLM.from_pretrained( model_name, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", # 2-3x faster device_map="auto" )

tokenizer = AutoTokenizer.from_pretrained(model_name) tokenizer.pad_token = tokenizer.eos_token

Optional: LoRA for parameter-efficient training

peft_config = LoraConfig( r=16, # Rank (higher = more capacity) lora_alpha=32, # Scaling factor (typically 2*r) target_modules=[ "q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj" ], task_type="CAUSAL_LM", lora_dropout=0.05, )

Initialize trainer

trainer = GRPOTrainer( model=model, processing_class=tokenizer, reward_funcs=[ incremental_format_reward, format_reward, correctness_reward, ], args=training_args, train_dataset=dataset, peft_config=peft_config, # Remove for full fine-tuning )

Train

trainer.train()

Save

trainer.save_model("final_model")


**Unsloth Setup (2-3x Faster)**
```python
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="google/gemma-3-1b-it",
    max_seq_length=1024,
    load_in_4bit=True,
    fast_inference=True,
    max_lora_rank=32,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    use_gradient_checkpointing="unsloth",
)

trainer.save_model("final_model")


**Unsloth设置（快2-3倍）**
```python
from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="google/gemma-3-1b-it",
    max_seq_length=1024,
    load_in_4bit=True,
    fast_inference=True,
    max_lora_rank=32,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=32,
    use_gradient_checkpointing="unsloth",
)

Rest is identical to standard setup

trainer = GRPOTrainer(model=model, ...) trainer.train()

---

trainer = GRPOTrainer(model=model, ...) trainer.train()

---

Critical Training Insights

关键训练见解

1. Loss Behavior (EXPECTED PATTERN)

1. 损失行为（预期模式）

Loss starts near 0 and INCREASES during training
This is CORRECT - loss measures KL divergence from initial policy
Model is learning (diverging from original behavior to optimize rewards)
Monitor reward metrics instead of loss for progress

损失初始接近0，训练期间上升
这是正常现象——损失衡量与初始策略的KL散度
模型正在学习（偏离原始行为以优化奖励）
监控奖励指标而非损失来判断进度

2. Reward Tracking

2. 奖励跟踪

Key metrics to watch:

```
reward
```
: Average across all completions
```
reward_std
```
: Diversity within groups (should remain > 0)
```
kl
```
: KL divergence from reference (should grow moderately)

Healthy Training Pattern:

Step   Reward    Reward_Std   KL
100    0.5       0.3          0.02
200    0.8       0.25         0.05
300    1.2       0.2          0.08  ← Good progression
400    1.5       0.15         0.12

Warning Signs:

Reward std → 0 (model collapsing to single response)
KL exploding (> 0.5) (diverging too much, reduce LR)
Reward stuck (reward functions too harsh or model capacity issue)

需要关注的关键指标：

```
reward
```
: 所有补全结果的平均值
```
reward_std
```
: 组内多样性（应保持>0）
```
kl
```
: 与基准模型的KL散度（应适度增长）

健康训练模式：

Step   Reward    Reward_Std   KL
100    0.5       0.3          0.02
200    0.8       0.25         0.05
300    1.2       0.2          0.08  ← 良好进展
400    1.5       0.15         0.12

警告信号：

Reward std → 0（模型坍缩为单一响应）
KL激增（>0.5）（偏离过度，降低学习率）
奖励停滞（奖励函数过于严格或模型容量不足）

3. Common Pitfalls and Solutions

3. 常见问题与解决方案

Problem	Symptom	Solution
Mode collapse	All completions identical	Increase `num_generations` , add diversity penalty
No learning	Flat rewards	Check reward function logic, increase LR
OOM errors	GPU memory exceeded	Reduce `num_generations` , enable gradient checkpointing
Slow training	< 1 it/s	Enable `use_vllm=True` , use Unsloth, reduce seq length
Format ignored	Model doesn't follow structure	Increase format reward weight, add incremental rewards

问题	症状	解决方案
模式坍缩	所有补全结果相同	增加 `num_generations` ，添加多样性惩罚
无学习进展	奖励持平	检查奖励函数逻辑，提高学习率
OOM错误	GPU内存不足	减少 `num_generations` ，启用梯度检查点
训练缓慢	<1次迭代/秒	启用 `use_vllm=True` ，使用Unsloth，降低序列长度
格式被忽略	模型不遵循结构	提高格式奖励权重，添加增量奖励

Advanced Patterns

进阶模式

1. Multi-Stage Training

1. 多阶段训练

For complex tasks, train in stages:

python

undefined

对于复杂任务，分阶段训练：

python

undefined

Stage 1: Format compliance (epochs=1)

Stage 1: 格式合规（epochs=1）

trainer_stage1 = GRPOTrainer( model=model, reward_funcs=[incremental_format_reward, format_reward], ... ) trainer_stage1.train()

Stage 2: Correctness (epochs=1)

Stage 2: 正确性（epochs=1）

trainer_stage2 = GRPOTrainer( model=model, reward_funcs=[format_reward, correctness_reward], ... ) trainer_stage2.train()

undefined

trainer_stage2 = GRPOTrainer( model=model, reward_funcs=[format_reward, correctness_reward], ... ) trainer_stage2.train()

undefined

2. Adaptive Reward Scaling

2. 自适应奖励缩放

python

class AdaptiveReward:
    def __init__(self, base_reward_func, initial_weight=1.0):
        self.func = base_reward_func
        self.weight = initial_weight

    def __call__(self, *args, **kwargs):
        rewards = self.func(*args, **kwargs)
        return [r * self.weight for r in rewards]

    def adjust_weight(self, success_rate):
        """Increase weight if model struggling, decrease if succeeding."""
        if success_rate < 0.3:
            self.weight *= 1.2
        elif success_rate > 0.8:
            self.weight *= 0.9

python

class AdaptiveReward:
    def __init__(self, base_reward_func, initial_weight=1.0):
        self.func = base_reward_func
        self.weight = initial_weight

    def __call__(self, *args, **kwargs):
        rewards = self.func(*args, **kwargs)
        return [r * self.weight for r in rewards]

    def adjust_weight(self, success_rate):
        """Increase weight if model struggling, decrease if succeeding."""
        if success_rate < 0.3:
            self.weight *= 1.2
        elif success_rate > 0.8:
            self.weight *= 0.9

3. Custom Dataset Integration

3. 自定义数据集集成

python

def load_custom_knowledge_base(csv_path):
    """Example: School communication platform docs."""
    import pandas as pd
    df = pd.read_csv(csv_path)

    dataset = Dataset.from_pandas(df).map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': x['expert_answer']
    })
    return dataset

python

def load_custom_knowledge_base(csv_path):
    """Example: School communication platform docs."""
    import pandas as pd
    df = pd.read_csv(csv_path)

    dataset = Dataset.from_pandas(df).map(lambda x: {
        'prompt': [
            {'role': 'system', 'content': CUSTOM_SYSTEM_PROMPT},
            {'role': 'user', 'content': x['question']}
        ],
        'answer': x['expert_answer']
    })
    return dataset

Deployment and Inference

部署与推理

Save and Merge LoRA

保存并合并LoRA

python

undefined

python

undefined

Merge LoRA adapters into base model

if hasattr(trainer.model, 'merge_and_unload'): merged_model = trainer.model.merge_and_unload() merged_model.save_pretrained("production_model") tokenizer.save_pretrained("production_model")

undefined

if hasattr(trainer.model, 'merge_and_unload'): merged_model = trainer.model.merge_and_unload() merged_model.save_pretrained("production_model") tokenizer.save_pretrained("production_model")

undefined

Inference Example

推理示例

python

from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="production_model",
    tokenizer=tokenizer
)

result = generator(
    [
        {'role': 'system', 'content': SYSTEM_PROMPT},
        {'role': 'user', 'content': "What is 15 + 27?"}
    ],
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)
print(result[0]['generated_text'])

python

from transformers import pipeline

generator = pipeline(
    "text-generation",
    model="production_model",
    tokenizer=tokenizer
)

result = generator(
    [
        {'role': 'system', 'content': SYSTEM_PROMPT},
        {'role': 'user', 'content': "What is 15 + 27?"}
    ],
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_p=0.9
)
print(result[0]['generated_text'])

Best Practices Checklist

最佳实践检查清单

Troubleshooting Guide

故障排除指南

Debugging Workflow

调试工作流

Isolate reward functions - Test each independently
Check data distribution - Ensure diversity in prompts
Reduce complexity - Start with single reward, add gradually
Monitor generations - Print samples every N steps
Validate extraction logic - Ensure answer parsing works

隔离奖励函数——单独测试每个函数
检查数据分布——确保提示多样性
降低复杂度——从单一奖励开始，逐步添加
监控生成结果——每N步打印样本
验证提取逻辑——确保答案解析正常工作

Quick Fixes

快速修复

python

undefined

python

undefined

Debug reward function

def debug_reward(completions, **kwargs): responses = [comp[0]['content'] for comp in completions] for i, r in enumerate(responses[:2]): # Print first 2 print(f"Response {i}: {r[:200]}...") return [1.0] * len(responses) # Dummy rewards

Test without training

trainer = GRPOTrainer(..., reward_funcs=[debug_reward]) trainer.generate_completions(dataset[:1]) # Generate without updating

---

trainer = GRPOTrainer(..., reward_funcs=[debug_reward]) trainer.generate_completions(dataset[:1]) # Generate without updating

---

References and Resources

参考资料

Official Documentation:

TRL GRPO Trainer: https://huggingface.co/docs/trl/grpo_trainer
DeepSeek R1 Paper: https://arxiv.org/abs/2501.12948
Unsloth Docs: https://docs.unsloth.ai/

Example Repositories:

Open R1 Implementation: https://github.com/huggingface/open-r1
TRL Examples: https://github.com/huggingface/trl/tree/main/examples

Recommended Reading:

Progressive Disclosure Pattern for agent instructions
Reward shaping in RL (Ng et al.)
LoRA paper (Hu et al., 2021)

官方文档：

TRL GRPO Trainer: https://huggingface.co/docs/trl/grpo_trainer
DeepSeek R1 Paper: https://arxiv.org/abs/2501.12948
Unsloth Docs: https://docs.unsloth.ai/

示例仓库：

Open R1 Implementation: https://github.com/huggingface/open-r1
TRL Examples: https://github.com/huggingface/trl/tree/main/examples

推荐阅读：

智能体指令的渐进式披露模式
RL中的奖励塑造（Ng等人）
LoRA论文（Hu等人，2021）

Usage Instructions for Agents

智能体使用说明

When this skill is loaded:

Read this entire file before implementing GRPO training
Start with the simplest reward function (e.g., length-based) to validate setup
Use the templates in
```
templates/
```
directory as starting points
Reference examples in
```
examples/
```
for task-specific implementations
Follow the workflow sequentially (don't skip steps)
Debug incrementally - add one reward function at a time

Critical Reminders:

Always use multiple reward functions (3-5 is optimal)
Monitor reward metrics, not loss
Test reward functions before training
Start small (num_generations=4), scale up gradually
Save checkpoints frequently (every 100 steps)

This skill is designed for expert-level implementation. Beginners should start with supervised fine-tuning before attempting GRPO.

加载本技能时：

在实现GRPO训练前通读全文
从最简单的奖励函数开始（例如基于长度的函数）以验证设置
使用
templates/
目录中的模板作为起点
参考
examples/
中的示例进行特定任务的实现
按顺序遵循工作流（不要跳过步骤）
逐步调试——一次添加一个奖励函数

关键提醒：

始终使用多个奖励函数（3-5个最佳）
监控奖励指标，而非损失
训练前测试奖励函数
从小规模开始（num_generations=4），逐步扩大
频繁保存检查点（每100步）

本技能专为专家级实现设计。初学者应先从监督微调开始，再尝试GRPO。