openclaw-rl-training

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

OpenClaw-RL Training Skill

OpenClaw-RL 训练技能

Skill by ara.so — Hermes Skills collection.
ara.so开发的技能 — 属于Hermes技能合集。

Overview

概述

OpenClaw-RL is a fully asynchronous reinforcement learning framework that trains personalized AI agents from natural conversation feedback. It wraps self-hosted models in an OpenClaw-compatible API, intercepts live multi-turn conversations, and continuously optimizes the policy in the background without interrupting usage.
Key capabilities:
  • Fully async 4-component architecture (serving, rollout, evaluation, training)
  • Three learning paradigms: Binary RL (GRPO), On-Policy Distillation (OPD), Hybrid Combine
  • Self-hosted and private — runs entirely on your infrastructure
  • Supports personal agent optimization and general agentic RL (terminal, GUI, SWE, tool-call)
  • Zero manual labeling — automatic trajectory creation from conversations
OpenClaw-RL是一个全异步的reinforcement learning框架,可通过自然对话反馈训练个性化AI Agent。它将自托管模型封装为OpenClaw兼容的API,拦截实时多轮对话,并在后台持续优化策略,且不会中断使用。
核心功能:
  • 全异步四组件架构(服务、轨迹生成、评估、训练)
  • 三种学习范式:Binary RL(GRPO)、On-Policy Distillation(OPD)、混合组合法
  • 自托管且私有 — 完全在您的基础设施上运行
  • 支持个人Agent优化和通用Agentic RL(终端、GUI、软件工程、工具调用)
  • 无需手动标注 — 从对话中自动生成训练轨迹

Installation

安装

Prerequisites

前置要求

bash
undefined
bash
undefined

Python 3.8+ required

Python 3.8+ required

CUDA-capable GPU(s) for training

CUDA-capable GPU(s) for training

Docker (optional, for containerized deployment)

Docker (optional, for containerized deployment)

undefined
undefined

Clone and Setup

克隆与设置

bash
git clone https://github.com/Gen-Verse/OpenClaw-RL.git
cd OpenClaw-RL
bash
git clone https://github.com/Gen-Verse/OpenClaw-RL.git
cd OpenClaw-RL

Install dependencies for your chosen method

Install dependencies for your chosen method

cd openclaw-combine # or openclaw-rl, openclaw-opd, etc. pip install -r requirements.txt
cd openclaw-combine # or openclaw-rl, openclaw-opd, etc. pip install -r requirements.txt

Install slime framework

Install slime framework

cd ../slime pip install -e .
cd ../slime pip install -e .

Install Megatron-LM

Install Megatron-LM

cd ../Megatron-LM pip install -e .
undefined
cd ../Megatron-LM pip install -e .
undefined

Environment Variables

环境变量

bash
export OPENCLAW_API_KEY=your_api_key_here
export WANDB_API_KEY=$YOUR_WANDB_KEY  # For experiment tracking
export HF_TOKEN=$YOUR_HF_TOKEN  # For model downloads
bash
export OPENCLAW_API_KEY=your_api_key_here
export WANDB_API_KEY=$YOUR_WANDB_KEY  # For experiment tracking
export HF_TOKEN=$YOUR_HF_TOKEN  # For model downloads

Architecture Components

架构组件

OpenClaw-RL has 4 decoupled async components:
  1. Agent Server - Serves the model via OpenClaw-compatible API
  2. Rollout Collector - Intercepts conversations, creates training trajectories
  3. Judge/PRM Evaluator - Scores interactions asynchronously with majority voting
  4. Policy Trainer - Optimizes the model using collected feedback
OpenClaw-RL包含4个解耦的异步组件:
  1. Agent服务器 - 通过OpenClaw兼容API提供模型服务
  2. 轨迹生成收集器 - 拦截对话,生成训练轨迹
  3. 评估器/PRM评判器 - 通过多数投票异步为交互打分
  4. 策略训练器 - 利用收集到的反馈优化模型

Training Methods

训练方法

1. Binary RL (GRPO)

1. Binary RL(GRPO)

Uses Process Reward Model to score each turn, then applies GRPO advantage estimation with PPO-style clipped loss.
bash
cd openclaw-rl
使用Process Reward Model为每一轮对话打分,然后应用带PPO风格裁剪损失的GRPO优势估计。
bash
cd openclaw-rl

Configure training script

Configure training script

export MASTER_ADDR=localhost export MASTER_PORT=6000 export NNODES=1 export NODE_RANK=0 export GPUS_PER_NODE=8
export MASTER_ADDR=localhost export MASTER_PORT=6000 export NNODES=1 export NODE_RANK=0 export GPUS_PER_NODE=8

Launch training

Launch training

bash run_binary_rl.sh

**Key configuration in script:**

```bash
#!/bin/bash
bash run_binary_rl.sh

**脚本中的核心配置:**

```bash
#!/bin/bash

Model paths

Model paths

CKPT_PATH=/path/to/your/model/checkpoint TOKENIZER_PATH=/path/to/tokenizer
CKPT_PATH=/path/to/your/model/checkpoint TOKENIZER_PATH=/path/to/tokenizer

Rollout configuration

Rollout configuration

ROLLOUT_ARGS=" --rollout-function-path rollout_binary.py
--num-rollout-workers 4
--rollout-batch-size 32
--max-turns 10 "
ROLLOUT_ARGS=" --rollout-function-path rollout_binary.py
--num-rollout-workers 4
--rollout-batch-size 32
--max-turns 10 "

Reward model configuration

Reward model configuration

REWARD_ARGS=" --custom-rm-path process_reward_model.py
--rm-checkpoint /path/to/prm/checkpoint
--reward-aggregation majority "
REWARD_ARGS=" --custom-rm-path process_reward_model.py
--rm-checkpoint /path/to/prm/checkpoint
--reward-aggregation majority "

Training hyperparameters

Training hyperparameters

OPTIMIZER_ARGS=" --lr 1e-6
--lr-warmup-samples 100
--clip-grad 1.0
--ppo-clip-ratio 0.2
--num-epochs 1 "
OPTIMIZER_ARGS=" --lr 1e-6
--lr-warmup-samples 100
--clip-grad 1.0
--ppo-clip-ratio 0.2
--num-epochs 1 "

Launch distributed training

Launch distributed training

torchrun --nproc_per_node=$GPUS_PER_NODE
--nnodes=$NNODES
--node_rank=$NODE_RANK
--master_addr=$MASTER_ADDR
--master_port=$MASTER_PORT
slime/train_grpo.py
$ROLLOUT_ARGS
$REWARD_ARGS
$OPTIMIZER_ARGS
undefined
torchrun --nproc_per_node=$GPUS_PER_NODE
--nnodes=$NNODES
--node_rank=$NODE_RANK
--master_addr=$MASTER_ADDR
--master_port=$MASTER_PORT
slime/train_grpo.py
$ROLLOUT_ARGS
$REWARD_ARGS
$OPTIMIZER_ARGS
undefined

2. On-Policy Distillation (OPD)

2. On-Policy Distillation(OPD)

Extracts textual hints from next-state feedback, creates enhanced teacher trajectories, uses token-level log-prob gaps as directional advantages.
bash
cd openclaw-opd
从下一状态反馈中提取文本提示,生成增强型教师轨迹,将token级别的对数概率差用作定向优势。
bash
cd openclaw-opd

Launch OPD training

Launch OPD training

bash run_opd_training.sh

**OPD configuration example:**

```python
bash run_opd_training.sh

**OPD配置示例:**

```python

custom_opd_loss.py

custom_opd_loss.py

import torch import torch.nn.functional as F
def compute_opd_loss( student_logprobs, teacher_logprobs, advantage_mask, clip_ratio=0.2 ): """ OPD loss: token-level advantage from teacher-student log-prob gap """ # Compute log-probability ratio logratio = student_logprobs - teacher_logprobs ratio = torch.exp(logratio)
# Apply clipping
clipped_ratio = torch.clamp(ratio, 1 - clip_ratio, 1 + clip_ratio)

# Compute advantages (negative gap = student should improve)
advantages = teacher_logprobs - student_logprobs

# Masked loss (only on relevant tokens)
loss_unclipped = -advantages * ratio
loss_clipped = -advantages * clipped_ratio
loss = torch.max(loss_unclipped, loss_clipped)

# Apply mask and return mean
masked_loss = loss * advantage_mask
return masked_loss.sum() / advantage_mask.sum()

**OPD rollout script:**

```python
import torch import torch.nn.functional as F
def compute_opd_loss( student_logprobs, teacher_logprobs, advantage_mask, clip_ratio=0.2 ): """ OPD loss: token-level advantage from teacher-student log-prob gap """ # Compute log-probability ratio logratio = student_logprobs - teacher_logprobs ratio = torch.exp(logratio)
# Apply clipping
clipped_ratio = torch.clamp(ratio, 1 - clip_ratio, 1 + clip_ratio)

# Compute advantages (negative gap = student should improve)
advantages = teacher_logprobs - student_logprobs

# Masked loss (only on relevant tokens)
loss_unclipped = -advantages * ratio
loss_clipped = -advantages * clipped_ratio
loss = torch.max(loss_unclipped, loss_clipped)

# Apply mask and return mean
masked_loss = loss * advantage_mask
return masked_loss.sum() / advantage_mask.sum()

**OPD轨迹生成脚本:**

```python

rollout_opd.py

rollout_opd.py

import asyncio from typing import List, Dict
async def collect_opd_trajectory( prompt: str, student_model, teacher_augmentation_fn, max_turns: int = 10 ) -> Dict: """ Collect trajectory with teacher augmentation """ trajectory = { "student_responses": [], "teacher_responses": [], "rewards": [], "advantages": [] }
current_prompt = prompt

for turn in range(max_turns):
    # Student generation
    student_response = await student_model.generate(current_prompt)
    
    # Get next-state feedback (from user/env)
    feedback = await get_next_feedback(student_response)
    
    # Extract hint and create augmented teacher prompt
    hint = await extract_hint_from_feedback(feedback)
    teacher_prompt = augment_prompt_with_hint(current_prompt, hint)
    
    # Teacher generation
    teacher_response = await student_model.generate(teacher_prompt)
    
    # Store trajectory data
    trajectory["student_responses"].append(student_response)
    trajectory["teacher_responses"].append(teacher_response)
    
    # Update for next turn
    current_prompt = create_next_prompt(student_response, feedback)

return trajectory
undefined
import asyncio from typing import List, Dict
async def collect_opd_trajectory( prompt: str, student_model, teacher_augmentation_fn, max_turns: int = 10 ) -> Dict: """ Collect trajectory with teacher augmentation """ trajectory = { "student_responses": [], "teacher_responses": [], "rewards": [], "advantages": [] }
current_prompt = prompt

for turn in range(max_turns):
    # Student generation
    student_response = await student_model.generate(current_prompt)
    
    # Get next-state feedback (from user/env)
    feedback = await get_next_feedback(student_response)
    
    # Extract hint and create augmented teacher prompt
    hint = await extract_hint_from_feedback(feedback)
    teacher_prompt = augment_prompt_with_hint(current_prompt, hint)
    
    # Teacher generation
    teacher_response = await student_model.generate(teacher_prompt)
    
    # Store trajectory data
    trajectory["student_responses"].append(student_response)
    trajectory["teacher_responses"].append(teacher_response)
    
    # Update for next turn
    current_prompt = create_next_prompt(student_response, feedback)

return trajectory
undefined

3. Hybrid Combine Method

3. 混合组合法

Combines Binary RL scalar rewards with OPD token-level signals for stronger optimization.
bash
cd openclaw-combine
结合Binary RL的标量奖励与OPD的token级信号,实现更强的优化效果。
bash
cd openclaw-combine

Launch hybrid training (one-line deployment)

Launch hybrid training (one-line deployment)

bash run_combine_training.sh

**Hybrid loss implementation:**

```python
bash run_combine_training.sh

**混合损失实现:**

```python

hybrid_loss.py

hybrid_loss.py

import torch
def compute_hybrid_loss( student_logprobs, teacher_logprobs, scalar_rewards, opd_weight=0.5, binary_weight=0.5, clip_ratio=0.2 ): """ Hybrid loss combining Binary RL and OPD """ # Binary RL component (GRPO) advantages_binary = compute_gae(scalar_rewards) logratio = student_logprobs - student_logprobs.detach() ratio = torch.exp(logratio)
pg_loss1 = -advantages_binary * ratio
pg_loss2 = -advantages_binary * torch.clamp(
    ratio, 1 - clip_ratio, 1 + clip_ratio
)
binary_loss = torch.max(pg_loss1, pg_loss2).mean()

# OPD component (token-level)
advantages_opd = teacher_logprobs - student_logprobs
opd_loss = -advantages_opd.mean()

# Combine with weights
total_loss = (
    binary_weight * binary_loss +
    opd_weight * opd_loss
)

return total_loss, {
    "binary_loss": binary_loss.item(),
    "opd_loss": opd_loss.item(),
    "total_loss": total_loss.item()
}
undefined
import torch
def compute_hybrid_loss( student_logprobs, teacher_logprobs, scalar_rewards, opd_weight=0.5, binary_weight=0.5, clip_ratio=0.2 ): """ Hybrid loss combining Binary RL and OPD """ # Binary RL component (GRPO) advantages_binary = compute_gae(scalar_rewards) logratio = student_logprobs - student_logprobs.detach() ratio = torch.exp(logratio)
pg_loss1 = -advantages_binary * ratio
pg_loss2 = -advantages_binary * torch.clamp(
    ratio, 1 - clip_ratio, 1 + clip_ratio
)
binary_loss = torch.max(pg_loss1, pg_loss2).mean()

# OPD component (token-level)
advantages_opd = teacher_logprobs - student_logprobs
opd_loss = -advantages_opd.mean()

# Combine with weights
total_loss = (
    binary_weight * binary_loss +
    opd_weight * opd_loss
)

return total_loss, {
    "binary_loss": binary_loss.item(),
    "opd_loss": opd_loss.item(),
    "total_loss": total_loss.item()
}
undefined

Personal Agent Optimization

个人Agent优化

Setup OpenClaw Extension

设置OpenClaw扩展

bash
undefined
bash
undefined

Install the RL training headers extension

Install the RL training headers extension

cd extensions/rl-training-headers npm install npm run build
cd extensions/rl-training-headers npm install npm run build

Configure in your OpenClaw instance

Configure in your OpenClaw instance

Add to openclaw config.json:

Add to openclaw config.json:


```json
{
  "extensions": [
    {
      "name": "rl-training-headers",
      "enabled": true,
      "config": {
        "rollout_endpoint": "http://localhost:8000/rollout",
        "training_mode": "async",
        "session_tracking": true
      }
    }
  ]
}

```json
{
  "extensions": [
    {
      "name": "rl-training-headers",
      "enabled": true,
      "config": {
        "rollout_endpoint": "http://localhost:8000/rollout",
        "training_mode": "async",
        "session_tracking": true
      }
    }
  ]
}

Launch Personal Agent Training

启动个人Agent训练

bash
undefined
bash
undefined

Start the model server

Start the model server

cd openclaw-combine python serve_model.py
--model-path /path/to/your/model
--port 8000
--gpu-ids 0,1
cd openclaw-combine python serve_model.py
--model-path /path/to/your/model
--port 8000
--gpu-ids 0,1

Start rollout collector

Start rollout collector

python collect_rollouts.py
--api-endpoint http://localhost:8000
--output-dir ./rollouts
--session-aware
python collect_rollouts.py
--api-endpoint http://localhost:8000
--output-dir ./rollouts
--session-aware

Start async trainer

Start async trainer

python train_async.py
--rollout-dir ./rollouts
--checkpoint-dir ./checkpoints
--method combine
--gpus 2,3,4,5
undefined
python train_async.py
--rollout-dir ./rollouts
--checkpoint-dir ./checkpoints
--method combine
--gpus 2,3,4,5
undefined

General Agentic RL

通用Agentic RL

Terminal Agent

终端Agent

bash
cd terminal-rl
bash
cd terminal-rl

Configure environment

Configure environment

export TASK_TYPE=bash_commands export MAX_STEPS=50
export TASK_TYPE=bash_commands export MAX_STEPS=50

Launch training

Launch training

bash run_terminal_agent.sh

**Terminal rollout example:**

```python
bash run_terminal_agent.sh

**终端轨迹生成示例:**

```python

terminal_rollout.py

terminal_rollout.py

import asyncio import subprocess
async def terminal_rollout(agent_model, task_description: str): """ Collect terminal interaction trajectory """ trajectory = [] terminal_state = initialize_terminal()
for step in range(MAX_STEPS):
    # Agent generates command
    command = await agent_model.generate(
        f"Task: {task_description}\nCurrent state: {terminal_state}\nCommand:"
    )
    
    # Execute in terminal
    result = subprocess.run(
        command,
        shell=True,
        capture_output=True,
        text=True,
        timeout=10
    )
    
    # Compute reward based on output
    reward = compute_terminal_reward(result, task_description)
    
    trajectory.append({
        "command": command,
        "output": result.stdout,
        "error": result.stderr,
        "reward": reward
    })
    
    # Update state
    terminal_state = get_terminal_state()
    
    if task_completed(result, task_description):
        break

return trajectory
undefined
import asyncio import subprocess
async def terminal_rollout(agent_model, task_description: str): """ Collect terminal interaction trajectory """ trajectory = [] terminal_state = initialize_terminal()
for step in range(MAX_STEPS):
    # Agent generates command
    command = await agent_model.generate(
        f"Task: {task_description}\nCurrent state: {terminal_state}\nCommand:"
    )
    
    # Execute in terminal
    result = subprocess.run(
        command,
        shell=True,
        capture_output=True,
        text=True,
        timeout=10
    )
    
    # Compute reward based on output
    reward = compute_terminal_reward(result, task_description)
    
    trajectory.append({
        "command": command,
        "output": result.stdout,
        "error": result.stderr,
        "reward": reward
    })
    
    # Update state
    terminal_state = get_terminal_state()
    
    if task_completed(result, task_description):
        break

return trajectory
undefined

GUI Agent

GUI Agent

bash
cd gui-rl
bash
cd gui-rl

Launch GUI agent training with vision model

Launch GUI agent training with vision model

bash run_gui_agent.sh --model qwen3.5-vl

**GUI interaction example:**

```python
bash run_gui_agent.sh --model qwen3.5-vl

**GUI交互示例:**

```python

gui_rollout.py

gui_rollout.py

from PIL import Image import pyautogui
async def gui_rollout(vision_model, task: str): """ Collect GUI interaction trajectory with screenshots """ trajectory = []
for step in range(MAX_GUI_STEPS):
    # Capture screen
    screenshot = pyautogui.screenshot()
    
    # Agent decides action based on visual input
    action = await vision_model.generate(
        prompt=f"Task: {task}\nWhat action should I take?",
        image=screenshot
    )
    
    # Parse and execute action
    parsed_action = parse_gui_action(action)
    execute_gui_action(parsed_action)
    
    # Get reward from environment/user feedback
    reward = await get_gui_reward(task, screenshot, parsed_action)
    
    trajectory.append({
        "screenshot": screenshot,
        "action": action,
        "reward": reward
    })

return trajectory
undefined
from PIL import Image import pyautogui
async def gui_rollout(vision_model, task: str): """ Collect GUI interaction trajectory with screenshots """ trajectory = []
for step in range(MAX_GUI_STEPS):
    # Capture screen
    screenshot = pyautogui.screenshot()
    
    # Agent decides action based on visual input
    action = await vision_model.generate(
        prompt=f"Task: {task}\nWhat action should I take?",
        image=screenshot
    )
    
    # Parse and execute action
    parsed_action = parse_gui_action(action)
    execute_gui_action(parsed_action)
    
    # Get reward from environment/user feedback
    reward = await get_gui_reward(task, screenshot, parsed_action)
    
    trajectory.append({
        "screenshot": screenshot,
        "action": action,
        "reward": reward
    })

return trajectory
undefined

SWE Agent

软件工程Agent

bash
cd swe-rl
bash
cd swe-rl

Launch software engineering agent training

Launch software engineering agent training

bash run_swe_agent.sh --benchmark swe-bench-lite
undefined
bash run_swe_agent.sh --benchmark swe-bench-lite
undefined

Tool-Call Agent

工具调用Agent

bash
cd toolcall-rl
bash
cd toolcall-rl

Configure available tools

Configure available tools

export TOOLS_CONFIG=./tools_config.json
export TOOLS_CONFIG=./tools_config.json

Launch tool-call agent training

Launch tool-call agent training

bash run_toolcall_agent.sh

**Tool-call training example:**

```python
bash run_toolcall_agent.sh

**工具调用训练示例:**

```python

toolcall_trainer.py

toolcall_trainer.py

import json
def train_toolcall_agent(model, tools_config_path: str): """ Train agent to use tools effectively """ with open(tools_config_path) as f: tools = json.load(f)
# Create tool-augmented prompts
tool_descriptions = format_tool_descriptions(tools)

# Training loop
for batch in dataloader:
    tasks = batch["tasks"]
    
    # Collect trajectories with tool usage
    trajectories = []
    for task in tasks:
        trajectory = collect_toolcall_trajectory(
            model=model,
            task=task,
            available_tools=tools
        )
        trajectories.append(trajectory)
    
    # Compute loss and update
    loss = compute_toolcall_loss(trajectories)
    loss.backward()
    optimizer.step()
undefined
import json
def train_toolcall_agent(model, tools_config_path: str): """ Train agent to use tools effectively """ with open(tools_config_path) as f: tools = json.load(f)
# Create tool-augmented prompts
tool_descriptions = format_tool_descriptions(tools)

# Training loop
for batch in dataloader:
    tasks = batch["tasks"]
    
    # Collect trajectories with tool usage
    trajectories = []
    for task in tasks:
        trajectory = collect_toolcall_trajectory(
            model=model,
            task=task,
            available_tools=tools
        )
        trajectories.append(trajectory)
    
    # Compute loss and update
    loss = compute_toolcall_loss(trajectories)
    loss.backward()
    optimizer.step()
undefined

LoRA Training Support

LoRA训练支持

bash
undefined
bash
undefined

Configure LoRA parameters

Configure LoRA parameters

export USE_LORA=true export LORA_RANK=16 export LORA_ALPHA=32 export LORA_DROPOUT=0.1
export USE_LORA=true export LORA_RANK=16 export LORA_ALPHA=32 export LORA_DROPOUT=0.1

Launch with LoRA

Launch with LoRA

bash run_combine_training.sh --lora

**LoRA configuration:**

```python
bash run_combine_training.sh --lora

**LoRA配置:**

```python

lora_config.py

lora_config.py

from peft import LoraConfig, get_peft_model
def setup_lora_model(base_model, lora_rank=16, lora_alpha=32): """ Configure model with LoRA adapters """ lora_config = LoraConfig( r=lora_rank, lora_alpha=lora_alpha, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.1, bias="none", task_type="CAUSAL_LM" )
peft_model = get_peft_model(base_model, lora_config)
peft_model.print_trainable_parameters()

return peft_model
undefined
from peft import LoraConfig, get_peft_model
def setup_lora_model(base_model, lora_rank=16, lora_alpha=32): """ Configure model with LoRA adapters """ lora_config = LoraConfig( r=lora_rank, lora_alpha=lora_alpha, target_modules=["q_proj", "k_proj", "v_proj", "o_proj"], lora_dropout=0.1, bias="none", task_type="CAUSAL_LM" )
peft_model = get_peft_model(base_model, lora_config)
peft_model.print_trainable_parameters()

return peft_model
undefined

Cloud Deployment

云部署

Tinker Deployment

Tinker部署

bash
undefined
bash
undefined

Configure Tinker credentials

Configure Tinker credentials

export TINKER_API_KEY=$YOUR_TINKER_KEY export TINKER_PROJECT_ID=your_project_id
export TINKER_API_KEY=$YOUR_TINKER_KEY export TINKER_PROJECT_ID=your_project_id

Deploy to Tinker

Deploy to Tinker

bash deploy_to_tinker.sh
undefined
bash deploy_to_tinker.sh
undefined

Fireworks AI Deployment

Fireworks AI部署

bash
undefined
bash
undefined

Configure Fireworks AI

Configure Fireworks AI

export FIREWORKS_API_KEY=$YOUR_FIREWORKS_KEY
export FIREWORKS_API_KEY=$YOUR_FIREWORKS_KEY

Deploy training job

Deploy training job

bash deploy_to_fireworks.sh --gpus 8 --method combine
undefined
bash deploy_to_fireworks.sh --gpus 8 --method combine
undefined

Configuration Files

配置文件

Training Configuration

训练配置

yaml
undefined
yaml
undefined

config/training_config.yaml

config/training_config.yaml

model: name: qwen3.5-4b checkpoint_path: /path/to/checkpoint tokenizer_path: /path/to/tokenizer
training: method: combine # binary, opd, or combine batch_size: 32 gradient_accumulation_steps: 4 learning_rate: 1e-6 warmup_steps: 100 max_steps: 10000

Binary RL params

ppo_clip_ratio: 0.2 value_clip_ratio: 0.2 gae_lambda: 0.95

OPD params

teacher_temperature: 1.0 hint_extraction_model: gpt-4

Hybrid params

binary_weight: 0.5 opd_weight: 0.5
rollout: num_workers: 4 max_turns: 10 session_aware: true parallel_envs: 16
evaluation: judge_model: gpt-4 majority_voting: true num_judges: 3 eval_frequency: 100
undefined
model: name: qwen3.5-4b checkpoint_path: /path/to/checkpoint tokenizer_path: /path/to/tokenizer
training: method: combine # binary, opd, or combine batch_size: 32 gradient_accumulation_steps: 4 learning_rate: 1e-6 warmup_steps: 100 max_steps: 10000

Binary RL params

ppo_clip_ratio: 0.2 value_clip_ratio: 0.2 gae_lambda: 0.95

OPD params

teacher_temperature: 1.0 hint_extraction_model: gpt-4

Hybrid params

binary_weight: 0.5 opd_weight: 0.5
rollout: num_workers: 4 max_turns: 10 session_aware: true parallel_envs: 16
evaluation: judge_model: gpt-4 majority_voting: true num_judges: 3 eval_frequency: 100
undefined

Rollout Configuration

轨迹生成配置

json
{
  "rollout_config": {
    "collection_mode": "async",
    "max_concurrent_sessions": 100,
    "session_timeout": 3600,
    "trajectory_format": "multi_turn",
    "message_classification": {
      "main_line": ["user", "assistant"],
      "side": ["system", "tool"]
    },
    "reward_computation": {
      "type": "next_state_feedback",
      "aggregation": "majority",
      "num_samples": 3
    }
  }
}
json
{
  "rollout_config": {
    "collection_mode": "async",
    "max_concurrent_sessions": 100,
    "session_timeout": 3600,
    "trajectory_format": "multi_turn",
    "message_classification": {
      "main_line": ["user", "assistant"],
      "side": ["system", "tool"]
    },
    "reward_computation": {
      "type": "next_state_feedback",
      "aggregation": "majority",
      "num_samples": 3
    }
  }
}

Common Patterns

常见模式

Custom Reward Function

自定义奖励函数

python
undefined
python
undefined

custom_reward.py

custom_reward.py

import torch
class CustomRewardModel: def init(self, checkpoint_path: str): self.model = load_reward_model(checkpoint_path)
def compute_reward(
    self,
    prompt: str,
    response: str,
    next_feedback: str
) -> float:
    """
    Compute reward based on response quality and next feedback
    """
    # Encode inputs
    inputs = self.tokenize(
        f"Prompt: {prompt}\nResponse: {response}\nFeedback: {next_feedback}"
    )
    
    # Get reward score
    with torch.no_grad():
        reward = self.model(inputs).item()
    
    return reward

def batch_compute_rewards(self, batch_data):
    """
    Efficiently compute rewards for batch
    """
    rewards = []
    for item in batch_data:
        reward = self.compute_reward(
            item["prompt"],
            item["response"],
            item["feedback"]
        )
        rewards.append(reward)
    return torch.tensor(rewards)
undefined
import torch
class CustomRewardModel: def init(self, checkpoint_path: str): self.model = load_reward_model(checkpoint_path)
def compute_reward(
    self,
    prompt: str,
    response: str,
    next_feedback: str
) -> float:
    """
    Compute reward based on response quality and next feedback
    """
    # Encode inputs
    inputs = self.tokenize(
        f"Prompt: {prompt}\nResponse: {response}\nFeedback: {next_feedback}"
    )
    
    # Get reward score
    with torch.no_grad():
        reward = self.model(inputs).item()
    
    return reward

def batch_compute_rewards(self, batch_data):
    """
    Efficiently compute rewards for batch
    """
    rewards = []
    for item in batch_data:
        reward = self.compute_reward(
            item["prompt"],
            item["response"],
            item["feedback"]
        )
        rewards.append(reward)
    return torch.tensor(rewards)
undefined

Session-Aware Trajectory Processing

会话感知轨迹处理

python
undefined
python
undefined

session_processor.py

session_processor.py

from collections import defaultdict
class SessionAwareProcessor: def init(self): self.sessions = defaultdict(list)
def add_interaction(self, session_id: str, interaction: dict):
    """
    Add interaction to session trajectory
    """
    self.sessions[session_id].append(interaction)

def get_training_trajectories(self, min_turns: int = 3):
    """
    Extract complete trajectories for training
    """
    trajectories = []
    
    for session_id, interactions in self.sessions.items():
        if len(interactions) >= min_turns:
            # Classify messages
            main_line = [
                i for i in interactions
                if i["role"] in ["user", "assistant"]
            ]
            
            # Create trajectory with advantages
            trajectory = self.compute_trajectory_advantages(main_line)
            trajectories.append(trajectory)
    
    return trajectories

def compute_trajectory_advantages(self, interactions: list):
    """
    Compute GAE advantages for trajectory
    """
    rewards = [i["reward"] for i in interactions]
    values = [i.get("value", 0) for i in interactions]
    
    advantages = compute_gae(
        rewards=rewards,
        values=values,
        gamma=0.99,
        lambda_=0.95
    )
    
    return {
        "interactions": interactions,
        "advantages": advantages
    }
undefined
from collections import defaultdict
class SessionAwareProcessor: def init(self): self.sessions = defaultdict(list)
def add_interaction(self, session_id: str, interaction: dict):
    """
    Add interaction to session trajectory
    """
    self.sessions[session_id].append(interaction)

def get_training_trajectories(self, min_turns: int = 3):
    """
    Extract complete trajectories for training
    """
    trajectories = []
    
    for session_id, interactions in self.sessions.items():
        if len(interactions) >= min_turns:
            # Classify messages
            main_line = [
                i for i in interactions
                if i["role"] in ["user", "assistant"]
            ]
            
            # Create trajectory with advantages
            trajectory = self.compute_trajectory_advantages(main_line)
            trajectories.append(trajectory)
    
    return trajectories

def compute_trajectory_advantages(self, interactions: list):
    """
    Compute GAE advantages for trajectory
    """
    rewards = [i["reward"] for i in interactions]
    values = [i.get("value", 0) for i in interactions]
    
    advantages = compute_gae(
        rewards=rewards,
        values=values,
        gamma=0.99,
        lambda_=0.95
    )
    
    return {
        "interactions": interactions,
        "advantages": advantages
    }
undefined

Monitoring and Debugging

监控与调试

Weights & Biases Integration

Weights & Biases集成

python
undefined
python
undefined

wandb_logging.py

wandb_logging.py

import wandb
def setup_wandb_logging(project_name: str, config: dict): """ Initialize W&B tracking """ wandb.init( project=project_name, config=config, name=f"openclaw-rl-{config['method']}" )
def log_training_metrics(step: int, metrics: dict): """ Log metrics to W&B """ wandb.log({ "step": step, "loss/total": metrics["total_loss"], "loss/binary": metrics.get("binary_loss", 0), "loss/opd": metrics.get("opd_loss", 0), "reward/mean": metrics["mean_reward"], "reward/std": metrics["std_reward"], "gradient/norm": metrics["grad_norm"], "learning_rate": metrics["lr"] })
undefined
import wandb
def setup_wandb_logging(project_name: str, config: dict): """ Initialize W&B tracking """ wandb.init( project=project_name, config=config, name=f"openclaw-rl-{config['method']}" )
def log_training_metrics(step: int, metrics: dict): """ Log metrics to W&B """ wandb.log({ "step": step, "loss/total": metrics["total_loss"], "loss/binary": metrics.get("binary_loss", 0), "loss/opd": metrics.get("opd_loss", 0), "reward/mean": metrics["mean_reward"], "reward/std": metrics["std_reward"], "gradient/norm": metrics["grad_norm"], "learning_rate": metrics["lr"] })
undefined

Debug Rollout Collection

轨迹生成收集调试

bash
undefined
bash
undefined

Enable debug logging

Enable debug logging

export OPENCLAW_DEBUG=true export ROLLOUT_LOG_LEVEL=DEBUG
export OPENCLAW_DEBUG=true export ROLLOUT_LOG_LEVEL=DEBUG

Test rollout collection

Test rollout collection

python -m openclaw_combine.test_rollout
--num-samples 10
--output-dir ./debug_rollouts
undefined
python -m openclaw_combine.test_rollout
--num-samples 10
--output-dir ./debug_rollouts
undefined

Troubleshooting

故障排除

Out of Memory During Training

训练期间内存不足

bash
undefined
bash
undefined

Reduce batch size and use gradient accumulation

Reduce batch size and use gradient accumulation

export BATCH_SIZE=8 export GRAD_ACCUM_STEPS=8
export BATCH_SIZE=8 export GRAD_ACCUM_STEPS=8

Enable gradient checkpointing

Enable gradient checkpointing

export USE_GRADIENT_CHECKPOINTING=true
export USE_GRADIENT_CHECKPOINTING=true

Use LoRA instead of full fine-tuning

Use LoRA instead of full fine-tuning

export USE_LORA=true export LORA_RANK=8
undefined
export USE_LORA=true export LORA_RANK=8
undefined

Slow Rollout Collection

轨迹生成收集缓慢

python
undefined
python
undefined

Increase parallel workers

Increase parallel workers

ROLLOUT_ARGS=" --num-rollout-workers 16
--parallel-envs 32
--async-collection "
undefined
ROLLOUT_ARGS=" --num-rollout-workers 16
--parallel-envs 32
--async-collection "
undefined

Reward Model Disagreement

奖励模型意见不一致

yaml
undefined
yaml
undefined

Use majority voting with more judges

Use majority voting with more judges

evaluation: judge_model: gpt-4 majority_voting: true num_judges: 5 # Increase from 3 consensus_threshold: 0.6
undefined
evaluation: judge_model: gpt-4 majority_voting: true num_judges: 5 # Increase from 3 consensus_threshold: 0.6
undefined

Training Instability

训练不稳定

bash
undefined
bash
undefined

Reduce learning rate and clip gradients

Reduce learning rate and clip gradients

export LEARNING_RATE=5e-7 export CLIP_GRAD_NORM=0.5
export LEARNING_RATE=5e-7 export CLIP_GRAD_NORM=0.5

Adjust PPO clipping

Adjust PPO clipping

export PPO_CLIP_RATIO=0.1
export PPO_CLIP_RATIO=0.1

Enable value function clipping

Enable value function clipping

export VALUE_CLIP=true
undefined
export VALUE_CLIP=true
undefined

Session Tracking Issues

会话跟踪问题

python
undefined
python
undefined

Check session classification

Check session classification

from openclaw_combine.utils import inspect_sessions
sessions = inspect_sessions("./rollouts") for session_id, data in sessions.items(): print(f"Session {session_id}:") print(f" Total turns: {len(data)}") print(f" Main-line turns: {sum(1 for i in data if i['type'] == 'main')}") print(f" Side turns: {sum(1 for i in data if i['type'] == 'side')}")
undefined
from openclaw_combine.utils import inspect_sessions
sessions = inspect_sessions("./rollouts") for session_id, data in sessions.items(): print(f"Session {session_id}:") print(f" Total turns: {len(data)}") print(f" Main-line turns: {sum(1 for i in data if i['type'] == 'main')}") print(f" Side turns: {sum(1 for i in data if i['type'] == 'side')}")
undefined

Best Practices

最佳实践

  1. Start with small scale: Test with 1-2 GPUs and small batch sizes before scaling
  2. Monitor gradients: Watch for gradient explosion/vanishing in early steps
  3. Use wandb: Track experiments systematically with Weights & Biases
  4. Checkpoint frequently: Save checkpoints every 100-500 steps for recovery
  5. Validate rollouts: Inspect collected trajectories before full training runs
  6. Combine methods gradually: Start with Binary RL, then OPD, then Hybrid
  7. Keep framework unmodified: Use extension points instead of modifying core code
  1. 从小规模开始:在扩容前先用1-2个GPU和小批量大小进行测试
  2. 监控梯度:在训练初期留意梯度爆炸或消失的情况
  3. 使用wandb:通过Weights & Biases系统性地跟踪实验
  4. 频繁保存检查点:每100-500步保存一次检查点,以便恢复
  5. 验证轨迹生成:在全面训练前检查收集到的轨迹
  6. 逐步组合方法:先从Binary RL开始,再引入OPD,最后使用混合法
  7. 不修改框架核心:使用扩展点而非修改核心代码