fine-tuning-serving-openpi

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

OpenPI Fine-Tuning and Serving

OpenPI 微调与部署

End-to-end workflows for fine-tuning and serving Physical Intelligence's OpenPI models (pi0, pi0-fast, pi0.5) on robot manipulation tasks from the public

openpi

repository. Covers blank-machine setup, JAX training, PyTorch training, checkpoint conversion, and policy inference serving.

本教程介绍了基于公开

openpi

仓库，针对机器人操作任务微调与部署Physical Intelligence的OpenPI模型（pi0、pi0-fast、pi0.5）的端到端工作流，覆盖空白机器环境搭建、JAX训练、PyTorch训练、检查点转换和策略推理部署全流程。

Quick start

快速开始

Clone the public repo, install the workspace, then serve a pretrained policy:

bash

git clone --recurse-submodules https://github.com/Physical-Intelligence/openpi.git
cd openpi
GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .
uv run scripts/serve_policy.py --env DROID

python

from openpi_client import websocket_client_policy

client = websocket_client_policy.WebsocketClientPolicy(host="localhost", port=8000)
result = client.infer(observation)
actions = result["actions"]  # numpy array of shape (chunk_size, action_dim)

克隆公开仓库，安装工作区依赖，然后部署预训练策略：

bash

git clone --recurse-submodules https://github.com/Physical-Intelligence/openpi.git
cd openpi
GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .
uv run scripts/serve_policy.py --env DROID

python

from openpi_client import websocket_client_policy

client = websocket_client_policy.WebsocketClientPolicy(host="localhost", port=8000)
result = client.infer(observation)
actions = result["actions"]  # 形状为(chunk_size, action_dim)的numpy数组

Core concepts

核心概念

Model family: OpenPI implements three model variants from Physical Intelligence:

Model	Architecture	Speed	Quality	Typical use
pi0	Flow-matching VLA	Baseline	Highest	Research, complex tasks
pi0-fast	Autoregressive action tokens	2-5x faster	Good	Real-time control
pi0.5	pi0 + improved vision encoder	Baseline	Best	Latest default

Key design choices:

Dual backend: JAX (primary, official training) and PyTorch (community, deployment-friendly)
Config-driven: All training/serving parameters defined in
```
src/openpi/training/config.py
```
Norm stats: Every config requires precomputed normalization statistics before training
WebSocket serving: Policy servers expose a WebSocket API for low-latency inference

Training loop invariant: After every config or dataset change, always re-run this cycle:

Compute norm stats → 2. Train → 3. Serve checkpoint → 4. Validate inference

模型家族：OpenPI实现了Physical Intelligence的三个模型变体：

模型	架构	速度	质量	典型用途
pi0	Flow-matching VLA	基准	最高	研究、复杂任务
pi0-fast	自回归动作令牌	快2-5倍	良好	实时控制
pi0.5	pi0 + 改进的视觉编码器	基准	最佳	最新默认选项

核心设计选择：

双后端支持：JAX（官方首选训练后端）和PyTorch（社区维护，更易部署）
配置驱动：所有训练/部署参数都定义在
```
src/openpi/training/config.py
```
中
归一化统计：训练前每个配置都需要预先计算归一化统计数据
WebSocket部署：策略服务器暴露WebSocket API以实现低延迟推理

训练循环不变规则：每次修改配置或数据集后，务必重新执行以下流程：

计算归一化统计数据 → 2. 训练 → 3. 部署检查点 → 4. 验证推理结果

Compute requirements

计算资源要求

Task	GPU	VRAM	Notes
Serve pi0.5 (inference)	1x A100/H100	~24 GB	Single GPU sufficient
Fine-tune pi0.5 (JAX)	1x A100 80GB	~60 GB	Use `fsdp_devices` for multi-GPU
Fine-tune pi0 (JAX)	1x A100 80GB	~40 GB	Smaller model footprint
Fine-tune (PyTorch DDP)	1-8x A100	~40 GB/GPU	torchrun launcher
Compute norm stats	CPU or 1x GPU	~8 GB	Fast, can run on login node

任务	GPU	显存	备注
部署pi0.5（推理）	1x A100/H100	~24 GB	单卡足够
微调pi0.5（JAX）	1x A100 80GB	~60 GB	多卡训练请配置 `fsdp_devices`
微调pi0（JAX）	1x A100 80GB	~40 GB	模型占用更小
微调（PyTorch DDP）	1-8x A100	~40 GB/卡	使用torchrun启动器
计算归一化统计数据	CPU或1x GPU	~8 GB	速度快，可在登录节点运行

Workflow 0: Blank-machine setup

工作流0：空白机器环境搭建

Copy this checklist and track progress:

text

Setup Progress:
- [ ] Step 1: Clone the public openpi repo with submodules
- [ ] Step 2: Install uv and sync the workspace
- [ ] Step 3: Install the editable package
- [ ] Step 4: Verify core imports and serving entrypoint

Step 1: Clone repo

bash

git clone --recurse-submodules https://github.com/Physical-Intelligence/openpi.git
cd openpi

If you already cloned without submodules:

bash

git submodule update --init --recursive

Step 2: Sync dependencies

bash

GIT_LFS_SKIP_SMUDGE=1 uv sync

Step 3: Install editable package

bash

GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .

Step 4: Verify installation

bash

uv run python -c "from openpi.training import config as _config; print(_config.get_config('pi05_droid').name)"
uv run scripts/serve_policy.py --help

复制以下检查清单跟踪进度：

text

搭建进度:
- [ ] 步骤1：克隆带子模块的openpi公开仓库
- [ ] 步骤2：安装uv并同步工作区依赖
- [ ] 步骤3：安装可编辑模式的包
- [ ] 步骤4：验证核心导入和部署入口

步骤1：克隆仓库

bash

git clone --recurse-submodules https://github.com/Physical-Intelligence/openpi.git
cd openpi

如果之前克隆时没有拉取子模块：

bash

git submodule update --init --recursive

步骤2：同步依赖

bash

GIT_LFS_SKIP_SMUDGE=1 uv sync

步骤3：安装可编辑包

bash

GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .

步骤4：验证安装

bash

uv run python -c "from openpi.training import config as _config; print(_config.get_config('pi05_droid').name)"
uv run scripts/serve_policy.py --help

When to use vs alternatives

适用场景与替代方案对比

Use this skill when:

Fine-tuning pi0, pi0-fast, or pi0.5 on LeRobot or RLDS datasets
Serving OpenPI policies for ALOHA, DROID, or LIBERO evaluation
Converting JAX checkpoints to PyTorch format
Debugging OpenPI training issues (norm stats, memory, config)

Use
fine-tuning-openvla-oft
instead when:

Fine-tuning OpenVLA with continuous action heads and LoRA
Reproducing OpenVLA-OFT paper results on LIBERO or ALOHA

Use
evaluating-cosmos-policy
instead when:

Evaluating NVIDIA Cosmos Policy on simulation benchmarks

适合使用本工具的场景：

在LeRobot或RLDS数据集上微调pi0、pi0-fast或pi0.5
为ALOHA、DROID或LIBERO评估部署OpenPI策略
将JAX检查点转换为PyTorch格式
调试OpenPI训练问题（归一化统计、显存、配置）

以下场景请使用
fine-tuning-openvla-oft
替代：

使用连续动作头和LoRA微调OpenVLA
在LIBERO或ALOHA上复现OpenVLA-OFT论文结果

以下场景请使用
evaluating-cosmos-policy
替代：

在仿真基准上评估NVIDIA Cosmos策略

Workflow 1: JAX fine-tuning on LeRobot data

工作流1：基于LeRobot数据的JAX微调

Copy this checklist and track progress:

text

JAX Fine-Tuning Progress:
- [ ] Step 1: Select and copy closest training config
- [ ] Step 2: Update dataset mapping and base checkpoint
- [ ] Step 3: Compute normalization statistics
- [ ] Step 4: Launch JAX training
- [ ] Step 5: Serve checkpoint and run inference sanity check

Step 1: Select config

Copy the closest config from

src/openpi/training/config.py

Config	Use case
`pi05_libero`	pi0.5 LIBERO fine-tuning
`pi0_libero`	pi0 full fine-tuning on LIBERO
`pi0_fast_libero`	pi0-fast on LIBERO
`pi0_aloha_pen_uncap`	ALOHA custom data
`pi05_droid_finetune`	Small custom DROID dataset (LeRobot format)
`pi05_full_droid_finetune`	Full DROID RLDS large-scale training

Step 2: Update dataset and transforms

python

undefined

复制以下检查清单跟踪进度：

text

JAX微调进度:
- [ ] 步骤1：选择并复制最匹配的训练配置
- [ ] 步骤2：更新数据集映射和基础检查点
- [ ] 步骤3：计算归一化统计数据
- [ ] 步骤4：启动JAX训练
- [ ] 步骤5：部署检查点并运行推理合理性检查

步骤1：选择配置

从

src/openpi/training/config.py

中复制最匹配的配置：

配置	适用场景
`pi05_libero`	pi0.5 LIBERO微调
`pi0_libero`	pi0在LIBERO上的全量微调
`pi0_fast_libero`	运行在LIBERO上的pi0-fast
`pi0_aloha_pen_uncap`	ALOHA自定义数据
`pi05_droid_finetune`	小型自定义DROID数据集（LeRobot格式）
`pi05_full_droid_finetune`	全量DROID RLDS大规模训练

步骤2：更新数据集和变换规则

python

undefined

In src/openpi/training/config.py, modify your config:

在src/openpi/training/config.py中修改你的配置:

TrainConfig( name="my_custom_config", model_type="pi05", data=LeRobotDataConfig( repo_id="your-org/your-dataset", # Adjust transforms to match your data format ), weight_loader=Pi05WeightLoader(), # Match model type )


Set `repo_id` for your dataset and ensure `weight_loader` matches the model type (pi0 vs pi0.5).

**Step 3: Compute normalization statistics**

```bash
uv run scripts/compute_norm_stats.py --config-name <config_name>

This must run before every training launch when config, dataset, or transforms change.

Step 4: Launch JAX training

bash

XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py <config_name> \
  --exp-name=<run_name> \
  --overwrite

For full DROID RLDS training, add the

rlds

dependency group:

bash

uv run --group rlds scripts/compute_norm_stats.py \
  --config-name pi05_full_droid_finetune \
  --max-frames 10000000

XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run --group rlds scripts/train.py \
  pi05_full_droid_finetune \
  --exp-name=<run_name> --overwrite

Step 5: Serve and validate

bash

uv run scripts/serve_policy.py policy:checkpoint \
  --policy.config=<config_name> \
  --policy.dir=checkpoints/<config_name>/<run_name>/<step>

Verify with a test client:

python

from openpi_client import websocket_client_policy

client = websocket_client_policy.WebsocketClientPolicy(host="localhost", port=8000)

TrainConfig( name="my_custom_config", model_type="pi05", data=LeRobotDataConfig( repo_id="your-org/your-dataset", # 调整变换规则以匹配你的数据格式 ), weight_loader=Pi05WeightLoader(), # 与模型类型匹配 )


为你的数据集设置`repo_id`，并确保`weight_loader`与模型类型（pi0或pi0.5）匹配。

**步骤3：计算归一化统计数据**

```bash
uv run scripts/compute_norm_stats.py --config-name <config_name>

当配置、数据集或变换规则修改时，每次启动训练前都必须执行此步骤。

步骤4：启动JAX训练

bash

XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py <config_name> \
  --exp-name=<run_name> \
  --overwrite

对于全量DROID RLDS训练，添加

rlds

依赖组：

bash

uv run --group rlds scripts/compute_norm_stats.py \
  --config-name pi05_full_droid_finetune \
  --max-frames 10000000

XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run --group rlds scripts/train.py \
  pi05_full_droid_finetune \
  --exp-name=<run_name> --overwrite

步骤5：部署与验证

bash

uv run scripts/serve_policy.py policy:checkpoint \
  --policy.config=<config_name> \
  --policy.dir=checkpoints/<config_name>/<run_name>/<step>

使用测试客户端验证：

python

from openpi_client import websocket_client_policy

client = websocket_client_policy.WebsocketClientPolicy(host="localhost", port=8000)

Build observation matching your config's expected keys

构建符合配置预期键值的观测数据

obs = {"image": img_array, "state": state_array, "prompt": "pick up the cup"} result = client.infer(obs) print(f"Action shape: {result['actions'].shape}") # (chunk_size, action_dim)

---

obs = {"image": img_array, "state": state_array, "prompt": "pick up the cup"} result = client.infer(obs) print(f"动作形状: {result['actions'].shape}") # (chunk_size, action_dim)

---

Workflow 2: PyTorch training and checkpoint conversion

工作流2：PyTorch训练与检查点转换

Copy this checklist and track progress:

text

PyTorch Setup Progress:
- [ ] Step 1: Sync dependencies and verify transformer version
- [ ] Step 2: Apply OpenPI transformer patches
- [ ] Step 3: Convert JAX checkpoint to PyTorch format
- [ ] Step 4: Launch PyTorch training or serve converted checkpoint

Step 1: Sync dependencies

bash

uv sync
uv pip show transformers

Step 2: Apply required patches

OpenPI PyTorch requires custom modifications to the installed

transformers

package:

bash

cp -r ./src/openpi/models_pytorch/transformers_replace/* \
  .venv/lib/python3.11/site-packages/transformers/

Step 3: Convert JAX checkpoint

bash

uv run examples/convert_jax_model_to_pytorch.py \
  --checkpoint_dir <jax_checkpoint_dir> \
  --config_name <config_name> \
  --output_path <pytorch_checkpoint_dir>

Step 4: Train or serve

Single GPU training:

bash

uv run scripts/train_pytorch.py <config_name> --exp_name <run_name>

Multi-GPU distributed training:

bash

uv run torchrun --standalone --nnodes=1 --nproc_per_node=<num_gpus> \
  scripts/train_pytorch.py <config_name> --exp_name <run_name>

Programmatic inference with converted checkpoint:

python

from openpi.training import config as _config
from openpi.policies import policy_config

config = _config.get_config("pi05_droid")
policy = policy_config.create_trained_policy(config, "<pytorch_checkpoint_dir>")
result = policy.infer(example)
actions = result["actions"]  # numpy array

Checkpoints follow the convention:

checkpoints/<config_name>/<exp_name>/<step>/

复制以下检查清单跟踪进度：

text

PyTorch设置进度:
- [ ] 步骤1：同步依赖并验证transformer版本
- [ ] 步骤2：应用OpenPI transformer补丁
- [ ] 步骤3：将JAX检查点转换为PyTorch格式
- [ ] 步骤4：启动PyTorch训练或部署转换后的检查点

步骤1：同步依赖

bash

uv sync
uv pip show transformers

步骤2：应用必要补丁

OpenPI PyTorch版本需要对已安装的

transformers

包进行自定义修改：

bash

cp -r ./src/openpi/models_pytorch/transformers_replace/* \
  .venv/lib/python3.11/site-packages/transformers/

步骤3：转换JAX检查点

bash

uv run examples/convert_jax_model_to_pytorch.py \
  --checkpoint_dir <jax_checkpoint_dir> \
  --config_name <config_name> \
  --output_path <pytorch_checkpoint_dir>

步骤4：训练或部署

单GPU训练：

bash

uv run scripts/train_pytorch.py <config_name> --exp_name <run_name>

多GPU分布式训练：

bash

uv run torchrun --standalone --nnodes=1 --nproc_per_node=<num_gpus> \
  scripts/train_pytorch.py <config_name> --exp_name <run_name>

使用转换后的检查点进行编程式推理：

python

from openpi.training import config as _config
from openpi.policies import policy_config

config = _config.get_config("pi05_droid")
policy = policy_config.create_trained_policy(config, "<pytorch_checkpoint_dir>")
result = policy.infer(example)
actions = result["actions"]  # numpy数组

检查点遵循命名规范：

checkpoints/<config_name>/<exp_name>/<step>/

。

Workflow 3: Policy inference serving

工作流3：策略推理部署

Copy this checklist and track progress:

text

Inference Server Progress:
- [ ] Step 1: Choose target environment and checkpoint
- [ ] Step 2: Start policy server
- [ ] Step 3: Confirm server is reachable
- [ ] Step 4: Integrate client into robot or simulation code

Step 1: Choose environment

Default environment presets:

Environment	Config	Default checkpoint
`ALOHA`	`pi05_aloha`	`gs://openpi-assets/checkpoints/pi05_base`
`ALOHA_SIM`	`pi0_aloha_sim`	`gs://openpi-assets/checkpoints/pi0_aloha_sim`
`DROID`	`pi05_droid`	`gs://openpi-assets/checkpoints/pi05_droid`
`LIBERO`	`pi05_libero`	`gs://openpi-assets/checkpoints/pi05_libero`

Step 2: Start server

Default mode (uses preset checkpoint):

bash

uv run scripts/serve_policy.py --env ALOHA

Explicit checkpoint mode (custom or local model):

bash

uv run scripts/serve_policy.py policy:checkpoint \
  --policy.config=pi05_libero \
  --policy.dir=checkpoints/pi05_libero/my_run/20000

Add

--default_prompt "task description"

when runtime observations omit a prompt.

Step 3: Verify connectivity

bash

uv run examples/simple_client/main.py --env DROID

Step 4: Embed remote client in robot code

Install the lightweight client in your robot environment:

bash

pip install "openpi-client @ git+https://github.com/Physical-Intelligence/openpi.git#subdirectory=packages/openpi-client"

Full integration example:

python

from openpi_client import websocket_client_policy
import numpy as np

复制以下检查清单跟踪进度：

text

推理服务器部署进度:
- [ ] 步骤1：选择目标环境和检查点
- [ ] 步骤2：启动策略服务器
- [ ] 步骤3：确认服务器可访问
- [ ] 步骤4：将客户端集成到机器人或仿真代码中

步骤1：选择环境

默认环境预设：

环境	配置	默认检查点
`ALOHA`	`pi05_aloha`	`gs://openpi-assets/checkpoints/pi05_base`
`ALOHA_SIM`	`pi0_aloha_sim`	`gs://openpi-assets/checkpoints/pi0_aloha_sim`
`DROID`	`pi05_droid`	`gs://openpi-assets/checkpoints/pi05_droid`
`LIBERO`	`pi05_libero`	`gs://openpi-assets/checkpoints/pi05_libero`

步骤2：启动服务器

默认模式（使用预设检查点）：

bash

uv run scripts/serve_policy.py --env ALOHA

显式检查点模式（自定义或本地模型）：

bash

uv run scripts/serve_policy.py policy:checkpoint \
  --policy.config=pi05_libero \
  --policy.dir=checkpoints/pi05_libero/my_run/20000

当运行时观测数据不包含提示词时，添加

--default_prompt "任务描述"

参数。

步骤3：验证连通性

bash

uv run examples/simple_client/main.py --env DROID

步骤4：在机器人代码中嵌入远程客户端

在你的机器人环境中安装轻量级客户端：

bash

pip install "openpi-client @ git+https://github.com/Physical-Intelligence/openpi.git#subdirectory=packages/openpi-client"

完整集成示例：

python

from openpi_client import websocket_client_policy
import numpy as np

Connect to remote policy server

连接到远程策略服务器

client = websocket_client_policy.WebsocketClientPolicy( host="gpu-server.local", port=8000 )

Build observation (keys must match policy transforms)

构建观测数据（键值必须与策略变换规则匹配）

observation = { "image": np.random.rand(224, 224, 3), # RGB image "state": np.zeros(7), # Joint positions "prompt": "pick up the red block", }

observation = { "image": np.random.rand(224, 224, 3), # RGB图像 "state": np.zeros(7), # 关节位置 "prompt": "拿起红色方块", }

Get actions

获取动作

result = client.infer(observation) actions = result["actions"] # shape: (action_chunk_size, action_dim)

result = client.infer(observation) actions = result["actions"] # 形状: (action_chunk_size, action_dim)

Execute first action on robot

在机器人上执行第一个动作

robot.step(actions[0])

---

robot.step(actions[0])

---

Common issues

常见问题

Issue: Missing norm stats error

Fix: run

scripts/compute_norm_stats.py --config-name <config_name>

before training.

Issue: Out of memory during JAX training

Fix: set

XLA_PYTHON_CLIENT_MEM_FRACTION=0.9

, lower batch size, or configure

fsdp_devices

python

undefined

问题：缺失归一化统计数据错误

修复：训练前运行

scripts/compute_norm_stats.py --config-name <config_name>

。

问题：JAX训练期间显存不足

修复：设置

XLA_PYTHON_CLIENT_MEM_FRACTION=0.9

，降低批量大小，或配置

fsdp_devices

：

python

undefined

In config: use model-parallel sharding

在配置中使用模型并行分片

TrainConfig( ... fsdp_devices=4, # Shard across 4 GPUs )


**Issue: OOM while loading PyTorch checkpoints**

Fix: `export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`

**Issue: Config not found**

Fix: ensure config name exists in `src/openpi/training/config.py` (exact match from `_CONFIGS` dict).

**Issue: PyTorch training diverges after library changes**

Fix: reapply the transformer patch. Run `uv cache clean transformers` to reset, then reapply.

**Issue: `serve_policy.py` crashes with `ModuleNotFoundError`**

Fix: resync the public workspace first:

```bash
GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .

If the missing module is simulator-related, install the extra runtime dependencies called for by that example:

bash

uv pip install pytest robosuite==1.4.0 gym bddl easydict matplotlib

Issue:
uv sync
fails with
rerun-sdk
wheel mismatch

Fix:

bash

uv sync --no-dev

TrainConfig( ... fsdp_devices=4, # 跨4个GPU分片 )


**问题：加载PyTorch检查点时OOM**

修复：`export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`

**问题：未找到配置**

修复：确保配置名称存在于`src/openpi/training/config.py`中（与`_CONFIGS`字典中的值完全匹配）。

**问题：库变更后PyTorch训练发散**

修复：重新应用transformer补丁。运行`uv cache clean transformers`重置，然后重新应用补丁。

**问题：`serve_policy.py`崩溃并报`ModuleNotFoundError`**

修复：先重新同步公开工作区：

```bash
GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .

如果缺失的模块与模拟器相关，安装示例要求的额外运行时依赖：

bash

uv pip install pytest robosuite==1.4.0 gym bddl easydict matplotlib

问题：
uv sync
失败并报
rerun-sdk
wheel不匹配

修复：

bash

uv sync --no-dev

or

或

uv sync --no-dev --no-install-package rerun-sdk


**Issue: Checkpoint download times out**

Fix: install `gsutil` and prefetch manually:

```bash
pip install gsutil
gsutil -m cp -r gs://openpi-assets/checkpoints/pi05_libero /local/cache/

Remove stale

.lock

files if a previous download was interrupted.

Issue: Policy server exits with code
137

Fix: OOM kill. Set JAX memory variables:

bash

export XLA_PYTHON_CLIENT_PREALLOCATE=false
export XLA_PYTHON_CLIENT_ALLOCATOR=platform

uv sync --no-dev --no-install-package rerun-sdk


**问题：检查点下载超时**

修复：安装`gsutil`并手动预拉取：

```bash
pip install gsutil
gsutil -m cp -r gs://openpi-assets/checkpoints/pi05_libero /local/cache/

如果之前的下载被中断，删除过期的

.lock

文件。

问题：策略服务器退出码为
137

修复：OOM被系统杀死。设置JAX内存变量：

bash

export XLA_PYTHON_CLIENT_PREALLOCATE=false
export XLA_PYTHON_CLIENT_ALLOCATOR=platform

For HPC/cluster users

面向HPC/集群用户的说明

On Slurm-managed clusters, wrap commands with resource allocation:

bash

srun --partition=gpu --gpus-per-node=1 --mem=64G --cpus-per-task=8 --pty bash

Route caches to scratch to avoid filling

/home

bash

export HF_HOME=/scratch/$USER/.cache/huggingface
export XDG_CACHE_HOME=/scratch/$USER/.cache
export PIP_CACHE_DIR=/scratch/$USER/.cache/pip
export UV_CACHE_DIR=/scratch/$USER/.cache/uv

Avoid stacking cluster Python modules when using uv-managed environments. Typically

module load cuda

is sufficient.

在Slurm管理的集群上，使用资源分配包装命令：

bash

srun --partition=gpu --gpus-per-node=1 --mem=64G --cpus-per-task=8 --pty bash

将缓存路径指向scratch目录以避免占满

/home

：

bash

export HF_HOME=/scratch/$USER/.cache/huggingface
export XDG_CACHE_HOME=/scratch/$USER/.cache
export PIP_CACHE_DIR=/scratch/$USER/.cache/pip
export UV_CACHE_DIR=/scratch/$USER/.cache/uv

使用uv管理的环境时，避免叠加集群Python模块，通常

module load cuda

就足够了。

Advanced topics

高级主题

Config recipes and baselines: See references/config-recipes.md Training debugging guide: See references/training-debugging.md Checkpoint and environment mapping: See references/checkpoints-and-env-map.md Remote client integration: See references/remote-client-pattern.md PyTorch precision and patching gotchas: See references/pytorch-gotchas.md

配置模板与基准：参见 references/config-recipes.md 训练调试指南：参见 references/training-debugging.md 检查点与环境映射：参见 references/checkpoints-and-env-map.md 远程客户端集成：参见 references/remote-client-pattern.md PyTorch精度与补丁注意事项：参见 references/pytorch-gotchas.md

Resources

资源

OpenPI repository: https://github.com/Physical-Intelligence/openpi
OpenPI client package: https://github.com/Physical-Intelligence/openpi/tree/main/packages/openpi-client
pi0 paper: https://www.physicalintelligence.company/blog/pi0
LeRobot dataset format: https://huggingface.co/docs/lerobot

OpenPI仓库: https://github.com/Physical-Intelligence/openpi
OpenPI客户端包: https://github.com/Physical-Intelligence/openpi/tree/main/packages/openpi-client
pi0论文: https://www.physicalintelligence.company/blog/pi0
LeRobot数据集格式: https://huggingface.co/docs/lerobot