fine-tuning-serving-openpi
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOpenPI Fine-Tuning and Serving
OpenPI 微调与部署
End-to-end workflows for fine-tuning and serving Physical Intelligence's OpenPI models (pi0, pi0-fast, pi0.5) on robot manipulation tasks from the public repository. Covers blank-machine setup, JAX training, PyTorch training, checkpoint conversion, and policy inference serving.
openpi本教程介绍了基于公开仓库,针对机器人操作任务微调与部署Physical Intelligence的OpenPI模型(pi0、pi0-fast、pi0.5)的端到端工作流,覆盖空白机器环境搭建、JAX训练、PyTorch训练、检查点转换和策略推理部署全流程。
openpiQuick start
快速开始
Clone the public repo, install the workspace, then serve a pretrained policy:
bash
git clone --recurse-submodules https://github.com/Physical-Intelligence/openpi.git
cd openpi
GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .
uv run scripts/serve_policy.py --env DROIDpython
from openpi_client import websocket_client_policy
client = websocket_client_policy.WebsocketClientPolicy(host="localhost", port=8000)
result = client.infer(observation)
actions = result["actions"] # numpy array of shape (chunk_size, action_dim)克隆公开仓库,安装工作区依赖,然后部署预训练策略:
bash
git clone --recurse-submodules https://github.com/Physical-Intelligence/openpi.git
cd openpi
GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .
uv run scripts/serve_policy.py --env DROIDpython
from openpi_client import websocket_client_policy
client = websocket_client_policy.WebsocketClientPolicy(host="localhost", port=8000)
result = client.infer(observation)
actions = result["actions"] # 形状为(chunk_size, action_dim)的numpy数组Core concepts
核心概念
Model family: OpenPI implements three model variants from Physical Intelligence:
| Model | Architecture | Speed | Quality | Typical use |
|---|---|---|---|---|
| pi0 | Flow-matching VLA | Baseline | Highest | Research, complex tasks |
| pi0-fast | Autoregressive action tokens | 2-5x faster | Good | Real-time control |
| pi0.5 | pi0 + improved vision encoder | Baseline | Best | Latest default |
Key design choices:
- Dual backend: JAX (primary, official training) and PyTorch (community, deployment-friendly)
- Config-driven: All training/serving parameters defined in
src/openpi/training/config.py - Norm stats: Every config requires precomputed normalization statistics before training
- WebSocket serving: Policy servers expose a WebSocket API for low-latency inference
Training loop invariant: After every config or dataset change, always re-run this cycle:
- Compute norm stats → 2. Train → 3. Serve checkpoint → 4. Validate inference
模型家族:OpenPI实现了Physical Intelligence的三个模型变体:
| 模型 | 架构 | 速度 | 质量 | 典型用途 |
|---|---|---|---|---|
| pi0 | Flow-matching VLA | 基准 | 最高 | 研究、复杂任务 |
| pi0-fast | 自回归动作令牌 | 快2-5倍 | 良好 | 实时控制 |
| pi0.5 | pi0 + 改进的视觉编码器 | 基准 | 最佳 | 最新默认选项 |
核心设计选择:
- 双后端支持:JAX(官方首选训练后端)和PyTorch(社区维护,更易部署)
- 配置驱动:所有训练/部署参数都定义在中
src/openpi/training/config.py - 归一化统计:训练前每个配置都需要预先计算归一化统计数据
- WebSocket部署:策略服务器暴露WebSocket API以实现低延迟推理
训练循环不变规则:每次修改配置或数据集后,务必重新执行以下流程:
- 计算归一化统计数据 → 2. 训练 → 3. 部署检查点 → 4. 验证推理结果
Compute requirements
计算资源要求
| Task | GPU | VRAM | Notes |
|---|---|---|---|
| Serve pi0.5 (inference) | 1x A100/H100 | ~24 GB | Single GPU sufficient |
| Fine-tune pi0.5 (JAX) | 1x A100 80GB | ~60 GB | Use |
| Fine-tune pi0 (JAX) | 1x A100 80GB | ~40 GB | Smaller model footprint |
| Fine-tune (PyTorch DDP) | 1-8x A100 | ~40 GB/GPU | torchrun launcher |
| Compute norm stats | CPU or 1x GPU | ~8 GB | Fast, can run on login node |
| 任务 | GPU | 显存 | 备注 |
|---|---|---|---|
| 部署pi0.5(推理) | 1x A100/H100 | ~24 GB | 单卡足够 |
| 微调pi0.5(JAX) | 1x A100 80GB | ~60 GB | 多卡训练请配置 |
| 微调pi0(JAX) | 1x A100 80GB | ~40 GB | 模型占用更小 |
| 微调(PyTorch DDP) | 1-8x A100 | ~40 GB/卡 | 使用torchrun启动器 |
| 计算归一化统计数据 | CPU或1x GPU | ~8 GB | 速度快,可在登录节点运行 |
Workflow 0: Blank-machine setup
工作流0:空白机器环境搭建
Copy this checklist and track progress:
text
Setup Progress:
- [ ] Step 1: Clone the public openpi repo with submodules
- [ ] Step 2: Install uv and sync the workspace
- [ ] Step 3: Install the editable package
- [ ] Step 4: Verify core imports and serving entrypointStep 1: Clone repo
bash
git clone --recurse-submodules https://github.com/Physical-Intelligence/openpi.git
cd openpiIf you already cloned without submodules:
bash
git submodule update --init --recursiveStep 2: Sync dependencies
bash
GIT_LFS_SKIP_SMUDGE=1 uv syncStep 3: Install editable package
bash
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .Step 4: Verify installation
bash
uv run python -c "from openpi.training import config as _config; print(_config.get_config('pi05_droid').name)"
uv run scripts/serve_policy.py --help复制以下检查清单跟踪进度:
text
搭建进度:
- [ ] 步骤1:克隆带子模块的openpi公开仓库
- [ ] 步骤2:安装uv并同步工作区依赖
- [ ] 步骤3:安装可编辑模式的包
- [ ] 步骤4:验证核心导入和部署入口步骤1:克隆仓库
bash
git clone --recurse-submodules https://github.com/Physical-Intelligence/openpi.git
cd openpi如果之前克隆时没有拉取子模块:
bash
git submodule update --init --recursive步骤2:同步依赖
bash
GIT_LFS_SKIP_SMUDGE=1 uv sync步骤3:安装可编辑包
bash
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .步骤4:验证安装
bash
uv run python -c "from openpi.training import config as _config; print(_config.get_config('pi05_droid').name)"
uv run scripts/serve_policy.py --helpWhen to use vs alternatives
适用场景与替代方案对比
Use this skill when:
- Fine-tuning pi0, pi0-fast, or pi0.5 on LeRobot or RLDS datasets
- Serving OpenPI policies for ALOHA, DROID, or LIBERO evaluation
- Converting JAX checkpoints to PyTorch format
- Debugging OpenPI training issues (norm stats, memory, config)
Use instead when:
fine-tuning-openvla-oft- Fine-tuning OpenVLA with continuous action heads and LoRA
- Reproducing OpenVLA-OFT paper results on LIBERO or ALOHA
Use instead when:
evaluating-cosmos-policy- Evaluating NVIDIA Cosmos Policy on simulation benchmarks
适合使用本工具的场景:
- 在LeRobot或RLDS数据集上微调pi0、pi0-fast或pi0.5
- 为ALOHA、DROID或LIBERO评估部署OpenPI策略
- 将JAX检查点转换为PyTorch格式
- 调试OpenPI训练问题(归一化统计、显存、配置)
以下场景请使用替代:
fine-tuning-openvla-oft- 使用连续动作头和LoRA微调OpenVLA
- 在LIBERO或ALOHA上复现OpenVLA-OFT论文结果
以下场景请使用替代:
evaluating-cosmos-policy- 在仿真基准上评估NVIDIA Cosmos策略
Workflow 1: JAX fine-tuning on LeRobot data
工作流1:基于LeRobot数据的JAX微调
Copy this checklist and track progress:
text
JAX Fine-Tuning Progress:
- [ ] Step 1: Select and copy closest training config
- [ ] Step 2: Update dataset mapping and base checkpoint
- [ ] Step 3: Compute normalization statistics
- [ ] Step 4: Launch JAX training
- [ ] Step 5: Serve checkpoint and run inference sanity checkStep 1: Select config
Copy the closest config from :
src/openpi/training/config.py| Config | Use case |
|---|---|
| pi0.5 LIBERO fine-tuning |
| pi0 full fine-tuning on LIBERO |
| pi0-fast on LIBERO |
| ALOHA custom data |
| Small custom DROID dataset (LeRobot format) |
| Full DROID RLDS large-scale training |
Step 2: Update dataset and transforms
python
undefined复制以下检查清单跟踪进度:
text
JAX微调进度:
- [ ] 步骤1:选择并复制最匹配的训练配置
- [ ] 步骤2:更新数据集映射和基础检查点
- [ ] 步骤3:计算归一化统计数据
- [ ] 步骤4:启动JAX训练
- [ ] 步骤5:部署检查点并运行推理合理性检查步骤1:选择配置
从中复制最匹配的配置:
src/openpi/training/config.py| 配置 | 适用场景 |
|---|---|
| pi0.5 LIBERO微调 |
| pi0在LIBERO上的全量微调 |
| 运行在LIBERO上的pi0-fast |
| ALOHA自定义数据 |
| 小型自定义DROID数据集(LeRobot格式) |
| 全量DROID RLDS大规模训练 |
步骤2:更新数据集和变换规则
python
undefinedIn src/openpi/training/config.py, modify your config:
在src/openpi/training/config.py中修改你的配置:
TrainConfig(
name="my_custom_config",
model_type="pi05",
data=LeRobotDataConfig(
repo_id="your-org/your-dataset",
# Adjust transforms to match your data format
),
weight_loader=Pi05WeightLoader(), # Match model type
)
Set `repo_id` for your dataset and ensure `weight_loader` matches the model type (pi0 vs pi0.5).
**Step 3: Compute normalization statistics**
```bash
uv run scripts/compute_norm_stats.py --config-name <config_name>This must run before every training launch when config, dataset, or transforms change.
Step 4: Launch JAX training
bash
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py <config_name> \
--exp-name=<run_name> \
--overwriteFor full DROID RLDS training, add the dependency group:
rldsbash
uv run --group rlds scripts/compute_norm_stats.py \
--config-name pi05_full_droid_finetune \
--max-frames 10000000
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run --group rlds scripts/train.py \
pi05_full_droid_finetune \
--exp-name=<run_name> --overwriteStep 5: Serve and validate
bash
uv run scripts/serve_policy.py policy:checkpoint \
--policy.config=<config_name> \
--policy.dir=checkpoints/<config_name>/<run_name>/<step>Verify with a test client:
python
from openpi_client import websocket_client_policy
client = websocket_client_policy.WebsocketClientPolicy(host="localhost", port=8000)TrainConfig(
name="my_custom_config",
model_type="pi05",
data=LeRobotDataConfig(
repo_id="your-org/your-dataset",
# 调整变换规则以匹配你的数据格式
),
weight_loader=Pi05WeightLoader(), # 与模型类型匹配
)
为你的数据集设置`repo_id`,并确保`weight_loader`与模型类型(pi0或pi0.5)匹配。
**步骤3:计算归一化统计数据**
```bash
uv run scripts/compute_norm_stats.py --config-name <config_name>当配置、数据集或变换规则修改时,每次启动训练前都必须执行此步骤。
步骤4:启动JAX训练
bash
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py <config_name> \
--exp-name=<run_name> \
--overwrite对于全量DROID RLDS训练,添加依赖组:
rldsbash
uv run --group rlds scripts/compute_norm_stats.py \
--config-name pi05_full_droid_finetune \
--max-frames 10000000
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run --group rlds scripts/train.py \
pi05_full_droid_finetune \
--exp-name=<run_name> --overwrite步骤5:部署与验证
bash
uv run scripts/serve_policy.py policy:checkpoint \
--policy.config=<config_name> \
--policy.dir=checkpoints/<config_name>/<run_name>/<step>使用测试客户端验证:
python
from openpi_client import websocket_client_policy
client = websocket_client_policy.WebsocketClientPolicy(host="localhost", port=8000)Build observation matching your config's expected keys
构建符合配置预期键值的观测数据
obs = {"image": img_array, "state": state_array, "prompt": "pick up the cup"}
result = client.infer(obs)
print(f"Action shape: {result['actions'].shape}") # (chunk_size, action_dim)
---obs = {"image": img_array, "state": state_array, "prompt": "pick up the cup"}
result = client.infer(obs)
print(f"动作形状: {result['actions'].shape}") # (chunk_size, action_dim)
---Workflow 2: PyTorch training and checkpoint conversion
工作流2:PyTorch训练与检查点转换
Copy this checklist and track progress:
text
PyTorch Setup Progress:
- [ ] Step 1: Sync dependencies and verify transformer version
- [ ] Step 2: Apply OpenPI transformer patches
- [ ] Step 3: Convert JAX checkpoint to PyTorch format
- [ ] Step 4: Launch PyTorch training or serve converted checkpointStep 1: Sync dependencies
bash
uv sync
uv pip show transformersStep 2: Apply required patches
OpenPI PyTorch requires custom modifications to the installed package:
transformersbash
cp -r ./src/openpi/models_pytorch/transformers_replace/* \
.venv/lib/python3.11/site-packages/transformers/Step 3: Convert JAX checkpoint
bash
uv run examples/convert_jax_model_to_pytorch.py \
--checkpoint_dir <jax_checkpoint_dir> \
--config_name <config_name> \
--output_path <pytorch_checkpoint_dir>Step 4: Train or serve
Single GPU training:
bash
uv run scripts/train_pytorch.py <config_name> --exp_name <run_name>Multi-GPU distributed training:
bash
uv run torchrun --standalone --nnodes=1 --nproc_per_node=<num_gpus> \
scripts/train_pytorch.py <config_name> --exp_name <run_name>Programmatic inference with converted checkpoint:
python
from openpi.training import config as _config
from openpi.policies import policy_config
config = _config.get_config("pi05_droid")
policy = policy_config.create_trained_policy(config, "<pytorch_checkpoint_dir>")
result = policy.infer(example)
actions = result["actions"] # numpy arrayCheckpoints follow the convention: .
checkpoints/<config_name>/<exp_name>/<step>/复制以下检查清单跟踪进度:
text
PyTorch设置进度:
- [ ] 步骤1:同步依赖并验证transformer版本
- [ ] 步骤2:应用OpenPI transformer补丁
- [ ] 步骤3:将JAX检查点转换为PyTorch格式
- [ ] 步骤4:启动PyTorch训练或部署转换后的检查点步骤1:同步依赖
bash
uv sync
uv pip show transformers步骤2:应用必要补丁
OpenPI PyTorch版本需要对已安装的包进行自定义修改:
transformersbash
cp -r ./src/openpi/models_pytorch/transformers_replace/* \
.venv/lib/python3.11/site-packages/transformers/步骤3:转换JAX检查点
bash
uv run examples/convert_jax_model_to_pytorch.py \
--checkpoint_dir <jax_checkpoint_dir> \
--config_name <config_name> \
--output_path <pytorch_checkpoint_dir>步骤4:训练或部署
单GPU训练:
bash
uv run scripts/train_pytorch.py <config_name> --exp_name <run_name>多GPU分布式训练:
bash
uv run torchrun --standalone --nnodes=1 --nproc_per_node=<num_gpus> \
scripts/train_pytorch.py <config_name> --exp_name <run_name>使用转换后的检查点进行编程式推理:
python
from openpi.training import config as _config
from openpi.policies import policy_config
config = _config.get_config("pi05_droid")
policy = policy_config.create_trained_policy(config, "<pytorch_checkpoint_dir>")
result = policy.infer(example)
actions = result["actions"] # numpy数组检查点遵循命名规范:。
checkpoints/<config_name>/<exp_name>/<step>/Workflow 3: Policy inference serving
工作流3:策略推理部署
Copy this checklist and track progress:
text
Inference Server Progress:
- [ ] Step 1: Choose target environment and checkpoint
- [ ] Step 2: Start policy server
- [ ] Step 3: Confirm server is reachable
- [ ] Step 4: Integrate client into robot or simulation codeStep 1: Choose environment
Default environment presets:
| Environment | Config | Default checkpoint |
|---|---|---|
| | |
| | |
| | |
| | |
Step 2: Start server
Default mode (uses preset checkpoint):
bash
uv run scripts/serve_policy.py --env ALOHAExplicit checkpoint mode (custom or local model):
bash
uv run scripts/serve_policy.py policy:checkpoint \
--policy.config=pi05_libero \
--policy.dir=checkpoints/pi05_libero/my_run/20000Add when runtime observations omit a prompt.
--default_prompt "task description"Step 3: Verify connectivity
bash
uv run examples/simple_client/main.py --env DROIDStep 4: Embed remote client in robot code
Install the lightweight client in your robot environment:
bash
pip install "openpi-client @ git+https://github.com/Physical-Intelligence/openpi.git#subdirectory=packages/openpi-client"Full integration example:
python
from openpi_client import websocket_client_policy
import numpy as np复制以下检查清单跟踪进度:
text
推理服务器部署进度:
- [ ] 步骤1:选择目标环境和检查点
- [ ] 步骤2:启动策略服务器
- [ ] 步骤3:确认服务器可访问
- [ ] 步骤4:将客户端集成到机器人或仿真代码中步骤1:选择环境
默认环境预设:
| 环境 | 配置 | 默认检查点 |
|---|---|---|
| | |
| | |
| | |
| | |
步骤2:启动服务器
默认模式(使用预设检查点):
bash
uv run scripts/serve_policy.py --env ALOHA显式检查点模式(自定义或本地模型):
bash
uv run scripts/serve_policy.py policy:checkpoint \
--policy.config=pi05_libero \
--policy.dir=checkpoints/pi05_libero/my_run/20000当运行时观测数据不包含提示词时,添加参数。
--default_prompt "任务描述"步骤3:验证连通性
bash
uv run examples/simple_client/main.py --env DROID步骤4:在机器人代码中嵌入远程客户端
在你的机器人环境中安装轻量级客户端:
bash
pip install "openpi-client @ git+https://github.com/Physical-Intelligence/openpi.git#subdirectory=packages/openpi-client"完整集成示例:
python
from openpi_client import websocket_client_policy
import numpy as npConnect to remote policy server
连接到远程策略服务器
client = websocket_client_policy.WebsocketClientPolicy(
host="gpu-server.local", port=8000
)
client = websocket_client_policy.WebsocketClientPolicy(
host="gpu-server.local", port=8000
)
Build observation (keys must match policy transforms)
构建观测数据(键值必须与策略变换规则匹配)
observation = {
"image": np.random.rand(224, 224, 3), # RGB image
"state": np.zeros(7), # Joint positions
"prompt": "pick up the red block",
}
observation = {
"image": np.random.rand(224, 224, 3), # RGB图像
"state": np.zeros(7), # 关节位置
"prompt": "拿起红色方块",
}
Get actions
获取动作
result = client.infer(observation)
actions = result["actions"] # shape: (action_chunk_size, action_dim)
result = client.infer(observation)
actions = result["actions"] # 形状: (action_chunk_size, action_dim)
Execute first action on robot
在机器人上执行第一个动作
robot.step(actions[0])
---robot.step(actions[0])
---Common issues
常见问题
Issue: Missing norm stats error
Fix: run before training.
scripts/compute_norm_stats.py --config-name <config_name>Issue: Out of memory during JAX training
Fix: set , lower batch size, or configure :
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9fsdp_devicespython
undefined问题:缺失归一化统计数据错误
修复:训练前运行。
scripts/compute_norm_stats.py --config-name <config_name>问题:JAX训练期间显存不足
修复:设置,降低批量大小,或配置:
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9fsdp_devicespython
undefinedIn config: use model-parallel sharding
在配置中使用模型并行分片
TrainConfig(
...
fsdp_devices=4, # Shard across 4 GPUs
)
**Issue: OOM while loading PyTorch checkpoints**
Fix: `export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`
**Issue: Config not found**
Fix: ensure config name exists in `src/openpi/training/config.py` (exact match from `_CONFIGS` dict).
**Issue: PyTorch training diverges after library changes**
Fix: reapply the transformer patch. Run `uv cache clean transformers` to reset, then reapply.
**Issue: `serve_policy.py` crashes with `ModuleNotFoundError`**
Fix: resync the public workspace first:
```bash
GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .If the missing module is simulator-related, install the extra runtime dependencies called for by that example:
bash
uv pip install pytest robosuite==1.4.0 gym bddl easydict matplotlibIssue: fails with wheel mismatch
uv syncrerun-sdkFix:
bash
uv sync --no-devTrainConfig(
...
fsdp_devices=4, # 跨4个GPU分片
)
**问题:加载PyTorch检查点时OOM**
修复:`export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`
**问题:未找到配置**
修复:确保配置名称存在于`src/openpi/training/config.py`中(与`_CONFIGS`字典中的值完全匹配)。
**问题:库变更后PyTorch训练发散**
修复:重新应用transformer补丁。运行`uv cache clean transformers`重置,然后重新应用补丁。
**问题:`serve_policy.py`崩溃并报`ModuleNotFoundError`**
修复:先重新同步公开工作区:
```bash
GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .如果缺失的模块与模拟器相关,安装示例要求的额外运行时依赖:
bash
uv pip install pytest robosuite==1.4.0 gym bddl easydict matplotlib问题:失败并报 wheel不匹配
uv syncrerun-sdk修复:
bash
uv sync --no-devor
或
uv sync --no-dev --no-install-package rerun-sdk
**Issue: Checkpoint download times out**
Fix: install `gsutil` and prefetch manually:
```bash
pip install gsutil
gsutil -m cp -r gs://openpi-assets/checkpoints/pi05_libero /local/cache/Remove stale files if a previous download was interrupted.
.lockIssue: Policy server exits with code
137Fix: OOM kill. Set JAX memory variables:
bash
export XLA_PYTHON_CLIENT_PREALLOCATE=false
export XLA_PYTHON_CLIENT_ALLOCATOR=platformuv sync --no-dev --no-install-package rerun-sdk
**问题:检查点下载超时**
修复:安装`gsutil`并手动预拉取:
```bash
pip install gsutil
gsutil -m cp -r gs://openpi-assets/checkpoints/pi05_libero /local/cache/如果之前的下载被中断,删除过期的文件。
.lock问题:策略服务器退出码为
137修复:OOM被系统杀死。设置JAX内存变量:
bash
export XLA_PYTHON_CLIENT_PREALLOCATE=false
export XLA_PYTHON_CLIENT_ALLOCATOR=platformFor HPC/cluster users
面向HPC/集群用户的说明
On Slurm-managed clusters, wrap commands with resource allocation:
bash
srun --partition=gpu --gpus-per-node=1 --mem=64G --cpus-per-task=8 --pty bashRoute caches to scratch to avoid filling :
/homebash
export HF_HOME=/scratch/$USER/.cache/huggingface
export XDG_CACHE_HOME=/scratch/$USER/.cache
export PIP_CACHE_DIR=/scratch/$USER/.cache/pip
export UV_CACHE_DIR=/scratch/$USER/.cache/uvAvoid stacking cluster Python modules when using uv-managed environments. Typically is sufficient.
module load cuda在Slurm管理的集群上,使用资源分配包装命令:
bash
srun --partition=gpu --gpus-per-node=1 --mem=64G --cpus-per-task=8 --pty bash将缓存路径指向scratch目录以避免占满:
/homebash
export HF_HOME=/scratch/$USER/.cache/huggingface
export XDG_CACHE_HOME=/scratch/$USER/.cache
export PIP_CACHE_DIR=/scratch/$USER/.cache/pip
export UV_CACHE_DIR=/scratch/$USER/.cache/uv使用uv管理的环境时,避免叠加集群Python模块,通常就足够了。
module load cudaAdvanced topics
高级主题
Config recipes and baselines: See references/config-recipes.md
Training debugging guide: See references/training-debugging.md
Checkpoint and environment mapping: See references/checkpoints-and-env-map.md
Remote client integration: See references/remote-client-pattern.md
PyTorch precision and patching gotchas: See references/pytorch-gotchas.md
配置模板与基准:参见 references/config-recipes.md
训练调试指南:参见 references/training-debugging.md
检查点与环境映射:参见 references/checkpoints-and-env-map.md
远程客户端集成:参见 references/remote-client-pattern.md
PyTorch精度与补丁注意事项:参见 references/pytorch-gotchas.md
Resources
资源
- OpenPI repository: https://github.com/Physical-Intelligence/openpi
- OpenPI client package: https://github.com/Physical-Intelligence/openpi/tree/main/packages/openpi-client
- pi0 paper: https://www.physicalintelligence.company/blog/pi0
- LeRobot dataset format: https://huggingface.co/docs/lerobot