fine-tuning-serving-openpi

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

OpenPI Fine-Tuning and Serving

OpenPI 微调与部署

End-to-end workflows for fine-tuning and serving Physical Intelligence's OpenPI models (pi0, pi0-fast, pi0.5) on robot manipulation tasks from the public
openpi
repository. Covers blank-machine setup, JAX training, PyTorch training, checkpoint conversion, and policy inference serving.
本教程介绍了基于公开
openpi
仓库,针对机器人操作任务微调与部署Physical Intelligence的OpenPI模型(pi0、pi0-fast、pi0.5)的端到端工作流,覆盖空白机器环境搭建、JAX训练、PyTorch训练、检查点转换和策略推理部署全流程。

Quick start

快速开始

Clone the public repo, install the workspace, then serve a pretrained policy:
bash
git clone --recurse-submodules https://github.com/Physical-Intelligence/openpi.git
cd openpi
GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .
uv run scripts/serve_policy.py --env DROID
python
from openpi_client import websocket_client_policy

client = websocket_client_policy.WebsocketClientPolicy(host="localhost", port=8000)
result = client.infer(observation)
actions = result["actions"]  # numpy array of shape (chunk_size, action_dim)
克隆公开仓库,安装工作区依赖,然后部署预训练策略:
bash
git clone --recurse-submodules https://github.com/Physical-Intelligence/openpi.git
cd openpi
GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .
uv run scripts/serve_policy.py --env DROID
python
from openpi_client import websocket_client_policy

client = websocket_client_policy.WebsocketClientPolicy(host="localhost", port=8000)
result = client.infer(observation)
actions = result["actions"]  # 形状为(chunk_size, action_dim)的numpy数组

Core concepts

核心概念

Model family: OpenPI implements three model variants from Physical Intelligence:
ModelArchitectureSpeedQualityTypical use
pi0Flow-matching VLABaselineHighestResearch, complex tasks
pi0-fastAutoregressive action tokens2-5x fasterGoodReal-time control
pi0.5pi0 + improved vision encoderBaselineBestLatest default
Key design choices:
  • Dual backend: JAX (primary, official training) and PyTorch (community, deployment-friendly)
  • Config-driven: All training/serving parameters defined in
    src/openpi/training/config.py
  • Norm stats: Every config requires precomputed normalization statistics before training
  • WebSocket serving: Policy servers expose a WebSocket API for low-latency inference
Training loop invariant: After every config or dataset change, always re-run this cycle:
  1. Compute norm stats → 2. Train → 3. Serve checkpoint → 4. Validate inference
模型家族:OpenPI实现了Physical Intelligence的三个模型变体:
模型架构速度质量典型用途
pi0Flow-matching VLA基准最高研究、复杂任务
pi0-fast自回归动作令牌快2-5倍良好实时控制
pi0.5pi0 + 改进的视觉编码器基准最佳最新默认选项
核心设计选择
  • 双后端支持:JAX(官方首选训练后端)和PyTorch(社区维护,更易部署)
  • 配置驱动:所有训练/部署参数都定义在
    src/openpi/training/config.py
  • 归一化统计:训练前每个配置都需要预先计算归一化统计数据
  • WebSocket部署:策略服务器暴露WebSocket API以实现低延迟推理
训练循环不变规则:每次修改配置或数据集后,务必重新执行以下流程:
  1. 计算归一化统计数据 → 2. 训练 → 3. 部署检查点 → 4. 验证推理结果

Compute requirements

计算资源要求

TaskGPUVRAMNotes
Serve pi0.5 (inference)1x A100/H100~24 GBSingle GPU sufficient
Fine-tune pi0.5 (JAX)1x A100 80GB~60 GBUse
fsdp_devices
for multi-GPU
Fine-tune pi0 (JAX)1x A100 80GB~40 GBSmaller model footprint
Fine-tune (PyTorch DDP)1-8x A100~40 GB/GPUtorchrun launcher
Compute norm statsCPU or 1x GPU~8 GBFast, can run on login node
任务GPU显存备注
部署pi0.5(推理)1x A100/H100~24 GB单卡足够
微调pi0.5(JAX)1x A100 80GB~60 GB多卡训练请配置
fsdp_devices
微调pi0(JAX)1x A100 80GB~40 GB模型占用更小
微调(PyTorch DDP)1-8x A100~40 GB/卡使用torchrun启动器
计算归一化统计数据CPU或1x GPU~8 GB速度快,可在登录节点运行

Workflow 0: Blank-machine setup

工作流0:空白机器环境搭建

Copy this checklist and track progress:
text
Setup Progress:
- [ ] Step 1: Clone the public openpi repo with submodules
- [ ] Step 2: Install uv and sync the workspace
- [ ] Step 3: Install the editable package
- [ ] Step 4: Verify core imports and serving entrypoint
Step 1: Clone repo
bash
git clone --recurse-submodules https://github.com/Physical-Intelligence/openpi.git
cd openpi
If you already cloned without submodules:
bash
git submodule update --init --recursive
Step 2: Sync dependencies
bash
GIT_LFS_SKIP_SMUDGE=1 uv sync
Step 3: Install editable package
bash
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .
Step 4: Verify installation
bash
uv run python -c "from openpi.training import config as _config; print(_config.get_config('pi05_droid').name)"
uv run scripts/serve_policy.py --help
复制以下检查清单跟踪进度:
text
搭建进度:
- [ ] 步骤1:克隆带子模块的openpi公开仓库
- [ ] 步骤2:安装uv并同步工作区依赖
- [ ] 步骤3:安装可编辑模式的包
- [ ] 步骤4:验证核心导入和部署入口
步骤1:克隆仓库
bash
git clone --recurse-submodules https://github.com/Physical-Intelligence/openpi.git
cd openpi
如果之前克隆时没有拉取子模块:
bash
git submodule update --init --recursive
步骤2:同步依赖
bash
GIT_LFS_SKIP_SMUDGE=1 uv sync
步骤3:安装可编辑包
bash
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .
步骤4:验证安装
bash
uv run python -c "from openpi.training import config as _config; print(_config.get_config('pi05_droid').name)"
uv run scripts/serve_policy.py --help

When to use vs alternatives

适用场景与替代方案对比

Use this skill when:
  • Fine-tuning pi0, pi0-fast, or pi0.5 on LeRobot or RLDS datasets
  • Serving OpenPI policies for ALOHA, DROID, or LIBERO evaluation
  • Converting JAX checkpoints to PyTorch format
  • Debugging OpenPI training issues (norm stats, memory, config)
Use
fine-tuning-openvla-oft
instead when:
  • Fine-tuning OpenVLA with continuous action heads and LoRA
  • Reproducing OpenVLA-OFT paper results on LIBERO or ALOHA
Use
evaluating-cosmos-policy
instead when:
  • Evaluating NVIDIA Cosmos Policy on simulation benchmarks

适合使用本工具的场景:
  • 在LeRobot或RLDS数据集上微调pi0、pi0-fast或pi0.5
  • 为ALOHA、DROID或LIBERO评估部署OpenPI策略
  • 将JAX检查点转换为PyTorch格式
  • 调试OpenPI训练问题(归一化统计、显存、配置)
以下场景请使用
fine-tuning-openvla-oft
替代:
  • 使用连续动作头和LoRA微调OpenVLA
  • 在LIBERO或ALOHA上复现OpenVLA-OFT论文结果
以下场景请使用
evaluating-cosmos-policy
替代:
  • 在仿真基准上评估NVIDIA Cosmos策略

Workflow 1: JAX fine-tuning on LeRobot data

工作流1:基于LeRobot数据的JAX微调

Copy this checklist and track progress:
text
JAX Fine-Tuning Progress:
- [ ] Step 1: Select and copy closest training config
- [ ] Step 2: Update dataset mapping and base checkpoint
- [ ] Step 3: Compute normalization statistics
- [ ] Step 4: Launch JAX training
- [ ] Step 5: Serve checkpoint and run inference sanity check
Step 1: Select config
Copy the closest config from
src/openpi/training/config.py
:
ConfigUse case
pi05_libero
pi0.5 LIBERO fine-tuning
pi0_libero
pi0 full fine-tuning on LIBERO
pi0_fast_libero
pi0-fast on LIBERO
pi0_aloha_pen_uncap
ALOHA custom data
pi05_droid_finetune
Small custom DROID dataset (LeRobot format)
pi05_full_droid_finetune
Full DROID RLDS large-scale training
Step 2: Update dataset and transforms
python
undefined
复制以下检查清单跟踪进度:
text
JAX微调进度:
- [ ] 步骤1:选择并复制最匹配的训练配置
- [ ] 步骤2:更新数据集映射和基础检查点
- [ ] 步骤3:计算归一化统计数据
- [ ] 步骤4:启动JAX训练
- [ ] 步骤5:部署检查点并运行推理合理性检查
步骤1:选择配置
src/openpi/training/config.py
中复制最匹配的配置:
配置适用场景
pi05_libero
pi0.5 LIBERO微调
pi0_libero
pi0在LIBERO上的全量微调
pi0_fast_libero
运行在LIBERO上的pi0-fast
pi0_aloha_pen_uncap
ALOHA自定义数据
pi05_droid_finetune
小型自定义DROID数据集(LeRobot格式)
pi05_full_droid_finetune
全量DROID RLDS大规模训练
步骤2:更新数据集和变换规则
python
undefined

In src/openpi/training/config.py, modify your config:

在src/openpi/training/config.py中修改你的配置:

TrainConfig( name="my_custom_config", model_type="pi05", data=LeRobotDataConfig( repo_id="your-org/your-dataset", # Adjust transforms to match your data format ), weight_loader=Pi05WeightLoader(), # Match model type )

Set `repo_id` for your dataset and ensure `weight_loader` matches the model type (pi0 vs pi0.5).

**Step 3: Compute normalization statistics**

```bash
uv run scripts/compute_norm_stats.py --config-name <config_name>
This must run before every training launch when config, dataset, or transforms change.
Step 4: Launch JAX training
bash
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py <config_name> \
  --exp-name=<run_name> \
  --overwrite
For full DROID RLDS training, add the
rlds
dependency group:
bash
uv run --group rlds scripts/compute_norm_stats.py \
  --config-name pi05_full_droid_finetune \
  --max-frames 10000000

XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run --group rlds scripts/train.py \
  pi05_full_droid_finetune \
  --exp-name=<run_name> --overwrite
Step 5: Serve and validate
bash
uv run scripts/serve_policy.py policy:checkpoint \
  --policy.config=<config_name> \
  --policy.dir=checkpoints/<config_name>/<run_name>/<step>
Verify with a test client:
python
from openpi_client import websocket_client_policy

client = websocket_client_policy.WebsocketClientPolicy(host="localhost", port=8000)
TrainConfig( name="my_custom_config", model_type="pi05", data=LeRobotDataConfig( repo_id="your-org/your-dataset", # 调整变换规则以匹配你的数据格式 ), weight_loader=Pi05WeightLoader(), # 与模型类型匹配 )

为你的数据集设置`repo_id`,并确保`weight_loader`与模型类型(pi0或pi0.5)匹配。

**步骤3:计算归一化统计数据**

```bash
uv run scripts/compute_norm_stats.py --config-name <config_name>
当配置、数据集或变换规则修改时,每次启动训练前都必须执行此步骤。
步骤4:启动JAX训练
bash
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run scripts/train.py <config_name> \
  --exp-name=<run_name> \
  --overwrite
对于全量DROID RLDS训练,添加
rlds
依赖组:
bash
uv run --group rlds scripts/compute_norm_stats.py \
  --config-name pi05_full_droid_finetune \
  --max-frames 10000000

XLA_PYTHON_CLIENT_MEM_FRACTION=0.9 uv run --group rlds scripts/train.py \
  pi05_full_droid_finetune \
  --exp-name=<run_name> --overwrite
步骤5:部署与验证
bash
uv run scripts/serve_policy.py policy:checkpoint \
  --policy.config=<config_name> \
  --policy.dir=checkpoints/<config_name>/<run_name>/<step>
使用测试客户端验证:
python
from openpi_client import websocket_client_policy

client = websocket_client_policy.WebsocketClientPolicy(host="localhost", port=8000)

Build observation matching your config's expected keys

构建符合配置预期键值的观测数据

obs = {"image": img_array, "state": state_array, "prompt": "pick up the cup"} result = client.infer(obs) print(f"Action shape: {result['actions'].shape}") # (chunk_size, action_dim)

---
obs = {"image": img_array, "state": state_array, "prompt": "pick up the cup"} result = client.infer(obs) print(f"动作形状: {result['actions'].shape}") # (chunk_size, action_dim)

---

Workflow 2: PyTorch training and checkpoint conversion

工作流2:PyTorch训练与检查点转换

Copy this checklist and track progress:
text
PyTorch Setup Progress:
- [ ] Step 1: Sync dependencies and verify transformer version
- [ ] Step 2: Apply OpenPI transformer patches
- [ ] Step 3: Convert JAX checkpoint to PyTorch format
- [ ] Step 4: Launch PyTorch training or serve converted checkpoint
Step 1: Sync dependencies
bash
uv sync
uv pip show transformers
Step 2: Apply required patches
OpenPI PyTorch requires custom modifications to the installed
transformers
package:
bash
cp -r ./src/openpi/models_pytorch/transformers_replace/* \
  .venv/lib/python3.11/site-packages/transformers/
Step 3: Convert JAX checkpoint
bash
uv run examples/convert_jax_model_to_pytorch.py \
  --checkpoint_dir <jax_checkpoint_dir> \
  --config_name <config_name> \
  --output_path <pytorch_checkpoint_dir>
Step 4: Train or serve
Single GPU training:
bash
uv run scripts/train_pytorch.py <config_name> --exp_name <run_name>
Multi-GPU distributed training:
bash
uv run torchrun --standalone --nnodes=1 --nproc_per_node=<num_gpus> \
  scripts/train_pytorch.py <config_name> --exp_name <run_name>
Programmatic inference with converted checkpoint:
python
from openpi.training import config as _config
from openpi.policies import policy_config

config = _config.get_config("pi05_droid")
policy = policy_config.create_trained_policy(config, "<pytorch_checkpoint_dir>")
result = policy.infer(example)
actions = result["actions"]  # numpy array
Checkpoints follow the convention:
checkpoints/<config_name>/<exp_name>/<step>/
.

复制以下检查清单跟踪进度:
text
PyTorch设置进度:
- [ ] 步骤1:同步依赖并验证transformer版本
- [ ] 步骤2:应用OpenPI transformer补丁
- [ ] 步骤3:将JAX检查点转换为PyTorch格式
- [ ] 步骤4:启动PyTorch训练或部署转换后的检查点
步骤1:同步依赖
bash
uv sync
uv pip show transformers
步骤2:应用必要补丁
OpenPI PyTorch版本需要对已安装的
transformers
包进行自定义修改:
bash
cp -r ./src/openpi/models_pytorch/transformers_replace/* \
  .venv/lib/python3.11/site-packages/transformers/
步骤3:转换JAX检查点
bash
uv run examples/convert_jax_model_to_pytorch.py \
  --checkpoint_dir <jax_checkpoint_dir> \
  --config_name <config_name> \
  --output_path <pytorch_checkpoint_dir>
步骤4:训练或部署
单GPU训练:
bash
uv run scripts/train_pytorch.py <config_name> --exp_name <run_name>
多GPU分布式训练:
bash
uv run torchrun --standalone --nnodes=1 --nproc_per_node=<num_gpus> \
  scripts/train_pytorch.py <config_name> --exp_name <run_name>
使用转换后的检查点进行编程式推理:
python
from openpi.training import config as _config
from openpi.policies import policy_config

config = _config.get_config("pi05_droid")
policy = policy_config.create_trained_policy(config, "<pytorch_checkpoint_dir>")
result = policy.infer(example)
actions = result["actions"]  # numpy数组
检查点遵循命名规范:
checkpoints/<config_name>/<exp_name>/<step>/

Workflow 3: Policy inference serving

工作流3:策略推理部署

Copy this checklist and track progress:
text
Inference Server Progress:
- [ ] Step 1: Choose target environment and checkpoint
- [ ] Step 2: Start policy server
- [ ] Step 3: Confirm server is reachable
- [ ] Step 4: Integrate client into robot or simulation code
Step 1: Choose environment
Default environment presets:
EnvironmentConfigDefault checkpoint
ALOHA
pi05_aloha
gs://openpi-assets/checkpoints/pi05_base
ALOHA_SIM
pi0_aloha_sim
gs://openpi-assets/checkpoints/pi0_aloha_sim
DROID
pi05_droid
gs://openpi-assets/checkpoints/pi05_droid
LIBERO
pi05_libero
gs://openpi-assets/checkpoints/pi05_libero
Step 2: Start server
Default mode (uses preset checkpoint):
bash
uv run scripts/serve_policy.py --env ALOHA
Explicit checkpoint mode (custom or local model):
bash
uv run scripts/serve_policy.py policy:checkpoint \
  --policy.config=pi05_libero \
  --policy.dir=checkpoints/pi05_libero/my_run/20000
Add
--default_prompt "task description"
when runtime observations omit a prompt.
Step 3: Verify connectivity
bash
uv run examples/simple_client/main.py --env DROID
Step 4: Embed remote client in robot code
Install the lightweight client in your robot environment:
bash
pip install "openpi-client @ git+https://github.com/Physical-Intelligence/openpi.git#subdirectory=packages/openpi-client"
Full integration example:
python
from openpi_client import websocket_client_policy
import numpy as np
复制以下检查清单跟踪进度:
text
推理服务器部署进度:
- [ ] 步骤1:选择目标环境和检查点
- [ ] 步骤2:启动策略服务器
- [ ] 步骤3:确认服务器可访问
- [ ] 步骤4:将客户端集成到机器人或仿真代码中
步骤1:选择环境
默认环境预设:
环境配置默认检查点
ALOHA
pi05_aloha
gs://openpi-assets/checkpoints/pi05_base
ALOHA_SIM
pi0_aloha_sim
gs://openpi-assets/checkpoints/pi0_aloha_sim
DROID
pi05_droid
gs://openpi-assets/checkpoints/pi05_droid
LIBERO
pi05_libero
gs://openpi-assets/checkpoints/pi05_libero
步骤2:启动服务器
默认模式(使用预设检查点):
bash
uv run scripts/serve_policy.py --env ALOHA
显式检查点模式(自定义或本地模型):
bash
uv run scripts/serve_policy.py policy:checkpoint \
  --policy.config=pi05_libero \
  --policy.dir=checkpoints/pi05_libero/my_run/20000
当运行时观测数据不包含提示词时,添加
--default_prompt "任务描述"
参数。
步骤3:验证连通性
bash
uv run examples/simple_client/main.py --env DROID
步骤4:在机器人代码中嵌入远程客户端
在你的机器人环境中安装轻量级客户端:
bash
pip install "openpi-client @ git+https://github.com/Physical-Intelligence/openpi.git#subdirectory=packages/openpi-client"
完整集成示例:
python
from openpi_client import websocket_client_policy
import numpy as np

Connect to remote policy server

连接到远程策略服务器

client = websocket_client_policy.WebsocketClientPolicy( host="gpu-server.local", port=8000 )
client = websocket_client_policy.WebsocketClientPolicy( host="gpu-server.local", port=8000 )

Build observation (keys must match policy transforms)

构建观测数据(键值必须与策略变换规则匹配)

observation = { "image": np.random.rand(224, 224, 3), # RGB image "state": np.zeros(7), # Joint positions "prompt": "pick up the red block", }
observation = { "image": np.random.rand(224, 224, 3), # RGB图像 "state": np.zeros(7), # 关节位置 "prompt": "拿起红色方块", }

Get actions

获取动作

result = client.infer(observation) actions = result["actions"] # shape: (action_chunk_size, action_dim)
result = client.infer(observation) actions = result["actions"] # 形状: (action_chunk_size, action_dim)

Execute first action on robot

在机器人上执行第一个动作

robot.step(actions[0])

---
robot.step(actions[0])

---

Common issues

常见问题

Issue: Missing norm stats error
Fix: run
scripts/compute_norm_stats.py --config-name <config_name>
before training.
Issue: Out of memory during JAX training
Fix: set
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9
, lower batch size, or configure
fsdp_devices
:
python
undefined
问题:缺失归一化统计数据错误
修复:训练前运行
scripts/compute_norm_stats.py --config-name <config_name>
问题:JAX训练期间显存不足
修复:设置
XLA_PYTHON_CLIENT_MEM_FRACTION=0.9
,降低批量大小,或配置
fsdp_devices
python
undefined

In config: use model-parallel sharding

在配置中使用模型并行分片

TrainConfig( ... fsdp_devices=4, # Shard across 4 GPUs )

**Issue: OOM while loading PyTorch checkpoints**

Fix: `export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`

**Issue: Config not found**

Fix: ensure config name exists in `src/openpi/training/config.py` (exact match from `_CONFIGS` dict).

**Issue: PyTorch training diverges after library changes**

Fix: reapply the transformer patch. Run `uv cache clean transformers` to reset, then reapply.

**Issue: `serve_policy.py` crashes with `ModuleNotFoundError`**

Fix: resync the public workspace first:

```bash
GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .
If the missing module is simulator-related, install the extra runtime dependencies called for by that example:
bash
uv pip install pytest robosuite==1.4.0 gym bddl easydict matplotlib
Issue:
uv sync
fails with
rerun-sdk
wheel mismatch
Fix:
bash
uv sync --no-dev
TrainConfig( ... fsdp_devices=4, # 跨4个GPU分片 )

**问题:加载PyTorch检查点时OOM**

修复:`export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`

**问题:未找到配置**

修复:确保配置名称存在于`src/openpi/training/config.py`中(与`_CONFIGS`字典中的值完全匹配)。

**问题:库变更后PyTorch训练发散**

修复:重新应用transformer补丁。运行`uv cache clean transformers`重置,然后重新应用补丁。

**问题:`serve_policy.py`崩溃并报`ModuleNotFoundError`**

修复:先重新同步公开工作区:

```bash
GIT_LFS_SKIP_SMUDGE=1 uv sync
GIT_LFS_SKIP_SMUDGE=1 uv pip install -e .
如果缺失的模块与模拟器相关,安装示例要求的额外运行时依赖:
bash
uv pip install pytest robosuite==1.4.0 gym bddl easydict matplotlib
问题:
uv sync
失败并报
rerun-sdk
wheel不匹配
修复:
bash
uv sync --no-dev

or

uv sync --no-dev --no-install-package rerun-sdk

**Issue: Checkpoint download times out**

Fix: install `gsutil` and prefetch manually:

```bash
pip install gsutil
gsutil -m cp -r gs://openpi-assets/checkpoints/pi05_libero /local/cache/
Remove stale
.lock
files if a previous download was interrupted.
Issue: Policy server exits with code
137
Fix: OOM kill. Set JAX memory variables:
bash
export XLA_PYTHON_CLIENT_PREALLOCATE=false
export XLA_PYTHON_CLIENT_ALLOCATOR=platform

uv sync --no-dev --no-install-package rerun-sdk

**问题:检查点下载超时**

修复:安装`gsutil`并手动预拉取:

```bash
pip install gsutil
gsutil -m cp -r gs://openpi-assets/checkpoints/pi05_libero /local/cache/
如果之前的下载被中断,删除过期的
.lock
文件。
问题:策略服务器退出码为
137
修复:OOM被系统杀死。设置JAX内存变量:
bash
export XLA_PYTHON_CLIENT_PREALLOCATE=false
export XLA_PYTHON_CLIENT_ALLOCATOR=platform

For HPC/cluster users

面向HPC/集群用户的说明

On Slurm-managed clusters, wrap commands with resource allocation:
bash
srun --partition=gpu --gpus-per-node=1 --mem=64G --cpus-per-task=8 --pty bash
Route caches to scratch to avoid filling
/home
:
bash
export HF_HOME=/scratch/$USER/.cache/huggingface
export XDG_CACHE_HOME=/scratch/$USER/.cache
export PIP_CACHE_DIR=/scratch/$USER/.cache/pip
export UV_CACHE_DIR=/scratch/$USER/.cache/uv
Avoid stacking cluster Python modules when using uv-managed environments. Typically
module load cuda
is sufficient.

在Slurm管理的集群上,使用资源分配包装命令:
bash
srun --partition=gpu --gpus-per-node=1 --mem=64G --cpus-per-task=8 --pty bash
将缓存路径指向scratch目录以避免占满
/home
bash
export HF_HOME=/scratch/$USER/.cache/huggingface
export XDG_CACHE_HOME=/scratch/$USER/.cache
export PIP_CACHE_DIR=/scratch/$USER/.cache/pip
export UV_CACHE_DIR=/scratch/$USER/.cache/uv
使用uv管理的环境时,避免叠加集群Python模块,通常
module load cuda
就足够了。

Advanced topics

高级主题

Config recipes and baselines: See references/config-recipes.md Training debugging guide: See references/training-debugging.md Checkpoint and environment mapping: See references/checkpoints-and-env-map.md Remote client integration: See references/remote-client-pattern.md PyTorch precision and patching gotchas: See references/pytorch-gotchas.md
配置模板与基准:参见 references/config-recipes.md 训练调试指南:参见 references/training-debugging.md 检查点与环境映射:参见 references/checkpoints-and-env-map.md 远程客户端集成:参见 references/remote-client-pattern.md PyTorch精度与补丁注意事项:参见 references/pytorch-gotchas.md

Resources

资源