h100-sglang-diffusion

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

H100 — SGLang Diffusion

Overview

概述

Use this skill to do SGLang diffusion development on the H100 box through

h100_sglang

. The default container is

sglang_bbuf

and the repo lives at

/data/bbuf/repos/sglang

Prefer this skill when:

Validating diffusion Triton / CUDA JIT kernels
Running diffusion model smoke tests (
```
DiffGenerator
```
, flux, etc.)
Comparing eager vs
```
torch.compile
```
diffusion performance
Verifying
```
python[diffusion]
```
editable install changes

This environment is already prepared:

```
sglang_bbuf
```
is running on
```
lmsysorg/sglang:dev
```
the repo is cloned at
```
/data/bbuf/repos/sglang
```
editable installs for
```
python[all]
```
and
```
python[diffusion]
```
are already done
```
/data/.cache
```
is mounted to
```
/root/.cache
```
Infiniband paths are mounted for RDMA-aware workflows:
```
/sys/class/infiniband
```
,
```
/dev/infiniband
```
, and
```
/usr/sbin/show_gids
```

使用本指南，可通过

h100_sglang

在H100服务器上进行SGLang diffusion开发。默认容器为

sglang_bbuf

，代码仓库位于

/data/bbuf/repos/sglang

。

优先使用本场景的情况：

验证diffusion Triton/CUDA JIT内核
运行扩散模型冒烟测试（
```
DiffGenerator
```
、flux等）
对比eager模式与
```
torch.compile
```
的diffusion性能
验证
```
python[diffusion]
```
可编辑安装的修改

本环境已预先配置完成：

```
sglang_bbuf
```
基于
```
lmsysorg/sglang:dev
```
镜像运行
代码仓库已克隆至
```
/data/bbuf/repos/sglang
```
已完成
```
python[all]
```
和
```
python[diffusion]
```
的可编辑安装
```
/data/.cache
```
已挂载至
```
/root/.cache
```
已挂载InfiniBand路径以支持RDMA相关工作流：
```
/sys/class/infiniband
```
、
```
/dev/infiniband
```
和
```
/usr/sbin/show_gids
```

Quick Start

快速开始

Check the host, container, and GPU state.

bash

ssh h100_sglang 'hostname && whoami'
ssh h100_sglang 'docker ps --format "table {{.Names}}\t{{.Status}}" | sed -n "1,20p"'
ssh h100_sglang 'nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total --format=csv,noheader,nounits'

Enter the container and confirm HF token visibility.

bash

ssh h100_sglang 'docker exec -it sglang_bbuf /bin/zsh'
cd /data/bbuf/repos/sglang
echo ${HF_TOKEN:+set}

HF_TOKEN

is missing, export it before any Hub-backed diffusion run:

bash

export HF_TOKEN=<your-hf-token>
export HUGGINGFACE_HUB_TOKEN="$HF_TOKEN"

For non-interactive

docker exec ... bash -lc "<cmd>"

runs, export both variables inline instead of relying on shell startup:

bash

ssh h100_sglang 'docker exec sglang_bbuf env HF_TOKEN=<your-hf-token> HUGGINGFACE_HUB_TOKEN=<your-hf-token> zsh -lc "..."'

Pick a free GPU.

Use a GPU with

utilization and only a few MiB allocated. Always set

CUDA_VISIBLE_DEVICES=<gpu_id>

for diffusion validation commands.

If the container is not running, start it.

bash

ssh h100_sglang 'docker start sglang_bbuf'

检查主机、容器和GPU状态。

bash

ssh h100_sglang 'hostname && whoami'
ssh h100_sglang 'docker ps --format "table {{.Names}}\t{{.Status}}" | sed -n "1,20p"'
ssh h100_sglang 'nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total --format=csv,noheader,nounits'

进入容器并确认HF_TOKEN是否可见。

bash

ssh h100_sglang 'docker exec -it sglang_bbuf /bin/zsh'
cd /data/bbuf/repos/sglang
echo ${HF_TOKEN:+set}

如果

HF_TOKEN

缺失，请在运行任何依赖Hub的diffusion任务前导出该变量：

bash

export HF_TOKEN=<your-hf-token>
export HUGGINGFACE_HUB_TOKEN="$HF_TOKEN"

对于非交互式的

docker exec ... bash -lc "<cmd>"

运行，请直接在命令中导出两个变量，而非依赖shell启动配置：

bash

ssh h100_sglang 'docker exec sglang_bbuf env HF_TOKEN=<your-hf-token> HUGGINGFACE_HUB_TOKEN=<your-hf-token> zsh -lc "..."'

选择空闲GPU。

请选择利用率为

且仅占用少量MiB显存的GPU。在运行diffusion验证命令时，务必设置

CUDA_VISIBLE_DEVICES=<gpu_id>

。

如果容器未运行，请启动它。

bash

ssh h100_sglang 'docker start sglang_bbuf'

Safe Remote Workflow

安全远程工作流

Inspect the repo state before editing.

bash

ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /data/bbuf/repos/sglang && git branch --show-current && git status --short"'

Fast-forward to latest clean
```
main
```
before creating a validation worktree.

bash

ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /data/bbuf/repos/sglang && git fetch origin && git checkout main && git pull --ff-only origin main"'

Never write directly into
```
/data/bbuf/repos/sglang
```
when it is dirty.
Use one of these isolation strategies.

Create a detached worktree for remote-only experiments:

bash

ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /data/bbuf/repos/sglang && git worktree add --detach /tmp/sglang_validate_h100 HEAD"'

Stream the local working tree into the container (validates exactly what is local right now):

bash

COPYFILE_DISABLE=1 tar --exclude=.git -cf - . | \
ssh h100_sglang 'docker exec -i sglang_bbuf sh -lc "rm -rf /tmp/sglang_local_validate && mkdir -p /tmp/sglang_local_validate && tar -xf - -C /tmp/sglang_local_validate"'
ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "find /tmp/sglang_local_validate -name '\''._*'\'' -delete"'

For patch-oriented validation:

fast-forward remote
```
main
```
create a detached worktree from that commit
stream or
```
git apply
```
only the focused local diff into the worktree

This keeps

/data/bbuf/repos/sglang

clean while still validating the exact local delta.

在编辑前检查仓库状态。

bash

ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /data/bbuf/repos/sglang && git branch --show-current && git status --short"'

在创建验证工作树前，先快进至最新的干净
```
main
```
分支。

bash

ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /data/bbuf/repos/sglang && git fetch origin && git checkout main && git pull --ff-only origin main"'

当
```
/data/bbuf/repos/sglang
```
处于脏状态时，切勿直接写入内容。
使用以下隔离策略之一。

为仅远程实验创建分离的工作树：

bash

ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /data/bbuf/repos/sglang && git worktree add --detach /tmp/sglang_validate_h100 HEAD"'

将本地工作树流式传输至容器（验证当前本地的精确内容）：

bash

COPYFILE_DISABLE=1 tar --exclude=.git -cf - . | \
ssh h100_sglang 'docker exec -i sglang_bbuf sh -lc "rm -rf /tmp/sglang_local_validate && mkdir -p /tmp/sglang_local_validate && tar -xf - -C /tmp/sglang_local_validate"'
ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "find /tmp/sglang_local_validate -name '\''._*'\'' -delete"'

针对补丁验证的场景：

快进远程
```
main
```
分支
基于该提交创建分离的工作树
将本地聚焦的diff流式传输或通过
```
git apply
```
应用至工作树

此方式可保持

/data/bbuf/repos/sglang

干净，同时仍能验证本地的精确修改。

Diffusion Validation Workflow

Diffusion验证工作流

1. Syntax / Import Check

1. 语法/导入检查

Always start here before running any GPU kernel or model test.

bash

ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /tmp/sglang_local_validate && python -m compileall python/sglang/jit_kernel/diffusion/triton python/sglang/multimodal_gen/runtime/layers"'

For broader coverage:

bash

ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /tmp/sglang_local_validate && python -m compileall python/sglang"'

在运行任何GPU内核或模型测试前，请务必先执行此步骤。

bash

ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /tmp/sglang_local_validate && python -m compileall python/sglang/jit_kernel/diffusion/triton python/sglang/multimodal_gen/runtime/layers"'

如需更全面的覆盖：

bash

ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /tmp/sglang_local_validate && python -m compileall python/sglang"'

2. JIT Kernel Smoke

2. JIT内核冒烟测试

Run a targeted smoke script covering the changed primitives before any model-level test.

Cover at least these when relevant:

```
rms_norm_fn
```
```
RMSNorm
```
under
```
torch.compile
```
```
norm_infer
```
```
apply_rotary_embedding
```

Pipe the smoke script through

docker exec -i

bash

ssh h100_sglang 'docker exec -i sglang_bbuf env CUDA_VISIBLE_DEVICES=0 PYTHONPATH=python python' < /path/to/local_smoke.py

在进行任何模型级测试前，先运行针对修改原语的定向冒烟脚本。

相关场景下至少需覆盖以下内容：

```
rms_norm_fn
```
```
torch.compile
```
下的
```
RMSNorm
```
```
norm_infer
```
```
apply_rotary_embedding
```

通过

docker exec -i

管道传输冒烟脚本：

bash

ssh h100_sglang 'docker exec -i sglang_bbuf env CUDA_VISIBLE_DEVICES=0 PYTHONPATH=python python' < /path/to/local_smoke.py

3. Fused Modulation Regression

3. 融合调制回归测试

Run this after any change to

jit_kernel/diffusion/triton

bash

ssh h100_sglang 'docker exec sglang_bbuf env CUDA_VISIBLE_DEVICES=0 PYTHONPATH=python zsh -lc "cd /tmp/sglang_local_validate && pytest -q python/sglang/jit_kernel/tests/test_qwen_image_modulation.py -q"'

在修改

jit_kernel/diffusion/triton

后，请运行此测试：

bash

ssh h100_sglang 'docker exec sglang_bbuf env CUDA_VISIBLE_DEVICES=0 PYTHONPATH=python zsh -lc "cd /tmp/sglang_local_validate && pytest -q python/sglang/jit_kernel/tests/test_qwen_image_modulation.py -q"'

4. General Diffusion Tests

4. 通用Diffusion测试

bash

ssh h100_sglang 'docker exec sglang_bbuf env CUDA_VISIBLE_DEVICES=0 PYTHONPATH=python zsh -lc "cd /tmp/sglang_local_validate && pytest -q path/to/diffusion_test.py -q"'

bash

ssh h100_sglang 'docker exec sglang_bbuf env CUDA_VISIBLE_DEVICES=0 PYTHONPATH=python zsh -lc "cd /tmp/sglang_local_validate && pytest -q path/to/diffusion_test.py -q"'

5. Model-Level Smoke (

DiffGenerator

)

5. 模型级冒烟测试（

DiffGenerator

）

Only after steps 1–4 pass.

Use a real

.py

file with

if __name__ == "__main__":

guard —

multiprocessing.spawn

will fail if the entry point is stdin or unguarded top-level code.

bash

undefined

仅在步骤1-4通过后执行。

请使用包含

if __name__ == "__main__":

guard的真实

.py

文件——若入口点为标准输入或无guard的顶层代码，

multiprocessing.spawn

会执行失败。

bash

undefined

stream the script file to the container

将脚本文件传输至容器

scp /path/to/local_smoke_model.py h100_sglang:/tmp/smoke_model.py ssh h100_sglang 'docker exec sglang_bbuf env CUDA_VISIBLE_DEVICES=0 HF_TOKEN=<your-hf-token> HUGGINGFACE_HUB_TOKEN=<your-hf-token> PYTHONPATH=/tmp/sglang_local_validate/python zsh -lc "python /tmp/smoke_model.py"'


Treat checkpoint, dependency, and environment failures separately from code regressions.


请将检查点、依赖和环境失败与代码回归问题分开处理。

6. Server-Level Smoke

6. 服务器级冒烟测试

Only attempt after model-level smoke passes.

bash

ssh h100_sglang 'docker exec sglang_bbuf env CUDA_VISIBLE_DEVICES=0 PYTHONPATH=python zsh -lc "cd /tmp/sglang_local_validate && python -m sglang.launch_server --model-path <model> --port 30000 &"'

仅在模型级冒烟测试通过后尝试。

bash

ssh h100_sglang 'docker exec sglang_bbuf env CUDA_VISIBLE_DEVICES=0 PYTHONPATH=python zsh -lc "cd /tmp/sglang_local_validate && python -m sglang.launch_server --model-path <model> --port 30000 &"'

Torch Compile Attribution

Torch Compile归因分析

When a benchmark compares eager vs

torch.compile

, do not stop at the speedup number. Capture matching eager and compile traces or perf dumps, then run:

bash

ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /tmp/sglang_local_validate && python scripts/analyze_diffusion_torch_compile.py"'

当基准测试对比eager模式与

torch.compile

时，不要仅停留在加速数值上。请捕获匹配的eager和compile跟踪或性能转储，然后运行：

bash

ssh h100_sglang 'docker exec sglang_bbuf zsh -lc "cd /tmp/sglang_local_validate && python scripts/analyze_diffusion_torch_compile.py"'

Cleanup

清理

bash

ssh h100_sglang 'docker exec sglang_bbuf rm -rf /tmp/sglang_local_validate /tmp/sglang_validate_h100 /tmp/smoke_model.py'

bash

ssh h100_sglang 'docker exec sglang_bbuf rm -rf /tmp/sglang_local_validate /tmp/sglang_validate_h100 /tmp/smoke_model.py'

h100-sglang-diffusion

Original

Translation

H100 — SGLang Diffusion

H100 — SGLang Diffusion

Overview

概述

Quick Start

快速开始

Safe Remote Workflow

安全远程工作流

Diffusion Validation Workflow

Diffusion验证工作流

1. Syntax / Import Check

1. 语法/导入检查

2. JIT Kernel Smoke

2. JIT内核冒烟测试

3. Fused Modulation Regression

3. 融合调制回归测试

4. General Diffusion Tests

4. 通用Diffusion测试

5. Model-Level Smoke (
`DiffGenerator`
)

5. 模型级冒烟测试（
`DiffGenerator`
）

stream the script file to the container

将脚本文件传输至容器

6. Server-Level Smoke

6. 服务器级冒烟测试

Torch Compile Attribution

Torch Compile归因分析

Cleanup

清理

h100-sglang-diffusion

Original

Translation

H100 — SGLang Diffusion

H100 — SGLang Diffusion

Overview

概述

Quick Start

快速开始

Safe Remote Workflow

安全远程工作流

Diffusion Validation Workflow

Diffusion验证工作流

1. Syntax / Import Check

1. 语法/导入检查

2. JIT Kernel Smoke

2. JIT内核冒烟测试

3. Fused Modulation Regression

3. 融合调制回归测试

4. General Diffusion Tests

4. 通用Diffusion测试

5. Model-Level Smoke (DiffGenerator)

5. 模型级冒烟测试（DiffGenerator）

stream the script file to the container

将脚本文件传输至容器

6. Server-Level Smoke

6. 服务器级冒烟测试

Torch Compile Attribution

Torch Compile归因分析

Cleanup

清理

5. Model-Level Smoke (
`DiffGenerator`
)

5. 模型级冒烟测试（
`DiffGenerator`
）