dynamo-interconnect-check

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Dynamo Interconnect Check

Dynamo互连检查

Purpose

用途

Confirm that the transport disaggregated serving depends on actually works. A deployment can pass an endpoint smoke test while disagg is silently wrong: if NIXL/UCX cannot reach the peer worker over RDMA or NVLink, KV transfer falls back to a slow or broken path. Catch that with read-only checks before trusting a disagg deployment or its benchmark numbers.

This skill is read-only. It never mutates the cluster and never prints secrets.

确认解耦服务依赖的传输是否正常工作。部署可能通过端点冒烟测试，但解耦配置却存在隐性问题：如果NIXL/UCX无法通过RDMA或NVLink连接到对等工作节点，KV传输会回退到缓慢或损坏的路径。在信任解耦部署或其基准测试数据之前，通过只读检查发现此类问题。

本工具为只读模式，不会修改集群，也不会打印敏感信息。

Prerequisites

前置条件

Python 3.10+ on the operator machine.
```
kubectl exec
```
access to a worker pod in the target Dynamo deployment.
Read access to the recipe directory (
```
recipes/<model>/<framework>/<mode>
```
).
For node-capability checks: tools like
```
ibstat
```
,
```
nvidia-smi
```
,
```
lsmod
```
available in the worker pod image (missing tools are reported as
```
skipped
```
, not failures).

操作机器上需安装Python 3.10及以上版本。
拥有对目标Dynamo部署中工作Pod的
```
kubectl exec
```
访问权限。
拥有对配方目录（
```
recipes/<model>/<framework>/<mode>
```
）的读取权限。
对于节点能力检查：工作Pod镜像中需包含
```
ibstat
```
、
```
nvidia-smi
```
、
```
lsmod
```
等工具（缺少工具会被标记为
```
skipped
```
，而非失败）。

When To Use

使用场景

After
```
dynamo-recipe-runner
```
deploys a disagg or multi-node recipe.
Before reporting disagg throughput/latency, so numbers reflect the real transport.
When agg works but disagg is slow, hangs, or returns wrong output and you suspect the fabric rather than the model.

For diagnosing pods that are already crashing or unschedulable, use

dynamo-troubleshoot

first.

在
```
dynamo-recipe-runner
```
部署解耦或多节点配方之后。
在报告解耦部署的吞吐量/延迟之前，确保数据反映真实的传输性能。
当聚合（agg）模式正常但解耦模式运行缓慢、挂起或返回错误输出，且怀疑是网络架构而非模型问题时。

若要诊断已崩溃或无法调度的Pod，请先使用

dynamo-troubleshoot

。

Instructions

操作步骤

1. Check Transport Env Vars On The Recipe

1. 检查配方中的传输环境变量

bash

python3 scripts/check_interconnect.py env recipes/<model>/<framework>/<mode>

Reports which NIXL/UCX/NCCL transport variables are set and flags disagg-critical ones (e.g.

UCX_TLS

UCX_NET_DEVICES

NCCL_IB_HCA

) that are absent. Missing here is only a warning — they may be baked into the image — so confirm with the node and NIXL checks. See

references/interconnect-env-vars.md

for what each variable does.

bash

python3 scripts/check_interconnect.py env recipes/<model>/<framework>/<mode>

报告已设置的NIXL/UCX/NCCL传输变量，并标记缺失的解耦关键变量（如

UCX_TLS

、

UCX_NET_DEVICES

、

NCCL_IB_HCA

）。此处缺失仅为警告——这些变量可能已内置到镜像中——因此需结合节点检查和NIXL检查进行确认。有关每个变量的作用，请参阅

references/interconnect-env-vars.md

。

2. Check Node Capabilities

2. 检查节点能力

Locally on a GPU node, or inside a running worker pod:

bash

python3 scripts/check_interconnect.py node \
  --namespace "${NAMESPACE}" --pod <worker-pod>

Probes (read-only) for: InfiniBand devices and Active links, GPUDirect RDMA (

nvidia_peermem

), GDRCopy, and NVLink in the GPU topology. Missing tools are reported as

skipped

, not failures.

在GPU节点本地或运行中的工作Pod内执行：

bash

python3 scripts/check_interconnect.py node \
  --namespace "${NAMESPACE}" --pod <worker-pod>

以只读方式探测以下内容：InfiniBand设备与活动链路、GPUDirect RDMA（

nvidia_peermem

）、GDRCopy，以及GPU拓扑中的NVLink。缺少工具会被标记为

skipped

，而非失败。

3. Validate NIXL Reachability

3. 验证NIXL可达性

bash

python3 scripts/check_interconnect.py nixl \
  --namespace "${NAMESPACE}" --pod <worker-pod>

Looks for NIXL test tooling in the pod and surfaces the exact next step to run a pairwise prefill↔decode transfer test. A full cross-pod transfer test requires two scheduled GPU pods on the fabric.

bash

python3 scripts/check_interconnect.py nixl \
  --namespace "${NAMESPACE}" --pod <worker-pod>

检查Pod中是否存在NIXL测试工具，并给出运行成对预填充↔解码传输测试的确切下一步操作。完整的跨Pod传输测试需要在网络架构上调度两个GPU Pod。

Available Scripts

可用脚本

Script	Purpose	Arguments
`scripts/check_interconnect.py env`	Inspect NIXL/UCX/NCCL env vars on a recipe	positional recipe path
`scripts/check_interconnect.py node`	Probe InfiniBand, GPUDirect RDMA, GDRCopy, NVLink on a node or pod	`--namespace` , `--pod`
`scripts/check_interconnect.py nixl`	Surface NIXL transfer-test readiness for a pod	`--namespace` , `--pod`

Invoke via the agentskills.io

run_script()

protocol:

python

run_script("scripts/check_interconnect.py", args=["env", "recipes/qwen3-coder-480b/sglang/disagg"])
run_script("scripts/check_interconnect.py", args=["node", "--namespace", "dynamo-demo", "--pod", "qwen-worker-0"])

脚本	用途	参数
`scripts/check_interconnect.py env`	检查配方中的NIXL/UCX/NCCL环境变量	必选参数：配方路径
`scripts/check_interconnect.py node`	探测节点或Pod上的InfiniBand、GPUDirect RDMA、GDRCopy、NVLink	`--namespace` , `--pod`
`scripts/check_interconnect.py nixl`	检查Pod的NIXL传输测试就绪状态	`--namespace` , `--pod`

通过agentskills.io的

run_script()

协议调用：

python

run_script("scripts/check_interconnect.py", args=["env", "recipes/qwen3-coder-480b/sglang/disagg"])
run_script("scripts/check_interconnect.py", args=["node", "--namespace", "dynamo-demo", "--pod", "qwen-worker-0"])

Examples

示例

Verify a disagg recipe's transport env shape before deploy:

bash

python3 scripts/check_interconnect.py env recipes/qwen3-coder-480b/sglang/disagg

After deploy, validate a worker pod's fabric:

bash

python3 scripts/check_interconnect.py node \
  --namespace dynamo-demo --pod qwen-worker-0
python3 scripts/check_interconnect.py nixl \
  --namespace dynamo-demo --pod qwen-worker-0

Equivalent through the agent protocol:

python

run_script("scripts/check_interconnect.py", args=["nixl", "--namespace", "dynamo-demo", "--pod", "qwen-worker-0"])

部署前验证解耦配方的传输环境变量配置：

bash

python3 scripts/check_interconnect.py env recipes/qwen3-coder-480b/sglang/disagg

部署后，验证工作Pod的网络架构：

bash

python3 scripts/check_interconnect.py node \
  --namespace dynamo-demo --pod qwen-worker-0
python3 scripts/check_interconnect.py nixl \
  --namespace dynamo-demo --pod qwen-worker-0

通过Agent协议的等效调用：

python

run_script("scripts/check_interconnect.py", args=["nixl", "--namespace", "dynamo-demo", "--pod", "qwen-worker-0"])

Output Contract

输出约定

Each check returns

ok

warn

fail

skipped

with a one-line detail, plus a rolled-up verdict on disagg transport readiness. Report:

transport env vars present vs. disagg-critical ones missing
RDMA / GPUDirect / NVLink capability status
whether NIXL reachability was validated, and the next command if not
a clear statement of whether disagg can be trusted, or what to fix first

每项检查会返回

ok

warn

fail

skipped

状态及一行详细说明，并汇总解耦传输就绪状态的结论。报告内容包括：

已存在的传输环境变量与缺失的解耦关键变量
RDMA/GPUDirect/NVLink的能力状态
是否已验证NIXL可达性，若未验证则给出下一步命令
明确说明是否可以信任解耦部署，或需优先修复的问题

Limitations

局限性

Read-only fabric probe; does not run a full pairwise NIXL transfer (requires two scheduled GPU pods and the in-pod NIXL test tools).
```
skipped
```
results for missing tools (
```
ibstat
```
,
```
nvidia-smi
```
,
```
lsmod
```
) are inconclusive, not a pass.
Env-var check inspects the recipe text; values injected at runtime via initContainers or operator-applied envs are not detected.
Single-node agg deployments do not exercise the transport — this skill is for disagg / multi-node validation.

仅为只读网络架构探测，不会运行完整的成对NIXL传输测试（需要两个已调度的GPU Pod及Pod内的NIXL测试工具）。
因缺少工具（
```
ibstat
```
、
```
nvidia-smi
```
、
```
lsmod
```
）导致的
```
skipped
```
结果为非确定性结论，不代表通过检查。
环境变量检查仅读取配方文本，无法检测通过initContainers或Operator注入的运行时环境变量值。
单节点聚合（agg）部署不会使用传输功能——本工具仅用于解耦/多节点验证。

Troubleshooting

故障排查

Symptom	Likely cause	Next step
`env` reports all critical vars missing	Vars baked into image or injected by operator	Run the `node` check inside the worker pod to verify actual env
`node` reports no Active IB link	Fabric down or HCA not provisioned to the node	Contact cluster admin; verify `kubectl describe node` shows `nvidia.com/gpu` and IB labels
`nvidia_peermem` missing	GPUDirect RDMA module not loaded	Ask cluster admin to load `nvidia-peermem` ; without it, NIXL falls back to staged copies
`nixl` finds no test tools	Worker image lacks NIXL test harness	Use a NIXL-enabled image or run the standalone transfer test from a debug pod

症状	可能原因	下一步操作
`env` 检查报告所有关键变量缺失	变量已内置到镜像或由Operator注入	在工作Pod内运行 `node` 检查以验证实际环境
`node` 检查报告无活动IB链路	网络架构故障或HCA未分配给节点	联系集群管理员；验证 `kubectl describe node` 是否显示 `nvidia.com/gpu` 及IB标签
`nvidia_peermem` 缺失	GPUDirect RDMA模块未加载	请求集群管理员加载 `nvidia-peermem` ；若无该模块，NIXL会回退到分段复制模式
`nixl` 检查未找到测试工具	工作镜像缺少NIXL测试套件	使用支持NIXL的镜像，或从调试Pod运行独立传输测试

Benchmark

基准测试

See

BENCHMARK.md

for the NVCARPS-EVAL performance report (auto-generated by the NVSkills CI pipeline). To refresh, re-run

/nvskills-ci

on an upstream PR touching this skill.

有关NVCARPS-EVAL性能报告，请参阅

BENCHMARK.md

（由NVSkills CI流水线自动生成）。如需更新报告，可在触及本工具的上游PR上重新运行

/nvskills-ci

。

References

参考资料

```
references/interconnect-env-vars.md
```
— NIXL/UCX/NCCL env var catalog and IB capability checklist.
Use
```
scripts/check_interconnect.py
```
for all read-only checks.

```
references/interconnect-env-vars.md
```
— NIXL/UCX/NCCL环境变量目录及IB能力检查清单。
所有只读检查请使用
```
scripts/check_interconnect.py
```
。