dynamo-interconnect-check

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Dynamo Interconnect Check

Dynamo互连检查

<!-- SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-License-Identifier: CC-BY-4.0 -->
<!-- SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-License-Identifier: CC-BY-4.0 -->

Purpose

用途

Confirm that the transport disaggregated serving depends on actually works. A deployment can pass an endpoint smoke test while disagg is silently wrong: if NIXL/UCX cannot reach the peer worker over RDMA or NVLink, KV transfer falls back to a slow or broken path. Catch that with read-only checks before trusting a disagg deployment or its benchmark numbers.
This skill is read-only. It never mutates the cluster and never prints secrets.
确认解耦服务依赖的传输是否正常工作。部署可能通过端点冒烟测试,但解耦配置却存在隐性问题:如果NIXL/UCX无法通过RDMA或NVLink连接到对等工作节点,KV传输会回退到缓慢或损坏的路径。在信任解耦部署或其基准测试数据之前,通过只读检查发现此类问题。
本工具为只读模式,不会修改集群,也不会打印敏感信息。

Prerequisites

前置条件

  • Python 3.10+ on the operator machine.
  • kubectl exec
    access to a worker pod in the target Dynamo deployment.
  • Read access to the recipe directory (
    recipes/<model>/<framework>/<mode>
    ).
  • For node-capability checks: tools like
    ibstat
    ,
    nvidia-smi
    ,
    lsmod
    available in the worker pod image (missing tools are reported as
    skipped
    , not failures).
  • 操作机器上需安装Python 3.10及以上版本。
  • 拥有对目标Dynamo部署中工作Pod的
    kubectl exec
    访问权限。
  • 拥有对配方目录(
    recipes/<model>/<framework>/<mode>
    )的读取权限。
  • 对于节点能力检查:工作Pod镜像中需包含
    ibstat
    nvidia-smi
    lsmod
    等工具(缺少工具会被标记为
    skipped
    ,而非失败)。

When To Use

使用场景

  • After
    dynamo-recipe-runner
    deploys a disagg or multi-node recipe.
  • Before reporting disagg throughput/latency, so numbers reflect the real transport.
  • When agg works but disagg is slow, hangs, or returns wrong output and you suspect the fabric rather than the model.
For diagnosing pods that are already crashing or unschedulable, use
dynamo-troubleshoot
first.
  • dynamo-recipe-runner
    部署解耦或多节点配方之后。
  • 在报告解耦部署的吞吐量/延迟之前,确保数据反映真实的传输性能。
  • 当聚合(agg)模式正常但解耦模式运行缓慢、挂起或返回错误输出,且怀疑是网络架构而非模型问题时。
若要诊断已崩溃或无法调度的Pod,请先使用
dynamo-troubleshoot

Instructions

操作步骤

1. Check Transport Env Vars On The Recipe

1. 检查配方中的传输环境变量

bash
python3 scripts/check_interconnect.py env recipes/<model>/<framework>/<mode>
Reports which NIXL/UCX/NCCL transport variables are set and flags disagg-critical ones (e.g.
UCX_TLS
,
UCX_NET_DEVICES
,
NCCL_IB_HCA
) that are absent. Missing here is only a warning — they may be baked into the image — so confirm with the node and NIXL checks. See
references/interconnect-env-vars.md
for what each variable does.
bash
python3 scripts/check_interconnect.py env recipes/<model>/<framework>/<mode>
报告已设置的NIXL/UCX/NCCL传输变量,并标记缺失的解耦关键变量(如
UCX_TLS
UCX_NET_DEVICES
NCCL_IB_HCA
)。此处缺失仅为警告——这些变量可能已内置到镜像中——因此需结合节点检查和NIXL检查进行确认。有关每个变量的作用,请参阅
references/interconnect-env-vars.md

2. Check Node Capabilities

2. 检查节点能力

Locally on a GPU node, or inside a running worker pod:
bash
python3 scripts/check_interconnect.py node \
  --namespace "${NAMESPACE}" --pod <worker-pod>
Probes (read-only) for: InfiniBand devices and Active links, GPUDirect RDMA (
nvidia_peermem
), GDRCopy, and NVLink in the GPU topology. Missing tools are reported as
skipped
, not failures.
在GPU节点本地或运行中的工作Pod内执行:
bash
python3 scripts/check_interconnect.py node \
  --namespace "${NAMESPACE}" --pod <worker-pod>
以只读方式探测以下内容:InfiniBand设备与活动链路、GPUDirect RDMA(
nvidia_peermem
)、GDRCopy,以及GPU拓扑中的NVLink。缺少工具会被标记为
skipped
,而非失败。

3. Validate NIXL Reachability

3. 验证NIXL可达性

bash
python3 scripts/check_interconnect.py nixl \
  --namespace "${NAMESPACE}" --pod <worker-pod>
Looks for NIXL test tooling in the pod and surfaces the exact next step to run a pairwise prefill↔decode transfer test. A full cross-pod transfer test requires two scheduled GPU pods on the fabric.
bash
python3 scripts/check_interconnect.py nixl \
  --namespace "${NAMESPACE}" --pod <worker-pod>
检查Pod中是否存在NIXL测试工具,并给出运行成对预填充↔解码传输测试的确切下一步操作。完整的跨Pod传输测试需要在网络架构上调度两个GPU Pod。

Available Scripts

可用脚本

ScriptPurposeArguments
scripts/check_interconnect.py env
Inspect NIXL/UCX/NCCL env vars on a recipepositional recipe path
scripts/check_interconnect.py node
Probe InfiniBand, GPUDirect RDMA, GDRCopy, NVLink on a node or pod
--namespace
,
--pod
scripts/check_interconnect.py nixl
Surface NIXL transfer-test readiness for a pod
--namespace
,
--pod
Invoke via the agentskills.io
run_script()
protocol:
python
run_script("scripts/check_interconnect.py", args=["env", "recipes/qwen3-coder-480b/sglang/disagg"])
run_script("scripts/check_interconnect.py", args=["node", "--namespace", "dynamo-demo", "--pod", "qwen-worker-0"])
脚本用途参数
scripts/check_interconnect.py env
检查配方中的NIXL/UCX/NCCL环境变量必选参数:配方路径
scripts/check_interconnect.py node
探测节点或Pod上的InfiniBand、GPUDirect RDMA、GDRCopy、NVLink
--namespace
,
--pod
scripts/check_interconnect.py nixl
检查Pod的NIXL传输测试就绪状态
--namespace
,
--pod
通过agentskills.io的
run_script()
协议调用:
python
run_script("scripts/check_interconnect.py", args=["env", "recipes/qwen3-coder-480b/sglang/disagg"])
run_script("scripts/check_interconnect.py", args=["node", "--namespace", "dynamo-demo", "--pod", "qwen-worker-0"])

Examples

示例

Verify a disagg recipe's transport env shape before deploy:
bash
python3 scripts/check_interconnect.py env recipes/qwen3-coder-480b/sglang/disagg
After deploy, validate a worker pod's fabric:
bash
python3 scripts/check_interconnect.py node \
  --namespace dynamo-demo --pod qwen-worker-0
python3 scripts/check_interconnect.py nixl \
  --namespace dynamo-demo --pod qwen-worker-0
Equivalent through the agent protocol:
python
run_script("scripts/check_interconnect.py", args=["nixl", "--namespace", "dynamo-demo", "--pod", "qwen-worker-0"])
部署前验证解耦配方的传输环境变量配置:
bash
python3 scripts/check_interconnect.py env recipes/qwen3-coder-480b/sglang/disagg
部署后,验证工作Pod的网络架构:
bash
python3 scripts/check_interconnect.py node \
  --namespace dynamo-demo --pod qwen-worker-0
python3 scripts/check_interconnect.py nixl \
  --namespace dynamo-demo --pod qwen-worker-0
通过Agent协议的等效调用:
python
run_script("scripts/check_interconnect.py", args=["nixl", "--namespace", "dynamo-demo", "--pod", "qwen-worker-0"])

Output Contract

输出约定

Each check returns
ok
/
warn
/
fail
/
skipped
with a one-line detail, plus a rolled-up verdict on disagg transport readiness. Report:
  • transport env vars present vs. disagg-critical ones missing
  • RDMA / GPUDirect / NVLink capability status
  • whether NIXL reachability was validated, and the next command if not
  • a clear statement of whether disagg can be trusted, or what to fix first
每项检查会返回
ok
/
warn
/
fail
/
skipped
状态及一行详细说明,并汇总解耦传输就绪状态的结论。报告内容包括:
  • 已存在的传输环境变量与缺失的解耦关键变量
  • RDMA/GPUDirect/NVLink的能力状态
  • 是否已验证NIXL可达性,若未验证则给出下一步命令
  • 明确说明是否可以信任解耦部署,或需优先修复的问题

Limitations

局限性

  • Read-only fabric probe; does not run a full pairwise NIXL transfer (requires two scheduled GPU pods and the in-pod NIXL test tools).
  • skipped
    results for missing tools (
    ibstat
    ,
    nvidia-smi
    ,
    lsmod
    ) are inconclusive, not a pass.
  • Env-var check inspects the recipe text; values injected at runtime via initContainers or operator-applied envs are not detected.
  • Single-node agg deployments do not exercise the transport — this skill is for disagg / multi-node validation.
  • 仅为只读网络架构探测,不会运行完整的成对NIXL传输测试(需要两个已调度的GPU Pod及Pod内的NIXL测试工具)。
  • 因缺少工具(
    ibstat
    nvidia-smi
    lsmod
    )导致的
    skipped
    结果为非确定性结论,不代表通过检查。
  • 环境变量检查仅读取配方文本,无法检测通过initContainers或Operator注入的运行时环境变量值。
  • 单节点聚合(agg)部署不会使用传输功能——本工具仅用于解耦/多节点验证。

Troubleshooting

故障排查

SymptomLikely causeNext step
env
reports all critical vars missing
Vars baked into image or injected by operatorRun the
node
check inside the worker pod to verify actual env
node
reports no Active IB link
Fabric down or HCA not provisioned to the nodeContact cluster admin; verify
kubectl describe node
shows
nvidia.com/gpu
and IB labels
nvidia_peermem
missing
GPUDirect RDMA module not loadedAsk cluster admin to load
nvidia-peermem
; without it, NIXL falls back to staged copies
nixl
finds no test tools
Worker image lacks NIXL test harnessUse a NIXL-enabled image or run the standalone transfer test from a debug pod
症状可能原因下一步操作
env
检查报告所有关键变量缺失
变量已内置到镜像或由Operator注入在工作Pod内运行
node
检查以验证实际环境
node
检查报告无活动IB链路
网络架构故障或HCA未分配给节点联系集群管理员;验证
kubectl describe node
是否显示
nvidia.com/gpu
及IB标签
nvidia_peermem
缺失
GPUDirect RDMA模块未加载请求集群管理员加载
nvidia-peermem
;若无该模块,NIXL会回退到分段复制模式
nixl
检查未找到测试工具
工作镜像缺少NIXL测试套件使用支持NIXL的镜像,或从调试Pod运行独立传输测试

Benchmark

基准测试

See
BENCHMARK.md
for the NVCARPS-EVAL performance report (auto-generated by the NVSkills CI pipeline). To refresh, re-run
/nvskills-ci
on an upstream PR touching this skill.
有关NVCARPS-EVAL性能报告,请参阅
BENCHMARK.md
(由NVSkills CI流水线自动生成)。如需更新报告,可在触及本工具的上游PR上重新运行
/nvskills-ci

References

参考资料

  • references/interconnect-env-vars.md
    — NIXL/UCX/NCCL env var catalog and IB capability checklist.
  • Use
    scripts/check_interconnect.py
    for all read-only checks.
  • references/interconnect-env-vars.md
    — NIXL/UCX/NCCL环境变量目录及IB能力检查清单。
  • 所有只读检查请使用
    scripts/check_interconnect.py