launch-nemo-rl

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

launch-nemo-rl — running NeMo-RL recipes on Kubernetes via nrl-k8s

launch-nemo-rl — 通过nrl-k8s在Kubernetes上运行NeMo-RL训练脚本

This is the playbook for the
nrl-k8s
CLI at
infra/nrl_k8s/
. Follow it when the user asks to launch / iterate / debug a NeMo-RL recipe on a Kubernetes cluster. Verify current state (
kubectl
,
git log
, the recipe + infra files) before acting — the cluster is shared and the cost of a wrong action is high.
本指南针对位于
infra/nrl_k8s/
路径下的
nrl-k8s
CLI工具编写。当用户需要在Kubernetes集群上启动/迭代/调试NeMo-RL训练脚本时,请遵循本指南。操作前请先验证当前状态(
kubectl
git log
、训练脚本及基础设施文件)——集群为共享资源,错误操作的成本很高。

1. One command, two modes

1. 一条命令,两种模式

There is a single top-level submission command:
nrl-k8s run
. It has two lifecycle modes.
ModeInvocationWhen to useCluster after?
Ephemeral (default)
nrl-k8s run
One-shot. KubeRay applies a RayJob, runs, tears the cluster down. Best for most runs.No (auto)
Long-lived
nrl-k8s run --raycluster
Dev loop. Reuses a matching live cluster, applies if absent, warns + reuses on drift (pass
--recreate
to replace). Then submits daemons and training. First-choice for iteration.
Yes
Ask: Do I need this cluster after the run? If yes, use
--raycluster
. Otherwise use the default (ephemeral).
The rest of the CLI is observability / stage-by-stage control:
CommandPurpose
nrl-k8s check
Validate a recipe + infra pair; optionally write the fully-resolved manifests (
-o
).
nrl-k8s status
Per-role RayCluster state, head pod phase, worker pod phases, daemon job status.
nrl-k8s cluster up/down/list/dashboard
Manage RayClusters independently of a run (e.g. render a manifest with
--dry-run
).
nrl-k8s job list/logs/stop
Observability over Ray Jobs already submitted to a role's cluster.
nrl-k8s logs
Tail a role's pod / daemon logs without needing a submission id.
提交训练仅需一个顶层命令:
nrl-k8s run
。它包含两种生命周期模式。
模式调用方式使用场景训练后集群状态
临时模式(默认)
nrl-k8s run
一次性任务。KubeRay会创建RayJob,执行完成后自动销毁集群。适用于大多数训练场景。自动销毁
持久化模式
nrl-k8s run --raycluster
开发迭代循环。复用匹配的活跃集群,若集群不存在则创建;若集群配置已漂移则发出警告并复用(可传入
--recreate
参数重新创建)。随后提交守护进程和训练任务。是开发迭代的首选模式。
保留集群
思考:训练结束后我还需要这个集群吗? 如果需要,使用
--raycluster
;否则使用默认的临时模式。
CLI的其余命令用于可观测性或分阶段控制:
命令用途
nrl-k8s check
验证训练脚本与基础设施配置的匹配性;可选生成完整解析后的配置清单(
-o
参数)。
nrl-k8s status
查看各角色RayCluster的状态、头节点Pod阶段、工作节点Pod阶段、守护进程任务状态。
nrl-k8s cluster up/down/list/dashboard
独立于训练任务管理RayCluster(例如通过
--dry-run
渲染配置清单)。
nrl-k8s job list/logs/stop
查看已提交到对应角色集群的Ray Job的状态与日志。
nrl-k8s logs
无需提交ID即可跟踪对应角色Pod/守护进程的日志。

2. Recipe + infra pair

2. 训练脚本+基础设施配置对

Every launch takes two files. Pass the infra with
--infra
, not merged inline:
nrl-k8s run infra/nrl_k8s/examples/<recipe>.yaml \
  --infra infra/nrl_k8s/examples/<recipe>.<profile>.infra.yaml
  • Recipe (e.g.
    qwen3_30b_math_8n_4gpu.yaml
    ) — NeMo-RL config: model, GRPO/SFT knobs,
    cluster.{gpus_per_node,num_nodes}
    . Uses
    defaults:
    to inherit from
    examples/configs/recipes/llm/...
    .
  • Infra (e.g.
    *.<profile>.infra.yaml
    ) — K8s/Ray shape: namespace, image, service account, RayCluster spec under
    kuberay:
    , optional Deployments under
    deployments:
    ,
    submit.submitter
    ,
    launch.{mode,codeSource,codePath,entrypoint}
    . Pair names follow
    <recipe>.<profile>[.prod].infra.yaml
    where
    <profile>
    names the hardware target (e.g.
    gb300
    ).
Example pairs in
infra/nrl_k8s/examples/
— read the neighbouring files to see the current conventions for the target profile.
每次启动训练都需要两个文件。通过
--infra
参数指定基础设施配置,请勿内联合并:
nrl-k8s run infra/nrl_k8s/examples/<recipe>.yaml \
  --infra infra/nrl_k8s/examples/<recipe>.<profile>.infra.yaml
  • 训练脚本(例如
    qwen3_30b_math_8n_4gpu.yaml
    )——NeMo-RL配置:模型、GRPO/SFT参数、
    cluster.{gpus_per_node,num_nodes}
    。通过
    defaults:
    继承自
    examples/configs/recipes/llm/...
    下的配置。
  • 基础设施配置(例如
    *.<profile>.infra.yaml
    )——K8s/Ray集群配置:命名空间、镜像、服务账号、
    kuberay:
    下的RayCluster规格、
    deployments:
    下的可选Deployment配置、
    submit.submitter
    launch.{mode,codeSource,codePath,entrypoint}
    。配置对命名遵循
    <recipe>.<profile>[.prod].infra.yaml
    格式,其中
    <profile>
    代表硬件目标(例如
    gb300
    )。
infra/nrl_k8s/examples/
目录下有示例配置对——请查看目标profile对应的相邻文件,了解当前的配置规范。

3. Long-lived mode flags

3. 持久化模式参数

Three independent dimensions.
--mode
is a macro that picks defaults; individual flags override it.
--mode interactive   → --submitter portForward  --code-source upload  (tails logs)
--mode batch         → --submitter exec         --code-source image   (returns after nohup)
  • Submitter:
    portForward
    uses
    kubectl port-forward
    + Ray Job SDK (gets a
    submission_id
    the dashboard tracks).
    exec
    uses
    kubectl exec
    +
    nohup
    on the head pod (no submission_id; driver appears as
    type=DRIVER
    in the dashboard).
  • Code source:
    upload
    stages a working_dir from the laptop (Ray 100 MiB cap).
    image
    /
    lustre
    expect code on the pod's filesystem — paired with
    --code-path
    (typically
    /opt/nemo-rl
    ), which is a subPath of the shared-filesystem PVC mount in the standard infra examples.
  • Wait:
    --wait
    tails logs until terminal;
    --no-wait
    returns as soon as the driver is running.
Other long-lived-only flags:
  • --replace
    — stop any running training / daemon job before submitting new ones (suffixes daemon submissionIds with a timestamp so Ray accepts the resubmit).
  • --recreate
    — delete + re-apply a RayCluster whose live spec has drifted from the rendered manifest (default is warn + reuse).
  • --skip-daemons
    — bring up all declared clusters but only submit training. Use on disagg recipes where gym/generation are already healthy.
Gotcha: on infra where the entrypoint does
cd /opt/nemo-rl
(or another in-image / Lustre path) and loads the recipe from there,
--code-source upload
does NOT override the recipe on the pod
— the uploaded working_dir sits in
/tmp/ray/...
but the entrypoint
cd
s away from it. To actually test a local recipe change, either sync your edits to the shared filesystem mounted into the pods or flip the Hydra overrides in the entrypoint.
包含三个独立维度。
--mode
参数是预设宏,可通过单独参数覆盖默认值。
--mode interactive   → --submitter portForward  --code-source upload  (实时跟踪日志)
--mode batch         → --submitter exec         --code-source image   (后台执行后返回)
  • 提交方式
    portForward
    使用
    kubectl port-forward
    + Ray Job SDK(生成一个仪表盘可跟踪的
    submission_id
    )。
    exec
    使用
    kubectl exec
    + 头节点Pod上的
    nohup
    (无submission_id;驱动进程在仪表盘中显示为
    type=DRIVER
    )。
  • 代码源
    upload
    将本地工作目录上传到集群(Ray限制为100 MiB)。
    image
    /
    lustre
    模式要求代码已存在于Pod的文件系统中——需配合
    --code-path
    参数(通常为
    /opt/nemo-rl
    ),该路径是标准基础设施示例中共享文件系统PVC挂载的子路径。
  • 等待模式
    --wait
    会跟踪日志直到任务结束;
    --no-wait
    会在驱动进程启动后立即返回。
其他仅适用于持久化模式的参数:
  • --replace
    — 提交新任务前停止所有正在运行的训练/守护进程任务(为守护进程提交ID添加时间戳后缀,确保Ray接受重新提交)。
  • --recreate
    — 当活跃集群的规格与渲染后的配置清单不一致时,删除并重新创建RayCluster(默认行为是发出警告并复用现有集群)。
  • --skip-daemons
    — 启动所有已声明的集群,但仅提交训练任务。适用于gym/generation服务已正常运行的拆分式训练脚本。
注意事项:如果基础设施配置中的入口命令执行了
cd /opt/nemo-rl
(或其他镜像内/Lustre路径)并从该路径加载训练脚本,
--code-source upload
不会覆盖Pod上的训练脚本
——上传的工作目录位于
/tmp/ray/...
,但入口命令已切换到其他路径。要测试本地训练脚本的修改,需将编辑内容同步到Pod挂载的共享文件系统,或修改入口命令中的Hydra覆盖参数。

4. Ephemeral mode flags (
--rayjob
)

4. 临时模式参数(
--rayjob

When
--rayjob
is set,
run
branches into the RayJob code path. Relevant flags:
  • --rayjob-name NAME
    — RayJob metadata name (defaults to the training cluster name).
  • --shutdown / --no-shutdown
    — default
    true
    : KubeRay deletes the RayCluster once the Ray Job reaches a terminal state.
  • --ttl SECONDS
    — default 3600s: keep the RayJob object around after the run finishes for post-mortem log access.
  • --wait / --no-wait
    — default
    wait
    : poll
    jobDeploymentStatus
    until Complete/Failed.
    --no-wait
    returns as soon as the RayJob is applied.
  • --timeout SECONDS
    — default 86400s (24h): bound the
    --wait
    poll.
  • --dry-run
    — render the RayJob manifest and print it; do not apply.
--replace
/
--recreate
/
--skip-daemons
are silently ignored in
--rayjob
mode (KubeRay owns lifecycle).
当设置
--rayjob
参数时,
run
命令会切换到RayJob执行路径。相关参数如下:
  • --rayjob-name NAME
    — RayJob的元数据名称(默认与训练集群名称一致)。
  • --shutdown / --no-shutdown
    — 默认值为
    true
    :当Ray Job进入终端状态后,KubeRay会删除RayCluster。
  • --ttl SECONDS
    — 默认值为3600秒:训练结束后保留RayJob对象一段时间,以便事后查看日志。
  • --wait / --no-wait
    — 默认值为
    wait
    :轮询
    jobDeploymentStatus
    直到任务完成/失败。
    --no-wait
    会在RayJob配置应用后立即返回。
  • --timeout SECONDS
    — 默认值为86400秒(24小时):限制
    --wait
    模式下的轮询时长。
  • --dry-run
    — 渲染RayJob配置清单并打印,不实际应用。
--rayjob
模式下,
--replace
/
--recreate
/
--skip-daemons
参数会被忽略(集群生命周期由KubeRay管理)。

5. Iterating on a config without touching the shared filesystem

5. 无需修改共享文件系统即可迭代配置

When the recipe on the pod filesystem has the wrong value for your experiment, use Hydra overrides on the entrypoint instead of forking the recipe. Pattern:
yaml
entrypoint: |
  set -eu
  cd /opt/nemo-rl
  RUN_ID="\${RAY_JOB_SUBMISSION_ID:-\${NRL_K8S_RUN_ID:-$(date -u +%Y%m%d-%H%M%S)}}"
  python -u examples/run_grpo.py \
    --config infra/nrl_k8s/examples/<recipe>.yaml \
    logger.wandb_enabled=true \
    logger.wandb.project=<project> \
    "logger.wandb.name=<run-name>-\${RUN_ID}"
Escape
${…}
with a backslash. OmegaConf otherwise interprets it as interpolation and errors on shell-style
${VAR:-default}
.
RUN_ID
resolves to
RAY_JOB_SUBMISSION_ID
(injected by KubeRay in rayjob mode) →
NRL_K8S_RUN_ID
(injected by the CLI in long-lived mode) → local timestamp — so the name is unique across either path.
当Pod文件系统中的训练脚本参数不符合实验需求时,无需复制训练脚本,可在入口命令中使用Hydra覆盖参数。示例格式:
yaml
entrypoint: |
  set -eu
  cd /opt/nemo-rl
  RUN_ID="\${RAY_JOB_SUBMISSION_ID:-\${NRL_K8S_RUN_ID:-$(date -u +%Y%m%d-%H%M%S)}}"
  python -u examples/run_grpo.py \
    --config infra/nrl_k8s/examples/<recipe>.yaml \
    logger.wandb_enabled=true \
    logger.wandb.project=<project> \
    "logger.wandb.name=<run-name>-\${RUN_ID}"
使用反斜杠转义
${…}
。否则OmegaConf会将其解析为插值表达式,并在遇到Shell风格的
${VAR:-default}
时报错。
RUN_ID
的解析优先级为:
RAY_JOB_SUBMISSION_ID
(RayJob模式下由KubeRay注入)→
NRL_K8S_RUN_ID
(持久化模式下由CLI注入)→ 本地时间戳——因此无论哪种模式,名称都是唯一的。

6. Per-profile concerns (hardware + scheduler + DRA)

6. 各Profile的注意事项(硬件+调度器+DRA)

Every infra YAML encodes a hardware/scheduler profile. The concrete examples in
infra/nrl_k8s/examples/
are authoritative for the profiles they target — read the neighbouring infra file before writing a new one. Things that commonly vary:
  • Per-node GPUs (e.g. 4 vs 8) — must match
    cluster.gpus_per_node
    in the recipe, otherwise workers stay
    Pending
    .
  • Node selectors — head pods usually land on a CPU-only node pool; GPU workers match on
    nvidia.com/gpu.product
    or a node-group label.
  • Scheduler — KAI (
    schedulerName: kai-scheduler
    +
    kai.scheduler/queue
    label) with topology annotations (
    kai.scheduler/topology
    ,
    kai.scheduler/topology-required-placement
    ) gang-schedules workers into one clique. Without it, pods may land on different racks and NVLink/RoCE won't span them.
  • DRA claims — ComputeDomain + RoCE are attached via
    resourceClaims
    referencing
    ResourceClaimTemplate
    s. The CLI auto-creates/deletes these when the worker pod spec contains DRA claim references — no manual setup needed.
  • Secrets — always via
    secretKeyRef
    (
    wandb-api-key
    , image pull secret). Never embed.
  • Shared filesystem mounts — typically a Lustre PVC mounted twice: once at the code path (e.g.
    /opt/nemo-rl
    with a user-scoped
    subPath
    ) and once at a workspace root (e.g.
    /mnt/rl-workspace
    ) for datasets, HF cache, and checkpoints.
Before applying an infra, verify prereqs exist in the target namespace:
bash
kubectl get pvc <workspace-pvc>
kubectl get secret <wandb-secret> <image-pull-secret>
kubectl get sa <service-account>
每个基础设施YAML文件都对应一个硬件/调度器Profile。
infra/nrl_k8s/examples/
目录下的具体示例是对应Profile的权威参考——编写新配置前请先查看目标Profile的相邻基础设施文件。常见的差异点包括:
  • 单节点GPU数量(例如4 vs 8)——必须与训练脚本中的
    cluster.gpus_per_node
    一致,否则工作节点会一直处于
    Pending
    状态。
  • 节点选择器——头节点通常部署在仅含CPU的节点池;GPU工作节点需匹配
    nvidia.com/gpu.product
    或节点组标签。
  • 调度器——KAI调度器(
    schedulerName: kai-scheduler
    +
    kai.scheduler/queue
    标签)配合拓扑注解(
    kai.scheduler/topology
    ,
    kai.scheduler/topology-required-placement
    )将工作节点调度到同一集群。如果不使用KAI调度器,Pod可能会部署在不同机架上,导致NVLink/RoCE无法跨机架通信。
  • DRA声明——ComputeDomain + RoCE通过
    resourceClaims
    引用
    ResourceClaimTemplate
    实现挂载。当工作节点Pod规格包含DRA声明引用时,CLI会自动创建/删除这些资源——无需手动配置。
  • 密钥——始终通过
    secretKeyRef
    引用(例如
    wandb-api-key
    、镜像拉取密钥)。切勿直接嵌入。
  • 共享文件系统挂载——通常是Lustre PVC挂载两次:一次挂载到代码路径(例如
    /opt/nemo-rl
    ,带用户范围的
    subPath
    ),另一次挂载到工作区根目录(例如
    /mnt/rl-workspace
    ),用于存储数据集、HF缓存和检查点。
应用基础设施配置前,请验证目标命名空间中是否存在必要的资源:
bash
kubectl get pvc <workspace-pvc>
kubectl get secret <wandb-secret> <image-pull-secret>
kubectl get sa <service-account>

7. End-to-end workflows

7. 端到端工作流

7a. Fresh one-shot run (rayjob)

7a. 全新一次性训练(RayJob模式)

bash
undefined
bash
undefined

From the NeMo-RL repo root:

在NeMo-RL仓库根目录执行:

nrl-k8s check <recipe> --infra <infra> # validate first nrl-k8s run <recipe> --infra <infra> --rayjob --dry-run # render RayJob manifest nrl-k8s run <recipe> --infra <infra> --rayjob --no-wait # apply, returns fast

Watch status + teardown (works even after your laptop disconnects because KubeRay owns the lifecycle):
```bash
kubectl get rayjob -n default <name> -w
kubectl get raycluster -n default                                    # empty = teardown succeeded
nrl-k8s check <recipe> --infra <infra> # 先验证配置 nrl-k8s run <recipe> --infra <infra> --rayjob --dry-run # 渲染RayJob配置清单 nrl-k8s run <recipe> --infra <infra> --rayjob --no-wait # 应用配置并快速返回

查看状态及销毁过程(即使笔记本断开连接也能正常执行,因为集群生命周期由KubeRay管理):
```bash
kubectl get rayjob -n default <name> -w
kubectl get raycluster -n default                                    # 无输出表示销毁成功

7b. Dev loop (long-lived)

7b. 开发迭代循环(持久化模式)

bash
nrl-k8s run <recipe> --infra <infra> --run-id $(date +%Y%m%d-%H%M%S)
bash
nrl-k8s run <recipe> --infra <infra> --run-id $(date +%Y%m%d-%H%M%S)

Edits in the recipe? Just re-run — reuses the live cluster.

修改训练脚本后?只需重新执行命令——复用现有活跃集群。

Pod spec changed? Add --recreate to delete + re-apply.

Pod规格变更?添加--recreate参数删除并重新创建集群。

Disagg recipe with gym/gen already healthy? --skip-daemons.

使用拆分式训练脚本且gym/gen服务已正常运行?添加--skip-daemons参数。

undefined
undefined

7c. First-time disaggregated bring-up

7c. 首次启动拆分式训练

bash
nrl-k8s run <recipe> --infra <disagg-infra> --mode batch --code-source image
bash
nrl-k8s run <recipe> --infra <disagg-infra> --mode batch --code-source image

7d. Cluster-only lifecycle

7d. 仅管理集群生命周期

bash
nrl-k8s cluster up   <recipe> --infra <infra> --target kuberay.training --wait
nrl-k8s cluster up   <recipe> --infra <infra> --target kuberay.training --dry-run   # render manifest
nrl-k8s cluster down <recipe> --infra <infra> --target kuberay.training --wait
nrl-k8s cluster down <recipe> --infra <infra>                                       # tear down all
nrl-k8s cluster list -n default
nrl-k8s cluster dashboard <cluster-name>                                  # port-forward + browser
bash
nrl-k8s cluster up   <recipe> --infra <infra> --target kuberay.training --wait
nrl-k8s cluster up   <recipe> --infra <infra> --target kuberay.training --dry-run   # 渲染配置清单
nrl-k8s cluster down <recipe> --infra <infra> --target kuberay.training --wait
nrl-k8s cluster down <recipe> --infra <infra>                                       # 销毁所有资源
nrl-k8s cluster list -n default
nrl-k8s cluster dashboard <cluster-name>                                  # 端口转发并打开浏览器

7e. Deployments (e.g. nemo-skills sandbox)

7e. 部署服务(例如nemo-skills沙箱)

bash
undefined
bash
undefined

Bring up just the deployment

仅启动部署服务

nrl-k8s cluster up <recipe> --infra <infra> --target deployments.nemo_skills
nrl-k8s cluster up <recipe> --infra <infra> --target deployments.nemo_skills

Tear down just the deployment

仅销毁部署服务

nrl-k8s cluster down <recipe> --infra <infra> --target deployments.nemo_skills
nrl-k8s cluster down <recipe> --infra <infra> --target deployments.nemo_skills

Tear down everything (RayClusters + Deployments)

销毁所有资源(RayClusters + Deployments)

nrl-k8s cluster down <recipe> --infra <infra>

The `deployments:` section in infra YAML declares Kubernetes Deployments managed alongside RayClusters. The CLI patches image, imagePullSecrets, and serviceAccountName from the top-level infra keys (same as RayClusters). Deployments start in parallel with cluster bring-up — no ordering dependency.
nrl-k8s cluster down <recipe> --infra <infra>

基础设施YAML文件中的`deployments:`部分声明了与RayCluster一起管理的Kubernetes Deployment。CLI会从顶层基础设施配置中获取镜像、镜像拉取密钥和服务账号名称,并应用到Deployment(与RayCluster一致)。Deployment与集群启动并行执行——无依赖顺序。

8. Monitoring a run

8. 监控训练任务

bash
undefined
bash
undefined

Status

查看状态

nrl-k8s status <recipe> --infra <infra> kubectl get rayjob,raycluster -n default
nrl-k8s status <recipe> --infra <infra> kubectl get rayjob,raycluster -n default

Follow the driver

跟踪驱动进程日志

nrl-k8s job list <recipe> --infra <infra> --role training nrl-k8s job logs <run-id> <recipe> --infra <infra> --role training -f

When the `nrl-k8s job logs -f` subprocess dies (`kubectl port-forward` i/o timeout after ~15 min idle), just re-run it. The training job keeps going.

To fetch driver logs for a terminal job (SUCCEEDED/FAILED) or a RayJob via the dashboard API:
```bash
RC=$(kubectl get rayjob -n default <rayjob-name> -o jsonpath='{.status.rayClusterName}')
kubectl port-forward -n default svc/${RC}-head-svc 18266:8265 &
curl -s http://localhost:18266/api/jobs/                              # lists jobs, find submission_id
curl -s "http://localhost:18266/api/jobs/<submission_id>/logs"        # full driver log
type=DRIVER
with
submission_id=null
means an exec-submitter run (no dashboard log endpoint — use
nrl-k8s job logs
instead).
type=SUBMISSION
has
submission_id
set and
/api/jobs/<id>/logs
works.
Wandb URL appears in the driver log on the first
wandb.init
call; grep
grep -oE 'https://wandb\.ai/[A-Za-z0-9_./-]+'
.
nrl-k8s job list <recipe> --infra <infra> --role training nrl-k8s job logs <run-id> <recipe> --infra <infra> --role training -f

当`nrl-k8s job logs -f`子进程终止(`kubectl port-forward`在空闲约15分钟后会超时),只需重新执行命令即可——训练任务会继续运行。

要获取已结束任务(成功/失败)或RayJob的驱动进程日志,可通过仪表盘API:
```bash
RC=$(kubectl get rayjob -n default <rayjob-name> -o jsonpath='{.status.rayClusterName}')
kubectl port-forward -n default svc/${RC}-head-svc 18266:8265 &
curl -s http://localhost:18266/api/jobs/                              # 列出所有任务,找到submission_id
curl -s "http://localhost:18266/api/jobs/<submission_id>/logs"        # 获取完整驱动进程日志
type=DRIVER
submission_id=null
表示使用exec提交方式的训练任务(无仪表盘日志端点——需使用
nrl-k8s job logs
命令)。
type=SUBMISSION
会设置
submission_id
,此时
/api/jobs/<id>/logs
接口可用。
Wandb URL会在首次调用
wandb.init
时出现在驱动进程日志中;可通过
grep -oE 'https://wandb\.ai/[A-Za-z0-9_./-]+'
命令提取。

9. Stopping things

9. 停止任务

What to stopCommand
One training run
nrl-k8s job stop <run-id> <recipe> --infra <infra> --role training
All running Ray jobs on a cluster (+ submit new)
nrl-k8s run <recipe> --infra <infra> --replace
A long-lived RayCluster
nrl-k8s cluster down <recipe> --infra <infra> --target kuberay.training --wait
A RayJob (ephemeral)
kubectl delete rayjob <name> -n default
— only if
shutdownAfterJobFinishes
didn't fire
Confirm before deleting shared infra. The cost of
cluster down
on someone else's cluster is high.
停止对象命令
单个训练任务
nrl-k8s job stop <run-id> <recipe> --infra <infra> --role training
集群上所有运行中的Ray Job(并提交新任务)
nrl-k8s run <recipe> --infra <infra> --replace
持久化RayCluster
nrl-k8s cluster down <recipe> --infra <infra> --target kuberay.training --wait
RayJob(临时模式)
kubectl delete rayjob <name> -n default
— 仅在
shutdownAfterJobFinishes
未触发时使用
删除共享基础设施前请确认。错误销毁他人集群的成本很高。

10. Verifying RayJob teardown

10. 验证RayJob销毁状态

After a
run --rayjob
completes with
--shutdown
(default), KubeRay should delete the RayCluster:
bash
kubectl get rayjob   -n default <rayjob-name>                        # jobDeploymentStatus = Complete
kubectl get raycluster -n default | grep <rayjob-name>               # no output = torn down
The RayJob object itself sticks around for
--ttl
seconds (default 3600s) so you can still fetch logs.
run --rayjob
训练完成且启用
--shutdown
(默认)后,KubeRay应删除RayCluster:
bash
kubectl get rayjob   -n default <rayjob-name>                        # jobDeploymentStatus = Complete
kubectl get raycluster -n default | grep <rayjob-name>               # 无输出表示已销毁
RayJob对象本身会保留
--ttl
秒(默认3600秒),以便事后查看日志。

11. Common gotchas

11. 常见陷阱

  • OmegaConf interpolation eats
    ${VAR}
    in recipe/infra YAML. Escape shell variables with
    \${VAR}
    so OmegaConf passes them through to the pod shell verbatim.
  • Megatron optimizer configs don't carry
    foreach
    /
    fused
    . Overrides like
    ~policy.optimizer.kwargs.foreach ~policy.optimizer.kwargs.fused
    (valid for DTensor configs) break on Megatron recipes. Omit them for Megatron.
  • DTensor vs Megatron — MoE recipes typically use
    megatron_cfg.enabled=true
    ; ensure
    dtensor_cfg.enabled=false
    in inherited defaults.
  • Shared filesystem vs git divergence
    codeSource: image|lustre
    reads from the pod filesystem. If your local edits aren't on the shared filesystem the pods mount, the run is testing the on-disk version, not yours. Either sync via a helper pod (head pod exec is often blocked) or override via Hydra flags.
  • Ephemeral-storage + readinessProbe are injected by kuberay/CDI webhooks at pod-apply time. Do NOT add them to the inline RayCluster spec.
  • Node taints vary per cluster.
    tolerations: [{operator: Exists}]
    on workers is defensive and worth keeping.
  • Dashboard blank page — Ray 2.52 installs dashboard assets as symlinks by default;
    nrl-k8s cluster dashboard <name>
    auto-reinstalls
    ray[default] --link-mode=copy
    to fix it. Bake
    ENV UV_LINK_MODE=copy
    in the image to avoid this entirely.
  • kubectl exec
    is usually blocked
    in automation — route around with
    kubectl get ... -o yaml
    ,
    kubectl logs
    , and
    kubectl port-forward
    + Ray dashboard APIs.
  • OmegaConf插值会解析训练脚本/基础设施YAML中的
    ${VAR}
    。使用
    \${VAR}
    转义Shell变量,确保OmegaConf将其原封不动地传递给Pod的Shell。
  • Megatron优化器配置不支持
    foreach
    /
    fused
    参数。类似
    ~policy.optimizer.kwargs.foreach ~policy.optimizer.kwargs.fused
    的覆盖参数(适用于DTensor配置)会在Megatron训练脚本中报错。使用Megatron时请省略这些参数。
  • DTensor vs Megatron — MoE训练脚本通常设置
    megatron_cfg.enabled=true
    ;请确保继承的默认配置中
    dtensor_cfg.enabled=false
  • 共享文件系统与Git差异
    codeSource: image|lustre
    模式从Pod文件系统读取代码。如果本地修改未同步到Pod挂载的共享文件系统,训练任务测试的是文件系统上的版本,而非本地修改。可通过辅助Pod同步(通常禁止直接执行头节点Pod的
    kubectl exec
    )或通过Hydra参数覆盖。
  • 临时存储+就绪探针由kuberay/CDI webhook在Pod应用时自动注入。请勿在RayCluster内联规格中添加这些配置。
  • 节点污点因集群而异。在工作节点上设置
    tolerations: [{operator: Exists}]
    是一种防御性配置,建议保留。
  • 仪表盘空白页面 — Ray 2.52默认将仪表盘资源安装为符号链接;
    nrl-k8s cluster dashboard <name>
    会自动重新安装
    ray[default] --link-mode=copy
    以修复该问题。在镜像中添加
    ENV UV_LINK_MODE=copy
    可彻底避免此问题。
  • kubectl exec
    通常被禁止
    在自动化场景中——可通过
    kubectl get ... -o yaml
    kubectl logs
    kubectl port-forward
    + Ray仪表盘API替代。

12. Checklist before calling a run "done"

12. 标记训练任务完成前的检查清单

Before reporting a launch as successful, verify:
  1. kubectl get rayjob/raycluster -n default
    shows the expected objects.
  2. nrl-k8s job list
    (or
    curl /api/jobs/
    ) shows the job in
    RUNNING
    /
    SUCCEEDED
    .
  3. Driver log contains
    wandb.ai/<project>/runs/<id>
    (if wandb is enabled) — share the URL with the user.
  4. At least one
    Processed prompts: 100%
    line appears (confirms generation is wired).
  5. For
    --rayjob
    mode only: after
    jobDeploymentStatus=Complete
    , confirm
    kubectl get raycluster | grep <name>
    is empty (teardown worked).
在报告训练启动成功前,请验证以下内容:
  1. kubectl get rayjob/raycluster -n default
    显示预期的资源对象。
  2. nrl-k8s job list
    (或
    curl /api/jobs/
    )显示任务处于
    RUNNING
    /
    SUCCEEDED
    状态。
  3. 驱动进程日志包含
    wandb.ai/<project>/runs/<id>
    (若启用wandb)——请将URL分享给用户。
  4. 日志中至少出现一条
    Processed prompts: 100%
    记录(确认生成流程已正常运行)。
  5. 仅针对
    --rayjob
    模式:当
    jobDeploymentStatus=Complete
    后,确认
    kubectl get raycluster | grep <name>
    无输出(销毁成功)。

13. Dev pod

13. 开发Pod

nrl-k8s dev
manages a lightweight CPU pod on the cluster for code syncing, debugging, and running
kubectl
/
nrl-k8s
from within the cluster.
bash
undefined
nrl-k8s dev
命令用于管理集群上的轻量CPU Pod,用于代码同步、调试以及在集群内运行
kubectl
/
nrl-k8s
命令。
bash
undefined

One-time: set up secrets (HF token, wandb, SSH key, rclone)

一次性操作:配置密钥(HF令牌、wandb、SSH密钥、rclone)

nrl-k8s dev setup-secrets --ssh-key ~/.ssh/id_rsa --add-rclone
nrl-k8s dev setup-secrets --ssh-key ~/.ssh/id_rsa --add-rclone

Create pod and exec in (idempotent — reuses existing pod)

创建Pod并进入终端(幂等操作——复用现有Pod)

nrl-k8s dev connect
nrl-k8s dev connect

Switch image (must stop first — image change is warned but not auto-applied)

切换镜像(需先停止Pod——镜像变更会发出警告但不会自动应用)

nrl-k8s dev stop nrl-k8s dev connect --image nvcr.io/nvidian/nemo-rl:v0.7.0
nrl-k8s dev stop nrl-k8s dev connect --image nvcr.io/nvidian/nemo-rl:v0.7.0

Tear down

销毁Pod

nrl-k8s dev stop

The dev pod:
- Runs on a CPU-only node (anti-affinity to GPU nodes)
- Mounts the shared `rl-workspace` PVC at `/mnt/rl-workspace`
- Sets `USER` env var to the `nrl-k8s` username (so `$USER` and `getpass.getuser()` work correctly despite running as root)
- Installs `kubectl`, `rclone` (if configured) on first boot
- Injects SSH keys and tokens via `envFrom` on a per-user K8s Secret

The pod's `default` service account needs an `edit` RoleBinding in the namespace for `kubectl` to work inside. `dev connect` checks this and prints the required YAML if missing.
nrl-k8s dev stop

开发Pod特性:
- 部署在仅含CPU的节点(与GPU节点反亲和性)
- 将共享`rl-workspace` PVC挂载到`/mnt/rl-workspace`
- 设置`USER`环境变量为`nrl-k8s`用户名(确保`$USER`和`getpass.getuser()`在以root身份运行时仍能正常工作)
- 首次启动时安装`kubectl`、`rclone`(若已配置)
- 通过每个用户的K8s Secret的`envFrom`注入SSH密钥和令牌

Pod的`default`服务账号需要在命名空间中拥有`edit`角色绑定,才能在Pod内正常使用`kubectl`。`dev connect`命令会检查此配置,若缺失则打印所需的YAML配置。

14. Where things live in the repo

14. 仓库中各文件的位置

  • CLI code:
    infra/nrl_k8s/src/nrl_k8s/
    (
    cli.py
    ,
    orchestrate.py
    ,
    manifest.py
    ,
    rayjob.py
    ,
    k8s.py
    ,
    submitters/
    ,
    schema.py
    ).
  • Tests:
    infra/nrl_k8s/tests/unit/
    — run with
    uv run --extra test pytest -x -q
    from
    infra/nrl_k8s/
    .
  • Recipe + infra examples:
    infra/nrl_k8s/examples/
    .
  • Base recipes this tool wraps:
    examples/configs/recipes/llm/…
    and
    examples/nemo_gym/…
    .
  • CLI代码:
    infra/nrl_k8s/src/nrl_k8s/
    cli.py
    ,
    orchestrate.py
    ,
    manifest.py
    ,
    rayjob.py
    ,
    k8s.py
    ,
    submitters/
    ,
    schema.py
    )。
  • 测试代码:
    infra/nrl_k8s/tests/unit/
    — 在
    infra/nrl_k8s/
    目录下执行
    uv run --extra test pytest -x -q
    运行测试。
  • 训练脚本+基础设施配置示例:
    infra/nrl_k8s/examples/
  • 本工具封装的基础训练脚本:
    examples/configs/recipes/llm/…
    examples/nemo_gym/…