launch-nemo-rl

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

launch-nemo-rl — running NeMo-RL recipes on Kubernetes via nrl-k8s

launch-nemo-rl — 通过nrl-k8s在Kubernetes上运行NeMo-RL训练脚本

This is the playbook for the
nrl-k8s
CLI at
infra/nrl_k8s/
. Follow it when the user asks to launch / iterate / debug a NeMo-RL recipe on a Kubernetes cluster. Verify current state (
kubectl
,
git log
, the recipe + infra files) before acting — the cluster is shared and the cost of a wrong action is high.
本指南针对位于
infra/nrl_k8s/
路径下的
nrl-k8s
CLI工具编写。当用户需要在Kubernetes集群上启动/迭代/调试NeMo-RL训练脚本时,请遵循本指南操作。执行操作前请先验证当前状态(
kubectl
git log
、训练脚本及基础设施文件)——集群为共享资源,错误操作的代价很高。

1. One command, two modes

1. 单命令,双模式

There is a single top-level submission command:
nrl-k8s run
. It has two lifecycle modes.
ModeInvocationWhen to useCluster after?
Ephemeral (default)
nrl-k8s run
One-shot. KubeRay applies a RayJob, runs, tears the cluster down. Best for most runs.No (auto)
Long-lived
nrl-k8s run --raycluster
Dev loop. Reuses a matching live cluster, applies if absent, warns + reuses on drift (pass
--recreate
to replace). Then submits daemons and training. First-choice for iteration.
Yes
Ask: Do I need this cluster after the run? If yes, use
--raycluster
. Otherwise use the default (ephemeral).
The rest of the CLI is observability / stage-by-stage control:
CommandPurpose
nrl-k8s check
Validate a recipe + infra pair; optionally write the fully-resolved manifests (
-o
).
nrl-k8s status
Per-role RayCluster state, head pod phase, worker pod phases, daemon job status.
nrl-k8s cluster up/down/list/dashboard
Manage RayClusters independently of a run (e.g. render a manifest with
--dry-run
).
nrl-k8s job list/logs/stop
Observability over Ray Jobs already submitted to a role's cluster.
nrl-k8s logs
Tail a role's pod / daemon logs without needing a submission id.
顶层提交命令只有一个:
nrl-k8s run
。它包含两种生命周期模式。
模式调用方式使用场景集群后续状态
临时模式(默认)
nrl-k8s run
一次性任务。KubeRay创建RayJob,执行完成后自动销毁集群。适用于大多数训练任务。无(自动销毁)
长期运行模式
nrl-k8s run --raycluster
开发循环场景。复用匹配的活跃集群,若集群不存在则创建;若集群配置漂移则发出警告并复用(可传入
--recreate
参数重建集群)。随后提交守护进程和训练任务。是迭代开发的首选模式。
保留
请先确认:训练完成后是否需要保留该集群? 如果是,使用
--raycluster
模式;否则使用默认的临时模式。
CLI的其余命令用于可观测性或分阶段控制:
命令用途
nrl-k8s check
验证训练脚本与基础设施配置的匹配性;可选生成完整解析后的清单文件(
-o
参数)。
nrl-k8s status
查看RayCluster各角色状态、头节点Pod阶段、工作节点Pod阶段、守护进程任务状态。
nrl-k8s cluster up/down/list/dashboard
独立于训练任务管理RayCluster(例如通过
--dry-run
参数渲染清单)。
nrl-k8s job list/logs/stop
查看已提交到集群的Ray任务的可观测数据。
nrl-k8s logs
无需提交ID即可跟踪指定角色的Pod/守护进程日志。

2. Recipe + infra pair

2. 训练脚本 + 基础设施配置对

Every launch takes two files. Pass the infra with
--infra
, not merged inline:
nrl-k8s run infra/nrl_k8s/examples/<recipe>.yaml \
  --infra infra/nrl_k8s/examples/<recipe>.<profile>.infra.yaml
  • Recipe (e.g.
    qwen3_30b_math_8n_4gpu.yaml
    ) — NeMo-RL config: model, GRPO/SFT knobs,
    cluster.{gpus_per_node,num_nodes}
    . Uses
    defaults:
    to inherit from
    examples/configs/recipes/llm/...
    .
  • Infra (e.g.
    *.<profile>.infra.yaml
    ) — K8s/Ray shape: namespace, image, service account, RayCluster spec under
    kuberay:
    , optional Deployments under
    deployments:
    ,
    submit.submitter
    ,
    launch.{mode,codeSource,codePath,entrypoint}
    . Pair names follow
    <recipe>.<profile>[.prod].infra.yaml
    where
    <profile>
    names the hardware target (e.g.
    gb300
    ).
Example pairs in
infra/nrl_k8s/examples/
— read the neighbouring files to see the current conventions for the target profile.
每次启动都需要两个文件。通过
--infra
参数传入基础设施配置,请勿内联合并:
nrl-k8s run infra/nrl_k8s/examples/<recipe>.yaml \
  --infra infra/nrl_k8s/examples/<recipe>.<profile>.infra.yaml
  • 训练脚本(例如
    qwen3_30b_math_8n_4gpu.yaml
    )——NeMo-RL配置:模型、GRPO/SFT参数、
    cluster.{gpus_per_node,num_nodes}
    。通过
    defaults:
    继承自
    examples/configs/recipes/llm/...
    路径下的配置。
  • 基础设施配置(例如
    *.<profile>.infra.yaml
    )——K8s/Ray集群规格:命名空间、镜像、服务账号、
    kuberay:
    下的RayCluster配置、
    deployments:
    下的可选Deployment配置、
    submit.submitter
    launch.{mode,codeSource,codePath,entrypoint}
    。配置对命名遵循
    <recipe>.<profile>[.prod].infra.yaml
    格式,其中
    <profile>
    代表硬件目标(例如
    gb300
    )。
infra/nrl_k8s/examples/
路径下提供了示例配置对,请查看同路径下的文件了解目标硬件配置的当前约定。

3. Long-lived mode flags

3. 长期运行模式参数

Three independent dimensions.
--mode
is a macro that picks defaults; individual flags override it.
--mode interactive   → --submitter portForward  --code-source upload  (tails logs)
--mode batch         → --submitter exec         --code-source image   (returns after nohup)
  • Submitter:
    portForward
    uses
    kubectl port-forward
    + Ray Job SDK (gets a
    submission_id
    the dashboard tracks).
    exec
    uses
    kubectl exec
    +
    nohup
    on the head pod (no submission_id; driver appears as
    type=DRIVER
    in the dashboard).
  • Code source:
    upload
    stages a working_dir from the laptop (Ray 100 MiB cap).
    image
    /
    lustre
    expect code on the pod's filesystem — paired with
    --code-path
    (typically
    /opt/nemo-rl
    ), which is a subPath of the shared-filesystem PVC mount in the standard infra examples.
  • Wait:
    --wait
    tails logs until terminal;
    --no-wait
    returns as soon as the driver is running.
Other long-lived-only flags:
  • --replace
    — stop any running training / daemon job before submitting new ones (suffixes daemon submissionIds with a timestamp so Ray accepts the resubmit).
  • --recreate
    — delete + re-apply a RayCluster whose live spec has drifted from the rendered manifest (default is warn + reuse).
  • --skip-daemons
    — bring up all declared clusters but only submit training. Use on disagg recipes where gym/generation are already healthy.
Gotcha: on infra where the entrypoint does
cd /opt/nemo-rl
(or another in-image / Lustre path) and loads the recipe from there,
--code-source upload
does NOT override the recipe on the pod
— the uploaded working_dir sits in
/tmp/ray/...
but the entrypoint
cd
s away from it. To actually test a local recipe change, either sync your edits to the shared filesystem mounted into the pods or flip the Hydra overrides in the entrypoint.
包含三个独立维度。
--mode
参数是预设宏,可通过单独参数覆盖默认值。
--mode interactive   → --submitter portForward  --code-source upload  (跟踪日志)
--mode batch         → --submitter exec         --code-source image   (后台执行后返回)
  • 提交方式
    portForward
    使用
    kubectl port-forward
    + Ray Job SDK(生成仪表盘可跟踪的
    submission_id
    )。
    exec
    使用
    kubectl exec
    + 头节点Pod上的
    nohup
    命令(无submission_id;驱动进程在仪表盘中显示为
    type=DRIVER
    )。
  • 代码来源
    upload
    将本地工作目录上传到集群(Ray限制为100 MiB)。
    image
    /
    lustre
    模式期望代码已存在于Pod的文件系统中——需配合
    --code-path
    参数(通常为
    /opt/nemo-rl
    ),该路径是标准基础设施示例中共享文件系统PVC挂载的子路径。
  • 等待模式
    --wait
    跟踪日志直到任务结束;
    --no-wait
    在驱动进程启动后立即返回。
其他仅适用于长期运行模式的参数:
  • --replace
    — 提交新任务前停止所有正在运行的训练/守护进程任务(为守护进程的submissionId添加时间戳后缀,确保Ray接受重新提交)。
  • --recreate
    — 当活跃集群的配置与渲染后的清单不一致时,删除并重建RayCluster(默认行为是发出警告并复用现有集群)。
  • --skip-daemons
    — 启动所有已声明的集群,但仅提交训练任务。适用于gym/generation服务已正常运行的拆分式训练脚本。
注意事项:如果基础设施配置中的入口命令执行
cd /opt/nemo-rl
(或其他镜像内/Lustre路径)并从该路径加载训练脚本,
--code-source upload
不会覆盖Pod上的训练脚本
——上传的工作目录位于
/tmp/ray/...
,但入口命令已切换到其他路径。要测试本地训练脚本的修改,需将编辑内容同步到Pod挂载的共享文件系统,或在入口命令中通过Hydra参数覆盖配置。

4. Ephemeral mode flags (
--rayjob
)

4. 临时模式参数(
--rayjob

When
--rayjob
is set,
run
branches into the RayJob code path. Relevant flags:
  • --rayjob-name NAME
    — RayJob metadata name (defaults to the training cluster name).
  • --shutdown / --no-shutdown
    — default
    true
    : KubeRay deletes the RayCluster once the Ray Job reaches a terminal state.
  • --ttl SECONDS
    — default 3600s: keep the RayJob object around after the run finishes for post-mortem log access.
  • --wait / --no-wait
    — default
    wait
    : poll
    jobDeploymentStatus
    until Complete/Failed.
    --no-wait
    returns as soon as the RayJob is applied.
  • --timeout SECONDS
    — default 86400s (24h): bound the
    --wait
    poll.
  • --dry-run
    — render the RayJob manifest and print it; do not apply.
--replace
/
--recreate
/
--skip-daemons
are silently ignored in
--rayjob
mode (KubeRay owns lifecycle).
当设置
--rayjob
参数时,
run
命令会切换到RayJob执行路径。相关参数:
  • --rayjob-name NAME
    — RayJob元数据名称(默认值为训练集群名称)。
  • --shutdown / --no-shutdown
    — 默认值为
    true
    :当RayJob进入终端状态后,KubeRay自动删除RayCluster。
  • --ttl SECONDS
    — 默认值为3600秒:训练完成后保留RayJob对象一段时间,以便事后查看日志。
  • --wait / --no-wait
    — 默认值为
    wait
    :轮询
    jobDeploymentStatus
    直到任务完成/失败。
    --no-wait
    在RayJob创建后立即返回。
  • --timeout SECONDS
    — 默认值为86400秒(24小时):限制
    --wait
    模式的轮询时长。
  • --dry-run
    — 渲染RayJob清单并打印,不实际执行创建操作。
--rayjob
模式下,
--replace
/
--recreate
/
--skip-daemons
参数会被忽略(KubeRay负责生命周期管理)。

5. Iterating on a config without touching the shared filesystem

5. 不修改共享文件系统的情况下迭代配置

When the recipe on the pod filesystem has the wrong value for your experiment, use Hydra overrides on the entrypoint instead of forking the recipe. Pattern:
yaml
entrypoint: |
  set -eu
  cd /opt/nemo-rl
  RUN_ID="\${RAY_JOB_SUBMISSION_ID:-\${NRL_K8S_RUN_ID:-$(date -u +%Y%m%d-%H%M%S)}}"
  python -u examples/run_grpo.py \
    --config infra/nrl_k8s/examples/<recipe>.yaml \
    logger.wandb_enabled=true \
    logger.wandb.project=<project> \
    "logger.wandb.name=<run-name>-\${RUN_ID}"
Escape
${…}
with a backslash. OmegaConf otherwise interprets it as interpolation and errors on shell-style
${VAR:-default}
.
RUN_ID
resolves to
RAY_JOB_SUBMISSION_ID
(injected by KubeRay in rayjob mode) →
NRL_K8S_RUN_ID
(injected by the CLI in long-lived mode) → local timestamp — so the name is unique across either path.
当Pod文件系统中的训练脚本配置不符合实验需求时,无需复制训练脚本,可通过入口命令中的Hydra参数覆盖配置。示例模式:
yaml
entrypoint: |
  set -eu
  cd /opt/nemo-rl
  RUN_ID="\${RAY_JOB_SUBMISSION_ID:-\${NRL_K8S_RUN_ID:-$(date -u +%Y%m%d-%H%M%S)}}"
  python -u examples/run_grpo.py \
    --config infra/nrl_k8s/examples/<recipe>.yaml \
    logger.wandb_enabled=true \
    logger.wandb.project=<project> \
    "logger.wandb.name=<run-name>-\${RUN_ID}"
使用反斜杠转义
${…}
。否则OmegaConf会将其解释为插值语法,并在遇到shell风格的
${VAR:-default}
时报错。
RUN_ID
的解析优先级为:
RAY_JOB_SUBMISSION_ID
(RayJob模式下由KubeRay注入)→
NRL_K8S_RUN_ID
(长期运行模式下由CLI注入)→ 本地时间戳——因此无论使用哪种模式,名称都是唯一的。

6. Per-profile concerns (hardware + scheduler + DRA)

6. 按硬件配置区分的注意事项(硬件 + 调度器 + DRA)

Every infra YAML encodes a hardware/scheduler profile. The concrete examples in
infra/nrl_k8s/examples/
are authoritative for the profiles they target — read the neighbouring infra file before writing a new one. Things that commonly vary:
  • Per-node GPUs (e.g. 4 vs 8) — must match
    cluster.gpus_per_node
    in the recipe, otherwise workers stay
    Pending
    .
  • Node selectors — head pods usually land on a CPU-only node pool; GPU workers match on
    nvidia.com/gpu.product
    or a node-group label.
  • Scheduler — KAI (
    schedulerName: kai-scheduler
    +
    kai.scheduler/queue
    label) with topology annotations (
    kai.scheduler/topology
    ,
    kai.scheduler/topology-required-placement
    ) gang-schedules workers into one clique. Without it, pods may land on different racks and NVLink/RoCE won't span them.
  • DRA claims — ComputeDomain + RoCE are attached via
    resourceClaims
    referencing
    ResourceClaimTemplate
    s. The CLI auto-creates/deletes these when the worker pod spec contains DRA claim references — no manual setup needed.
  • Secrets — always via
    secretKeyRef
    (
    wandb-api-key
    , image pull secret). Never embed.
  • Shared filesystem mounts — typically a Lustre PVC mounted twice: once at the code path (e.g.
    /opt/nemo-rl
    with a user-scoped
    subPath
    ) and once at a workspace root (e.g.
    /mnt/rl-workspace
    ) for datasets, HF cache, and checkpoints.
Before applying an infra, verify prereqs exist in the target namespace:
bash
kubectl get pvc <workspace-pvc>
kubectl get secret <wandb-secret> <image-pull-secret>
kubectl get sa <service-account>
每个基础设施YAML文件都对应一种硬件/调度器配置。
infra/nrl_k8s/examples/
路径下的具体示例是对应硬件配置的权威参考——编写新配置前请先查看同路径下的基础设施文件。常见的差异点:
  • 单节点GPU数量(例如4 vs 8)——必须与训练脚本中的
    cluster.gpus_per_node
    匹配,否则工作节点会一直处于
    Pending
    状态。
  • 节点选择器——头节点通常部署在仅含CPU的节点池;GPU工作节点通过
    nvidia.com/gpu.product
    或节点组标签匹配。
  • 调度器——KAI调度器(
    schedulerName: kai-scheduler
    +
    kai.scheduler/queue
    标签)配合拓扑注解(
    kai.scheduler/topology
    ,
    kai.scheduler/topology-required-placement
    )将工作节点调度到同一集群中。如果不使用该调度器,Pod可能会部署在不同机架上,导致NVLink/RoCE无法跨机架通信。
  • DRA声明——ComputeDomain + RoCE通过
    resourceClaims
    引用
    ResourceClaimTemplate
    实现挂载。当工作节点Pod配置中包含DRA声明引用时,CLI会自动创建/删除这些资源——无需手动配置。
  • 密钥——始终通过
    secretKeyRef
    引用(例如
    wandb-api-key
    、镜像拉取密钥)。禁止直接嵌入密钥。
  • 共享文件系统挂载——通常是Lustre PVC挂载两次:一次挂载到代码路径(例如
    /opt/nemo-rl
    ,带用户范围的
    subPath
    ),另一次挂载到工作区根路径(例如
    /mnt/rl-workspace
    ),用于存储数据集、HF缓存和检查点。
应用基础设施配置前,请验证目标命名空间中是否存在必要的资源:
bash
kubectl get pvc <workspace-pvc>
kubectl get secret <wandb-secret> <image-pull-secret>
kubectl get sa <service-account>

7. End-to-end workflows

7. 端到端工作流

7a. Fresh one-shot run (rayjob)

7a. 全新一次性训练任务(rayjob模式)

bash
undefined
bash
undefined

From the NeMo-RL repo root:

在NeMo-RL仓库根目录执行:

nrl-k8s check <recipe> --infra <infra> # validate first nrl-k8s run <recipe> --infra <infra> --rayjob --dry-run # render RayJob manifest nrl-k8s run <recipe> --infra <infra> --rayjob --no-wait # apply, returns fast

Watch status + teardown (works even after your laptop disconnects because KubeRay owns the lifecycle):
```bash
kubectl get rayjob -n default <name> -w
kubectl get raycluster -n default                                    # empty = teardown succeeded
nrl-k8s check <recipe> --infra <infra> # 先验证配置 nrl-k8s run <recipe> --infra <infra> --rayjob --dry-run # 渲染RayJob清单 nrl-k8s run <recipe> --infra <infra> --rayjob --no-wait # 创建任务,快速返回

查看状态及销毁过程(即使笔记本断开连接也能正常执行,因为KubeRay负责生命周期管理):
```bash
kubectl get rayjob -n default <name> -w
kubectl get raycluster -n default                                    # 无输出表示销毁成功

7b. Dev loop (long-lived)

7b. 开发循环(长期运行模式)

bash
nrl-k8s run <recipe> --infra <infra> --run-id $(date +%Y%m%d-%H%M%S)
bash
nrl-k8s run <recipe> --infra <infra> --run-id $(date +%Y%m%d-%H%M%S)

Edits in the recipe? Just re-run — reuses the live cluster.

修改训练脚本后?只需重新执行命令——复用活跃集群。

Pod spec changed? Add --recreate to delete + re-apply.

Pod配置变更?添加--recreate参数删除并重建集群。

Disagg recipe with gym/gen already healthy? --skip-daemons.

使用拆分式训练脚本且gym/gen服务已正常运行?添加--skip-daemons参数。

undefined
undefined

7c. First-time disaggregated bring-up

7c. 首次启动拆分式训练任务

bash
nrl-k8s run <recipe> --infra <disagg-infra> --mode batch --code-source image
bash
nrl-k8s run <recipe> --infra <disagg-infra> --mode batch --code-source image

7d. Cluster-only lifecycle

7d. 仅集群生命周期管理

bash
nrl-k8s cluster up   <recipe> --infra <infra> --target kuberay.training --wait
nrl-k8s cluster up   <recipe> --infra <infra> --target kuberay.training --dry-run   # render manifest
nrl-k8s cluster down <recipe> --infra <infra> --target kuberay.training --wait
nrl-k8s cluster down <recipe> --infra <infra>                                       # tear down all
nrl-k8s cluster list -n default
nrl-k8s cluster dashboard <cluster-name>                                  # port-forward + browser
bash
nrl-k8s cluster up   <recipe> --infra <infra> --target kuberay.training --wait
nrl-k8s cluster up   <recipe> --infra <infra> --target kuberay.training --dry-run   # 渲染清单
nrl-k8s cluster down <recipe> --infra <infra> --target kuberay.training --wait
nrl-k8s cluster down <recipe> --infra <infra>                                       # 销毁所有资源
nrl-k8s cluster list -n default
nrl-k8s cluster dashboard <cluster-name>                                  # 端口转发并打开浏览器

7e. Deployments (e.g. nemo-skills sandbox)

7e. Deployment管理(例如nemo-skills沙箱)

bash
undefined
bash
undefined

Bring up just the deployment

仅启动Deployment

nrl-k8s cluster up <recipe> --infra <infra> --target deployments.nemo_skills
nrl-k8s cluster up <recipe> --infra <infra> --target deployments.nemo_skills

Tear down just the deployment

仅销毁Deployment

nrl-k8s cluster down <recipe> --infra <infra> --target deployments.nemo_skills
nrl-k8s cluster down <recipe> --infra <infra> --target deployments.nemo_skills

Tear down everything (RayClusters + Deployments)

销毁所有资源(RayClusters + Deployments)

nrl-k8s cluster down <recipe> --infra <infra>

The `deployments:` section in infra YAML declares Kubernetes Deployments managed alongside RayClusters. The CLI patches image, imagePullSecrets, and serviceAccountName from the top-level infra keys (same as RayClusters). Deployments start in parallel with cluster bring-up — no ordering dependency.
nrl-k8s cluster down <recipe> --infra <infra>

基础设施YAML文件中的`deployments:`部分声明了与RayCluster一起管理的Kubernetes Deployment。CLI会从顶层基础设施配置中提取镜像、imagePullSecrets和serviceAccountName并应用到Deployment(与RayCluster逻辑相同)。Deployment与集群启动并行执行——无依赖顺序。

8. Monitoring a run

8. 监控训练任务

bash
undefined
bash
undefined

Status

查看状态

nrl-k8s status <recipe> --infra <infra> kubectl get rayjob,raycluster -n default
nrl-k8s status <recipe> --infra <infra> kubectl get rayjob,raycluster -n default

Follow the driver

跟踪驱动进程日志

nrl-k8s job list <recipe> --infra <infra> --role training nrl-k8s job logs <run-id> <recipe> --infra <infra> --role training -f

When the `nrl-k8s job logs -f` subprocess dies (`kubectl port-forward` i/o timeout after ~15 min idle), just re-run it. The training job keeps going.

To fetch driver logs for a terminal job (SUCCEEDED/FAILED) or a RayJob via the dashboard API:
```bash
RC=$(kubectl get rayjob -n default <rayjob-name> -o jsonpath='{.status.rayClusterName}')
kubectl port-forward -n default svc/${RC}-head-svc 18266:8265 &
curl -s http://localhost:18266/api/jobs/                              # lists jobs, find submission_id
curl -s "http://localhost:18266/api/jobs/<submission_id>/logs"        # full driver log
type=DRIVER
with
submission_id=null
means an exec-submitter run (no dashboard log endpoint — use
nrl-k8s job logs
instead).
type=SUBMISSION
has
submission_id
set and
/api/jobs/<id>/logs
works.
Wandb URL appears in the driver log on the first
wandb.init
call; grep
grep -oE 'https://wandb\.ai/[A-Za-z0-9_./-]+'
.
nrl-k8s job list <recipe> --infra <infra> --role training nrl-k8s job logs <run-id> <recipe> --infra <infra> --role training -f

当`nrl-k8s job logs -f`子进程终止(`kubectl port-forward`在闲置约15分钟后会超时),只需重新执行命令即可——训练任务会继续运行。

要获取已完成(SUCCEEDED/FAILED)任务的驱动进程日志,或通过仪表盘API获取RayJob日志:
```bash
RC=$(kubectl get rayjob -n default <rayjob-name> -o jsonpath='{.status.rayClusterName}')
kubectl port-forward -n default svc/${RC}-head-svc 18266:8265 &
curl -s http://localhost:18266/api/jobs/                              # 列出任务,找到submission_id
curl -s "http://localhost:18266/api/jobs/<submission_id>/logs"        # 获取完整驱动进程日志
type=DRIVER
submission_id=null
表示使用exec提交方式的任务(无仪表盘日志端点——请使用
nrl-k8s job logs
命令)。
type=SUBMISSION
submission_id
已设置的任务可通过
/api/jobs/<id>/logs
获取日志。
Wandb URL会在首次调用
wandb.init
时出现在驱动进程日志中;可通过
grep -oE 'https://wandb\.ai/[A-Za-z0-9_./-]+'
命令提取。

9. Stopping things

9. 停止任务

What to stopCommand
One training run
nrl-k8s job stop <run-id> <recipe> --infra <infra> --role training
All running Ray jobs on a cluster (+ submit new)
nrl-k8s run <recipe> --infra <infra> --replace
A long-lived RayCluster
nrl-k8s cluster down <recipe> --infra <infra> --target kuberay.training --wait
A RayJob (ephemeral)
kubectl delete rayjob <name> -n default
— only if
shutdownAfterJobFinishes
didn't fire
Confirm before deleting shared infra. The cost of
cluster down
on someone else's cluster is high.
停止目标命令
单个训练任务
nrl-k8s job stop <run-id> <recipe> --infra <infra> --role training
集群上所有运行中的Ray任务(并提交新任务)
nrl-k8s run <recipe> --infra <infra> --replace
长期运行的RayCluster
nrl-k8s cluster down <recipe> --infra <infra> --target kuberay.training --wait
RayJob(临时模式)
kubectl delete rayjob <name> -n default
— 仅在
shutdownAfterJobFinishes
未触发时使用
删除共享基础设施前请确认。误删除他人集群的代价很高。

10. Verifying RayJob teardown

10. 验证RayJob销毁

After a
run --rayjob
completes with
--shutdown
(default), KubeRay should delete the RayCluster:
bash
kubectl get rayjob   -n default <rayjob-name>                        # jobDeploymentStatus = Complete
kubectl get raycluster -n default | grep <rayjob-name>               # no output = torn down
The RayJob object itself sticks around for
--ttl
seconds (default 3600s) so you can still fetch logs.
run --rayjob
任务在
--shutdown
模式(默认)下完成后,KubeRay应自动删除RayCluster:
bash
kubectl get rayjob   -n default <rayjob-name>                        # jobDeploymentStatus = Complete
kubectl get raycluster -n default | grep <rayjob-name>               # 无输出表示已销毁
RayJob对象本身会保留
--ttl
秒(默认3600秒),以便仍能获取日志。

11. Common gotchas

11. 常见陷阱

  • OmegaConf interpolation eats
    ${VAR}
    in recipe/infra YAML. Escape shell variables with
    \${VAR}
    so OmegaConf passes them through to the pod shell verbatim.
  • Megatron optimizer configs don't carry
    foreach
    /
    fused
    . Overrides like
    ~policy.optimizer.kwargs.foreach ~policy.optimizer.kwargs.fused
    (valid for DTensor configs) break on Megatron recipes. Omit them for Megatron.
  • DTensor vs Megatron — MoE recipes typically use
    megatron_cfg.enabled=true
    ; ensure
    dtensor_cfg.enabled=false
    in inherited defaults.
  • Shared filesystem vs git divergence
    codeSource: image|lustre
    reads from the pod filesystem. If your local edits aren't on the shared filesystem the pods mount, the run is testing the on-disk version, not yours. Either sync via a helper pod (head pod exec is often blocked) or override via Hydra flags.
  • Ephemeral-storage + readinessProbe are injected by kuberay/CDI webhooks at pod-apply time. Do NOT add them to the inline RayCluster spec.
  • Node taints vary per cluster.
    tolerations: [{operator: Exists}]
    on workers is defensive and worth keeping.
  • Dashboard blank page — Ray 2.52 installs dashboard assets as symlinks by default;
    nrl-k8s cluster dashboard <name>
    auto-reinstalls
    ray[default] --link-mode=copy
    to fix it. Bake
    ENV UV_LINK_MODE=copy
    in the image to avoid this entirely.
  • kubectl exec
    is usually blocked
    in automation — route around with
    kubectl get ... -o yaml
    ,
    kubectl logs
    , and
    kubectl port-forward
    + Ray dashboard APIs.
  • OmegaConf插值语法会解析训练脚本/基础设施YAML中的
    ${VAR}
    。请使用
    \${VAR}
    转义shell变量,确保OmegaConf将其原封不动地传递给Pod的shell。
  • Megatron优化器配置不支持
    foreach
    /
    fused
    参数。类似
    ~policy.optimizer.kwargs.foreach ~policy.optimizer.kwargs.fused
    的覆盖参数(适用于DTensor配置)会在Megatron训练脚本中失效。Megatron训练脚本请省略这些参数。
  • DTensor vs Megatron — MoE训练脚本通常设置
    megatron_cfg.enabled=true
    ;请确保继承的默认配置中
    dtensor_cfg.enabled=false
  • 共享文件系统与git差异
    codeSource: image|lustre
    模式从Pod文件系统读取代码。如果本地修改未同步到Pod挂载的共享文件系统,训练任务测试的是磁盘上的版本,而非本地修改后的版本。可通过辅助Pod同步(头节点exec通常被禁止)或Hydra参数覆盖来解决。
  • 临时存储 + 就绪探针由kuberay/CDI webhook在Pod创建时自动注入。请勿在RayCluster内联配置中添加这些内容。
  • 节点污点因集群而异。在工作节点上设置
    tolerations: [{operator: Exists}]
    是防御性配置,建议保留。
  • 仪表盘空白页面 — Ray 2.52默认通过符号链接安装仪表盘资源;
    nrl-k8s cluster dashboard <name>
    会自动重新安装
    ray[default] --link-mode=copy
    来修复该问题。在镜像中添加
    ENV UV_LINK_MODE=copy
    可从根本上避免此问题。
  • kubectl exec
    通常被禁止
    在自动化场景中——可通过
    kubectl get ... -o yaml
    kubectl logs
    kubectl port-forward
    + Ray仪表盘API替代。

12. Checklist before calling a run "done"

12. 确认训练任务完成的检查清单

Before reporting a launch as successful, verify:
  1. kubectl get rayjob/raycluster -n default
    shows the expected objects.
  2. nrl-k8s job list
    (or
    curl /api/jobs/
    ) shows the job in
    RUNNING
    /
    SUCCEEDED
    .
  3. Driver log contains
    wandb.ai/<project>/runs/<id>
    (if wandb is enabled) — share the URL with the user.
  4. At least one
    Processed prompts: 100%
    line appears (confirms generation is wired).
  5. For
    --rayjob
    mode only: after
    jobDeploymentStatus=Complete
    , confirm
    kubectl get raycluster | grep <name>
    is empty (teardown worked).
在报告训练任务启动成功前,请验证以下内容:
  1. kubectl get rayjob/raycluster -n default
    显示预期的资源对象。
  2. nrl-k8s job list
    (或
    curl /api/jobs/
    )显示任务处于
    RUNNING
    /
    SUCCEEDED
    状态。
  3. 驱动进程日志包含
    wandb.ai/<project>/runs/<id>
    (若启用wandb)——请将该URL分享给用户。
  4. 日志中至少出现一行
    Processed prompts: 100%
    (确认生成流程已正常运行)。
  5. 仅适用于
    --rayjob
    模式:当
    jobDeploymentStatus=Complete
    后,确认
    kubectl get raycluster | grep <name>
    无输出(销毁成功)。

13. Dev pod

13. 开发Pod

nrl-k8s dev
manages a lightweight CPU pod on the cluster for code syncing, debugging, and running
kubectl
/
nrl-k8s
from within the cluster.
bash
undefined
nrl-k8s dev
命令用于在集群上管理轻量级CPU Pod,用于代码同步、调试以及在集群内运行
kubectl
/
nrl-k8s
命令。
bash
undefined

One-time: set up secrets (HF token, wandb, SSH key, rclone)

一次性操作:配置密钥(HF token、wandb、SSH密钥、rclone)

nrl-k8s dev setup-secrets --ssh-key ~/.ssh/id_rsa --add-rclone
nrl-k8s dev setup-secrets --ssh-key ~/.ssh/id_rsa --add-rclone

Create pod and exec in (idempotent — reuses existing pod)

创建Pod并进入(幂等操作——复用现有Pod)

nrl-k8s dev connect
nrl-k8s dev connect

Switch image (must stop first — image change is warned but not auto-applied)

切换镜像(必须先停止Pod——镜像变更会发出警告但不会自动应用)

nrl-k8s dev stop nrl-k8s dev connect --image nvcr.io/nvidian/nemo-rl:v0.7.0
nrl-k8s dev stop nrl-k8s dev connect --image nvcr.io/nvidian/nemo-rl:v0.7.0

Tear down

销毁Pod

nrl-k8s dev stop

The dev pod:
- Runs on a CPU-only node (anti-affinity to GPU nodes)
- Mounts the shared `rl-workspace` PVC at `/mnt/rl-workspace`
- Sets `USER` env var to the `nrl-k8s` username (so `$USER` and `getpass.getuser()` work correctly despite running as root)
- Installs `kubectl`, `rclone` (if configured) on first boot
- Injects SSH keys and tokens via `envFrom` on a per-user K8s Secret

The pod's `default` service account needs an `edit` RoleBinding in the namespace for `kubectl` to work inside. `dev connect` checks this and prints the required YAML if missing.
nrl-k8s dev stop

开发Pod特性:
- 部署在仅含CPU的节点上(与GPU节点互斥)
- 将共享`rl-workspace` PVC挂载到`/mnt/rl-workspace`
- 设置`USER`环境变量为`nrl-k8s`用户名(确保`$USER`和`getpass.getuser()`在以root身份运行时仍能正常工作)
- 首次启动时安装`kubectl`、`rclone`(若已配置)
- 通过每个用户的K8s Secret注入SSH密钥和令牌

Pod的`default`服务账号需要在命名空间中拥有`edit`角色绑定,才能在Pod内正常使用`kubectl`。`dev connect`命令会检查该权限,如果缺失则打印所需的YAML配置。

14. Where things live in the repo

14. 仓库中各组件的位置

  • CLI code:
    infra/nrl_k8s/src/nrl_k8s/
    (
    cli.py
    ,
    orchestrate.py
    ,
    manifest.py
    ,
    rayjob.py
    ,
    k8s.py
    ,
    submitters/
    ,
    schema.py
    ).
  • Tests:
    infra/nrl_k8s/tests/unit/
    — run with
    uv run --extra test pytest -x -q
    from
    infra/nrl_k8s/
    .
  • Recipe + infra examples:
    infra/nrl_k8s/examples/
    .
  • Base recipes this tool wraps:
    examples/configs/recipes/llm/…
    and
    examples/nemo_gym/…
    .
  • CLI代码:
    infra/nrl_k8s/src/nrl_k8s/
    cli.py
    ,
    orchestrate.py
    ,
    manifest.py
    ,
    rayjob.py
    ,
    k8s.py
    ,
    submitters/
    ,
    schema.py
    )。
  • 测试代码:
    infra/nrl_k8s/tests/unit/
    — 在
    infra/nrl_k8s/
    路径下执行
    uv run --extra test pytest -x -q
    运行测试。
  • 训练脚本 + 基础设施配置示例:
    infra/nrl_k8s/examples/
  • 本工具封装的基础训练脚本:
    examples/configs/recipes/llm/…
    examples/nemo_gym/…