launch-nemo-rl

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

launch-nemo-rl — running NeMo-RL recipes on Kubernetes via nrl-k8s

launch-nemo-rl — 通过nrl-k8s在Kubernetes上运行NeMo-RL训练脚本

This is the playbook for the

nrl-k8s

CLI at

infra/nrl_k8s/

. Follow it when the user asks to launch / iterate / debug a NeMo-RL recipe on a Kubernetes cluster. Verify current state (

kubectl

git log

, the recipe + infra files) before acting — the cluster is shared and the cost of a wrong action is high.

本指南针对位于

infra/nrl_k8s/

路径下的

nrl-k8s

CLI工具编写。当用户需要在Kubernetes集群上启动/迭代/调试NeMo-RL训练脚本时，请遵循本指南操作。执行操作前请先验证当前状态（

kubectl

、

git log

、训练脚本及基础设施文件）——集群为共享资源，错误操作的代价很高。

1. One command, two modes

1. 单命令，双模式

There is a single top-level submission command: nrl-k8s run
. It has two lifecycle modes.

Mode	Invocation	When to use	Cluster after?
Ephemeral (default)	`nrl-k8s run`	One-shot. KubeRay applies a RayJob, runs, tears the cluster down. Best for most runs.	No (auto)
Long-lived	`nrl-k8s run --raycluster`	Dev loop. Reuses a matching live cluster, applies if absent, warns + reuses on drift (pass `--recreate` to replace). Then submits daemons and training. First-choice for iteration.	Yes

Ask: Do I need this cluster after the run? If yes, use

--raycluster

. Otherwise use the default (ephemeral).

The rest of the CLI is observability / stage-by-stage control:

Command	Purpose
`nrl-k8s check`	Validate a recipe + infra pair; optionally write the fully-resolved manifests ( `-o` ).
`nrl-k8s status`	Per-role RayCluster state, head pod phase, worker pod phases, daemon job status.
`nrl-k8s cluster up/down/list/dashboard`	Manage RayClusters independently of a run (e.g. render a manifest with `--dry-run` ).
`nrl-k8s job list/logs/stop`	Observability over Ray Jobs already submitted to a role's cluster.
`nrl-k8s logs`	Tail a role's pod / daemon logs without needing a submission id.

顶层提交命令只有一个：nrl-k8s run
。它包含两种生命周期模式。

模式	调用方式	使用场景	集群后续状态
临时模式（默认）	`nrl-k8s run`	一次性任务。KubeRay创建RayJob，执行完成后自动销毁集群。适用于大多数训练任务。	无（自动销毁）
长期运行模式	`nrl-k8s run --raycluster`	开发循环场景。复用匹配的活跃集群，若集群不存在则创建；若集群配置漂移则发出警告并复用（可传入 `--recreate` 参数重建集群）。随后提交守护进程和训练任务。是迭代开发的首选模式。	保留

请先确认：训练完成后是否需要保留该集群？ 如果是，使用

--raycluster

模式；否则使用默认的临时模式。

CLI的其余命令用于可观测性或分阶段控制：

命令	用途
`nrl-k8s check`	验证训练脚本与基础设施配置的匹配性；可选生成完整解析后的清单文件（ `-o` 参数）。
`nrl-k8s status`	查看RayCluster各角色状态、头节点Pod阶段、工作节点Pod阶段、守护进程任务状态。
`nrl-k8s cluster up/down/list/dashboard`	独立于训练任务管理RayCluster（例如通过 `--dry-run` 参数渲染清单）。
`nrl-k8s job list/logs/stop`	查看已提交到集群的Ray任务的可观测数据。
`nrl-k8s logs`	无需提交ID即可跟踪指定角色的Pod/守护进程日志。

2. Recipe + infra pair

2. 训练脚本 + 基础设施配置对

Every launch takes two files. Pass the infra with

--infra

, not merged inline:

nrl-k8s run infra/nrl_k8s/examples/<recipe>.yaml \
  --infra infra/nrl_k8s/examples/<recipe>.<profile>.infra.yaml

Recipe (e.g.

qwen3_30b_math_8n_4gpu.yaml

) — NeMo-RL config: model, GRPO/SFT knobs,

cluster.{gpus_per_node,num_nodes}

. Uses

defaults:

to inherit from

examples/configs/recipes/llm/...

Infra (e.g.
```
*.<profile>.infra.yaml
```
) — K8s/Ray shape: namespace, image, service account, RayCluster spec under
```
kuberay:
```
, optional Deployments under
```
deployments:
```
,
```
submit.submitter
```
,
```
launch.{mode,codeSource,codePath,entrypoint}
```
. Pair names follow
```
<recipe>.<profile>[.prod].infra.yaml
```
where
```
<profile>
```
names the hardware target (e.g.
```
gb300
```
).

Example pairs in

infra/nrl_k8s/examples/

— read the neighbouring files to see the current conventions for the target profile.

每次启动都需要两个文件。通过

--infra

参数传入基础设施配置，请勿内联合并：

nrl-k8s run infra/nrl_k8s/examples/<recipe>.yaml \
  --infra infra/nrl_k8s/examples/<recipe>.<profile>.infra.yaml

训练脚本（例如
```
qwen3_30b_math_8n_4gpu.yaml
```
）——NeMo-RL配置：模型、GRPO/SFT参数、
```
cluster.{gpus_per_node,num_nodes}
```
。通过
```
defaults:
```
继承自
```
examples/configs/recipes/llm/...
```
路径下的配置。
基础设施配置（例如
```
*.<profile>.infra.yaml
```
）——K8s/Ray集群规格：命名空间、镜像、服务账号、
```
kuberay:
```
下的RayCluster配置、
```
deployments:
```
下的可选Deployment配置、
```
submit.submitter
```
、
```
launch.{mode,codeSource,codePath,entrypoint}
```
。配置对命名遵循
```
<recipe>.<profile>[.prod].infra.yaml
```
格式，其中
```
<profile>
```
代表硬件目标（例如
```
gb300
```
）。

infra/nrl_k8s/examples/

路径下提供了示例配置对，请查看同路径下的文件了解目标硬件配置的当前约定。

3. Long-lived mode flags

3. 长期运行模式参数

Three independent dimensions.

--mode

is a macro that picks defaults; individual flags override it.

--mode interactive   → --submitter portForward  --code-source upload  (tails logs)
--mode batch         → --submitter exec         --code-source image   (returns after nohup)

Submitter:
```
portForward
```
uses
```
kubectl port-forward
```
+ Ray Job SDK (gets a
```
submission_id
```
the dashboard tracks).
```
exec
```
uses
```
kubectl exec
```
+
```
nohup
```
on the head pod (no submission_id; driver appears as
```
type=DRIVER
```
in the dashboard).
Code source:
```
upload
```
stages a working_dir from the laptop (Ray 100 MiB cap).
```
image
```
/
```
lustre
```
expect code on the pod's filesystem — paired with
```
--code-path
```
(typically
```
/opt/nemo-rl
```
), which is a subPath of the shared-filesystem PVC mount in the standard infra examples.
Wait:
```
--wait
```
tails logs until terminal;
```
--no-wait
```
returns as soon as the driver is running.

Other long-lived-only flags:

```
--replace
```
— stop any running training / daemon job before submitting new ones (suffixes daemon submissionIds with a timestamp so Ray accepts the resubmit).
```
--recreate
```
— delete + re-apply a RayCluster whose live spec has drifted from the rendered manifest (default is warn + reuse).
```
--skip-daemons
```
— bring up all declared clusters but only submit training. Use on disagg recipes where gym/generation are already healthy.

Gotcha: on infra where the entrypoint does

cd /opt/nemo-rl

(or another in-image / Lustre path) and loads the recipe from there, --code-source upload
does NOT override the recipe on the pod — the uploaded working_dir sits in

/tmp/ray/...

but the entrypoint

cd

s away from it. To actually test a local recipe change, either sync your edits to the shared filesystem mounted into the pods or flip the Hydra overrides in the entrypoint.

包含三个独立维度。

--mode

参数是预设宏，可通过单独参数覆盖默认值。

--mode interactive   → --submitter portForward  --code-source upload  (跟踪日志)
--mode batch         → --submitter exec         --code-source image   (后台执行后返回)

提交方式：
```
portForward
```
使用
```
kubectl port-forward
```
+ Ray Job SDK（生成仪表盘可跟踪的
```
submission_id
```
）。
```
exec
```
使用
```
kubectl exec
```
+ 头节点Pod上的
```
nohup
```
命令（无submission_id；驱动进程在仪表盘中显示为
```
type=DRIVER
```
）。
代码来源：
```
upload
```
将本地工作目录上传到集群（Ray限制为100 MiB）。
```
image
```
/
```
lustre
```
模式期望代码已存在于Pod的文件系统中——需配合
```
--code-path
```
参数（通常为
```
/opt/nemo-rl
```
），该路径是标准基础设施示例中共享文件系统PVC挂载的子路径。
等待模式：
```
--wait
```
跟踪日志直到任务结束；
```
--no-wait
```
在驱动进程启动后立即返回。

其他仅适用于长期运行模式的参数：

```
--replace
```
— 提交新任务前停止所有正在运行的训练/守护进程任务（为守护进程的submissionId添加时间戳后缀，确保Ray接受重新提交）。
```
--recreate
```
— 当活跃集群的配置与渲染后的清单不一致时，删除并重建RayCluster（默认行为是发出警告并复用现有集群）。
```
--skip-daemons
```
— 启动所有已声明的集群，但仅提交训练任务。适用于gym/generation服务已正常运行的拆分式训练脚本。

注意事项：如果基础设施配置中的入口命令执行

cd /opt/nemo-rl

（或其他镜像内/Lustre路径）并从该路径加载训练脚本，--code-source upload
不会覆盖Pod上的训练脚本——上传的工作目录位于

/tmp/ray/...

，但入口命令已切换到其他路径。要测试本地训练脚本的修改，需将编辑内容同步到Pod挂载的共享文件系统，或在入口命令中通过Hydra参数覆盖配置。

4. Ephemeral mode flags (

--rayjob

)

4. 临时模式参数（

--rayjob

）

When

--rayjob

is set,

run

branches into the RayJob code path. Relevant flags:

```
--rayjob-name NAME
```
— RayJob metadata name (defaults to the training cluster name).
```
--shutdown / --no-shutdown
```
— default
```
true
```
: KubeRay deletes the RayCluster once the Ray Job reaches a terminal state.
```
--ttl SECONDS
```
— default 3600s: keep the RayJob object around after the run finishes for post-mortem log access.
```
--wait / --no-wait
```
— default
```
wait
```
: poll
```
jobDeploymentStatus
```
until Complete/Failed.
```
--no-wait
```
returns as soon as the RayJob is applied.
```
--timeout SECONDS
```
— default 86400s (24h): bound the
```
--wait
```
poll.
```
--dry-run
```
— render the RayJob manifest and print it; do not apply.

--replace

--recreate

--skip-daemons

are silently ignored in

--rayjob

mode (KubeRay owns lifecycle).

当设置

--rayjob

参数时，

run

命令会切换到RayJob执行路径。相关参数：

```
--rayjob-name NAME
```
— RayJob元数据名称（默认值为训练集群名称）。
```
--shutdown / --no-shutdown
```
— 默认值为
```
true
```
：当RayJob进入终端状态后，KubeRay自动删除RayCluster。
```
--ttl SECONDS
```
— 默认值为3600秒：训练完成后保留RayJob对象一段时间，以便事后查看日志。
```
--wait / --no-wait
```
— 默认值为
```
wait
```
：轮询
```
jobDeploymentStatus
```
直到任务完成/失败。
```
--no-wait
```
在RayJob创建后立即返回。
```
--timeout SECONDS
```
— 默认值为86400秒（24小时）：限制
```
--wait
```
模式的轮询时长。
```
--dry-run
```
— 渲染RayJob清单并打印，不实际执行创建操作。

在

--rayjob

模式下，

--replace

--recreate

--skip-daemons

参数会被忽略（KubeRay负责生命周期管理）。

5. Iterating on a config without touching the shared filesystem

5. 不修改共享文件系统的情况下迭代配置

When the recipe on the pod filesystem has the wrong value for your experiment, use Hydra overrides on the entrypoint instead of forking the recipe. Pattern:

yaml

entrypoint: |
  set -eu
  cd /opt/nemo-rl
  RUN_ID="\${RAY_JOB_SUBMISSION_ID:-\${NRL_K8S_RUN_ID:-$(date -u +%Y%m%d-%H%M%S)}}"
  python -u examples/run_grpo.py \
    --config infra/nrl_k8s/examples/<recipe>.yaml \
    logger.wandb_enabled=true \
    logger.wandb.project=<project> \
    "logger.wandb.name=<run-name>-\${RUN_ID}"

Escape
${…}
with a backslash. OmegaConf otherwise interprets it as interpolation and errors on shell-style

${VAR:-default}

RUN_ID

resolves to

RAY_JOB_SUBMISSION_ID

(injected by KubeRay in rayjob mode) →

NRL_K8S_RUN_ID

(injected by the CLI in long-lived mode) → local timestamp — so the name is unique across either path.

当Pod文件系统中的训练脚本配置不符合实验需求时，无需复制训练脚本，可通过入口命令中的Hydra参数覆盖配置。示例模式：

yaml

entrypoint: |
  set -eu
  cd /opt/nemo-rl
  RUN_ID="\${RAY_JOB_SUBMISSION_ID:-\${NRL_K8S_RUN_ID:-$(date -u +%Y%m%d-%H%M%S)}}"
  python -u examples/run_grpo.py \
    --config infra/nrl_k8s/examples/<recipe>.yaml \
    logger.wandb_enabled=true \
    logger.wandb.project=<project> \
    "logger.wandb.name=<run-name>-\${RUN_ID}"

使用反斜杠转义
${…}
。否则OmegaConf会将其解释为插值语法，并在遇到shell风格的

${VAR:-default}

时报错。

RUN_ID

的解析优先级为：

RAY_JOB_SUBMISSION_ID

（RayJob模式下由KubeRay注入）→

NRL_K8S_RUN_ID

（长期运行模式下由CLI注入）→ 本地时间戳——因此无论使用哪种模式，名称都是唯一的。

6. Per-profile concerns (hardware + scheduler + DRA)

6. 按硬件配置区分的注意事项（硬件 + 调度器 + DRA）

Every infra YAML encodes a hardware/scheduler profile. The concrete examples in

infra/nrl_k8s/examples/

are authoritative for the profiles they target — read the neighbouring infra file before writing a new one. Things that commonly vary:

Per-node GPUs (e.g. 4 vs 8) — must match
```
cluster.gpus_per_node
```
in the recipe, otherwise workers stay
```
Pending
```
.
Node selectors — head pods usually land on a CPU-only node pool; GPU workers match on
```
nvidia.com/gpu.product
```
or a node-group label.
Scheduler — KAI (
```
schedulerName: kai-scheduler
```
+
```
kai.scheduler/queue
```
label) with topology annotations (
```
kai.scheduler/topology
```
,
```
kai.scheduler/topology-required-placement
```
) gang-schedules workers into one clique. Without it, pods may land on different racks and NVLink/RoCE won't span them.
DRA claims — ComputeDomain + RoCE are attached via
```
resourceClaims
```
referencing
```
ResourceClaimTemplate
```
s. The CLI auto-creates/deletes these when the worker pod spec contains DRA claim references — no manual setup needed.
Secrets — always via
```
secretKeyRef
```
(
```
wandb-api-key
```
, image pull secret). Never embed.
Shared filesystem mounts — typically a Lustre PVC mounted twice: once at the code path (e.g.
```
/opt/nemo-rl
```
with a user-scoped
```
subPath
```
) and once at a workspace root (e.g.
```
/mnt/rl-workspace
```
) for datasets, HF cache, and checkpoints.

Before applying an infra, verify prereqs exist in the target namespace:

bash

kubectl get pvc <workspace-pvc>
kubectl get secret <wandb-secret> <image-pull-secret>
kubectl get sa <service-account>

每个基础设施YAML文件都对应一种硬件/调度器配置。

infra/nrl_k8s/examples/

路径下的具体示例是对应硬件配置的权威参考——编写新配置前请先查看同路径下的基础设施文件。常见的差异点：

单节点GPU数量（例如4 vs 8）——必须与训练脚本中的
```
cluster.gpus_per_node
```
匹配，否则工作节点会一直处于
```
Pending
```
状态。
节点选择器——头节点通常部署在仅含CPU的节点池；GPU工作节点通过
```
nvidia.com/gpu.product
```
或节点组标签匹配。
调度器——KAI调度器（
```
schedulerName: kai-scheduler
```
+
```
kai.scheduler/queue
```
标签）配合拓扑注解（
```
kai.scheduler/topology
```
,
```
kai.scheduler/topology-required-placement
```
）将工作节点调度到同一集群中。如果不使用该调度器，Pod可能会部署在不同机架上，导致NVLink/RoCE无法跨机架通信。
DRA声明——ComputeDomain + RoCE通过
```
resourceClaims
```
引用
```
ResourceClaimTemplate
```
实现挂载。当工作节点Pod配置中包含DRA声明引用时，CLI会自动创建/删除这些资源——无需手动配置。
密钥——始终通过
```
secretKeyRef
```
引用（例如
```
wandb-api-key
```
、镜像拉取密钥）。禁止直接嵌入密钥。
共享文件系统挂载——通常是Lustre PVC挂载两次：一次挂载到代码路径（例如
```
/opt/nemo-rl
```
，带用户范围的
```
subPath
```
），另一次挂载到工作区根路径（例如
```
/mnt/rl-workspace
```
），用于存储数据集、HF缓存和检查点。

应用基础设施配置前，请验证目标命名空间中是否存在必要的资源：

bash

kubectl get pvc <workspace-pvc>
kubectl get secret <wandb-secret> <image-pull-secret>
kubectl get sa <service-account>

7. End-to-end workflows

7. 端到端工作流

7a. Fresh one-shot run (rayjob)

7a. 全新一次性训练任务（rayjob模式）

bash

undefined

bash

undefined

From the NeMo-RL repo root:

在NeMo-RL仓库根目录执行：

nrl-k8s check <recipe> --infra <infra> # validate first nrl-k8s run <recipe> --infra <infra> --rayjob --dry-run # render RayJob manifest nrl-k8s run <recipe> --infra <infra> --rayjob --no-wait # apply, returns fast


Watch status + teardown (works even after your laptop disconnects because KubeRay owns the lifecycle):
```bash
kubectl get rayjob -n default <name> -w
kubectl get raycluster -n default                                    # empty = teardown succeeded

nrl-k8s check <recipe> --infra <infra> # 先验证配置 nrl-k8s run <recipe> --infra <infra> --rayjob --dry-run # 渲染RayJob清单 nrl-k8s run <recipe> --infra <infra> --rayjob --no-wait # 创建任务，快速返回


查看状态及销毁过程（即使笔记本断开连接也能正常执行，因为KubeRay负责生命周期管理）：
```bash
kubectl get rayjob -n default <name> -w
kubectl get raycluster -n default                                    # 无输出表示销毁成功

7b. Dev loop (long-lived)

7b. 开发循环（长期运行模式）

bash

nrl-k8s run <recipe> --infra <infra> --run-id $(date +%Y%m%d-%H%M%S)

bash

nrl-k8s run <recipe> --infra <infra> --run-id $(date +%Y%m%d-%H%M%S)

Edits in the recipe? Just re-run — reuses the live cluster.

修改训练脚本后？只需重新执行命令——复用活跃集群。

Pod spec changed? Add --recreate to delete + re-apply.

Pod配置变更？添加--recreate参数删除并重建集群。

Disagg recipe with gym/gen already healthy? --skip-daemons.

使用拆分式训练脚本且gym/gen服务已正常运行？添加--skip-daemons参数。

undefined

undefined

7c. First-time disaggregated bring-up

7c. 首次启动拆分式训练任务

bash

nrl-k8s run <recipe> --infra <disagg-infra> --mode batch --code-source image

bash

nrl-k8s run <recipe> --infra <disagg-infra> --mode batch --code-source image

7d. Cluster-only lifecycle

7d. 仅集群生命周期管理

bash

nrl-k8s cluster up   <recipe> --infra <infra> --target kuberay.training --wait
nrl-k8s cluster up   <recipe> --infra <infra> --target kuberay.training --dry-run   # render manifest
nrl-k8s cluster down <recipe> --infra <infra> --target kuberay.training --wait
nrl-k8s cluster down <recipe> --infra <infra>                                       # tear down all
nrl-k8s cluster list -n default
nrl-k8s cluster dashboard <cluster-name>                                  # port-forward + browser

bash

nrl-k8s cluster up   <recipe> --infra <infra> --target kuberay.training --wait
nrl-k8s cluster up   <recipe> --infra <infra> --target kuberay.training --dry-run   # 渲染清单
nrl-k8s cluster down <recipe> --infra <infra> --target kuberay.training --wait
nrl-k8s cluster down <recipe> --infra <infra>                                       # 销毁所有资源
nrl-k8s cluster list -n default
nrl-k8s cluster dashboard <cluster-name>                                  # 端口转发并打开浏览器

7e. Deployments (e.g. nemo-skills sandbox)

7e. Deployment管理（例如nemo-skills沙箱）

bash

undefined

bash

undefined

Bring up just the deployment

仅启动Deployment

nrl-k8s cluster up <recipe> --infra <infra> --target deployments.nemo_skills

Tear down just the deployment

仅销毁Deployment

nrl-k8s cluster down <recipe> --infra <infra> --target deployments.nemo_skills

Tear down everything (RayClusters + Deployments)

销毁所有资源（RayClusters + Deployments）

nrl-k8s cluster down <recipe> --infra <infra>


The `deployments:` section in infra YAML declares Kubernetes Deployments managed alongside RayClusters. The CLI patches image, imagePullSecrets, and serviceAccountName from the top-level infra keys (same as RayClusters). Deployments start in parallel with cluster bring-up — no ordering dependency.

nrl-k8s cluster down <recipe> --infra <infra>


基础设施YAML文件中的`deployments:`部分声明了与RayCluster一起管理的Kubernetes Deployment。CLI会从顶层基础设施配置中提取镜像、imagePullSecrets和serviceAccountName并应用到Deployment（与RayCluster逻辑相同）。Deployment与集群启动并行执行——无依赖顺序。

8. Monitoring a run

8. 监控训练任务

bash

undefined

bash

undefined

Status

查看状态

nrl-k8s status <recipe> --infra <infra> kubectl get rayjob,raycluster -n default

Follow the driver

跟踪驱动进程日志

nrl-k8s job list <recipe> --infra <infra> --role training nrl-k8s job logs <run-id> <recipe> --infra <infra> --role training -f


When the `nrl-k8s job logs -f` subprocess dies (`kubectl port-forward` i/o timeout after ~15 min idle), just re-run it. The training job keeps going.

To fetch driver logs for a terminal job (SUCCEEDED/FAILED) or a RayJob via the dashboard API:
```bash
RC=$(kubectl get rayjob -n default <rayjob-name> -o jsonpath='{.status.rayClusterName}')
kubectl port-forward -n default svc/${RC}-head-svc 18266:8265 &
curl -s http://localhost:18266/api/jobs/                              # lists jobs, find submission_id
curl -s "http://localhost:18266/api/jobs/<submission_id>/logs"        # full driver log

type=DRIVER

with

submission_id=null

means an exec-submitter run (no dashboard log endpoint — use

nrl-k8s job logs

instead).

type=SUBMISSION

has

submission_id

set and

/api/jobs/<id>/logs

works.

Wandb URL appears in the driver log on the first

wandb.init

call; grep

grep -oE 'https://wandb\.ai/[A-Za-z0-9_./-]+'

nrl-k8s job list <recipe> --infra <infra> --role training nrl-k8s job logs <run-id> <recipe> --infra <infra> --role training -f


当`nrl-k8s job logs -f`子进程终止（`kubectl port-forward`在闲置约15分钟后会超时），只需重新执行命令即可——训练任务会继续运行。

要获取已完成（SUCCEEDED/FAILED）任务的驱动进程日志，或通过仪表盘API获取RayJob日志：
```bash
RC=$(kubectl get rayjob -n default <rayjob-name> -o jsonpath='{.status.rayClusterName}')
kubectl port-forward -n default svc/${RC}-head-svc 18266:8265 &
curl -s http://localhost:18266/api/jobs/                              # 列出任务，找到submission_id
curl -s "http://localhost:18266/api/jobs/<submission_id>/logs"        # 获取完整驱动进程日志

type=DRIVER

且

submission_id=null

表示使用exec提交方式的任务（无仪表盘日志端点——请使用

nrl-k8s job logs

命令）。

type=SUBMISSION

且

submission_id

已设置的任务可通过

/api/jobs/<id>/logs

获取日志。

Wandb URL会在首次调用

wandb.init

时出现在驱动进程日志中；可通过

grep -oE 'https://wandb\.ai/[A-Za-z0-9_./-]+'

命令提取。

9. Stopping things

9. 停止任务

What to stop	Command
One training run	`nrl-k8s job stop <run-id> <recipe> --infra <infra> --role training`
All running Ray jobs on a cluster (+ submit new)	`nrl-k8s run <recipe> --infra <infra> --replace`
A long-lived RayCluster	`nrl-k8s cluster down <recipe> --infra <infra> --target kuberay.training --wait`
A RayJob (ephemeral)	`kubectl delete rayjob <name> -n default` — only if `shutdownAfterJobFinishes` didn't fire

Confirm before deleting shared infra. The cost of

cluster down

on someone else's cluster is high.

停止目标	命令
单个训练任务	`nrl-k8s job stop <run-id> <recipe> --infra <infra> --role training`
集群上所有运行中的Ray任务（并提交新任务）	`nrl-k8s run <recipe> --infra <infra> --replace`
长期运行的RayCluster	`nrl-k8s cluster down <recipe> --infra <infra> --target kuberay.training --wait`
RayJob（临时模式）	`kubectl delete rayjob <name> -n default` — 仅在 `shutdownAfterJobFinishes` 未触发时使用

删除共享基础设施前请确认。误删除他人集群的代价很高。

10. Verifying RayJob teardown

10. 验证RayJob销毁

After a

run --rayjob

completes with

--shutdown

(default), KubeRay should delete the RayCluster:

bash

kubectl get rayjob   -n default <rayjob-name>                        # jobDeploymentStatus = Complete
kubectl get raycluster -n default | grep <rayjob-name>               # no output = torn down

The RayJob object itself sticks around for

--ttl

seconds (default 3600s) so you can still fetch logs.

当

run --rayjob

任务在

--shutdown

模式（默认）下完成后，KubeRay应自动删除RayCluster：

bash

kubectl get rayjob   -n default <rayjob-name>                        # jobDeploymentStatus = Complete
kubectl get raycluster -n default | grep <rayjob-name>               # 无输出表示已销毁

RayJob对象本身会保留

--ttl

秒（默认3600秒），以便仍能获取日志。

11. Common gotchas

11. 常见陷阱

OmegaConf interpolation eats
```
${VAR}
```
in recipe/infra YAML. Escape shell variables with
```
\${VAR}
```
so OmegaConf passes them through to the pod shell verbatim.
Megatron optimizer configs don't carry
```
foreach
```
/
```
fused
```
. Overrides like
```
~policy.optimizer.kwargs.foreach ~policy.optimizer.kwargs.fused
```
(valid for DTensor configs) break on Megatron recipes. Omit them for Megatron.
DTensor vs Megatron — MoE recipes typically use
```
megatron_cfg.enabled=true
```
; ensure
```
dtensor_cfg.enabled=false
```
in inherited defaults.
Shared filesystem vs git divergence —
```
codeSource: image|lustre
```
reads from the pod filesystem. If your local edits aren't on the shared filesystem the pods mount, the run is testing the on-disk version, not yours. Either sync via a helper pod (head pod exec is often blocked) or override via Hydra flags.
Ephemeral-storage + readinessProbe are injected by kuberay/CDI webhooks at pod-apply time. Do NOT add them to the inline RayCluster spec.
Node taints vary per cluster.
```
tolerations: [{operator: Exists}]
```
on workers is defensive and worth keeping.
Dashboard blank page — Ray 2.52 installs dashboard assets as symlinks by default;
```
nrl-k8s cluster dashboard <name>
```
auto-reinstalls
```
ray[default] --link-mode=copy
```
to fix it. Bake
```
ENV UV_LINK_MODE=copy
```
in the image to avoid this entirely.
kubectl exec
is usually blocked in automation — route around with
```
kubectl get ... -o yaml
```
,
```
kubectl logs
```
, and
```
kubectl port-forward
```
+ Ray dashboard APIs.

OmegaConf插值语法会解析训练脚本/基础设施YAML中的
```
${VAR}
```
。请使用
```
\${VAR}
```
转义shell变量，确保OmegaConf将其原封不动地传递给Pod的shell。
Megatron优化器配置不支持
```
foreach
```
/
```
fused
```
参数。类似
```
~policy.optimizer.kwargs.foreach ~policy.optimizer.kwargs.fused
```
的覆盖参数（适用于DTensor配置）会在Megatron训练脚本中失效。Megatron训练脚本请省略这些参数。
DTensor vs Megatron — MoE训练脚本通常设置
```
megatron_cfg.enabled=true
```
；请确保继承的默认配置中
```
dtensor_cfg.enabled=false
```
。
共享文件系统与git差异 —
```
codeSource: image|lustre
```
模式从Pod文件系统读取代码。如果本地修改未同步到Pod挂载的共享文件系统，训练任务测试的是磁盘上的版本，而非本地修改后的版本。可通过辅助Pod同步（头节点exec通常被禁止）或Hydra参数覆盖来解决。
临时存储 + 就绪探针由kuberay/CDI webhook在Pod创建时自动注入。请勿在RayCluster内联配置中添加这些内容。
节点污点因集群而异。在工作节点上设置
```
tolerations: [{operator: Exists}]
```
是防御性配置，建议保留。
仪表盘空白页面 — Ray 2.52默认通过符号链接安装仪表盘资源；
```
nrl-k8s cluster dashboard <name>
```
会自动重新安装
```
ray[default] --link-mode=copy
```
来修复该问题。在镜像中添加
```
ENV UV_LINK_MODE=copy
```
可从根本上避免此问题。
kubectl exec
通常被禁止在自动化场景中——可通过
```
kubectl get ... -o yaml
```
、
```
kubectl logs
```
和
```
kubectl port-forward
```
+ Ray仪表盘API替代。

12. Checklist before calling a run "done"

12. 确认训练任务完成的检查清单

Before reporting a launch as successful, verify:

kubectl get rayjob/raycluster -n default

shows the expected objects.

nrl-k8s job list

(or

curl /api/jobs/

) shows the job in

RUNNING

SUCCEEDED

Driver log contains
```
wandb.ai/<project>/runs/<id>
```
(if wandb is enabled) — share the URL with the user.
At least one
```
Processed prompts: 100%
```
line appears (confirms generation is wired).

For

--rayjob

mode only: after

jobDeploymentStatus=Complete

, confirm

kubectl get raycluster | grep <name>

is empty (teardown worked).

在报告训练任务启动成功前，请验证以下内容：

kubectl get rayjob/raycluster -n default

显示预期的资源对象。

nrl-k8s job list

（或

curl /api/jobs/

）显示任务处于

RUNNING

SUCCEEDED

状态。

驱动进程日志包含
```
wandb.ai/<project>/runs/<id>
```
（若启用wandb）——请将该URL分享给用户。
日志中至少出现一行
```
Processed prompts: 100%
```
（确认生成流程已正常运行）。

仅适用于

--rayjob

模式：当

jobDeploymentStatus=Complete

后，确认

kubectl get raycluster | grep <name>

无输出（销毁成功）。

13. Dev pod

13. 开发Pod

nrl-k8s dev

manages a lightweight CPU pod on the cluster for code syncing, debugging, and running

kubectl

nrl-k8s

from within the cluster.

bash

undefined

nrl-k8s dev

命令用于在集群上管理轻量级CPU Pod，用于代码同步、调试以及在集群内运行

kubectl

nrl-k8s

命令。

bash

undefined

One-time: set up secrets (HF token, wandb, SSH key, rclone)

一次性操作：配置密钥（HF token、wandb、SSH密钥、rclone）

nrl-k8s dev setup-secrets --ssh-key ~/.ssh/id_rsa --add-rclone

Create pod and exec in (idempotent — reuses existing pod)

创建Pod并进入（幂等操作——复用现有Pod）

nrl-k8s dev connect

Switch image (must stop first — image change is warned but not auto-applied)

切换镜像（必须先停止Pod——镜像变更会发出警告但不会自动应用）

nrl-k8s dev stop nrl-k8s dev connect --image nvcr.io/nvidian/nemo-rl:v0.7.0

Tear down

销毁Pod

nrl-k8s dev stop


The dev pod:
- Runs on a CPU-only node (anti-affinity to GPU nodes)
- Mounts the shared `rl-workspace` PVC at `/mnt/rl-workspace`
- Sets `USER` env var to the `nrl-k8s` username (so `$USER` and `getpass.getuser()` work correctly despite running as root)
- Installs `kubectl`, `rclone` (if configured) on first boot
- Injects SSH keys and tokens via `envFrom` on a per-user K8s Secret

The pod's `default` service account needs an `edit` RoleBinding in the namespace for `kubectl` to work inside. `dev connect` checks this and prints the required YAML if missing.

nrl-k8s dev stop


开发Pod特性：
- 部署在仅含CPU的节点上（与GPU节点互斥）
- 将共享`rl-workspace` PVC挂载到`/mnt/rl-workspace`
- 设置`USER`环境变量为`nrl-k8s`用户名（确保`$USER`和`getpass.getuser()`在以root身份运行时仍能正常工作）
- 首次启动时安装`kubectl`、`rclone`（若已配置）
- 通过每个用户的K8s Secret注入SSH密钥和令牌

Pod的`default`服务账号需要在命名空间中拥有`edit`角色绑定，才能在Pod内正常使用`kubectl`。`dev connect`命令会检查该权限，如果缺失则打印所需的YAML配置。

14. Where things live in the repo

14. 仓库中各组件的位置

CLI code:

infra/nrl_k8s/src/nrl_k8s/

(

cli.py

orchestrate.py

manifest.py

rayjob.py

k8s.py

submitters/

schema.py

Tests:

infra/nrl_k8s/tests/unit/

— run with

uv run --extra test pytest -x -q

from

infra/nrl_k8s/

Recipe + infra examples:
```
infra/nrl_k8s/examples/
```
.

Base recipes this tool wraps:

examples/configs/recipes/llm/…

and

examples/nemo_gym/…

CLI代码：

infra/nrl_k8s/src/nrl_k8s/

（

cli.py

orchestrate.py

manifest.py

rayjob.py

k8s.py

submitters/

schema.py

）。

测试代码：

infra/nrl_k8s/tests/unit/

— 在

infra/nrl_k8s/

路径下执行

uv run --extra test pytest -x -q

运行测试。

训练脚本 + 基础设施配置示例：
```
infra/nrl_k8s/examples/
```
。

本工具封装的基础训练脚本：

examples/configs/recipes/llm/…

和

examples/nemo_gym/…

。

launch-nemo-rl

Original

Translation

launch-nemo-rl — running NeMo-RL recipes on Kubernetes via nrl-k8s

launch-nemo-rl — 通过nrl-k8s在Kubernetes上运行NeMo-RL训练脚本

1. One command, two modes

1. 单命令，双模式

2. Recipe + infra pair

2. 训练脚本 + 基础设施配置对

3. Long-lived mode flags

3. 长期运行模式参数

4. Ephemeral mode flags (--rayjob)

4. 临时模式参数（--rayjob）

5. Iterating on a config without touching the shared filesystem

5. 不修改共享文件系统的情况下迭代配置

6. Per-profile concerns (hardware + scheduler + DRA)

6. 按硬件配置区分的注意事项（硬件 + 调度器 + DRA）

7. End-to-end workflows

7. 端到端工作流

7a. Fresh one-shot run (rayjob)

7a. 全新一次性训练任务（rayjob模式）

From the NeMo-RL repo root:

在NeMo-RL仓库根目录执行：

7b. Dev loop (long-lived)

7b. 开发循环（长期运行模式）

Edits in the recipe? Just re-run — reuses the live cluster.

修改训练脚本后？只需重新执行命令——复用活跃集群。

Pod spec changed? Add --recreate to delete + re-apply.

Pod配置变更？添加--recreate参数删除并重建集群。

Disagg recipe with gym/gen already healthy? --skip-daemons.

使用拆分式训练脚本且gym/gen服务已正常运行？添加--skip-daemons参数。

7c. First-time disaggregated bring-up

7c. 首次启动拆分式训练任务

7d. Cluster-only lifecycle

7d. 仅集群生命周期管理

7e. Deployments (e.g. nemo-skills sandbox)

7e. Deployment管理（例如nemo-skills沙箱）

Bring up just the deployment

仅启动Deployment

Tear down just the deployment

仅销毁Deployment

Tear down everything (RayClusters + Deployments)

销毁所有资源（RayClusters + Deployments）

8. Monitoring a run

8. 监控训练任务

Status

查看状态

Follow the driver

跟踪驱动进程日志

9. Stopping things

9. 停止任务

10. Verifying RayJob teardown

10. 验证RayJob销毁

11. Common gotchas

11. 常见陷阱

12. Checklist before calling a run "done"

12. 确认训练任务完成的检查清单

13. Dev pod

13. 开发Pod

One-time: set up secrets (HF token, wandb, SSH key, rclone)

一次性操作：配置密钥（HF token、wandb、SSH密钥、rclone）

Create pod and exec in (idempotent — reuses existing pod)

创建Pod并进入（幂等操作——复用现有Pod）

Switch image (must stop first — image change is warned but not auto-applied)

切换镜像（必须先停止Pod——镜像变更会发出警告但不会自动应用）

Tear down

销毁Pod

14. Where things live in the repo

14. 仓库中各组件的位置

4. Ephemeral mode flags (
`--rayjob`
)

4. 临时模式参数（
`--rayjob`
）