launch-nemo-rl

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

launch-nemo-rl — running NeMo-RL recipes on Kubernetes via nrl-k8s

launch-nemo-rl — 通过nrl-k8s在Kubernetes上运行NeMo-RL训练脚本

This is the playbook for the

nrl-k8s

CLI at

infra/nrl_k8s/

. Follow it when the user asks to launch / iterate / debug a NeMo-RL recipe on a Kubernetes cluster. Verify current state (

kubectl

git log

, the recipe + infra files) before acting — the cluster is shared and the cost of a wrong action is high.

本指南针对位于

infra/nrl_k8s/

路径下的

nrl-k8s

CLI工具编写。当用户需要在Kubernetes集群上启动/迭代/调试NeMo-RL训练脚本时，请遵循本指南。操作前请先验证当前状态（

kubectl

、

git log

、训练脚本及基础设施文件）——集群为共享资源，错误操作的成本很高。

1. One command, two modes

1. 一条命令，两种模式

There is a single top-level submission command: nrl-k8s run
. It has two lifecycle modes.

Mode	Invocation	When to use	Cluster after?
Ephemeral (default)	`nrl-k8s run`	One-shot. KubeRay applies a RayJob, runs, tears the cluster down. Best for most runs.	No (auto)
Long-lived	`nrl-k8s run --raycluster`	Dev loop. Reuses a matching live cluster, applies if absent, warns + reuses on drift (pass `--recreate` to replace). Then submits daemons and training. First-choice for iteration.	Yes

Ask: Do I need this cluster after the run? If yes, use

--raycluster

. Otherwise use the default (ephemeral).

The rest of the CLI is observability / stage-by-stage control:

Command	Purpose
`nrl-k8s check`	Validate a recipe + infra pair; optionally write the fully-resolved manifests ( `-o` ).
`nrl-k8s status`	Per-role RayCluster state, head pod phase, worker pod phases, daemon job status.
`nrl-k8s cluster up/down/list/dashboard`	Manage RayClusters independently of a run (e.g. render a manifest with `--dry-run` ).
`nrl-k8s job list/logs/stop`	Observability over Ray Jobs already submitted to a role's cluster.
`nrl-k8s logs`	Tail a role's pod / daemon logs without needing a submission id.

提交训练仅需一个顶层命令：nrl-k8s run
。它包含两种生命周期模式。

模式	调用方式	使用场景	训练后集群状态
临时模式（默认）	`nrl-k8s run`	一次性任务。KubeRay会创建RayJob，执行完成后自动销毁集群。适用于大多数训练场景。	自动销毁
持久化模式	`nrl-k8s run --raycluster`	开发迭代循环。复用匹配的活跃集群，若集群不存在则创建；若集群配置已漂移则发出警告并复用（可传入 `--recreate` 参数重新创建）。随后提交守护进程和训练任务。是开发迭代的首选模式。	保留集群

思考：训练结束后我还需要这个集群吗？ 如果需要，使用

--raycluster

；否则使用默认的临时模式。

CLI的其余命令用于可观测性或分阶段控制：

命令	用途
`nrl-k8s check`	验证训练脚本与基础设施配置的匹配性；可选生成完整解析后的配置清单（ `-o` 参数）。
`nrl-k8s status`	查看各角色RayCluster的状态、头节点Pod阶段、工作节点Pod阶段、守护进程任务状态。
`nrl-k8s cluster up/down/list/dashboard`	独立于训练任务管理RayCluster（例如通过 `--dry-run` 渲染配置清单）。
`nrl-k8s job list/logs/stop`	查看已提交到对应角色集群的Ray Job的状态与日志。
`nrl-k8s logs`	无需提交ID即可跟踪对应角色Pod/守护进程的日志。

2. Recipe + infra pair

2. 训练脚本+基础设施配置对

Every launch takes two files. Pass the infra with

--infra

, not merged inline:

nrl-k8s run infra/nrl_k8s/examples/<recipe>.yaml \
  --infra infra/nrl_k8s/examples/<recipe>.<profile>.infra.yaml

Recipe (e.g.

qwen3_30b_math_8n_4gpu.yaml

) — NeMo-RL config: model, GRPO/SFT knobs,

cluster.{gpus_per_node,num_nodes}

. Uses

defaults:

to inherit from

examples/configs/recipes/llm/...

Infra (e.g.
```
*.<profile>.infra.yaml
```
) — K8s/Ray shape: namespace, image, service account, RayCluster spec under
```
kuberay:
```
, optional Deployments under
```
deployments:
```
,
```
submit.submitter
```
,
```
launch.{mode,codeSource,codePath,entrypoint}
```
. Pair names follow
```
<recipe>.<profile>[.prod].infra.yaml
```
where
```
<profile>
```
names the hardware target (e.g.
```
gb300
```
).

Example pairs in

infra/nrl_k8s/examples/

— read the neighbouring files to see the current conventions for the target profile.

每次启动训练都需要两个文件。通过

--infra

参数指定基础设施配置，请勿内联合并：

nrl-k8s run infra/nrl_k8s/examples/<recipe>.yaml \
  --infra infra/nrl_k8s/examples/<recipe>.<profile>.infra.yaml

训练脚本（例如
```
qwen3_30b_math_8n_4gpu.yaml
```
）——NeMo-RL配置：模型、GRPO/SFT参数、
```
cluster.{gpus_per_node,num_nodes}
```
。通过
```
defaults:
```
继承自
```
examples/configs/recipes/llm/...
```
下的配置。
基础设施配置（例如
```
*.<profile>.infra.yaml
```
）——K8s/Ray集群配置：命名空间、镜像、服务账号、
```
kuberay:
```
下的RayCluster规格、
```
deployments:
```
下的可选Deployment配置、
```
submit.submitter
```
、
```
launch.{mode,codeSource,codePath,entrypoint}
```
。配置对命名遵循
```
<recipe>.<profile>[.prod].infra.yaml
```
格式，其中
```
<profile>
```
代表硬件目标（例如
```
gb300
```
）。

infra/nrl_k8s/examples/

目录下有示例配置对——请查看目标profile对应的相邻文件，了解当前的配置规范。

3. Long-lived mode flags

3. 持久化模式参数

Three independent dimensions.

--mode

is a macro that picks defaults; individual flags override it.

--mode interactive   → --submitter portForward  --code-source upload  (tails logs)
--mode batch         → --submitter exec         --code-source image   (returns after nohup)

Submitter:
```
portForward
```
uses
```
kubectl port-forward
```
+ Ray Job SDK (gets a
```
submission_id
```
the dashboard tracks).
```
exec
```
uses
```
kubectl exec
```
+
```
nohup
```
on the head pod (no submission_id; driver appears as
```
type=DRIVER
```
in the dashboard).
Code source:
```
upload
```
stages a working_dir from the laptop (Ray 100 MiB cap).
```
image
```
/
```
lustre
```
expect code on the pod's filesystem — paired with
```
--code-path
```
(typically
```
/opt/nemo-rl
```
), which is a subPath of the shared-filesystem PVC mount in the standard infra examples.
Wait:
```
--wait
```
tails logs until terminal;
```
--no-wait
```
returns as soon as the driver is running.

Other long-lived-only flags:

```
--replace
```
— stop any running training / daemon job before submitting new ones (suffixes daemon submissionIds with a timestamp so Ray accepts the resubmit).
```
--recreate
```
— delete + re-apply a RayCluster whose live spec has drifted from the rendered manifest (default is warn + reuse).
```
--skip-daemons
```
— bring up all declared clusters but only submit training. Use on disagg recipes where gym/generation are already healthy.

Gotcha: on infra where the entrypoint does

cd /opt/nemo-rl

(or another in-image / Lustre path) and loads the recipe from there, --code-source upload
does NOT override the recipe on the pod — the uploaded working_dir sits in

/tmp/ray/...

but the entrypoint

cd

s away from it. To actually test a local recipe change, either sync your edits to the shared filesystem mounted into the pods or flip the Hydra overrides in the entrypoint.

包含三个独立维度。

--mode

参数是预设宏，可通过单独参数覆盖默认值。

--mode interactive   → --submitter portForward  --code-source upload  (实时跟踪日志)
--mode batch         → --submitter exec         --code-source image   (后台执行后返回)

提交方式：
```
portForward
```
使用
```
kubectl port-forward
```
+ Ray Job SDK（生成一个仪表盘可跟踪的
```
submission_id
```
）。
```
exec
```
使用
```
kubectl exec
```
+ 头节点Pod上的
```
nohup
```
（无submission_id；驱动进程在仪表盘中显示为
```
type=DRIVER
```
）。
代码源：
```
upload
```
将本地工作目录上传到集群（Ray限制为100 MiB）。
```
image
```
/
```
lustre
```
模式要求代码已存在于Pod的文件系统中——需配合
```
--code-path
```
参数（通常为
```
/opt/nemo-rl
```
），该路径是标准基础设施示例中共享文件系统PVC挂载的子路径。
等待模式：
```
--wait
```
会跟踪日志直到任务结束；
```
--no-wait
```
会在驱动进程启动后立即返回。

其他仅适用于持久化模式的参数：

```
--replace
```
— 提交新任务前停止所有正在运行的训练/守护进程任务（为守护进程提交ID添加时间戳后缀，确保Ray接受重新提交）。
```
--recreate
```
— 当活跃集群的规格与渲染后的配置清单不一致时，删除并重新创建RayCluster（默认行为是发出警告并复用现有集群）。
```
--skip-daemons
```
— 启动所有已声明的集群，但仅提交训练任务。适用于gym/generation服务已正常运行的拆分式训练脚本。

注意事项：如果基础设施配置中的入口命令执行了

cd /opt/nemo-rl

（或其他镜像内/Lustre路径）并从该路径加载训练脚本，--code-source upload
不会覆盖Pod上的训练脚本——上传的工作目录位于

/tmp/ray/...

，但入口命令已切换到其他路径。要测试本地训练脚本的修改，需将编辑内容同步到Pod挂载的共享文件系统，或修改入口命令中的Hydra覆盖参数。

4. Ephemeral mode flags (

--rayjob

)

4. 临时模式参数（

--rayjob

）

When

--rayjob

is set,

run

branches into the RayJob code path. Relevant flags:

```
--rayjob-name NAME
```
— RayJob metadata name (defaults to the training cluster name).
```
--shutdown / --no-shutdown
```
— default
```
true
```
: KubeRay deletes the RayCluster once the Ray Job reaches a terminal state.
```
--ttl SECONDS
```
— default 3600s: keep the RayJob object around after the run finishes for post-mortem log access.
```
--wait / --no-wait
```
— default
```
wait
```
: poll
```
jobDeploymentStatus
```
until Complete/Failed.
```
--no-wait
```
returns as soon as the RayJob is applied.
```
--timeout SECONDS
```
— default 86400s (24h): bound the
```
--wait
```
poll.
```
--dry-run
```
— render the RayJob manifest and print it; do not apply.

--replace

--recreate

--skip-daemons

are silently ignored in

--rayjob

mode (KubeRay owns lifecycle).

当设置

--rayjob

参数时，

run

命令会切换到RayJob执行路径。相关参数如下：

```
--rayjob-name NAME
```
— RayJob的元数据名称（默认与训练集群名称一致）。
```
--shutdown / --no-shutdown
```
— 默认值为
```
true
```
：当Ray Job进入终端状态后，KubeRay会删除RayCluster。
```
--ttl SECONDS
```
— 默认值为3600秒：训练结束后保留RayJob对象一段时间，以便事后查看日志。
```
--wait / --no-wait
```
— 默认值为
```
wait
```
：轮询
```
jobDeploymentStatus
```
直到任务完成/失败。
```
--no-wait
```
会在RayJob配置应用后立即返回。
```
--timeout SECONDS
```
— 默认值为86400秒（24小时）：限制
```
--wait
```
模式下的轮询时长。
```
--dry-run
```
— 渲染RayJob配置清单并打印，不实际应用。

在

--rayjob

模式下，

--replace

--recreate

--skip-daemons

参数会被忽略（集群生命周期由KubeRay管理）。

5. Iterating on a config without touching the shared filesystem

5. 无需修改共享文件系统即可迭代配置

When the recipe on the pod filesystem has the wrong value for your experiment, use Hydra overrides on the entrypoint instead of forking the recipe. Pattern:

yaml

entrypoint: |
  set -eu
  cd /opt/nemo-rl
  RUN_ID="\${RAY_JOB_SUBMISSION_ID:-\${NRL_K8S_RUN_ID:-$(date -u +%Y%m%d-%H%M%S)}}"
  python -u examples/run_grpo.py \
    --config infra/nrl_k8s/examples/<recipe>.yaml \
    logger.wandb_enabled=true \
    logger.wandb.project=<project> \
    "logger.wandb.name=<run-name>-\${RUN_ID}"

Escape
${…}
with a backslash. OmegaConf otherwise interprets it as interpolation and errors on shell-style

${VAR:-default}

RUN_ID

resolves to

RAY_JOB_SUBMISSION_ID

(injected by KubeRay in rayjob mode) →

NRL_K8S_RUN_ID

(injected by the CLI in long-lived mode) → local timestamp — so the name is unique across either path.

当Pod文件系统中的训练脚本参数不符合实验需求时，无需复制训练脚本，可在入口命令中使用Hydra覆盖参数。示例格式：

yaml

entrypoint: |
  set -eu
  cd /opt/nemo-rl
  RUN_ID="\${RAY_JOB_SUBMISSION_ID:-\${NRL_K8S_RUN_ID:-$(date -u +%Y%m%d-%H%M%S)}}"
  python -u examples/run_grpo.py \
    --config infra/nrl_k8s/examples/<recipe>.yaml \
    logger.wandb_enabled=true \
    logger.wandb.project=<project> \
    "logger.wandb.name=<run-name>-\${RUN_ID}"

使用反斜杠转义
${…}
。否则OmegaConf会将其解析为插值表达式，并在遇到Shell风格的

${VAR:-default}

时报错。

RUN_ID

的解析优先级为：

RAY_JOB_SUBMISSION_ID

（RayJob模式下由KubeRay注入）→

NRL_K8S_RUN_ID

（持久化模式下由CLI注入）→ 本地时间戳——因此无论哪种模式，名称都是唯一的。

6. Per-profile concerns (hardware + scheduler + DRA)

6. 各Profile的注意事项（硬件+调度器+DRA）

Every infra YAML encodes a hardware/scheduler profile. The concrete examples in

infra/nrl_k8s/examples/

are authoritative for the profiles they target — read the neighbouring infra file before writing a new one. Things that commonly vary:

Per-node GPUs (e.g. 4 vs 8) — must match
```
cluster.gpus_per_node
```
in the recipe, otherwise workers stay
```
Pending
```
.
Node selectors — head pods usually land on a CPU-only node pool; GPU workers match on
```
nvidia.com/gpu.product
```
or a node-group label.
Scheduler — KAI (
```
schedulerName: kai-scheduler
```
+
```
kai.scheduler/queue
```
label) with topology annotations (
```
kai.scheduler/topology
```
,
```
kai.scheduler/topology-required-placement
```
) gang-schedules workers into one clique. Without it, pods may land on different racks and NVLink/RoCE won't span them.
DRA claims — ComputeDomain + RoCE are attached via
```
resourceClaims
```
referencing
```
ResourceClaimTemplate
```
s. The CLI auto-creates/deletes these when the worker pod spec contains DRA claim references — no manual setup needed.
Secrets — always via
```
secretKeyRef
```
(
```
wandb-api-key
```
, image pull secret). Never embed.
Shared filesystem mounts — typically a Lustre PVC mounted twice: once at the code path (e.g.
```
/opt/nemo-rl
```
with a user-scoped
```
subPath
```
) and once at a workspace root (e.g.
```
/mnt/rl-workspace
```
) for datasets, HF cache, and checkpoints.

Before applying an infra, verify prereqs exist in the target namespace:

bash

kubectl get pvc <workspace-pvc>
kubectl get secret <wandb-secret> <image-pull-secret>
kubectl get sa <service-account>

每个基础设施YAML文件都对应一个硬件/调度器Profile。

infra/nrl_k8s/examples/

目录下的具体示例是对应Profile的权威参考——编写新配置前请先查看目标Profile的相邻基础设施文件。常见的差异点包括：

单节点GPU数量（例如4 vs 8）——必须与训练脚本中的
```
cluster.gpus_per_node
```
一致，否则工作节点会一直处于
```
Pending
```
状态。
节点选择器——头节点通常部署在仅含CPU的节点池；GPU工作节点需匹配
```
nvidia.com/gpu.product
```
或节点组标签。
调度器——KAI调度器（
```
schedulerName: kai-scheduler
```
+
```
kai.scheduler/queue
```
标签）配合拓扑注解（
```
kai.scheduler/topology
```
,
```
kai.scheduler/topology-required-placement
```
）将工作节点调度到同一集群。如果不使用KAI调度器，Pod可能会部署在不同机架上，导致NVLink/RoCE无法跨机架通信。
DRA声明——ComputeDomain + RoCE通过
```
resourceClaims
```
引用
```
ResourceClaimTemplate
```
实现挂载。当工作节点Pod规格包含DRA声明引用时，CLI会自动创建/删除这些资源——无需手动配置。
密钥——始终通过
```
secretKeyRef
```
引用（例如
```
wandb-api-key
```
、镜像拉取密钥）。切勿直接嵌入。
共享文件系统挂载——通常是Lustre PVC挂载两次：一次挂载到代码路径（例如
```
/opt/nemo-rl
```
，带用户范围的
```
subPath
```
），另一次挂载到工作区根目录（例如
```
/mnt/rl-workspace
```
），用于存储数据集、HF缓存和检查点。

应用基础设施配置前，请验证目标命名空间中是否存在必要的资源：

bash

kubectl get pvc <workspace-pvc>
kubectl get secret <wandb-secret> <image-pull-secret>
kubectl get sa <service-account>

7. End-to-end workflows

7. 端到端工作流

7a. Fresh one-shot run (rayjob)

7a. 全新一次性训练（RayJob模式）

bash

undefined

bash

undefined

From the NeMo-RL repo root:

在NeMo-RL仓库根目录执行:

nrl-k8s check <recipe> --infra <infra> # validate first nrl-k8s run <recipe> --infra <infra> --rayjob --dry-run # render RayJob manifest nrl-k8s run <recipe> --infra <infra> --rayjob --no-wait # apply, returns fast


Watch status + teardown (works even after your laptop disconnects because KubeRay owns the lifecycle):
```bash
kubectl get rayjob -n default <name> -w
kubectl get raycluster -n default                                    # empty = teardown succeeded

nrl-k8s check <recipe> --infra <infra> # 先验证配置 nrl-k8s run <recipe> --infra <infra> --rayjob --dry-run # 渲染RayJob配置清单 nrl-k8s run <recipe> --infra <infra> --rayjob --no-wait # 应用配置并快速返回


查看状态及销毁过程（即使笔记本断开连接也能正常执行，因为集群生命周期由KubeRay管理）：
```bash
kubectl get rayjob -n default <name> -w
kubectl get raycluster -n default                                    # 无输出表示销毁成功

7b. Dev loop (long-lived)

7b. 开发迭代循环（持久化模式）

bash

nrl-k8s run <recipe> --infra <infra> --run-id $(date +%Y%m%d-%H%M%S)

bash

nrl-k8s run <recipe> --infra <infra> --run-id $(date +%Y%m%d-%H%M%S)

Edits in the recipe? Just re-run — reuses the live cluster.

修改训练脚本后？只需重新执行命令——复用现有活跃集群。

Pod spec changed? Add --recreate to delete + re-apply.

Pod规格变更？添加--recreate参数删除并重新创建集群。

Disagg recipe with gym/gen already healthy? --skip-daemons.

使用拆分式训练脚本且gym/gen服务已正常运行？添加--skip-daemons参数。

undefined

undefined

7c. First-time disaggregated bring-up

7c. 首次启动拆分式训练

bash

nrl-k8s run <recipe> --infra <disagg-infra> --mode batch --code-source image

bash

nrl-k8s run <recipe> --infra <disagg-infra> --mode batch --code-source image

7d. Cluster-only lifecycle

7d. 仅管理集群生命周期

bash

nrl-k8s cluster up   <recipe> --infra <infra> --target kuberay.training --wait
nrl-k8s cluster up   <recipe> --infra <infra> --target kuberay.training --dry-run   # render manifest
nrl-k8s cluster down <recipe> --infra <infra> --target kuberay.training --wait
nrl-k8s cluster down <recipe> --infra <infra>                                       # tear down all
nrl-k8s cluster list -n default
nrl-k8s cluster dashboard <cluster-name>                                  # port-forward + browser

bash

nrl-k8s cluster up   <recipe> --infra <infra> --target kuberay.training --wait
nrl-k8s cluster up   <recipe> --infra <infra> --target kuberay.training --dry-run   # 渲染配置清单
nrl-k8s cluster down <recipe> --infra <infra> --target kuberay.training --wait
nrl-k8s cluster down <recipe> --infra <infra>                                       # 销毁所有资源
nrl-k8s cluster list -n default
nrl-k8s cluster dashboard <cluster-name>                                  # 端口转发并打开浏览器

7e. Deployments (e.g. nemo-skills sandbox)

7e. 部署服务（例如nemo-skills沙箱）

bash

undefined

bash

undefined

Bring up just the deployment

仅启动部署服务

nrl-k8s cluster up <recipe> --infra <infra> --target deployments.nemo_skills

Tear down just the deployment

仅销毁部署服务

nrl-k8s cluster down <recipe> --infra <infra> --target deployments.nemo_skills

Tear down everything (RayClusters + Deployments)

销毁所有资源（RayClusters + Deployments）

nrl-k8s cluster down <recipe> --infra <infra>


The `deployments:` section in infra YAML declares Kubernetes Deployments managed alongside RayClusters. The CLI patches image, imagePullSecrets, and serviceAccountName from the top-level infra keys (same as RayClusters). Deployments start in parallel with cluster bring-up — no ordering dependency.

nrl-k8s cluster down <recipe> --infra <infra>


基础设施YAML文件中的`deployments:`部分声明了与RayCluster一起管理的Kubernetes Deployment。CLI会从顶层基础设施配置中获取镜像、镜像拉取密钥和服务账号名称，并应用到Deployment（与RayCluster一致）。Deployment与集群启动并行执行——无依赖顺序。

8. Monitoring a run

8. 监控训练任务

bash

undefined

bash

undefined

Status

查看状态

nrl-k8s status <recipe> --infra <infra> kubectl get rayjob,raycluster -n default

Follow the driver

跟踪驱动进程日志

nrl-k8s job list <recipe> --infra <infra> --role training nrl-k8s job logs <run-id> <recipe> --infra <infra> --role training -f


When the `nrl-k8s job logs -f` subprocess dies (`kubectl port-forward` i/o timeout after ~15 min idle), just re-run it. The training job keeps going.

To fetch driver logs for a terminal job (SUCCEEDED/FAILED) or a RayJob via the dashboard API:
```bash
RC=$(kubectl get rayjob -n default <rayjob-name> -o jsonpath='{.status.rayClusterName}')
kubectl port-forward -n default svc/${RC}-head-svc 18266:8265 &
curl -s http://localhost:18266/api/jobs/                              # lists jobs, find submission_id
curl -s "http://localhost:18266/api/jobs/<submission_id>/logs"        # full driver log

type=DRIVER

with

submission_id=null

means an exec-submitter run (no dashboard log endpoint — use

nrl-k8s job logs

instead).

type=SUBMISSION

has

submission_id

set and

/api/jobs/<id>/logs

works.

Wandb URL appears in the driver log on the first

wandb.init

call; grep

grep -oE 'https://wandb\.ai/[A-Za-z0-9_./-]+'

nrl-k8s job list <recipe> --infra <infra> --role training nrl-k8s job logs <run-id> <recipe> --infra <infra> --role training -f


当`nrl-k8s job logs -f`子进程终止（`kubectl port-forward`在空闲约15分钟后会超时），只需重新执行命令即可——训练任务会继续运行。

要获取已结束任务（成功/失败）或RayJob的驱动进程日志，可通过仪表盘API：
```bash
RC=$(kubectl get rayjob -n default <rayjob-name> -o jsonpath='{.status.rayClusterName}')
kubectl port-forward -n default svc/${RC}-head-svc 18266:8265 &
curl -s http://localhost:18266/api/jobs/                              # 列出所有任务，找到submission_id
curl -s "http://localhost:18266/api/jobs/<submission_id>/logs"        # 获取完整驱动进程日志

type=DRIVER

且

submission_id=null

表示使用exec提交方式的训练任务（无仪表盘日志端点——需使用

nrl-k8s job logs

命令）。

type=SUBMISSION

会设置

submission_id

，此时

/api/jobs/<id>/logs

接口可用。

Wandb URL会在首次调用

wandb.init

时出现在驱动进程日志中；可通过

grep -oE 'https://wandb\.ai/[A-Za-z0-9_./-]+'

命令提取。

9. Stopping things

9. 停止任务

What to stop	Command
One training run	`nrl-k8s job stop <run-id> <recipe> --infra <infra> --role training`
All running Ray jobs on a cluster (+ submit new)	`nrl-k8s run <recipe> --infra <infra> --replace`
A long-lived RayCluster	`nrl-k8s cluster down <recipe> --infra <infra> --target kuberay.training --wait`
A RayJob (ephemeral)	`kubectl delete rayjob <name> -n default` — only if `shutdownAfterJobFinishes` didn't fire

Confirm before deleting shared infra. The cost of

cluster down

on someone else's cluster is high.

停止对象	命令
单个训练任务	`nrl-k8s job stop <run-id> <recipe> --infra <infra> --role training`
集群上所有运行中的Ray Job（并提交新任务）	`nrl-k8s run <recipe> --infra <infra> --replace`
持久化RayCluster	`nrl-k8s cluster down <recipe> --infra <infra> --target kuberay.training --wait`
RayJob（临时模式）	`kubectl delete rayjob <name> -n default` — 仅在 `shutdownAfterJobFinishes` 未触发时使用

删除共享基础设施前请确认。错误销毁他人集群的成本很高。

10. Verifying RayJob teardown

10. 验证RayJob销毁状态

After a

run --rayjob

completes with

--shutdown

(default), KubeRay should delete the RayCluster:

bash

kubectl get rayjob   -n default <rayjob-name>                        # jobDeploymentStatus = Complete
kubectl get raycluster -n default | grep <rayjob-name>               # no output = torn down

The RayJob object itself sticks around for

--ttl

seconds (default 3600s) so you can still fetch logs.

当

run --rayjob

训练完成且启用

--shutdown

（默认）后，KubeRay应删除RayCluster：

bash

kubectl get rayjob   -n default <rayjob-name>                        # jobDeploymentStatus = Complete
kubectl get raycluster -n default | grep <rayjob-name>               # 无输出表示已销毁

RayJob对象本身会保留

--ttl

秒（默认3600秒），以便事后查看日志。

11. Common gotchas

11. 常见陷阱

OmegaConf interpolation eats
```
${VAR}
```
in recipe/infra YAML. Escape shell variables with
```
\${VAR}
```
so OmegaConf passes them through to the pod shell verbatim.
Megatron optimizer configs don't carry
```
foreach
```
/
```
fused
```
. Overrides like
```
~policy.optimizer.kwargs.foreach ~policy.optimizer.kwargs.fused
```
(valid for DTensor configs) break on Megatron recipes. Omit them for Megatron.
DTensor vs Megatron — MoE recipes typically use
```
megatron_cfg.enabled=true
```
; ensure
```
dtensor_cfg.enabled=false
```
in inherited defaults.
Shared filesystem vs git divergence —
```
codeSource: image|lustre
```
reads from the pod filesystem. If your local edits aren't on the shared filesystem the pods mount, the run is testing the on-disk version, not yours. Either sync via a helper pod (head pod exec is often blocked) or override via Hydra flags.
Ephemeral-storage + readinessProbe are injected by kuberay/CDI webhooks at pod-apply time. Do NOT add them to the inline RayCluster spec.
Node taints vary per cluster.
```
tolerations: [{operator: Exists}]
```
on workers is defensive and worth keeping.
Dashboard blank page — Ray 2.52 installs dashboard assets as symlinks by default;
```
nrl-k8s cluster dashboard <name>
```
auto-reinstalls
```
ray[default] --link-mode=copy
```
to fix it. Bake
```
ENV UV_LINK_MODE=copy
```
in the image to avoid this entirely.
kubectl exec
is usually blocked in automation — route around with
```
kubectl get ... -o yaml
```
,
```
kubectl logs
```
, and
```
kubectl port-forward
```
+ Ray dashboard APIs.

OmegaConf插值会解析训练脚本/基础设施YAML中的
```
${VAR}
```
。使用
```
\${VAR}
```
转义Shell变量，确保OmegaConf将其原封不动地传递给Pod的Shell。
Megatron优化器配置不支持
```
foreach
```
/
```
fused
```
参数。类似
```
~policy.optimizer.kwargs.foreach ~policy.optimizer.kwargs.fused
```
的覆盖参数（适用于DTensor配置）会在Megatron训练脚本中报错。使用Megatron时请省略这些参数。
DTensor vs Megatron — MoE训练脚本通常设置
```
megatron_cfg.enabled=true
```
；请确保继承的默认配置中
```
dtensor_cfg.enabled=false
```
。
共享文件系统与Git差异 —
```
codeSource: image|lustre
```
模式从Pod文件系统读取代码。如果本地修改未同步到Pod挂载的共享文件系统，训练任务测试的是文件系统上的版本，而非本地修改。可通过辅助Pod同步（通常禁止直接执行头节点Pod的
```
kubectl exec
```
）或通过Hydra参数覆盖。
临时存储+就绪探针由kuberay/CDI webhook在Pod应用时自动注入。请勿在RayCluster内联规格中添加这些配置。
节点污点因集群而异。在工作节点上设置
```
tolerations: [{operator: Exists}]
```
是一种防御性配置，建议保留。
仪表盘空白页面 — Ray 2.52默认将仪表盘资源安装为符号链接；
```
nrl-k8s cluster dashboard <name>
```
会自动重新安装
```
ray[default] --link-mode=copy
```
以修复该问题。在镜像中添加
```
ENV UV_LINK_MODE=copy
```
可彻底避免此问题。
kubectl exec
通常被禁止在自动化场景中——可通过
```
kubectl get ... -o yaml
```
、
```
kubectl logs
```
和
```
kubectl port-forward
```
+ Ray仪表盘API替代。

12. Checklist before calling a run "done"

12. 标记训练任务完成前的检查清单

Before reporting a launch as successful, verify:

kubectl get rayjob/raycluster -n default

shows the expected objects.

nrl-k8s job list

(or

curl /api/jobs/

) shows the job in

RUNNING

SUCCEEDED

Driver log contains
```
wandb.ai/<project>/runs/<id>
```
(if wandb is enabled) — share the URL with the user.
At least one
```
Processed prompts: 100%
```
line appears (confirms generation is wired).

For

--rayjob

mode only: after

jobDeploymentStatus=Complete

, confirm

kubectl get raycluster | grep <name>

is empty (teardown worked).

在报告训练启动成功前，请验证以下内容：

kubectl get rayjob/raycluster -n default

显示预期的资源对象。

nrl-k8s job list

（或

curl /api/jobs/

）显示任务处于

RUNNING

SUCCEEDED

状态。

驱动进程日志包含
```
wandb.ai/<project>/runs/<id>
```
（若启用wandb）——请将URL分享给用户。
日志中至少出现一条
```
Processed prompts: 100%
```
记录（确认生成流程已正常运行）。

仅针对

--rayjob

模式：当

jobDeploymentStatus=Complete

后，确认

kubectl get raycluster | grep <name>

无输出（销毁成功）。

13. Dev pod

13. 开发Pod

nrl-k8s dev

manages a lightweight CPU pod on the cluster for code syncing, debugging, and running

kubectl

nrl-k8s

from within the cluster.

bash

undefined

nrl-k8s dev

命令用于管理集群上的轻量CPU Pod，用于代码同步、调试以及在集群内运行

kubectl

nrl-k8s

命令。

bash

undefined

One-time: set up secrets (HF token, wandb, SSH key, rclone)

一次性操作：配置密钥（HF令牌、wandb、SSH密钥、rclone）

nrl-k8s dev setup-secrets --ssh-key ~/.ssh/id_rsa --add-rclone

Create pod and exec in (idempotent — reuses existing pod)

创建Pod并进入终端（幂等操作——复用现有Pod）

nrl-k8s dev connect

Switch image (must stop first — image change is warned but not auto-applied)

切换镜像（需先停止Pod——镜像变更会发出警告但不会自动应用）

nrl-k8s dev stop nrl-k8s dev connect --image nvcr.io/nvidian/nemo-rl:v0.7.0

Tear down

销毁Pod

nrl-k8s dev stop


The dev pod:
- Runs on a CPU-only node (anti-affinity to GPU nodes)
- Mounts the shared `rl-workspace` PVC at `/mnt/rl-workspace`
- Sets `USER` env var to the `nrl-k8s` username (so `$USER` and `getpass.getuser()` work correctly despite running as root)
- Installs `kubectl`, `rclone` (if configured) on first boot
- Injects SSH keys and tokens via `envFrom` on a per-user K8s Secret

The pod's `default` service account needs an `edit` RoleBinding in the namespace for `kubectl` to work inside. `dev connect` checks this and prints the required YAML if missing.

nrl-k8s dev stop


开发Pod特性：
- 部署在仅含CPU的节点（与GPU节点反亲和性）
- 将共享`rl-workspace` PVC挂载到`/mnt/rl-workspace`
- 设置`USER`环境变量为`nrl-k8s`用户名（确保`$USER`和`getpass.getuser()`在以root身份运行时仍能正常工作）
- 首次启动时安装`kubectl`、`rclone`（若已配置）
- 通过每个用户的K8s Secret的`envFrom`注入SSH密钥和令牌

Pod的`default`服务账号需要在命名空间中拥有`edit`角色绑定，才能在Pod内正常使用`kubectl`。`dev connect`命令会检查此配置，若缺失则打印所需的YAML配置。

14. Where things live in the repo

14. 仓库中各文件的位置

CLI code:

infra/nrl_k8s/src/nrl_k8s/

(

cli.py

orchestrate.py

manifest.py

rayjob.py

k8s.py

submitters/

schema.py

Tests:

infra/nrl_k8s/tests/unit/

— run with

uv run --extra test pytest -x -q

from

infra/nrl_k8s/

Recipe + infra examples:
```
infra/nrl_k8s/examples/
```
.

Base recipes this tool wraps:

examples/configs/recipes/llm/…

and

examples/nemo_gym/…

CLI代码：

infra/nrl_k8s/src/nrl_k8s/

（

cli.py

orchestrate.py

manifest.py

rayjob.py

k8s.py

submitters/

schema.py

）。

测试代码：

infra/nrl_k8s/tests/unit/

— 在

infra/nrl_k8s/

目录下执行

uv run --extra test pytest -x -q

运行测试。

训练脚本+基础设施配置示例：
```
infra/nrl_k8s/examples/
```
。

本工具封装的基础训练脚本：

examples/configs/recipes/llm/…

和

examples/nemo_gym/…

。

launch-nemo-rl

Original

Translation

launch-nemo-rl — running NeMo-RL recipes on Kubernetes via nrl-k8s

launch-nemo-rl — 通过nrl-k8s在Kubernetes上运行NeMo-RL训练脚本

1. One command, two modes

1. 一条命令，两种模式

2. Recipe + infra pair

2. 训练脚本+基础设施配置对

3. Long-lived mode flags

3. 持久化模式参数

4. Ephemeral mode flags (--rayjob)

4. 临时模式参数（--rayjob）

5. Iterating on a config without touching the shared filesystem

5. 无需修改共享文件系统即可迭代配置

6. Per-profile concerns (hardware + scheduler + DRA)

6. 各Profile的注意事项（硬件+调度器+DRA）

7. End-to-end workflows

7. 端到端工作流

7a. Fresh one-shot run (rayjob)

7a. 全新一次性训练（RayJob模式）

From the NeMo-RL repo root:

在NeMo-RL仓库根目录执行:

7b. Dev loop (long-lived)

7b. 开发迭代循环（持久化模式）

Edits in the recipe? Just re-run — reuses the live cluster.

修改训练脚本后？只需重新执行命令——复用现有活跃集群。

Pod spec changed? Add --recreate to delete + re-apply.

Pod规格变更？添加--recreate参数删除并重新创建集群。

Disagg recipe with gym/gen already healthy? --skip-daemons.

使用拆分式训练脚本且gym/gen服务已正常运行？添加--skip-daemons参数。

7c. First-time disaggregated bring-up

7c. 首次启动拆分式训练

7d. Cluster-only lifecycle

7d. 仅管理集群生命周期

7e. Deployments (e.g. nemo-skills sandbox)

7e. 部署服务（例如nemo-skills沙箱）

Bring up just the deployment

仅启动部署服务

Tear down just the deployment

仅销毁部署服务

Tear down everything (RayClusters + Deployments)

销毁所有资源（RayClusters + Deployments）

8. Monitoring a run

8. 监控训练任务

Status

查看状态

Follow the driver

跟踪驱动进程日志

9. Stopping things

9. 停止任务

10. Verifying RayJob teardown

10. 验证RayJob销毁状态

11. Common gotchas

11. 常见陷阱

12. Checklist before calling a run "done"

12. 标记训练任务完成前的检查清单

13. Dev pod

13. 开发Pod

One-time: set up secrets (HF token, wandb, SSH key, rclone)

一次性操作：配置密钥（HF令牌、wandb、SSH密钥、rclone）

Create pod and exec in (idempotent — reuses existing pod)

创建Pod并进入终端（幂等操作——复用现有Pod）

Switch image (must stop first — image change is warned but not auto-applied)

切换镜像（需先停止Pod——镜像变更会发出警告但不会自动应用）

Tear down

销毁Pod

14. Where things live in the repo

14. 仓库中各文件的位置

4. Ephemeral mode flags (
`--rayjob`
)

4. 临时模式参数（
`--rayjob`
）