launch-nemo-rl
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
Chineselaunch-nemo-rl — running NeMo-RL recipes on Kubernetes via nrl-k8s
launch-nemo-rl — 通过nrl-k8s在Kubernetes上运行NeMo-RL训练脚本
This is the playbook for the CLI at . Follow it when the user asks to launch / iterate / debug a NeMo-RL recipe on a Kubernetes cluster. Verify current state (, , the recipe + infra files) before acting — the cluster is shared and the cost of a wrong action is high.
nrl-k8sinfra/nrl_k8s/kubectlgit log本指南针对位于路径下的 CLI工具编写。当用户需要在Kubernetes集群上启动/迭代/调试NeMo-RL训练脚本时,请遵循本指南操作。执行操作前请先验证当前状态(、、训练脚本及基础设施文件)——集群为共享资源,错误操作的代价很高。
infra/nrl_k8s/nrl-k8skubectlgit log1. One command, two modes
1. 单命令,双模式
There is a single top-level submission command: . It has two lifecycle modes.
nrl-k8s run| Mode | Invocation | When to use | Cluster after? |
|---|---|---|---|
| Ephemeral (default) | | One-shot. KubeRay applies a RayJob, runs, tears the cluster down. Best for most runs. | No (auto) |
| Long-lived | | Dev loop. Reuses a matching live cluster, applies if absent, warns + reuses on drift (pass | Yes |
Ask: Do I need this cluster after the run? If yes, use . Otherwise use the default (ephemeral).
--rayclusterThe rest of the CLI is observability / stage-by-stage control:
| Command | Purpose |
|---|---|
| Validate a recipe + infra pair; optionally write the fully-resolved manifests ( |
| Per-role RayCluster state, head pod phase, worker pod phases, daemon job status. |
| Manage RayClusters independently of a run (e.g. render a manifest with |
| Observability over Ray Jobs already submitted to a role's cluster. |
| Tail a role's pod / daemon logs without needing a submission id. |
顶层提交命令只有一个:。它包含两种生命周期模式。
nrl-k8s run| 模式 | 调用方式 | 使用场景 | 集群后续状态 |
|---|---|---|---|
| 临时模式(默认) | | 一次性任务。KubeRay创建RayJob,执行完成后自动销毁集群。适用于大多数训练任务。 | 无(自动销毁) |
| 长期运行模式 | | 开发循环场景。复用匹配的活跃集群,若集群不存在则创建;若集群配置漂移则发出警告并复用(可传入 | 保留 |
请先确认:训练完成后是否需要保留该集群? 如果是,使用模式;否则使用默认的临时模式。
--rayclusterCLI的其余命令用于可观测性或分阶段控制:
| 命令 | 用途 |
|---|---|
| 验证训练脚本与基础设施配置的匹配性;可选生成完整解析后的清单文件( |
| 查看RayCluster各角色状态、头节点Pod阶段、工作节点Pod阶段、守护进程任务状态。 |
| 独立于训练任务管理RayCluster(例如通过 |
| 查看已提交到集群的Ray任务的可观测数据。 |
| 无需提交ID即可跟踪指定角色的Pod/守护进程日志。 |
2. Recipe + infra pair
2. 训练脚本 + 基础设施配置对
Every launch takes two files. Pass the infra with , not merged inline:
--infranrl-k8s run infra/nrl_k8s/examples/<recipe>.yaml \
--infra infra/nrl_k8s/examples/<recipe>.<profile>.infra.yaml- Recipe (e.g. ) — NeMo-RL config: model, GRPO/SFT knobs,
qwen3_30b_math_8n_4gpu.yaml. Usescluster.{gpus_per_node,num_nodes}to inherit fromdefaults:.examples/configs/recipes/llm/... - Infra (e.g. ) — K8s/Ray shape: namespace, image, service account, RayCluster spec under
*.<profile>.infra.yaml, optional Deployments underkuberay:,deployments:,submit.submitter. Pair names followlaunch.{mode,codeSource,codePath,entrypoint}where<recipe>.<profile>[.prod].infra.yamlnames the hardware target (e.g.<profile>).gb300
Example pairs in — read the neighbouring files to see the current conventions for the target profile.
infra/nrl_k8s/examples/每次启动都需要两个文件。通过参数传入基础设施配置,请勿内联合并:
--infranrl-k8s run infra/nrl_k8s/examples/<recipe>.yaml \
--infra infra/nrl_k8s/examples/<recipe>.<profile>.infra.yaml- 训练脚本(例如)——NeMo-RL配置:模型、GRPO/SFT参数、
qwen3_30b_math_8n_4gpu.yaml。通过cluster.{gpus_per_node,num_nodes}继承自defaults:路径下的配置。examples/configs/recipes/llm/... - 基础设施配置(例如)——K8s/Ray集群规格:命名空间、镜像、服务账号、
*.<profile>.infra.yaml下的RayCluster配置、kuberay:下的可选Deployment配置、deployments:、submit.submitter。配置对命名遵循launch.{mode,codeSource,codePath,entrypoint}格式,其中<recipe>.<profile>[.prod].infra.yaml代表硬件目标(例如<profile>)。gb300
infra/nrl_k8s/examples/3. Long-lived mode flags
3. 长期运行模式参数
Three independent dimensions. is a macro that picks defaults; individual flags override it.
--mode--mode interactive → --submitter portForward --code-source upload (tails logs)
--mode batch → --submitter exec --code-source image (returns after nohup)- Submitter: uses
portForward+ Ray Job SDK (gets akubectl port-forwardthe dashboard tracks).submission_idusesexec+kubectl execon the head pod (no submission_id; driver appears asnohupin the dashboard).type=DRIVER - Code source: stages a working_dir from the laptop (Ray 100 MiB cap).
upload/imageexpect code on the pod's filesystem — paired withlustre(typically--code-path), which is a subPath of the shared-filesystem PVC mount in the standard infra examples./opt/nemo-rl - Wait: tails logs until terminal;
--waitreturns as soon as the driver is running.--no-wait
Other long-lived-only flags:
- — stop any running training / daemon job before submitting new ones (suffixes daemon submissionIds with a timestamp so Ray accepts the resubmit).
--replace - — delete + re-apply a RayCluster whose live spec has drifted from the rendered manifest (default is warn + reuse).
--recreate - — bring up all declared clusters but only submit training. Use on disagg recipes where gym/generation are already healthy.
--skip-daemons
Gotcha: on infra where the entrypoint does (or another in-image / Lustre path) and loads the recipe from there, does NOT override the recipe on the pod — the uploaded working_dir sits in but the entrypoint s away from it. To actually test a local recipe change, either sync your edits to the shared filesystem mounted into the pods or flip the Hydra overrides in the entrypoint.
cd /opt/nemo-rl--code-source upload/tmp/ray/...cd包含三个独立维度。参数是预设宏,可通过单独参数覆盖默认值。
--mode--mode interactive → --submitter portForward --code-source upload (跟踪日志)
--mode batch → --submitter exec --code-source image (后台执行后返回)- 提交方式:使用
portForward+ Ray Job SDK(生成仪表盘可跟踪的kubectl port-forward)。submission_id使用exec+ 头节点Pod上的kubectl exec命令(无submission_id;驱动进程在仪表盘中显示为nohup)。type=DRIVER - 代码来源:将本地工作目录上传到集群(Ray限制为100 MiB)。
upload/image模式期望代码已存在于Pod的文件系统中——需配合lustre参数(通常为--code-path),该路径是标准基础设施示例中共享文件系统PVC挂载的子路径。/opt/nemo-rl - 等待模式:跟踪日志直到任务结束;
--wait在驱动进程启动后立即返回。--no-wait
其他仅适用于长期运行模式的参数:
- — 提交新任务前停止所有正在运行的训练/守护进程任务(为守护进程的submissionId添加时间戳后缀,确保Ray接受重新提交)。
--replace - — 当活跃集群的配置与渲染后的清单不一致时,删除并重建RayCluster(默认行为是发出警告并复用现有集群)。
--recreate - — 启动所有已声明的集群,但仅提交训练任务。适用于gym/generation服务已正常运行的拆分式训练脚本。
--skip-daemons
注意事项:如果基础设施配置中的入口命令执行(或其他镜像内/Lustre路径)并从该路径加载训练脚本,不会覆盖Pod上的训练脚本——上传的工作目录位于,但入口命令已切换到其他路径。要测试本地训练脚本的修改,需将编辑内容同步到Pod挂载的共享文件系统,或在入口命令中通过Hydra参数覆盖配置。
cd /opt/nemo-rl--code-source upload/tmp/ray/...4. Ephemeral mode flags (--rayjob
)
--rayjob4. 临时模式参数(--rayjob
)
--rayjobWhen is set, branches into the RayJob code path. Relevant flags:
--rayjobrun- — RayJob metadata name (defaults to the training cluster name).
--rayjob-name NAME - — default
--shutdown / --no-shutdown: KubeRay deletes the RayCluster once the Ray Job reaches a terminal state.true - — default 3600s: keep the RayJob object around after the run finishes for post-mortem log access.
--ttl SECONDS - — default
--wait / --no-wait: pollwaituntil Complete/Failed.jobDeploymentStatusreturns as soon as the RayJob is applied.--no-wait - — default 86400s (24h): bound the
--timeout SECONDSpoll.--wait - — render the RayJob manifest and print it; do not apply.
--dry-run
--replace--recreate--skip-daemons--rayjob当设置参数时,命令会切换到RayJob执行路径。相关参数:
--rayjobrun- — RayJob元数据名称(默认值为训练集群名称)。
--rayjob-name NAME - — 默认值为
--shutdown / --no-shutdown:当RayJob进入终端状态后,KubeRay自动删除RayCluster。true - — 默认值为3600秒:训练完成后保留RayJob对象一段时间,以便事后查看日志。
--ttl SECONDS - — 默认值为
--wait / --no-wait:轮询wait直到任务完成/失败。jobDeploymentStatus在RayJob创建后立即返回。--no-wait - — 默认值为86400秒(24小时):限制
--timeout SECONDS模式的轮询时长。--wait - — 渲染RayJob清单并打印,不实际执行创建操作。
--dry-run
在模式下, / / 参数会被忽略(KubeRay负责生命周期管理)。
--rayjob--replace--recreate--skip-daemons5. Iterating on a config without touching the shared filesystem
5. 不修改共享文件系统的情况下迭代配置
When the recipe on the pod filesystem has the wrong value for your experiment, use Hydra overrides on the entrypoint instead of forking the recipe. Pattern:
yaml
entrypoint: |
set -eu
cd /opt/nemo-rl
RUN_ID="\${RAY_JOB_SUBMISSION_ID:-\${NRL_K8S_RUN_ID:-$(date -u +%Y%m%d-%H%M%S)}}"
python -u examples/run_grpo.py \
--config infra/nrl_k8s/examples/<recipe>.yaml \
logger.wandb_enabled=true \
logger.wandb.project=<project> \
"logger.wandb.name=<run-name>-\${RUN_ID}"Escape with a backslash. OmegaConf otherwise interprets it as interpolation and errors on shell-style . resolves to (injected by KubeRay in rayjob mode) → (injected by the CLI in long-lived mode) → local timestamp — so the name is unique across either path.
${…}${VAR:-default}RUN_IDRAY_JOB_SUBMISSION_IDNRL_K8S_RUN_ID当Pod文件系统中的训练脚本配置不符合实验需求时,无需复制训练脚本,可通过入口命令中的Hydra参数覆盖配置。示例模式:
yaml
entrypoint: |
set -eu
cd /opt/nemo-rl
RUN_ID="\${RAY_JOB_SUBMISSION_ID:-\${NRL_K8S_RUN_ID:-$(date -u +%Y%m%d-%H%M%S)}}"
python -u examples/run_grpo.py \
--config infra/nrl_k8s/examples/<recipe>.yaml \
logger.wandb_enabled=true \
logger.wandb.project=<project> \
"logger.wandb.name=<run-name>-\${RUN_ID}"使用反斜杠转义。否则OmegaConf会将其解释为插值语法,并在遇到shell风格的时报错。的解析优先级为:(RayJob模式下由KubeRay注入)→ (长期运行模式下由CLI注入)→ 本地时间戳——因此无论使用哪种模式,名称都是唯一的。
${…}${VAR:-default}RUN_IDRAY_JOB_SUBMISSION_IDNRL_K8S_RUN_ID6. Per-profile concerns (hardware + scheduler + DRA)
6. 按硬件配置区分的注意事项(硬件 + 调度器 + DRA)
Every infra YAML encodes a hardware/scheduler profile. The concrete examples in are authoritative for the profiles they target — read the neighbouring infra file before writing a new one. Things that commonly vary:
infra/nrl_k8s/examples/- Per-node GPUs (e.g. 4 vs 8) — must match in the recipe, otherwise workers stay
cluster.gpus_per_node.Pending - Node selectors — head pods usually land on a CPU-only node pool; GPU workers match on or a node-group label.
nvidia.com/gpu.product - Scheduler — KAI (+
schedulerName: kai-schedulerlabel) with topology annotations (kai.scheduler/queue,kai.scheduler/topology) gang-schedules workers into one clique. Without it, pods may land on different racks and NVLink/RoCE won't span them.kai.scheduler/topology-required-placement - DRA claims — ComputeDomain + RoCE are attached via referencing
resourceClaimss. The CLI auto-creates/deletes these when the worker pod spec contains DRA claim references — no manual setup needed.ResourceClaimTemplate - Secrets — always via (
secretKeyRef, image pull secret). Never embed.wandb-api-key - Shared filesystem mounts — typically a Lustre PVC mounted twice: once at the code path (e.g. with a user-scoped
/opt/nemo-rl) and once at a workspace root (e.g.subPath) for datasets, HF cache, and checkpoints./mnt/rl-workspace
Before applying an infra, verify prereqs exist in the target namespace:
bash
kubectl get pvc <workspace-pvc>
kubectl get secret <wandb-secret> <image-pull-secret>
kubectl get sa <service-account>每个基础设施YAML文件都对应一种硬件/调度器配置。路径下的具体示例是对应硬件配置的权威参考——编写新配置前请先查看同路径下的基础设施文件。常见的差异点:
infra/nrl_k8s/examples/- 单节点GPU数量(例如4 vs 8)——必须与训练脚本中的匹配,否则工作节点会一直处于
cluster.gpus_per_node状态。Pending - 节点选择器——头节点通常部署在仅含CPU的节点池;GPU工作节点通过或节点组标签匹配。
nvidia.com/gpu.product - 调度器——KAI调度器(+
schedulerName: kai-scheduler标签)配合拓扑注解(kai.scheduler/queue,kai.scheduler/topology)将工作节点调度到同一集群中。如果不使用该调度器,Pod可能会部署在不同机架上,导致NVLink/RoCE无法跨机架通信。kai.scheduler/topology-required-placement - DRA声明——ComputeDomain + RoCE通过引用
resourceClaims实现挂载。当工作节点Pod配置中包含DRA声明引用时,CLI会自动创建/删除这些资源——无需手动配置。ResourceClaimTemplate - 密钥——始终通过引用(例如
secretKeyRef、镜像拉取密钥)。禁止直接嵌入密钥。wandb-api-key - 共享文件系统挂载——通常是Lustre PVC挂载两次:一次挂载到代码路径(例如,带用户范围的
/opt/nemo-rl),另一次挂载到工作区根路径(例如subPath),用于存储数据集、HF缓存和检查点。/mnt/rl-workspace
应用基础设施配置前,请验证目标命名空间中是否存在必要的资源:
bash
kubectl get pvc <workspace-pvc>
kubectl get secret <wandb-secret> <image-pull-secret>
kubectl get sa <service-account>7. End-to-end workflows
7. 端到端工作流
7a. Fresh one-shot run (rayjob)
7a. 全新一次性训练任务(rayjob模式)
bash
undefinedbash
undefinedFrom the NeMo-RL repo root:
在NeMo-RL仓库根目录执行:
nrl-k8s check <recipe> --infra <infra> # validate first
nrl-k8s run <recipe> --infra <infra> --rayjob --dry-run # render RayJob manifest
nrl-k8s run <recipe> --infra <infra> --rayjob --no-wait # apply, returns fast
Watch status + teardown (works even after your laptop disconnects because KubeRay owns the lifecycle):
```bash
kubectl get rayjob -n default <name> -w
kubectl get raycluster -n default # empty = teardown succeedednrl-k8s check <recipe> --infra <infra> # 先验证配置
nrl-k8s run <recipe> --infra <infra> --rayjob --dry-run # 渲染RayJob清单
nrl-k8s run <recipe> --infra <infra> --rayjob --no-wait # 创建任务,快速返回
查看状态及销毁过程(即使笔记本断开连接也能正常执行,因为KubeRay负责生命周期管理):
```bash
kubectl get rayjob -n default <name> -w
kubectl get raycluster -n default # 无输出表示销毁成功7b. Dev loop (long-lived)
7b. 开发循环(长期运行模式)
bash
nrl-k8s run <recipe> --infra <infra> --run-id $(date +%Y%m%d-%H%M%S)bash
nrl-k8s run <recipe> --infra <infra> --run-id $(date +%Y%m%d-%H%M%S)Edits in the recipe? Just re-run — reuses the live cluster.
修改训练脚本后?只需重新执行命令——复用活跃集群。
Pod spec changed? Add --recreate to delete + re-apply.
Pod配置变更?添加--recreate参数删除并重建集群。
Disagg recipe with gym/gen already healthy? --skip-daemons.
使用拆分式训练脚本且gym/gen服务已正常运行?添加--skip-daemons参数。
undefinedundefined7c. First-time disaggregated bring-up
7c. 首次启动拆分式训练任务
bash
nrl-k8s run <recipe> --infra <disagg-infra> --mode batch --code-source imagebash
nrl-k8s run <recipe> --infra <disagg-infra> --mode batch --code-source image7d. Cluster-only lifecycle
7d. 仅集群生命周期管理
bash
nrl-k8s cluster up <recipe> --infra <infra> --target kuberay.training --wait
nrl-k8s cluster up <recipe> --infra <infra> --target kuberay.training --dry-run # render manifest
nrl-k8s cluster down <recipe> --infra <infra> --target kuberay.training --wait
nrl-k8s cluster down <recipe> --infra <infra> # tear down all
nrl-k8s cluster list -n default
nrl-k8s cluster dashboard <cluster-name> # port-forward + browserbash
nrl-k8s cluster up <recipe> --infra <infra> --target kuberay.training --wait
nrl-k8s cluster up <recipe> --infra <infra> --target kuberay.training --dry-run # 渲染清单
nrl-k8s cluster down <recipe> --infra <infra> --target kuberay.training --wait
nrl-k8s cluster down <recipe> --infra <infra> # 销毁所有资源
nrl-k8s cluster list -n default
nrl-k8s cluster dashboard <cluster-name> # 端口转发并打开浏览器7e. Deployments (e.g. nemo-skills sandbox)
7e. Deployment管理(例如nemo-skills沙箱)
bash
undefinedbash
undefinedBring up just the deployment
仅启动Deployment
nrl-k8s cluster up <recipe> --infra <infra> --target deployments.nemo_skills
nrl-k8s cluster up <recipe> --infra <infra> --target deployments.nemo_skills
Tear down just the deployment
仅销毁Deployment
nrl-k8s cluster down <recipe> --infra <infra> --target deployments.nemo_skills
nrl-k8s cluster down <recipe> --infra <infra> --target deployments.nemo_skills
Tear down everything (RayClusters + Deployments)
销毁所有资源(RayClusters + Deployments)
nrl-k8s cluster down <recipe> --infra <infra>
The `deployments:` section in infra YAML declares Kubernetes Deployments managed alongside RayClusters. The CLI patches image, imagePullSecrets, and serviceAccountName from the top-level infra keys (same as RayClusters). Deployments start in parallel with cluster bring-up — no ordering dependency.nrl-k8s cluster down <recipe> --infra <infra>
基础设施YAML文件中的`deployments:`部分声明了与RayCluster一起管理的Kubernetes Deployment。CLI会从顶层基础设施配置中提取镜像、imagePullSecrets和serviceAccountName并应用到Deployment(与RayCluster逻辑相同)。Deployment与集群启动并行执行——无依赖顺序。8. Monitoring a run
8. 监控训练任务
bash
undefinedbash
undefinedStatus
查看状态
nrl-k8s status <recipe> --infra <infra>
kubectl get rayjob,raycluster -n default
nrl-k8s status <recipe> --infra <infra>
kubectl get rayjob,raycluster -n default
Follow the driver
跟踪驱动进程日志
nrl-k8s job list <recipe> --infra <infra> --role training
nrl-k8s job logs <run-id> <recipe> --infra <infra> --role training -f
When the `nrl-k8s job logs -f` subprocess dies (`kubectl port-forward` i/o timeout after ~15 min idle), just re-run it. The training job keeps going.
To fetch driver logs for a terminal job (SUCCEEDED/FAILED) or a RayJob via the dashboard API:
```bash
RC=$(kubectl get rayjob -n default <rayjob-name> -o jsonpath='{.status.rayClusterName}')
kubectl port-forward -n default svc/${RC}-head-svc 18266:8265 &
curl -s http://localhost:18266/api/jobs/ # lists jobs, find submission_id
curl -s "http://localhost:18266/api/jobs/<submission_id>/logs" # full driver logtype=DRIVERsubmission_id=nullnrl-k8s job logstype=SUBMISSIONsubmission_id/api/jobs/<id>/logsWandb URL appears in the driver log on the first call; grep .
wandb.initgrep -oE 'https://wandb\.ai/[A-Za-z0-9_./-]+'nrl-k8s job list <recipe> --infra <infra> --role training
nrl-k8s job logs <run-id> <recipe> --infra <infra> --role training -f
当`nrl-k8s job logs -f`子进程终止(`kubectl port-forward`在闲置约15分钟后会超时),只需重新执行命令即可——训练任务会继续运行。
要获取已完成(SUCCEEDED/FAILED)任务的驱动进程日志,或通过仪表盘API获取RayJob日志:
```bash
RC=$(kubectl get rayjob -n default <rayjob-name> -o jsonpath='{.status.rayClusterName}')
kubectl port-forward -n default svc/${RC}-head-svc 18266:8265 &
curl -s http://localhost:18266/api/jobs/ # 列出任务,找到submission_id
curl -s "http://localhost:18266/api/jobs/<submission_id>/logs" # 获取完整驱动进程日志type=DRIVERsubmission_id=nullnrl-k8s job logstype=SUBMISSIONsubmission_id/api/jobs/<id>/logsWandb URL会在首次调用时出现在驱动进程日志中;可通过命令提取。
wandb.initgrep -oE 'https://wandb\.ai/[A-Za-z0-9_./-]+'9. Stopping things
9. 停止任务
| What to stop | Command |
|---|---|
| One training run | |
| All running Ray jobs on a cluster (+ submit new) | |
| A long-lived RayCluster | |
| A RayJob (ephemeral) | |
Confirm before deleting shared infra. The cost of on someone else's cluster is high.
cluster down| 停止目标 | 命令 |
|---|---|
| 单个训练任务 | |
| 集群上所有运行中的Ray任务(并提交新任务) | |
| 长期运行的RayCluster | |
| RayJob(临时模式) | |
删除共享基础设施前请确认。误删除他人集群的代价很高。
10. Verifying RayJob teardown
10. 验证RayJob销毁
After a completes with (default), KubeRay should delete the RayCluster:
run --rayjob--shutdownbash
kubectl get rayjob -n default <rayjob-name> # jobDeploymentStatus = Complete
kubectl get raycluster -n default | grep <rayjob-name> # no output = torn downThe RayJob object itself sticks around for seconds (default 3600s) so you can still fetch logs.
--ttl当任务在模式(默认)下完成后,KubeRay应自动删除RayCluster:
run --rayjob--shutdownbash
kubectl get rayjob -n default <rayjob-name> # jobDeploymentStatus = Complete
kubectl get raycluster -n default | grep <rayjob-name> # 无输出表示已销毁RayJob对象本身会保留秒(默认3600秒),以便仍能获取日志。
--ttl11. Common gotchas
11. 常见陷阱
- OmegaConf interpolation eats in recipe/infra YAML. Escape shell variables with
${VAR}so OmegaConf passes them through to the pod shell verbatim.\${VAR} - Megatron optimizer configs don't carry /
foreach. Overrides likefused(valid for DTensor configs) break on Megatron recipes. Omit them for Megatron.~policy.optimizer.kwargs.foreach ~policy.optimizer.kwargs.fused - DTensor vs Megatron — MoE recipes typically use ; ensure
megatron_cfg.enabled=truein inherited defaults.dtensor_cfg.enabled=false - Shared filesystem vs git divergence — reads from the pod filesystem. If your local edits aren't on the shared filesystem the pods mount, the run is testing the on-disk version, not yours. Either sync via a helper pod (head pod exec is often blocked) or override via Hydra flags.
codeSource: image|lustre - Ephemeral-storage + readinessProbe are injected by kuberay/CDI webhooks at pod-apply time. Do NOT add them to the inline RayCluster spec.
- Node taints vary per cluster. on workers is defensive and worth keeping.
tolerations: [{operator: Exists}] - Dashboard blank page — Ray 2.52 installs dashboard assets as symlinks by default; auto-reinstalls
nrl-k8s cluster dashboard <name>to fix it. Bakeray[default] --link-mode=copyin the image to avoid this entirely.ENV UV_LINK_MODE=copy - is usually blocked in automation — route around with
kubectl exec,kubectl get ... -o yaml, andkubectl logs+ Ray dashboard APIs.kubectl port-forward
- OmegaConf插值语法会解析训练脚本/基础设施YAML中的。请使用
${VAR}转义shell变量,确保OmegaConf将其原封不动地传递给Pod的shell。\${VAR} - Megatron优化器配置不支持/
foreach参数。类似fused的覆盖参数(适用于DTensor配置)会在Megatron训练脚本中失效。Megatron训练脚本请省略这些参数。~policy.optimizer.kwargs.foreach ~policy.optimizer.kwargs.fused - DTensor vs Megatron — MoE训练脚本通常设置;请确保继承的默认配置中
megatron_cfg.enabled=true。dtensor_cfg.enabled=false - 共享文件系统与git差异 — 模式从Pod文件系统读取代码。如果本地修改未同步到Pod挂载的共享文件系统,训练任务测试的是磁盘上的版本,而非本地修改后的版本。可通过辅助Pod同步(头节点exec通常被禁止)或Hydra参数覆盖来解决。
codeSource: image|lustre - 临时存储 + 就绪探针由kuberay/CDI webhook在Pod创建时自动注入。请勿在RayCluster内联配置中添加这些内容。
- 节点污点因集群而异。在工作节点上设置是防御性配置,建议保留。
tolerations: [{operator: Exists}] - 仪表盘空白页面 — Ray 2.52默认通过符号链接安装仪表盘资源;会自动重新安装
nrl-k8s cluster dashboard <name>来修复该问题。在镜像中添加ray[default] --link-mode=copy可从根本上避免此问题。ENV UV_LINK_MODE=copy - 通常被禁止在自动化场景中——可通过
kubectl exec、kubectl get ... -o yaml和kubectl logs+ Ray仪表盘API替代。kubectl port-forward
12. Checklist before calling a run "done"
12. 确认训练任务完成的检查清单
Before reporting a launch as successful, verify:
- shows the expected objects.
kubectl get rayjob/raycluster -n default - (or
nrl-k8s job list) shows the job incurl /api/jobs//RUNNING.SUCCEEDED - Driver log contains (if wandb is enabled) — share the URL with the user.
wandb.ai/<project>/runs/<id> - At least one line appears (confirms generation is wired).
Processed prompts: 100% - For mode only: after
--rayjob, confirmjobDeploymentStatus=Completeis empty (teardown worked).kubectl get raycluster | grep <name>
在报告训练任务启动成功前,请验证以下内容:
- 显示预期的资源对象。
kubectl get rayjob/raycluster -n default - (或
nrl-k8s job list)显示任务处于curl /api/jobs//RUNNING状态。SUCCEEDED - 驱动进程日志包含(若启用wandb)——请将该URL分享给用户。
wandb.ai/<project>/runs/<id> - 日志中至少出现一行(确认生成流程已正常运行)。
Processed prompts: 100% - 仅适用于模式:当
--rayjob后,确认jobDeploymentStatus=Complete无输出(销毁成功)。kubectl get raycluster | grep <name>
13. Dev pod
13. 开发Pod
nrl-k8s devkubectlnrl-k8sbash
undefinednrl-k8s devkubectlnrl-k8sbash
undefinedOne-time: set up secrets (HF token, wandb, SSH key, rclone)
一次性操作:配置密钥(HF token、wandb、SSH密钥、rclone)
nrl-k8s dev setup-secrets --ssh-key ~/.ssh/id_rsa --add-rclone
nrl-k8s dev setup-secrets --ssh-key ~/.ssh/id_rsa --add-rclone
Create pod and exec in (idempotent — reuses existing pod)
创建Pod并进入(幂等操作——复用现有Pod)
nrl-k8s dev connect
nrl-k8s dev connect
Switch image (must stop first — image change is warned but not auto-applied)
切换镜像(必须先停止Pod——镜像变更会发出警告但不会自动应用)
nrl-k8s dev stop
nrl-k8s dev connect --image nvcr.io/nvidian/nemo-rl:v0.7.0
nrl-k8s dev stop
nrl-k8s dev connect --image nvcr.io/nvidian/nemo-rl:v0.7.0
Tear down
销毁Pod
nrl-k8s dev stop
The dev pod:
- Runs on a CPU-only node (anti-affinity to GPU nodes)
- Mounts the shared `rl-workspace` PVC at `/mnt/rl-workspace`
- Sets `USER` env var to the `nrl-k8s` username (so `$USER` and `getpass.getuser()` work correctly despite running as root)
- Installs `kubectl`, `rclone` (if configured) on first boot
- Injects SSH keys and tokens via `envFrom` on a per-user K8s Secret
The pod's `default` service account needs an `edit` RoleBinding in the namespace for `kubectl` to work inside. `dev connect` checks this and prints the required YAML if missing.nrl-k8s dev stop
开发Pod特性:
- 部署在仅含CPU的节点上(与GPU节点互斥)
- 将共享`rl-workspace` PVC挂载到`/mnt/rl-workspace`
- 设置`USER`环境变量为`nrl-k8s`用户名(确保`$USER`和`getpass.getuser()`在以root身份运行时仍能正常工作)
- 首次启动时安装`kubectl`、`rclone`(若已配置)
- 通过每个用户的K8s Secret注入SSH密钥和令牌
Pod的`default`服务账号需要在命名空间中拥有`edit`角色绑定,才能在Pod内正常使用`kubectl`。`dev connect`命令会检查该权限,如果缺失则打印所需的YAML配置。14. Where things live in the repo
14. 仓库中各组件的位置
- CLI code: (
infra/nrl_k8s/src/nrl_k8s/,cli.py,orchestrate.py,manifest.py,rayjob.py,k8s.py,submitters/).schema.py - Tests: — run with
infra/nrl_k8s/tests/unit/fromuv run --extra test pytest -x -q.infra/nrl_k8s/ - Recipe + infra examples: .
infra/nrl_k8s/examples/ - Base recipes this tool wraps: and
examples/configs/recipes/llm/….examples/nemo_gym/…
- CLI代码:(
infra/nrl_k8s/src/nrl_k8s/,cli.py,orchestrate.py,manifest.py,rayjob.py,k8s.py,submitters/)。schema.py - 测试代码:— 在
infra/nrl_k8s/tests/unit/路径下执行infra/nrl_k8s/运行测试。uv run --extra test pytest -x -q - 训练脚本 + 基础设施配置示例:。
infra/nrl_k8s/examples/ - 本工具封装的基础训练脚本:和
examples/configs/recipes/llm/…。examples/nemo_gym/…