Loading...
Loading...
Playbook for launching, monitoring, stopping, and debugging NeMo-RL recipes on a Kubernetes cluster via the nrl-k8s CLI. Covers ephemeral vs long-lived RayCluster modes, iterating on runs, and debugging hung or failed training jobs.
npx skill4agent add nvidia/skills launch-nemo-rlnrl-k8sinfra/nrl_k8s/kubectlgit lognrl-k8s run| Mode | Invocation | When to use | Cluster after? |
|---|---|---|---|
| Ephemeral (default) | | One-shot. KubeRay applies a RayJob, runs, tears the cluster down. Best for most runs. | No (auto) |
| Long-lived | | Dev loop. Reuses a matching live cluster, applies if absent, warns + reuses on drift (pass | Yes |
--raycluster| Command | Purpose |
|---|---|
| Validate a recipe + infra pair; optionally write the fully-resolved manifests ( |
| Per-role RayCluster state, head pod phase, worker pod phases, daemon job status. |
| Manage RayClusters independently of a run (e.g. render a manifest with |
| Observability over Ray Jobs already submitted to a role's cluster. |
| Tail a role's pod / daemon logs without needing a submission id. |
--infranrl-k8s run infra/nrl_k8s/examples/<recipe>.yaml \
--infra infra/nrl_k8s/examples/<recipe>.<profile>.infra.yamlqwen3_30b_math_8n_4gpu.yamlcluster.{gpus_per_node,num_nodes}defaults:examples/configs/recipes/llm/...*.<profile>.infra.yamlkuberay:deployments:submit.submitterlaunch.{mode,codeSource,codePath,entrypoint}<recipe>.<profile>[.prod].infra.yaml<profile>gb300infra/nrl_k8s/examples/--mode--mode interactive → --submitter portForward --code-source upload (tails logs)
--mode batch → --submitter exec --code-source image (returns after nohup)portForwardkubectl port-forwardsubmission_idexeckubectl execnohuptype=DRIVERuploadimagelustre--code-path/opt/nemo-rl--wait--no-wait--replace--recreate--skip-daemonscd /opt/nemo-rl--code-source upload/tmp/ray/...cd--rayjob--rayjobrun--rayjob-name NAME--shutdown / --no-shutdowntrue--ttl SECONDS--wait / --no-waitwaitjobDeploymentStatus--no-wait--timeout SECONDS--wait--dry-run--replace--recreate--skip-daemons--rayjobentrypoint: |
set -eu
cd /opt/nemo-rl
RUN_ID="\${RAY_JOB_SUBMISSION_ID:-\${NRL_K8S_RUN_ID:-$(date -u +%Y%m%d-%H%M%S)}}"
python -u examples/run_grpo.py \
--config infra/nrl_k8s/examples/<recipe>.yaml \
logger.wandb_enabled=true \
logger.wandb.project=<project> \
"logger.wandb.name=<run-name>-\${RUN_ID}"${…}${VAR:-default}RUN_IDRAY_JOB_SUBMISSION_IDNRL_K8S_RUN_IDinfra/nrl_k8s/examples/cluster.gpus_per_nodePendingnvidia.com/gpu.productschedulerName: kai-schedulerkai.scheduler/queuekai.scheduler/topologykai.scheduler/topology-required-placementresourceClaimsResourceClaimTemplatesecretKeyRefwandb-api-key/opt/nemo-rlsubPath/mnt/rl-workspacekubectl get pvc <workspace-pvc>
kubectl get secret <wandb-secret> <image-pull-secret>
kubectl get sa <service-account># From the NeMo-RL repo root:
nrl-k8s check <recipe> --infra <infra> # validate first
nrl-k8s run <recipe> --infra <infra> --rayjob --dry-run # render RayJob manifest
nrl-k8s run <recipe> --infra <infra> --rayjob --no-wait # apply, returns fastkubectl get rayjob -n default <name> -w
kubectl get raycluster -n default # empty = teardown succeedednrl-k8s run <recipe> --infra <infra> --run-id $(date +%Y%m%d-%H%M%S)
# Edits in the recipe? Just re-run — reuses the live cluster.
# Pod spec changed? Add --recreate to delete + re-apply.
# Disagg recipe with gym/gen already healthy? --skip-daemons.nrl-k8s run <recipe> --infra <disagg-infra> --mode batch --code-source imagenrl-k8s cluster up <recipe> --infra <infra> --target kuberay.training --wait
nrl-k8s cluster up <recipe> --infra <infra> --target kuberay.training --dry-run # render manifest
nrl-k8s cluster down <recipe> --infra <infra> --target kuberay.training --wait
nrl-k8s cluster down <recipe> --infra <infra> # tear down all
nrl-k8s cluster list -n default
nrl-k8s cluster dashboard <cluster-name> # port-forward + browser# Bring up just the deployment
nrl-k8s cluster up <recipe> --infra <infra> --target deployments.nemo_skills
# Tear down just the deployment
nrl-k8s cluster down <recipe> --infra <infra> --target deployments.nemo_skills
# Tear down everything (RayClusters + Deployments)
nrl-k8s cluster down <recipe> --infra <infra>deployments:# Status
nrl-k8s status <recipe> --infra <infra>
kubectl get rayjob,raycluster -n default
# Follow the driver
nrl-k8s job list <recipe> --infra <infra> --role training
nrl-k8s job logs <run-id> <recipe> --infra <infra> --role training -fnrl-k8s job logs -fkubectl port-forwardRC=$(kubectl get rayjob -n default <rayjob-name> -o jsonpath='{.status.rayClusterName}')
kubectl port-forward -n default svc/${RC}-head-svc 18266:8265 &
curl -s http://localhost:18266/api/jobs/ # lists jobs, find submission_id
curl -s "http://localhost:18266/api/jobs/<submission_id>/logs" # full driver logtype=DRIVERsubmission_id=nullnrl-k8s job logstype=SUBMISSIONsubmission_id/api/jobs/<id>/logswandb.initgrep -oE 'https://wandb\.ai/[A-Za-z0-9_./-]+'| What to stop | Command |
|---|---|
| One training run | |
| All running Ray jobs on a cluster (+ submit new) | |
| A long-lived RayCluster | |
| A RayJob (ephemeral) | |
cluster downrun --rayjob--shutdownkubectl get rayjob -n default <rayjob-name> # jobDeploymentStatus = Complete
kubectl get raycluster -n default | grep <rayjob-name> # no output = torn down--ttl${VAR}\${VAR}foreachfused~policy.optimizer.kwargs.foreach ~policy.optimizer.kwargs.fusedmegatron_cfg.enabled=truedtensor_cfg.enabled=falsecodeSource: image|lustretolerations: [{operator: Exists}]nrl-k8s cluster dashboard <name>ray[default] --link-mode=copyENV UV_LINK_MODE=copykubectl execkubectl get ... -o yamlkubectl logskubectl port-forwardkubectl get rayjob/raycluster -n defaultnrl-k8s job listcurl /api/jobs/RUNNINGSUCCEEDEDwandb.ai/<project>/runs/<id>Processed prompts: 100%--rayjobjobDeploymentStatus=Completekubectl get raycluster | grep <name>nrl-k8s devkubectlnrl-k8s# One-time: set up secrets (HF token, wandb, SSH key, rclone)
nrl-k8s dev setup-secrets --ssh-key ~/.ssh/id_rsa --add-rclone
# Create pod and exec in (idempotent — reuses existing pod)
nrl-k8s dev connect
# Switch image (must stop first — image change is warned but not auto-applied)
nrl-k8s dev stop
nrl-k8s dev connect --image nvcr.io/nvidian/nemo-rl:v0.7.0
# Tear down
nrl-k8s dev stoprl-workspace/mnt/rl-workspaceUSERnrl-k8s$USERgetpass.getuser()kubectlrcloneenvFromdefaulteditkubectldev connectinfra/nrl_k8s/src/nrl_k8s/cli.pyorchestrate.pymanifest.pyrayjob.pyk8s.pysubmitters/schema.pyinfra/nrl_k8s/tests/unit/uv run --extra test pytest -x -qinfra/nrl_k8s/infra/nrl_k8s/examples/examples/configs/recipes/llm/…examples/nemo_gym/…