dynamo-recipe-runner

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

Dynamo Recipe Runner

Dynamo Recipe 运行器

<!-- SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-License-Identifier: CC-BY-4.0 -->
<!-- SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. SPDX-License-Identifier: CC-BY-4.0 -->

Purpose

用途

Get from user intent to a working Dynamo recipe endpoint with minimal back and forth. Do not create new guide content. Operate on the existing
recipes/
tree, patch the smallest necessary set of manifests, deploy when the user has cluster access, and prove success with an OpenAI-compatible smoke request.
获取用户需求,以最少的沟通成本搭建可用的Dynamo Recipe端点。不创建新的指南内容。基于现有的
recipes/
目录结构操作,仅补丁修复必要的最小化清单文件,当用户拥有集群访问权限时进行部署,并通过兼容OpenAI的冒烟测试验证部署成功。

Prerequisites

前置条件

  • Python 3.10+ on the operator machine.
  • kubectl
    configured with a working cluster context.
  • Cluster has a default storage class for model-cache PVCs.
  • Hugging Face token stored in a Kubernetes secret named
    hf-token-secret
    (or equivalent) in the target namespace.
  • Read access to the
    recipes/
    tree in the ai-dynamo/dynamo repository.
  • 操作机器上安装Python 3.10及以上版本。
  • kubectl
    已配置可用的集群上下文。
  • 集群拥有用于model-cache PVC的默认存储类。
  • Hugging Face令牌存储在目标命名空间中名为
    hf-token-secret
    (或等效名称)的Kubernetes密钥中。
  • 拥有ai-dynamo/dynamo仓库中
    recipes/
    目录的读取权限。

Required Inputs

必需输入

Collect or infer these before changing manifests:
  • recipe target: model, framework (
    vllm
    ,
    sglang
    ,
    trtllm
    ,
    tokenspeed
    ), deployment mode, and GPU type/count
  • Kubernetes context and namespace
  • Hugging Face secret name, usually
    hf-token-secret
  • storage class for model cache PVCs
  • runtime image tag if the recipe uses a placeholder or stale test image
  • whether to run commands or only produce exact commands
If a required value is missing and cannot be inferred from the selected recipe, ask for only that value.
在修改清单文件前,收集或推断以下信息:
  • 目标配置方案:模型、框架(
    vllm
    sglang
    trtllm
    tokenspeed
    )、部署模式以及GPU类型/数量
  • Kubernetes上下文和命名空间
  • Hugging Face密钥名称,通常为
    hf-token-secret
  • 用于模型缓存PVC的存储类
  • 若配置方案使用占位符或过时测试镜像,需提供运行时镜像标签
  • 是直接执行命令还是仅生成精确命令
若某个必需值缺失且无法从所选配置方案中推断,仅询问该值。

Instructions

操作步骤

1. Preflight

1. 预检

Run read-only checks first:
bash
git status --short
python3 scripts/recipe_tool.py list --format table
kubectl config current-context
kubectl get storageclass
kubectl get nodes -o wide
kubectl get namespace "${NAMESPACE}"
kubectl get secret hf-token-secret -n "${NAMESPACE}"
If
kubectl
is unavailable or the cluster is unreachable, continue by selecting and validating the recipe, then return exact commands instead of pretending the deployment ran.
首先执行只读检查:
bash
git status --short
python3 scripts/recipe_tool.py list --format table
kubectl config current-context
kubectl get storageclass
kubectl get nodes -o wide
kubectl get namespace "${NAMESPACE}"
kubectl get secret hf-token-secret -n "${NAMESPACE}"
kubectl
不可用或集群无法访问,继续选择并验证配置方案,然后返回精确命令而非模拟部署执行。

2. Select The Recipe

2. 选择配置方案

Use the recipe matrix from
recipes/README.md
and the scanner:
bash
python3 scripts/recipe_tool.py list \
  --query qwen --framework vllm --mode disagg --format table
Prefer an exact existing recipe. Do not invent new manifests unless the user explicitly asks to author a new recipe.
使用
recipes/README.md
中的配置方案矩阵和扫描工具:
bash
python3 scripts/recipe_tool.py list \
  --query qwen --framework vllm --mode disagg --format table
优先选择完全匹配的现有配置方案。除非用户明确要求编写新配置方案,否则不要创建新的清单文件。

3. Inspect And Validate

3. 检查与验证

Read the selected recipe README, model-cache manifests,
deploy.yaml
, and
perf.yaml
if present. Then run:
bash
python3 scripts/recipe_tool.py validate \
  recipes/<model>/<framework>/<mode>
Resolve reported blockers before applying manifests: storage class, model cache PVC, image tag, HF token secret, GPU count, frontend service name, and router mode.
阅读所选配置方案的README、model-cache清单、
deploy.yaml
以及(若存在的)
perf.yaml
。然后执行:
bash
python3 scripts/recipe_tool.py validate \
  recipes/<model>/<framework>/<mode>
在应用清单文件前解决报告的阻塞问题:存储类、模型缓存PVC、镜像标签、HF令牌密钥、GPU数量、前端服务名称以及路由模式。

4. Patch Minimal Values

4. 最小化补丁修复

Patch only recipe-specific values needed for this run. Do not reformat whole YAML files. Common patches:
  • storageClassName
  • image repository/tag
  • model path or model cache mount path
  • GPU resource requests/limits
  • frontend
    DYN_ROUTER_MODE
  • namespace only when a manifest hardcodes it
Never write Hugging Face tokens into files or logs. Use Kubernetes secrets.
仅补丁修复本次运行所需的配置方案特定值。不要重新格式化整个YAML文件。常见的补丁修复内容:
  • storageClassName
  • 镜像仓库/标签
  • 模型路径或模型缓存挂载路径
  • GPU资源请求/限制
  • 前端
    DYN_ROUTER_MODE
  • 仅当清单文件硬编码命名空间时才修改命名空间
切勿将Hugging Face令牌写入文件或日志。使用Kubernetes密钥存储。

5. Deploy

5. 部署

Follow the selected recipe README when it differs from the default sequence. The default sequence is:
bash
kubectl apply -f recipes/<model>/model-cache/ -n "${NAMESPACE}"
kubectl wait --for=condition=Complete job/model-download -n "${NAMESPACE}" --timeout=6000s
kubectl apply -f recipes/<model>/<framework>/<mode>/deploy.yaml -n "${NAMESPACE}"
kubectl get dynamographdeployment -n "${NAMESPACE}"
kubectl get pods -n "${NAMESPACE}" -o wide
Wait for the frontend and workers to be ready before testing.
当所选配置方案的README与默认流程不同时,遵循README中的步骤。默认流程为:
bash
kubectl apply -f recipes/<model>/model-cache/ -n "${NAMESPACE}"
kubectl wait --for=condition=Complete job/model-download -n "${NAMESPACE}" --timeout=6000s
kubectl apply -f recipes/<model>/<framework>/<mode>/deploy.yaml -n "${NAMESPACE}"
kubectl get dynamographdeployment -n "${NAMESPACE}"
kubectl get pods -n "${NAMESPACE}" -o wide
在测试前等待前端和工作节点就绪。

6. Smoke Test

6. 冒烟测试

Port-forward the frontend service, then verify
/v1/models
and one chat completion:
bash
kubectl port-forward svc/<deployment-name>-frontend 8000:8000 -n "${NAMESPACE}"
curl http://127.0.0.1:8000/v1/models
If
dynamo-router-starter
is also installed, prefer its
scripts/check_router_health.py
for the full OpenAI-compatible smoke test. If this fails, switch to
dynamo-troubleshoot
.
端口转发前端服务,然后验证
/v1/models
接口和一次聊天补全请求:
bash
kubectl port-forward svc/<deployment-name>-frontend 8000:8000 -n "${NAMESPACE}"
curl http://127.0.0.1:8000/v1/models
若同时安装了
dynamo-router-starter
,优先使用其
scripts/check_router_health.py
进行完整的兼容OpenAI的冒烟测试。若测试失败,切换至
dynamo-troubleshoot
工具。

Available Scripts

可用脚本

ScriptPurposeArguments
scripts/recipe_tool.py list
Enumerate available recipes, optionally filtered
--query
,
--framework
,
--mode
,
--format
scripts/recipe_tool.py validate
Validate a recipe directory before applypositional recipe path
Invoke via the agentskills.io
run_script()
protocol:
python
run_script("scripts/recipe_tool.py", args=["list", "--framework", "sglang", "--format", "table"])
run_script("scripts/recipe_tool.py", args=["validate", "recipes/nemotron-3-super-fp8/sglang/agg"])
脚本用途参数
scripts/recipe_tool.py list
枚举可用配置方案,可按需过滤
--query
,
--framework
,
--mode
,
--format
scripts/recipe_tool.py validate
在应用前验证配置方案目录位置参数:配置方案路径
通过agentskills.io的
run_script()
协议调用:
python
run_script("scripts/recipe_tool.py", args=["list", "--framework", "sglang", "--format", "table"])
run_script("scripts/recipe_tool.py", args=["validate", "recipes/nemotron-3-super-fp8/sglang/agg"])

Examples

示例

List sglang recipes that fit a single 8xB200 node:
bash
python3 scripts/recipe_tool.py list --framework sglang --format table
Validate a specific recipe and resolve blockers before applying:
bash
python3 scripts/recipe_tool.py validate recipes/nemotron-3-super-fp8/sglang/agg
Equivalent through the agent protocol:
python
run_script("scripts/recipe_tool.py", args=["validate", "recipes/nemotron-3-super-fp8/sglang/agg"])
列出适配单台8xB200节点的sglang配置方案:
bash
python3 scripts/recipe_tool.py list --framework sglang --format table
验证特定配置方案并在应用前解决阻塞问题:
bash
python3 scripts/recipe_tool.py validate recipes/nemotron-3-super-fp8/sglang/agg
通过代理协议的等效调用:
python
run_script("scripts/recipe_tool.py", args=["validate", "recipes/nemotron-3-super-fp8/sglang/agg"])

Output Contract

输出约定

Return:
  • selected recipe path and why it was selected
  • exact values patched
  • commands run or commands to run
  • endpoint and smoke-test result
  • unresolved blockers, if any
  • next troubleshooting step when deployment does not become healthy
返回内容:
  • 所选配置方案路径及选择原因
  • 补丁修复的精确值
  • 已执行的命令或待执行的命令
  • 端点地址和冒烟测试结果
  • (若存在的)未解决阻塞问题
  • 当部署无法恢复健康时的下一步排查步骤

Limitations

限制

  • Operates on the existing
    recipes/
    tree only. Does not author new manifests.
  • Cluster-mutating apply steps require
    kubectl
    permission to the target namespace.
  • Smoke-test depth is intentionally minimal; for full router/endpoint coverage use
    dynamo-router-starter
    .
  • Multi-node disagg transport correctness is out of scope; use
    dynamo-interconnect-check
    after deploy.
  • 仅基于现有
    recipes/
    目录结构操作。不编写新的清单文件。
  • 修改集群的应用步骤需要
    kubectl
    拥有目标命名空间的权限。
  • 冒烟测试深度故意设置为最小;若需完整的路由/端点覆盖,请使用
    dynamo-router-starter
  • 多节点分离传输的正确性不在本工具范围内;部署完成后请使用
    dynamo-interconnect-check
    工具。

Troubleshooting

故障排查

SymptomLikely causeNext step
kubectl
cluster unreachable
Context not set or VPN downReturn exact commands instead of running them; resume when cluster is reachable
validate
reports missing storage class
Cluster has no default
StorageClass
Patch
storageClassName
on the model-cache manifest before applying
Model-cache job stuck
Pending
PVC unbound or HF secret missingInspect PVC events; create or rename the HF secret to match the recipe
Worker pods
ImagePullBackOff
Stale image tag or missing pull secretPatch the image tag; verify image pull secret in the namespace
/v1/models
4xx/5xx after deploy
Frontend not ready or wrong service portWait for pods Ready; re-run port-forward; switch to
dynamo-troubleshoot
if it persists
症状可能原因下一步操作
kubectl
无法连接集群
上下文未设置或VPN断开返回精确命令而非执行;集群恢复后再继续
validate
报告缺失存储类
集群无默认
StorageClass
在应用前补丁修复model-cache清单中的
storageClassName
Model-cache作业卡在
Pending
状态
PVC未绑定或HF密钥缺失检查PVC事件;创建或重命名HF密钥以匹配配置方案
工作节点Pod出现
ImagePullBackOff
镜像标签过时或缺失拉取密钥补丁修复镜像标签;验证命名空间中的镜像拉取密钥
部署后
/v1/models
返回4xx/5xx错误
前端未就绪或服务端口错误等待Pod就绪;重新执行端口转发;若问题持续,切换至
dynamo-troubleshoot
工具

Benchmark

基准测试

See
BENCHMARK.md
for the NVCARPS-EVAL performance report (auto-generated by the NVSkills CI pipeline). To refresh, re-run
/nvskills-ci
on an upstream PR touching this skill.
查看
BENCHMARK.md
获取NVCARPS-EVAL性能报告(由NVSkills CI管道自动生成)。如需刷新报告,请在触及本技能的上游PR上重新运行
/nvskills-ci

References

参考资料

  • Read
    references/k8s-recipe-workflow.md
    for command templates and readiness checks.
  • Use
    scripts/recipe_tool.py
    for recipe discovery and lightweight validation.
  • 阅读
    references/k8s-recipe-workflow.md
    获取命令模板和就绪检查说明。
  • 使用
    scripts/recipe_tool.py
    进行配置方案发现和轻量级验证。