dynamo-recipe-runner
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseDynamo Recipe Runner
Dynamo Recipe 运行器
<!--
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: CC-BY-4.0
-->
<!--
SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: CC-BY-4.0
-->
Purpose
用途
Get from user intent to a working Dynamo recipe endpoint with minimal back and
forth. Do not create new guide content. Operate on the existing
tree, patch the smallest necessary set of manifests, deploy when the user has
cluster access, and prove success with an OpenAI-compatible smoke request.
recipes/获取用户需求,以最少的沟通成本搭建可用的Dynamo Recipe端点。不创建新的指南内容。基于现有的目录结构操作,仅补丁修复必要的最小化清单文件,当用户拥有集群访问权限时进行部署,并通过兼容OpenAI的冒烟测试验证部署成功。
recipes/Prerequisites
前置条件
- Python 3.10+ on the operator machine.
- configured with a working cluster context.
kubectl - Cluster has a default storage class for model-cache PVCs.
- Hugging Face token stored in a Kubernetes secret named (or equivalent) in the target namespace.
hf-token-secret - Read access to the tree in the ai-dynamo/dynamo repository.
recipes/
- 操作机器上安装Python 3.10及以上版本。
- 已配置可用的集群上下文。
kubectl - 集群拥有用于model-cache PVC的默认存储类。
- Hugging Face令牌存储在目标命名空间中名为(或等效名称)的Kubernetes密钥中。
hf-token-secret - 拥有ai-dynamo/dynamo仓库中目录的读取权限。
recipes/
Required Inputs
必需输入
Collect or infer these before changing manifests:
- recipe target: model, framework (,
vllm,sglang,trtllm), deployment mode, and GPU type/counttokenspeed - Kubernetes context and namespace
- Hugging Face secret name, usually
hf-token-secret - storage class for model cache PVCs
- runtime image tag if the recipe uses a placeholder or stale test image
- whether to run commands or only produce exact commands
If a required value is missing and cannot be inferred from the selected recipe,
ask for only that value.
在修改清单文件前,收集或推断以下信息:
- 目标配置方案:模型、框架(、
vllm、sglang、trtllm)、部署模式以及GPU类型/数量tokenspeed - Kubernetes上下文和命名空间
- Hugging Face密钥名称,通常为
hf-token-secret - 用于模型缓存PVC的存储类
- 若配置方案使用占位符或过时测试镜像,需提供运行时镜像标签
- 是直接执行命令还是仅生成精确命令
若某个必需值缺失且无法从所选配置方案中推断,仅询问该值。
Instructions
操作步骤
1. Preflight
1. 预检
Run read-only checks first:
bash
git status --short
python3 scripts/recipe_tool.py list --format table
kubectl config current-context
kubectl get storageclass
kubectl get nodes -o wide
kubectl get namespace "${NAMESPACE}"
kubectl get secret hf-token-secret -n "${NAMESPACE}"If is unavailable or the cluster is unreachable, continue by
selecting and validating the recipe, then return exact commands instead of
pretending the deployment ran.
kubectl首先执行只读检查:
bash
git status --short
python3 scripts/recipe_tool.py list --format table
kubectl config current-context
kubectl get storageclass
kubectl get nodes -o wide
kubectl get namespace "${NAMESPACE}"
kubectl get secret hf-token-secret -n "${NAMESPACE}"若不可用或集群无法访问,继续选择并验证配置方案,然后返回精确命令而非模拟部署执行。
kubectl2. Select The Recipe
2. 选择配置方案
Use the recipe matrix from and the scanner:
recipes/README.mdbash
python3 scripts/recipe_tool.py list \
--query qwen --framework vllm --mode disagg --format tablePrefer an exact existing recipe. Do not invent new manifests unless the user
explicitly asks to author a new recipe.
使用中的配置方案矩阵和扫描工具:
recipes/README.mdbash
python3 scripts/recipe_tool.py list \
--query qwen --framework vllm --mode disagg --format table优先选择完全匹配的现有配置方案。除非用户明确要求编写新配置方案,否则不要创建新的清单文件。
3. Inspect And Validate
3. 检查与验证
Read the selected recipe README, model-cache manifests, , and
if present. Then run:
deploy.yamlperf.yamlbash
python3 scripts/recipe_tool.py validate \
recipes/<model>/<framework>/<mode>Resolve reported blockers before applying manifests: storage class, model cache
PVC, image tag, HF token secret, GPU count, frontend service name, and router
mode.
阅读所选配置方案的README、model-cache清单、以及(若存在的)。然后执行:
deploy.yamlperf.yamlbash
python3 scripts/recipe_tool.py validate \
recipes/<model>/<framework>/<mode>在应用清单文件前解决报告的阻塞问题:存储类、模型缓存PVC、镜像标签、HF令牌密钥、GPU数量、前端服务名称以及路由模式。
4. Patch Minimal Values
4. 最小化补丁修复
Patch only recipe-specific values needed for this run. Do not reformat whole
YAML files. Common patches:
storageClassName- image repository/tag
- model path or model cache mount path
- GPU resource requests/limits
- frontend
DYN_ROUTER_MODE - namespace only when a manifest hardcodes it
Never write Hugging Face tokens into files or logs. Use Kubernetes secrets.
仅补丁修复本次运行所需的配置方案特定值。不要重新格式化整个YAML文件。常见的补丁修复内容:
storageClassName- 镜像仓库/标签
- 模型路径或模型缓存挂载路径
- GPU资源请求/限制
- 前端
DYN_ROUTER_MODE - 仅当清单文件硬编码命名空间时才修改命名空间
切勿将Hugging Face令牌写入文件或日志。使用Kubernetes密钥存储。
5. Deploy
5. 部署
Follow the selected recipe README when it differs from the default sequence.
The default sequence is:
bash
kubectl apply -f recipes/<model>/model-cache/ -n "${NAMESPACE}"
kubectl wait --for=condition=Complete job/model-download -n "${NAMESPACE}" --timeout=6000s
kubectl apply -f recipes/<model>/<framework>/<mode>/deploy.yaml -n "${NAMESPACE}"
kubectl get dynamographdeployment -n "${NAMESPACE}"
kubectl get pods -n "${NAMESPACE}" -o wideWait for the frontend and workers to be ready before testing.
当所选配置方案的README与默认流程不同时,遵循README中的步骤。默认流程为:
bash
kubectl apply -f recipes/<model>/model-cache/ -n "${NAMESPACE}"
kubectl wait --for=condition=Complete job/model-download -n "${NAMESPACE}" --timeout=6000s
kubectl apply -f recipes/<model>/<framework>/<mode>/deploy.yaml -n "${NAMESPACE}"
kubectl get dynamographdeployment -n "${NAMESPACE}"
kubectl get pods -n "${NAMESPACE}" -o wide在测试前等待前端和工作节点就绪。
6. Smoke Test
6. 冒烟测试
Port-forward the frontend service, then verify and one chat
completion:
/v1/modelsbash
kubectl port-forward svc/<deployment-name>-frontend 8000:8000 -n "${NAMESPACE}"
curl http://127.0.0.1:8000/v1/modelsIf is also installed, prefer its
for the full OpenAI-compatible smoke test. If this fails, switch to
.
dynamo-router-starterscripts/check_router_health.pydynamo-troubleshoot端口转发前端服务,然后验证接口和一次聊天补全请求:
/v1/modelsbash
kubectl port-forward svc/<deployment-name>-frontend 8000:8000 -n "${NAMESPACE}"
curl http://127.0.0.1:8000/v1/models若同时安装了,优先使用其进行完整的兼容OpenAI的冒烟测试。若测试失败,切换至工具。
dynamo-router-starterscripts/check_router_health.pydynamo-troubleshootAvailable Scripts
可用脚本
| Script | Purpose | Arguments |
|---|---|---|
| Enumerate available recipes, optionally filtered | |
| Validate a recipe directory before apply | positional recipe path |
Invoke via the agentskills.io protocol:
run_script()python
run_script("scripts/recipe_tool.py", args=["list", "--framework", "sglang", "--format", "table"])
run_script("scripts/recipe_tool.py", args=["validate", "recipes/nemotron-3-super-fp8/sglang/agg"])| 脚本 | 用途 | 参数 |
|---|---|---|
| 枚举可用配置方案,可按需过滤 | |
| 在应用前验证配置方案目录 | 位置参数:配置方案路径 |
通过agentskills.io的协议调用:
run_script()python
run_script("scripts/recipe_tool.py", args=["list", "--framework", "sglang", "--format", "table"])
run_script("scripts/recipe_tool.py", args=["validate", "recipes/nemotron-3-super-fp8/sglang/agg"])Examples
示例
List sglang recipes that fit a single 8xB200 node:
bash
python3 scripts/recipe_tool.py list --framework sglang --format tableValidate a specific recipe and resolve blockers before applying:
bash
python3 scripts/recipe_tool.py validate recipes/nemotron-3-super-fp8/sglang/aggEquivalent through the agent protocol:
python
run_script("scripts/recipe_tool.py", args=["validate", "recipes/nemotron-3-super-fp8/sglang/agg"])列出适配单台8xB200节点的sglang配置方案:
bash
python3 scripts/recipe_tool.py list --framework sglang --format table验证特定配置方案并在应用前解决阻塞问题:
bash
python3 scripts/recipe_tool.py validate recipes/nemotron-3-super-fp8/sglang/agg通过代理协议的等效调用:
python
run_script("scripts/recipe_tool.py", args=["validate", "recipes/nemotron-3-super-fp8/sglang/agg"])Output Contract
输出约定
Return:
- selected recipe path and why it was selected
- exact values patched
- commands run or commands to run
- endpoint and smoke-test result
- unresolved blockers, if any
- next troubleshooting step when deployment does not become healthy
返回内容:
- 所选配置方案路径及选择原因
- 补丁修复的精确值
- 已执行的命令或待执行的命令
- 端点地址和冒烟测试结果
- (若存在的)未解决阻塞问题
- 当部署无法恢复健康时的下一步排查步骤
Limitations
限制
- Operates on the existing tree only. Does not author new manifests.
recipes/ - Cluster-mutating apply steps require permission to the target namespace.
kubectl - Smoke-test depth is intentionally minimal; for full router/endpoint coverage use .
dynamo-router-starter - Multi-node disagg transport correctness is out of scope; use after deploy.
dynamo-interconnect-check
- 仅基于现有目录结构操作。不编写新的清单文件。
recipes/ - 修改集群的应用步骤需要拥有目标命名空间的权限。
kubectl - 冒烟测试深度故意设置为最小;若需完整的路由/端点覆盖,请使用。
dynamo-router-starter - 多节点分离传输的正确性不在本工具范围内;部署完成后请使用工具。
dynamo-interconnect-check
Troubleshooting
故障排查
| Symptom | Likely cause | Next step |
|---|---|---|
| Context not set or VPN down | Return exact commands instead of running them; resume when cluster is reachable |
| Cluster has no default | Patch |
Model-cache job stuck | PVC unbound or HF secret missing | Inspect PVC events; create or rename the HF secret to match the recipe |
Worker pods | Stale image tag or missing pull secret | Patch the image tag; verify image pull secret in the namespace |
| Frontend not ready or wrong service port | Wait for pods Ready; re-run port-forward; switch to |
| 症状 | 可能原因 | 下一步操作 |
|---|---|---|
| 上下文未设置或VPN断开 | 返回精确命令而非执行;集群恢复后再继续 |
| 集群无默认 | 在应用前补丁修复model-cache清单中的 |
Model-cache作业卡在 | PVC未绑定或HF密钥缺失 | 检查PVC事件;创建或重命名HF密钥以匹配配置方案 |
工作节点Pod出现 | 镜像标签过时或缺失拉取密钥 | 补丁修复镜像标签;验证命名空间中的镜像拉取密钥 |
部署后 | 前端未就绪或服务端口错误 | 等待Pod就绪;重新执行端口转发;若问题持续,切换至 |
Benchmark
基准测试
See for the NVCARPS-EVAL performance report (auto-generated by the NVSkills CI pipeline). To refresh, re-run on an upstream PR touching this skill.
BENCHMARK.md/nvskills-ci查看获取NVCARPS-EVAL性能报告(由NVSkills CI管道自动生成)。如需刷新报告,请在触及本技能的上游PR上重新运行。
BENCHMARK.md/nvskills-ciReferences
参考资料
- Read for command templates and readiness checks.
references/k8s-recipe-workflow.md - Use for recipe discovery and lightweight validation.
scripts/recipe_tool.py
- 阅读获取命令模板和就绪检查说明。
references/k8s-recipe-workflow.md - 使用进行配置方案发现和轻量级验证。
scripts/recipe_tool.py