dynamo-recipe-runner

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Dynamo Recipe Runner

Dynamo Recipe 运行器

Purpose

用途

Get from user intent to a working Dynamo recipe endpoint with minimal back and forth. Do not create new guide content. Operate on the existing

recipes/

tree, patch the smallest necessary set of manifests, deploy when the user has cluster access, and prove success with an OpenAI-compatible smoke request.

获取用户需求，以最少的沟通成本搭建可用的Dynamo Recipe端点。不创建新的指南内容。基于现有的

recipes/

目录结构操作，仅补丁修复必要的最小化清单文件，当用户拥有集群访问权限时进行部署，并通过兼容OpenAI的冒烟测试验证部署成功。

Prerequisites

前置条件

Python 3.10+ on the operator machine.
```
kubectl
```
configured with a working cluster context.
Cluster has a default storage class for model-cache PVCs.
Hugging Face token stored in a Kubernetes secret named
```
hf-token-secret
```
(or equivalent) in the target namespace.
Read access to the
```
recipes/
```
tree in the ai-dynamo/dynamo repository.

操作机器上安装Python 3.10及以上版本。
```
kubectl
```
已配置可用的集群上下文。
集群拥有用于model-cache PVC的默认存储类。
Hugging Face令牌存储在目标命名空间中名为
```
hf-token-secret
```
（或等效名称）的Kubernetes密钥中。
拥有ai-dynamo/dynamo仓库中
```
recipes/
```
目录的读取权限。

Required Inputs

必需输入

Collect or infer these before changing manifests:

recipe target: model, framework (
```
vllm
```
,
```
sglang
```
,
```
trtllm
```
,
```
tokenspeed
```
), deployment mode, and GPU type/count
Kubernetes context and namespace
Hugging Face secret name, usually
```
hf-token-secret
```
storage class for model cache PVCs
runtime image tag if the recipe uses a placeholder or stale test image
whether to run commands or only produce exact commands

If a required value is missing and cannot be inferred from the selected recipe, ask for only that value.

在修改清单文件前，收集或推断以下信息：

目标配置方案：模型、框架（
```
vllm
```
、
```
sglang
```
、
```
trtllm
```
、
```
tokenspeed
```
）、部署模式以及GPU类型/数量
Kubernetes上下文和命名空间
Hugging Face密钥名称，通常为
```
hf-token-secret
```
用于模型缓存PVC的存储类
若配置方案使用占位符或过时测试镜像，需提供运行时镜像标签
是直接执行命令还是仅生成精确命令

若某个必需值缺失且无法从所选配置方案中推断，仅询问该值。

Instructions

操作步骤

1. Preflight

1. 预检

Run read-only checks first:

bash

git status --short
python3 scripts/recipe_tool.py list --format table
kubectl config current-context
kubectl get storageclass
kubectl get nodes -o wide
kubectl get namespace "${NAMESPACE}"
kubectl get secret hf-token-secret -n "${NAMESPACE}"

kubectl

is unavailable or the cluster is unreachable, continue by selecting and validating the recipe, then return exact commands instead of pretending the deployment ran.

首先执行只读检查：

bash

git status --short
python3 scripts/recipe_tool.py list --format table
kubectl config current-context
kubectl get storageclass
kubectl get nodes -o wide
kubectl get namespace "${NAMESPACE}"
kubectl get secret hf-token-secret -n "${NAMESPACE}"

若

kubectl

不可用或集群无法访问，继续选择并验证配置方案，然后返回精确命令而非模拟部署执行。

2. Select The Recipe

2. 选择配置方案

Use the recipe matrix from

recipes/README.md

and the scanner:

bash

python3 scripts/recipe_tool.py list \
  --query qwen --framework vllm --mode disagg --format table

Prefer an exact existing recipe. Do not invent new manifests unless the user explicitly asks to author a new recipe.

使用

recipes/README.md

中的配置方案矩阵和扫描工具：

bash

python3 scripts/recipe_tool.py list \
  --query qwen --framework vllm --mode disagg --format table

优先选择完全匹配的现有配置方案。除非用户明确要求编写新配置方案，否则不要创建新的清单文件。

3. Inspect And Validate

3. 检查与验证

Read the selected recipe README, model-cache manifests,

deploy.yaml

, and

perf.yaml

if present. Then run:

bash

python3 scripts/recipe_tool.py validate \
  recipes/<model>/<framework>/<mode>

Resolve reported blockers before applying manifests: storage class, model cache PVC, image tag, HF token secret, GPU count, frontend service name, and router mode.

阅读所选配置方案的README、model-cache清单、

deploy.yaml

以及（若存在的）

perf.yaml

。然后执行：

bash

python3 scripts/recipe_tool.py validate \
  recipes/<model>/<framework>/<mode>

在应用清单文件前解决报告的阻塞问题：存储类、模型缓存PVC、镜像标签、HF令牌密钥、GPU数量、前端服务名称以及路由模式。

4. Patch Minimal Values

4. 最小化补丁修复

Patch only recipe-specific values needed for this run. Do not reformat whole YAML files. Common patches:

```
storageClassName
```
image repository/tag
model path or model cache mount path
GPU resource requests/limits
frontend
```
DYN_ROUTER_MODE
```
namespace only when a manifest hardcodes it

Never write Hugging Face tokens into files or logs. Use Kubernetes secrets.

仅补丁修复本次运行所需的配置方案特定值。不要重新格式化整个YAML文件。常见的补丁修复内容：

```
storageClassName
```
镜像仓库/标签
模型路径或模型缓存挂载路径
GPU资源请求/限制
前端
```
DYN_ROUTER_MODE
```
仅当清单文件硬编码命名空间时才修改命名空间

切勿将Hugging Face令牌写入文件或日志。使用Kubernetes密钥存储。

5. Deploy

5. 部署

Follow the selected recipe README when it differs from the default sequence. The default sequence is:

bash

kubectl apply -f recipes/<model>/model-cache/ -n "${NAMESPACE}"
kubectl wait --for=condition=Complete job/model-download -n "${NAMESPACE}" --timeout=6000s
kubectl apply -f recipes/<model>/<framework>/<mode>/deploy.yaml -n "${NAMESPACE}"
kubectl get dynamographdeployment -n "${NAMESPACE}"
kubectl get pods -n "${NAMESPACE}" -o wide

Wait for the frontend and workers to be ready before testing.

当所选配置方案的README与默认流程不同时，遵循README中的步骤。默认流程为：

bash

kubectl apply -f recipes/<model>/model-cache/ -n "${NAMESPACE}"
kubectl wait --for=condition=Complete job/model-download -n "${NAMESPACE}" --timeout=6000s
kubectl apply -f recipes/<model>/<framework>/<mode>/deploy.yaml -n "${NAMESPACE}"
kubectl get dynamographdeployment -n "${NAMESPACE}"
kubectl get pods -n "${NAMESPACE}" -o wide

在测试前等待前端和工作节点就绪。

6. Smoke Test

6. 冒烟测试

Port-forward the frontend service, then verify

/v1/models

and one chat completion:

bash

kubectl port-forward svc/<deployment-name>-frontend 8000:8000 -n "${NAMESPACE}"
curl http://127.0.0.1:8000/v1/models

dynamo-router-starter

is also installed, prefer its

scripts/check_router_health.py

for the full OpenAI-compatible smoke test. If this fails, switch to

dynamo-troubleshoot

端口转发前端服务，然后验证

/v1/models

接口和一次聊天补全请求：

bash

kubectl port-forward svc/<deployment-name>-frontend 8000:8000 -n "${NAMESPACE}"
curl http://127.0.0.1:8000/v1/models

若同时安装了

dynamo-router-starter

，优先使用其

scripts/check_router_health.py

进行完整的兼容OpenAI的冒烟测试。若测试失败，切换至

dynamo-troubleshoot

工具。

Available Scripts

可用脚本

Script	Purpose	Arguments
`scripts/recipe_tool.py list`	Enumerate available recipes, optionally filtered	`--query` , `--framework` , `--mode` , `--format`
`scripts/recipe_tool.py validate`	Validate a recipe directory before apply	positional recipe path

Invoke via the agentskills.io

run_script()

protocol:

python

run_script("scripts/recipe_tool.py", args=["list", "--framework", "sglang", "--format", "table"])
run_script("scripts/recipe_tool.py", args=["validate", "recipes/nemotron-3-super-fp8/sglang/agg"])

脚本	用途	参数
`scripts/recipe_tool.py list`	枚举可用配置方案，可按需过滤	`--query` , `--framework` , `--mode` , `--format`
`scripts/recipe_tool.py validate`	在应用前验证配置方案目录	位置参数：配置方案路径

通过agentskills.io的

run_script()

协议调用：

python

run_script("scripts/recipe_tool.py", args=["list", "--framework", "sglang", "--format", "table"])
run_script("scripts/recipe_tool.py", args=["validate", "recipes/nemotron-3-super-fp8/sglang/agg"])

Examples

示例

List sglang recipes that fit a single 8xB200 node:

bash

python3 scripts/recipe_tool.py list --framework sglang --format table

Validate a specific recipe and resolve blockers before applying:

bash

python3 scripts/recipe_tool.py validate recipes/nemotron-3-super-fp8/sglang/agg

Equivalent through the agent protocol:

python

run_script("scripts/recipe_tool.py", args=["validate", "recipes/nemotron-3-super-fp8/sglang/agg"])

列出适配单台8xB200节点的sglang配置方案：

bash

python3 scripts/recipe_tool.py list --framework sglang --format table

验证特定配置方案并在应用前解决阻塞问题：

bash

python3 scripts/recipe_tool.py validate recipes/nemotron-3-super-fp8/sglang/agg

通过代理协议的等效调用：

python

run_script("scripts/recipe_tool.py", args=["validate", "recipes/nemotron-3-super-fp8/sglang/agg"])

Output Contract

输出约定

Return:

selected recipe path and why it was selected
exact values patched
commands run or commands to run
endpoint and smoke-test result
unresolved blockers, if any
next troubleshooting step when deployment does not become healthy

返回内容：

所选配置方案路径及选择原因
补丁修复的精确值
已执行的命令或待执行的命令
端点地址和冒烟测试结果
（若存在的）未解决阻塞问题
当部署无法恢复健康时的下一步排查步骤

Limitations

限制

Operates on the existing
```
recipes/
```
tree only. Does not author new manifests.
Cluster-mutating apply steps require
```
kubectl
```
permission to the target namespace.
Smoke-test depth is intentionally minimal; for full router/endpoint coverage use
```
dynamo-router-starter
```
.
Multi-node disagg transport correctness is out of scope; use
```
dynamo-interconnect-check
```
after deploy.

仅基于现有
```
recipes/
```
目录结构操作。不编写新的清单文件。
修改集群的应用步骤需要
```
kubectl
```
拥有目标命名空间的权限。
冒烟测试深度故意设置为最小；若需完整的路由/端点覆盖，请使用
```
dynamo-router-starter
```
。
多节点分离传输的正确性不在本工具范围内；部署完成后请使用
```
dynamo-interconnect-check
```
工具。

Troubleshooting

故障排查

Symptom	Likely cause	Next step
`kubectl` cluster unreachable	Context not set or VPN down	Return exact commands instead of running them; resume when cluster is reachable
`validate` reports missing storage class	Cluster has no default `StorageClass`	Patch `storageClassName` on the model-cache manifest before applying
Model-cache job stuck `Pending`	PVC unbound or HF secret missing	Inspect PVC events; create or rename the HF secret to match the recipe
Worker pods `ImagePullBackOff`	Stale image tag or missing pull secret	Patch the image tag; verify image pull secret in the namespace
`/v1/models` 4xx/5xx after deploy	Frontend not ready or wrong service port	Wait for pods Ready; re-run port-forward; switch to `dynamo-troubleshoot` if it persists

症状	可能原因	下一步操作
`kubectl` 无法连接集群	上下文未设置或VPN断开	返回精确命令而非执行；集群恢复后再继续
`validate` 报告缺失存储类	集群无默认 `StorageClass`	在应用前补丁修复model-cache清单中的 `storageClassName`
Model-cache作业卡在 `Pending` 状态	PVC未绑定或HF密钥缺失	检查PVC事件；创建或重命名HF密钥以匹配配置方案
工作节点Pod出现 `ImagePullBackOff`	镜像标签过时或缺失拉取密钥	补丁修复镜像标签；验证命名空间中的镜像拉取密钥
部署后 `/v1/models` 返回4xx/5xx错误	前端未就绪或服务端口错误	等待Pod就绪；重新执行端口转发；若问题持续，切换至 `dynamo-troubleshoot` 工具

Benchmark

基准测试

See

BENCHMARK.md

for the NVCARPS-EVAL performance report (auto-generated by the NVSkills CI pipeline). To refresh, re-run

/nvskills-ci

on an upstream PR touching this skill.

查看

BENCHMARK.md

获取NVCARPS-EVAL性能报告（由NVSkills CI管道自动生成）。如需刷新报告，请在触及本技能的上游PR上重新运行

/nvskills-ci

。

References

参考资料

Read
```
references/k8s-recipe-workflow.md
```
for command templates and readiness checks.
Use
```
scripts/recipe_tool.py
```
for recipe discovery and lightweight validation.

阅读
```
references/k8s-recipe-workflow.md
```
获取命令模板和就绪检查说明。
使用
```
scripts/recipe_tool.py
```
进行配置方案发现和轻量级验证。