Dynamo Recipe Runner

Purpose

Get from user intent to a working Dynamo recipe endpoint with minimal back and forth. Do not create new guide content. Operate on the existing

recipes/

tree, patch the smallest necessary set of manifests, deploy when the user has cluster access, and prove success with an OpenAI-compatible smoke request.

Prerequisites

Python 3.10+ on the operator machine.
```
kubectl
```
configured with a working cluster context.
Cluster has a default storage class for model-cache PVCs.
Hugging Face token stored in a Kubernetes secret named
```
hf-token-secret
```
(or equivalent) in the target namespace.
Read access to the
```
recipes/
```
tree in the ai-dynamo/dynamo repository.

Required Inputs

Collect or infer these before changing manifests:

recipe target: model, framework (
```
vllm
```
,
```
sglang
```
,
```
trtllm
```
,
```
tokenspeed
```
), deployment mode, and GPU type/count
Kubernetes context and namespace
Hugging Face secret name, usually
```
hf-token-secret
```
storage class for model cache PVCs
runtime image tag if the recipe uses a placeholder or stale test image
whether to run commands or only produce exact commands

If a required value is missing and cannot be inferred from the selected recipe, ask for only that value.

Instructions

1. Preflight

Run read-only checks first:

bash

git status --short
python3 scripts/recipe_tool.py list --format table
kubectl config current-context
kubectl get storageclass
kubectl get nodes -o wide
kubectl get namespace "${NAMESPACE}"
kubectl get secret hf-token-secret -n "${NAMESPACE}"

kubectl

is unavailable or the cluster is unreachable, continue by selecting and validating the recipe, then return exact commands instead of pretending the deployment ran.

2. Select The Recipe

Use the recipe matrix from

recipes/README.md

and the scanner:

bash

python3 scripts/recipe_tool.py list \
  --query qwen --framework vllm --mode disagg --format table

Prefer an exact existing recipe. Do not invent new manifests unless the user explicitly asks to author a new recipe.

3. Inspect And Validate

Read the selected recipe README, model-cache manifests,

deploy.yaml

, and

perf.yaml

if present. Then run:

bash

python3 scripts/recipe_tool.py validate \
  recipes/<model>/<framework>/<mode>

Resolve reported blockers before applying manifests: storage class, model cache PVC, image tag, HF token secret, GPU count, frontend service name, and router mode.

4. Patch Minimal Values

Patch only recipe-specific values needed for this run. Do not reformat whole YAML files. Common patches:

```
storageClassName
```
image repository/tag
model path or model cache mount path
GPU resource requests/limits
frontend
```
DYN_ROUTER_MODE
```
namespace only when a manifest hardcodes it

Never write Hugging Face tokens into files or logs. Use Kubernetes secrets.

5. Deploy

Follow the selected recipe README when it differs from the default sequence. The default sequence is:

bash

kubectl apply -f recipes/<model>/model-cache/ -n "${NAMESPACE}"
kubectl wait --for=condition=Complete job/model-download -n "${NAMESPACE}" --timeout=6000s
kubectl apply -f recipes/<model>/<framework>/<mode>/deploy.yaml -n "${NAMESPACE}"
kubectl get dynamographdeployment -n "${NAMESPACE}"
kubectl get pods -n "${NAMESPACE}" -o wide

Wait for the frontend and workers to be ready before testing.

6. Smoke Test

Port-forward the frontend service, then verify

/v1/models

and one chat completion:

bash

kubectl port-forward svc/<deployment-name>-frontend 8000:8000 -n "${NAMESPACE}"
curl http://127.0.0.1:8000/v1/models

dynamo-router-starter

is also installed, prefer its

scripts/check_router_health.py

for the full OpenAI-compatible smoke test. If this fails, switch to

dynamo-troubleshoot

Available Scripts

Script	Purpose	Arguments
`scripts/recipe_tool.py list`	Enumerate available recipes, optionally filtered	`--query` , `--framework` , `--mode` , `--format`
`scripts/recipe_tool.py validate`	Validate a recipe directory before apply	positional recipe path

Invoke via the agentskills.io

run_script()

protocol:

python

run_script("scripts/recipe_tool.py", args=["list", "--framework", "sglang", "--format", "table"])
run_script("scripts/recipe_tool.py", args=["validate", "recipes/nemotron-3-super-fp8/sglang/agg"])

Examples

List sglang recipes that fit a single 8xB200 node:

bash

python3 scripts/recipe_tool.py list --framework sglang --format table

Validate a specific recipe and resolve blockers before applying:

bash

python3 scripts/recipe_tool.py validate recipes/nemotron-3-super-fp8/sglang/agg

Equivalent through the agent protocol:

python

run_script("scripts/recipe_tool.py", args=["validate", "recipes/nemotron-3-super-fp8/sglang/agg"])

Output Contract

Return:

selected recipe path and why it was selected
exact values patched
commands run or commands to run
endpoint and smoke-test result
unresolved blockers, if any
next troubleshooting step when deployment does not become healthy

Limitations

Operates on the existing
```
recipes/
```
tree only. Does not author new manifests.
Cluster-mutating apply steps require
```
kubectl
```
permission to the target namespace.
Smoke-test depth is intentionally minimal; for full router/endpoint coverage use
```
dynamo-router-starter
```
.
Multi-node disagg transport correctness is out of scope; use
```
dynamo-interconnect-check
```
after deploy.

Troubleshooting

Symptom	Likely cause	Next step
`kubectl` cluster unreachable	Context not set or VPN down	Return exact commands instead of running them; resume when cluster is reachable
`validate` reports missing storage class	Cluster has no default `StorageClass`	Patch `storageClassName` on the model-cache manifest before applying
Model-cache job stuck `Pending`	PVC unbound or HF secret missing	Inspect PVC events; create or rename the HF secret to match the recipe
Worker pods `ImagePullBackOff`	Stale image tag or missing pull secret	Patch the image tag; verify image pull secret in the namespace
`/v1/models` 4xx/5xx after deploy	Frontend not ready or wrong service port	Wait for pods Ready; re-run port-forward; switch to `dynamo-troubleshoot` if it persists

Benchmark

See

BENCHMARK.md

for the NVCARPS-EVAL performance report (auto-generated by the NVSkills CI pipeline). To refresh, re-run

/nvskills-ci

on an upstream PR touching this skill.

References

Read
```
references/k8s-recipe-workflow.md
```
for command templates and readiness checks.
Use
```
scripts/recipe_tool.py
```
for recipe discovery and lightweight validation.

dynamo-recipe-runner

NPX Install

Tags

SKILL.md Content

Dynamo Recipe Runner

Purpose

Prerequisites

Required Inputs

Instructions

1. Preflight

2. Select The Recipe

3. Inspect And Validate

4. Patch Minimal Values

5. Deploy

6. Smoke Test

Available Scripts

Examples

Output Contract

Limitations

Troubleshooting

Benchmark

References