Loading...
Loading...
Recommend and customize Megatron Bridge recipes for a user's model, GPU count, and training goal. Indexes library recipes (pretrain/SFT/PEFT) and performance recipes.
npx skill4agent add nvidia/skills nemo-mbridge-recipe-recommender# Pretrain with mock data
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe <recipe_function_name> \
--dataset llm-pretrain-mock
# SFT with SQuAD
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe <recipe_function_name> \
--dataset llm-finetune
# Override any field via CLI
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe llama3_8b_pretrain_config \
--dataset llm-pretrain-mock \
'model.tensor_model_parallel_size=2' \
'training.global_batch_size=64'python scripts/performance/run_script.py \
--recipe <model_family> \
--gpu_type h100 \
--num_gpus 64 \
--data mockPerf recipes are NOT fully validated for correctness. Most conversations and testing were on mock data. They are designed for upper-bound throughput measurement, not production training. Always validate loss curves and convergence independently.
scripts/performance/configs/src/megatron/bridge/recipes/<family>/<model>_perf.pyllama3_8b_h100_bf16_pretrain_config()WorkloadBaseConfigset_workload_base_configsget_perf_optimized_recipe_benchmark_common()_perf_precision()run_recipe.pysrc/megatron/bridge/recipes/ConfigContainer| Recipe | Mode | TP | PP | CP | SP | GPUs (min) | Seq Len |
|---|---|---|---|---|---|---|---|
| Pretrain | 2 | 1 | — | — | 2 | 4K |
| Pretrain | 2 | 1 | — | ✓ | 2 | 8K |
| Pretrain | 2 | 1 | 2 | ✓ | 4 | 16K |
| Pretrain | 2 | 1 | 4 | ✓ | 8 | 64K |
| Pretrain | 2 | 1 | 8 | ✓ | 16 | 128K |
| Pretrain | 8 | 4 | — | ✓ | 32 | 8K |
| Pretrain | 8 | 4 | 2 | ✓ | 64 | 16K |
| Pretrain | 8 | 4 | 4 | ✓ | 128 | 64K |
| Pretrain | 8 | 16 | — | ✓ | 128 | 8K |
| SFT | 2 | 1 | — | ✓ | 2 | 8K |
| SFT | 4 | 4 | — | ✓ | 16 | 8K |
| SFT | 8 | 8 | — | ✓ | 64 | 8K |
| PEFT | 1 | 1 | — | — | 1 | 8K |
| PEFT | 2 | 4 | — | ✓ | 8 | 8K |
| PEFT | 4 | 8 | — | ✓ | 32 | 8K |
| Recipe | Mode | TP | PP | Sizes |
|---|---|---|---|---|
| All | 1–8 | 1–4 | 500M, 1.5B, 7B, 14B, 32B, 72B |
| All | 1–8 | 1–4 | 500M, 1.5B, 3B, 7B, 14B, 32B, 72B |
| Recipe | Mode | TP | PP | CP | Sizes |
|---|---|---|---|---|---|
| Pretrain | 1–8 | 1–2 | — | 600M–32B |
| SFT | 1–8 | 1–2 | — | 600M–32B |
| SFT | 1 | 1 | 8 | 600M (128K seq) |
| PEFT | 1 | 1 | — | 600M–32B |
| Recipe | Mode | TP | PP | EP | CP | GPUs |
|---|---|---|---|---|---|---|
| Pretrain | 1 | 1 | 8 | — | 8 |
| SFT | 1 | 1 | 8 | — | 8 |
| PEFT | 1 | 1 | 1 | — | 1 |
| Pretrain | 4 | 16 | 8 | 2 | 512+ |
| SFT | 4 | 8 | 8 | — | 256 |
| PEFT | 1 | 4 | 4 | — | 16 |
| Recipe | Mode | TP | PP | EP |
|---|---|---|---|---|
| Pretrain | 1 | 4 | 8 |
| SFT | 1 | 2 | 8 |
| PEFT | 1 | 1 | 4 |
| Recipe | Mode | TP | PP | EP | GPUs |
|---|---|---|---|---|---|
| Pretrain | 1 | 1 | 8 | 8 |
| Pretrain | 1 | 4 | 32 | 128 |
| Pretrain | 2 | 16 | 64 | 2048 |
| Pretrain | 2 | 8 | 32 | 256 |
| Recipe | Mode | TP | PP | EP | GPUs |
|---|---|---|---|---|---|
| Pretrain | 2 | 8 | 16 | 256 |
| Pretrain | 1 | 4 | 8 | 32 |
| SFT | 2 | 8 | 16 | 256 |
| SFT | 1 | 4 | 8 | 32 |
| PEFT | 2 | 4 | 4 | 32 |
| PEFT | 1 | 2 | 4 | 8 |
| Recipe | Mode | TP | PP | Sizes |
|---|---|---|---|---|
| All | 2–8 | 1–2 | 2B, 9B, 27B |
| All | 1 | 1 | 1B (32K seq) |
| Recipe | Mode | TP | PP | EP | Notes |
|---|---|---|---|---|---|
| P/S/PEFT | 1–8 | 1–4 | — | Dense SSM-hybrid |
| P/S/PEFT | varies | 1 | 8 | MoE + Mamba |
| P/S/PEFT | 4 | 1 | 8 | MoE + Mamba, ~40% CUDA graph gain |
| P/S/PEFT | varies | 1 | — | Dense |
| Recipe | Mode | Notes |
|---|---|---|
| All | MoE EP=8 |
| All | MoE EP=8 |
| SFT/PEFT | Dense |
| All | MoE + FP8/MXFP8 variants |
| All | MoE |
| Pretrain | MLM/Bridge parity baseline |
| Pretrain | TP=4, PP=8, VP=6 |
| Pretrain | 1T MoE, TP=2 PP=16 EP=32 |
| Recipe | Mode | TP | PP | EP | GPUs |
|---|---|---|---|---|---|
| SFT/PEFT | 1–8 | 1–2 | — | 1–16 |
| SFT/PEFT | 1–8 | 1–4 | — | 1–32 |
| SFT/PEFT | 1–4 | 1–8 | 1–32 | 1–512 |
| SFT/PEFT | varies | varies | varies | varies |
| SFT/PEFT | 1 | 8 | 4–16 | 64–512 |
| SFT/PEFT | 2–4 | 1 | — | 8 |
| Recipe | Mode | TP | CP |
|---|---|---|---|
| P/SFT | 1 | 8 |
| P/SFT | 2 | 4 |
| P/SFT | 2 | 1 |
scripts/performance/run_script.pyWorkloadBaseConfigImportant: Perf recipes are designed for upper-bound throughput benchmarks, not production training. They run 50 iterations on mock data by default. Throughput numbers are aspirational targets, not validated convergence configs.
| Model | GPUs | GPU Types | Key Features |
|---|---|---|---|
| Llama 3 8B | 8 | H100, B200, B300, GB200, GB300, R100 | CUDA graphs (local), FSDP on GB variants |
| Llama 3 70B | 64 | H100, B200, B300, GB200, GB300 | TP comm overlap (userbuffers), FSDP, CUDA graphs |
| Llama 3.1 405B | 128–1024 | H100, B200, B300, GB200, GB300 | TP+CP comm overlap (userbuffers), FSDP, heavy PP/VP |
| Model | GPUs | GPU Types | Key Features |
|---|---|---|---|
| DeepSeek V3 (671B MoE) | 256–1024 | H100, B200, B300, GB200, GB300 | HybridEP dispatcher, MLA recompute, CUDA graphs (TE scoped) |
| Model | GPUs | GPU Types | Key Features |
|---|---|---|---|
| Qwen3 30B-A3B | 8–16 | H100, B200, B300, GB200, GB300 | MoE alltoall/flex dispatcher |
| Qwen3 235B-A22B | 64–256 | H100, B200, B300, GB200, GB300 | TP comm overlap, CUDA graphs, MoE a2a overlap |
| Qwen3-Next 80B-A3B | 64–128 | H100, B200, B300, GB200, GB300 | EP 64–128 |
| Model | GPUs | GPU Types | Key Features |
|---|---|---|---|
| Qwen3-VL 30B-A3B | 8–16 | H100, B200, B300, GB200, GB300 | VLM + MoE |
| Qwen3-VL 235B-A22B | 64–256 | H100, B200, B300, GB200, GB300 | VLM + MoE, TP comm overlap |
| Model | GPUs | GPU Types | Key Features |
|---|---|---|---|
| Kimi K2 (1T MoE) | 256–1024 | H100, B200, B300, GB200, GB300 | Muon/Adam optimizer, HybridEP, pipeline layout helpers |
| Model | GPUs | GPU Types | Key Features |
|---|---|---|---|
| Nemotron 3 Nano (30B MoE+Mamba) | 8–16 | H100, B200, B300, GB200, GB300 | TE CUDA graphs (attn+mamba+moe), HybridEP |
| Nemotron 3 Super | 64 | H100, B200, B300, GB200, GB300 | TE CUDA graphs, EP=64 |
| NemotronH 56B | 64 | H100, B200, B300 | TP=2–8, TE graphs (mamba+attn) |
| Model | GPUs | GPU Types | Key Features |
|---|---|---|---|
| GPT-OSS 120B | 64 | H100, B200, GB200 | EP=64, HybridEP on GB200 |
User wants to train a model
│
├─ Know the model name?
│ ├─ Yes → Look up in Library Recipe Index above
│ │ ├─ Has a recipe for their size + mode? → Use it directly
│ │ └─ No exact match? → Use closest size, adjust parallelism
│ └─ No → Ask for model name, size, and HF model ID
│
├─ What's the training goal?
│ ├─ Pretrain → Use *_pretrain_config
│ ├─ SFT (full fine-tune) → Use *_sft_config
│ └─ PEFT (LoRA/DoRA) → Use *_peft_config (lowest GPU requirement)
│
├─ How many GPUs?
│ ├─ 1 GPU → Only PEFT recipes work (TP=1, PP=1)
│ ├─ 8 GPUs (1 node) → Most 8B–16B models, small MoE (EP=8)
│ ├─ 16–64 GPUs → 70B dense, medium MoE
│ └─ 128+ GPUs → 405B+, large MoE (DeepSeek V3, Kimi K2)
│
├─ Want throughput benchmarks?
│ ├─ Yes → Use perf recipes (scripts/performance/)
│ │ └─ ⚠️ These run on mock data for upper-bound perf only
│ └─ No → Use library recipes (scripts/training/run_recipe.py)
│
└─ Long context?
├─ > 8K → Need CP (context parallelism), check *_16k / *_64k / *_128k variants
└─ ≤ 8K → Default recipes worknum_key_value_headsnum_key_value_heads=8cp_comm_typea2a+p2pmicro_batch_sizeglobal_batch_sizeGBS = micro_batch_size × DP × gradient_accumulation_stepsmicro_batch_size=1| Pitfall | Symptom | Fix |
|---|---|---|
| TP > num_kv_heads | Crash: "TP must divide num_query_groups" | Reduce TP to a divisor of num_kv_heads |
| PP without VP | Poor throughput (large bubble) | Set |
| EP too low for large MoE | OOM on expert params | Increase EP; each expert lives on EP/num_experts ranks |
| CUDA graphs + packed sequences | Assert: "CUDA graph accepts only Tensor inputs" | Disable packing or use |
| CUDA graphs + full recompute | Assert: "full recompute only with full iteration CUDA graph" | Disable recompute or switch to |
| Assert on provider init when CUDA graphs enabled | Set |
| FSDP + TP > 1 on H100 | Possible comm bottleneck | Prefer FSDP with TP=1 or TP=2 on H100; FSDP shines on GB/B-series |
| Long context without CP | OOM on activations | Add CP=2/4/8; use |
MoE | May hurt perf (False in many H100 presets) | Set |
| VLM SFT missing image data | Runs but produces garbage | Provide actual multimodal dataset or use mock VLM data |
| Qwen35-VL MoE FSDP | Tested on Blackwell only | May not work on H100; validate first |
# Scale Llama3 8B from 2 GPUs to 8 GPUs (increase DP)
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe llama3_8b_pretrain_config \
--dataset llm-pretrain-mock
# Reduce parallelism for Qwen3-MoE 30B to fit on 4 GPUs
uv run python -m torch.distributed.run --nproc_per_node=4 scripts/training/run_recipe.py \
--recipe qwen3_30b_a3b_sft_config \
--dataset llm-finetune \
'model.expert_model_parallel_size=4'
# Add long context to an existing recipe
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe llama3_8b_pretrain_config \
--dataset llm-pretrain-mock \
'model.seq_length=32768' \
'model.context_parallel_size=4'
# Enable CUDA graphs on any recipe
uv run python -m torch.distributed.run --nproc_per_node=8 scripts/training/run_recipe.py \
--recipe qwen3_30b_a3b_pretrain_config \
--dataset llm-pretrain-mock \
'model.cuda_graph_impl=transformer_engine' \
'model.cuda_graph_scope=[attn,moe_router,moe_preprocess]' \
'model.use_te_rng_tracker=True' \
'rng.te_rng_tracker=True'| I want to... | Start with | GPUs needed |
|---|---|---|
| Try Bridge for the first time | | 2 |
| Fine-tune a 7-8B model | | 2–8 |
| LoRA on 1 GPU | | 1 |
| Pretrain a dense 70B | | 32–64 |
| Train a small MoE | | 8 |
| Train a large MoE (235B+) | | 256–512 |
| Benchmark throughput | Perf recipes via | Varies |
| Long-context training | | 16+ |
| VLM fine-tuning | | 4–8 |
| Diffusion training | | 8 |
| What | Path |
|---|---|
| Library recipes root | |
Recipe | |
| Common recipe helpers | |
| Training entry point | |
| Perf recipes root | |
| Perf entry point | |
| Perf workload configs | |
| Perf overrides (benchmark defaults) | |