Search Results: megatron-bridge

Found 42 Skills

nemo-mbridge-perf-moe-comm-overlap

MoE expert-parallel communication overlap in Megatron Bridge. Covers dispatch/combine overlap, flex dispatcher backends, and expert wgrad scheduling.

🇺🇸|EnglishTranslated

Testing & QAnvidia/skills

testing

Testing reference for Megatron Bridge — unit and functional test layout, tier semantics (L0/L1/L2/flaky), script conventions, running tests locally, adding/moving/disabling tests, and pytest conventions.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

parity-testing

Structured framework for verifying numerical parity of HF<->MCore weight conversions. References existing tools and the add-model-support skill.

🇺🇸|EnglishTranslated

AI & Machine Learningpromptingcompany/nv-skill...

nemo-mbridge-perf-activation-recompute

Validate and use selective and full activation recompute in Megatron Bridge to reduce GPU memory usage at the cost of extra compute.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

recipe-recommender

Recommend and customize Megatron Bridge recipes for a user's model, GPU count, and training goal. Indexes library recipes (pretrain/SFT/PEFT) and performance recipes.

🇺🇸|EnglishTranslated

DevOps & Cloud Servicesnvidia/skills

build-and-dependency

Dev environment setup for Megatron Bridge — container-based development, uv package management, lockfile regeneration, adding dependencies, Slurm container usage, and common build pitfalls.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

perf-cpu-offloading

Validate and use CPU offloading in Megatron Bridge, including layer-level activation offloading and fractional optimizer state offloading with HybridDeviceOptimizer.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

perf-activation-recompute

Validate and use selective and full activation recompute in Megatron Bridge to reduce GPU memory usage at the cost of extra compute.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

nemo-mbridge-resiliency

Resiliency features in Megatron Bridge including fault tolerance, straggler detection, in-process restart, preemption, and re-run state machine.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

perf-cuda-graphs

Validate and use CUDA graph capture in Megatron Bridge, including local full-iteration graphs and Transformer Engine scoped graphs for attention, MLP, and MoE modules.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

resiliency

Resiliency features in Megatron Bridge including fault tolerance, straggler detection, in-process restart, preemption, and re-run state machine.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

perf-memory-tuning

Techniques for reducing peak GPU memory in Megatron Bridge — expandable segments, parallelism resizing, activation recompute, CPU offloading constraints, and common OOM fixes.

🇺🇸|EnglishTranslated