Loading...
Loading...
Found 98 Skills
Production-ready reinforcement learning algorithms (PPO, SAC, DQN, TD3, DDPG, A2C) with scikit-learn-like API. Use for standard RL experiments, quick prototyping, and well-documented algorithm implementations. Best for single-agent RL with Gymnasium environments. For high-performance parallel training, multi-agent systems, or custom vectorized environments, use pufferlib instead.
Techniques for reducing peak GPU memory in Megatron Bridge — expandable segments, parallelism resizing, activation recompute, CPU offloading constraints, and common OOM fixes.
Validate and use MoE expert-parallel communication overlap in Megatron-Bridge, including overlap_moe_expert_parallel_comm, delay_wgrad_compute, and flex dispatcher backends such as DeepEP and HybridEP.
Guide for selecting and configuring distributed training strategies in NeMo AutoModel, including FSDP2, Megatron FSDP, DDP, and parallelism settings.
Verify and build the required environment for Triton operator development on the Ascend platform, including configurations of dependencies such as CANN, Python/torch/torch_npu/triton-ascend and PATH environment variables. This is used when users need to configure the Triton operator development environment, check the installation of CANN/torch/triton-ascend, or verify whether the environment is available.
PyTorch-based TAO image classification. Supports a wide range of backbones (FAN, EfficientNet, ResNet, etc.) with distillation and quantization for deployment. Use when training, evaluating, distilling, quantizing, exporting, or running inference for a TAO image-classification (PyT) model. Trigger phrases include "train image classifier", "TAO classification", "ResNet/EfficientNet/FAN backbone classifier", "classification-pyt".
PyTorch implementation of TurboQuant for LLM KV cache compression using two-stage vector quantization (random rotation + Lloyd-Max + QJL residual correction).
Educational GPT implementation in ~300 lines. Reproduces GPT-2 (124M) on OpenWebText. Clean, hackable code for learning transformers. By Andrej Karpathy. Perfect for understanding GPT architecture from scratch. Train on Shakespeare (CPU) or OpenWebText (multi-GPU).
Enterprise LLM Fine-Tuning with LoRA, QLoRA, and PEFT techniques
Debug AOTInductor (AOTI) errors and crashes. Use when encountering AOTI segfaults, device mismatch errors, constant loading failures, or runtime errors from aot_compile, aot_load, aoti_compile_and_package, or aoti_load_package.
Use this skill when building computer vision applications, implementing image classification, object detection, or segmentation pipelines. Triggers on image classification, object detection, YOLO, semantic segmentation, image preprocessing, data augmentation, transfer learning, CNN architectures, vision transformers, and any task requiring visual recognition or image analysis.
Use when writing DALI data loading or preprocessing code with `nvidia.dali.experimental.dynamic` (ndd), or when converting DALI pipeline-mode code to dynamic mode, or when the user asks about DALI dynamic mode, imperative DALI, or ndd. Use this skill any time someone mentions 'ndd', 'dynamic mode', or wants to load/augment data with DALI outside of a pipeline definition.