Search Results: distributed-training

Found 20 Skills

AI & Machine Learningk-dense-ai/claude-scienti...

pytorch-lightning

Deep learning framework (PyTorch Lightning). Organize PyTorch code into LightningModules, configure Trainers for multi-GPU/TPU, implement data pipelines, callbacks, logging (W&B, TensorBoard), distributed training (DDP, FSDP, DeepSpeed), for scalable neural network training.

🇺🇸|EnglishTranslated

3 scripts/Checked

AI & Machine Learningnvidia/skills

nemo-mbridge-perf-megatron-fsdp

Operational guide for enabling Megatron FSDP in Megatron-Bridge, including config knobs, code anchors, pitfalls, and verification.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

mcore-run-on-slurm

How to launch distributed Megatron-LM training jobs on a SLURM cluster. Covers a minimal sbatch skeleton, environment-variable setup for torch.distributed.run, CUDA_DEVICE_MAX_CONNECTIONS rules across hardware and parallelism modes, container conventions, monitoring, and per-rank failure diagnosis.

🇺🇸|EnglishTranslated

AI & Machine Learningdavila7/claude-code-templ...

pytorch-fsdp

Expert guidance for Fully Sharded Data Parallel training with PyTorch FSDP - parameter sharding, mixed precision, CPU offloading, FSDP2

🇺🇸|EnglishTranslated

AI & Machine Learningruvnet/ruflo

flow-nexus-neural

Train and deploy neural networks in distributed E2B sandboxes with Flow Nexus

🇺🇸|EnglishTranslated

AI & Machine Learningtondevrel/scientific-agen...

pytorch-research

Advanced sub-skill for PyTorch focused on deep research and production engineering. Covers custom Autograd functions, module hooks, advanced initialization, Distributed Data Parallel (DDP), and performance profiling.

🇺🇸|EnglishTranslated

AI & Machine Learningkiterlin/intelligent-dete...

pytorch-fsdp2

Adds PyTorch FSDP2 (fully_shard) to training scripts with correct init, sharding, mixed precision/offload config, and distributed checkpointing. Use when models exceed single-GPU memory or when you need DTensor-based sharding with DeviceMesh.

🇺🇸|EnglishTranslated

AI & Machine Learningkiterlin/intelligent-dete...

ray-train

Distributed training orchestration across clusters. Scales PyTorch/TensorFlow/HuggingFace from laptop to 1000s of nodes. Built-in hyperparameter tuning with Ray Tune, fault tolerance, elastic scaling. Use when training massive models across multiple machines or running distributed hyperparameter sweeps.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

tao-run-on-kubernetes

Kubernetes execution platform — submits TAO container jobs as single-pod k8s Jobs with NVIDIA GPU scheduling. Use when running on EKS / GKE / AKS / on-prem clusters with the NVIDIA GPU Operator installed, or when integrating TAO into an existing k8s-native ML platform.

🇺🇸|EnglishTranslated

AI & Machine Learningitsmostafa/llm-engineerin...

pytorch

Building and training neural networks with PyTorch. Use when implementing deep learning models, training loops, data pipelines, model optimization with torch.compile, distributed training, or deploying PyTorch models.

🇺🇸|EnglishTranslated

AI & Machine Learningascend/agent-skills

hccl-test

HCCL (Huawei Collective Communication Library) performance testing for Ascend NPU clusters. Use for testing distributed communication bandwidth, verifying HCCL functionality, and benchmarking collective operations like AllReduce, AllGather. Covers MPI installation, multi-node pre-flight checks (SSH/CANN version/NPU health), and production testing workflows.

🇺🇸|EnglishTranslated

5 scripts/Attention

AI & Machine Learningwanshuiyin/auto-claude-co...

qzcli

Manage GPU compute jobs on the Qizhi (启智) platform using qzcli — a kubectl-style CLI tool. Use when user says "qzcli", "启智平台", "submit job", "stop job", "查计算组", "avail", "list jobs", "batch submit", or needs to manage distributed training jobs on a Qizhi instance.

🇺🇸|EnglishTranslated