Loading...
Loading...
Found 21 Skills
GPU Code to Ascend NPU Adaptation Review Expert. When users need to migrate GPU-based code (especially deep learning and model inference-related code) to Huawei Ascend NPU, this skill must be used for comprehensive review. This skill can identify bottlenecks in GPU-to-NPU migration, write adaptation scripts, generate verification plans, and output a complete Markdown review report. Trigger scenarios include: users mentioning keywords such as "NPU adaptation", "Ascend migration", "GPU to NPU", "Ascend", "CANN", "model migration", "operator adaptation", or users requesting to review GPU code repositories and migrate to the NPU platform.
Deep learning framework development with tinygrad - a minimal tensor library with autograd, JIT compilation, and multi-device support. Use when writing neural networks, training models, implementing tensor operations, working with UOps/PatternMatcher for graph transformations, or contributing to tinygrad internals. Triggers on tinygrad imports, Tensor operations, nn modules, optimizer usage, schedule/codegen work, or device backends.
LeetCode-style PyTorch interview practice environment with auto-grading for implementing softmax, attention, GPT-2 and more from scratch.
Transform raw content (text/URL) into structured learning documents with 6-phase framework combining AI analysis + reflection prompts. Use when the user wants to deeply understand content, create study notes, or learn from articles/books/documents. Triggers on "learn this", "deep learn", "study this", "tạo tài liệu học", "phân tích nội dung", "hiểu sâu", or "deep learner".
High-level PyTorch framework with Trainer class, automatic distributed training (DDP/FSDP/DeepSpeed), callbacks system, and minimal boilerplate. Scales from laptop to supercomputer with same code. Use when you want clean training loops with built-in best practices.
Guidance for building Caffe from source and training CIFAR-10 models. This skill applies when tasks involve compiling Caffe deep learning framework, configuring Makefile.config, preparing CIFAR-10 dataset, or training CNN models with Caffe solvers. Use for legacy ML framework installation, LMDB dataset preparation, and CPU-only deep learning training tasks.
Nsight Systems (nsys) CLI for system-level timeline profiling. Use when the user wants to run nsys profile, analyze .nsys-rep reports, use nsys stats/analyze/recipe commands, diagnose GPU idle time from timeline traces, or profile distributed training with NCCL overlap analysis. NOT for kernel-level metrics like SOL%, occupancy, or roofline (use perf-nsight-compute-analysis for ncu). NOT for writing or generating kernels. NOT for applying optimizations like CUDA Graphs.
Apply CUDA Graphs to PyTorch workloads — API selection (torch.compile, PyTorch make_graphed_callables, TE make_graphed_callables, MCore CudaGraphManager, FullCudaGraphWrapper, manual torch.cuda.graph), code compatibility, capture workflows, dynamic pattern handling, and troubleshooting. Triggers: CUDA graph, torch.cuda.graph, make_graphed_callables, reduce-overhead, graph capture, graph replay, kernel launch overhead, CudaGraphManager, FullCudaGraphWrapper, full-iteration graph, stream capture.
Refactor PyTorch code to improve maintainability, readability, and adherence to best practices. Identifies and fixes DRY violations, long functions, deep nesting, SRP violations, and opportunities for modular components. Applies PyTorch 2.x patterns including torch.compile optimization, Automatic Mixed Precision (AMP), optimized DataLoader configuration, modular nn.Module design, gradient checkpointing, CUDA memory management, PyTorch Lightning integration, custom Dataset classes, model factory patterns, weight initialization, and reproducibility patterns.