Loading...
Loading...
Found 20 Skills
Use when "training LLM", "finetuning", "RLHF", "distributed training", "DeepSpeed", "Accelerate", "PyTorch Lightning", "Ray Train", "TRL", "Unsloth", "LoRA training", "flash attention", "gradient checkpointing"
LangGraph framework for building stateful, multi-agent AI applications with cyclical workflows, human-in-the-loop patterns, and persistent checkpointing.
Use this skill when building real-time data pipelines, stream processing jobs, or change data capture systems. Triggers on tasks involving Apache Kafka (producers, consumers, topics, partitions, consumer groups, Connect, Streams), Apache Flink (DataStream API, windowing, checkpointing, stateful processing), event sourcing implementations, CDC with Debezium, stream processing patterns (windowing, watermarks, exactly-once semantics), and any pipeline that processes unbounded data in motion rather than data at rest.
Refactor PyTorch code to improve maintainability, readability, and adherence to best practices. Identifies and fixes DRY violations, long functions, deep nesting, SRP violations, and opportunities for modular components. Applies PyTorch 2.x patterns including torch.compile optimization, Automatic Mixed Precision (AMP), optimized DataLoader configuration, modular nn.Module design, gradient checkpointing, CUDA memory management, PyTorch Lightning integration, custom Dataset classes, model factory patterns, weight initialization, and reproducibility patterns.
Create and modify NeMo AutoModel training and evaluation recipes, including YAML structure, builders, and execution flow.
Process multiple documents in bulk with parallel execution
Resiliency features in Megatron Bridge including fault tolerance, straggler detection, in-process restart, preemption, and re-run state machine.
Resiliency features in Megatron Bridge including fault tolerance, straggler detection, in-process restart, preemption, and re-run state machine.