Loading...
Loading...
Found 33 Skills
vLLM Ascend plugin for LLM inference serving on Huawei Ascend NPU. Use for offline batch inference, API server deployment, quantization inference (with msmodelslim quantized models), tensor/pipeline parallelism for distributed serving, and OpenAI-compatible API endpoints. Supports Qwen, DeepSeek, GLM, LLaMA models with Ascend-optimized kernels.
Diagnoses and improves Qdrant search relevance. Use when someone reports 'search results are bad', 'wrong results', 'low precision', 'low recall', 'irrelevant matches', 'missing expected results', or asks 'how to improve search quality?', 'which embedding model?', 'should I use hybrid search?', 'should I use reranking?'. Also use when search quality degrades after quantization, model change, or data growth.
Vector search best practices for Azure DocumentDB using `cosmosSearch` — choosing between DiskANN / HNSW / IVF, creating indexes, tuning `lBuild` / `lSearch` / `maxDegree`, Product Quantization (up to 16,000 dims), half-precision (fp16) indexing, and normalizing embeddings for cosine similarity. Use when building RAG / semantic-search applications, creating a vector index, tuning recall/latency, or reducing vector-index memory footprint.
Develop, debug, and optimize SGLang LLM serving engine. Use when the user mentions SGLang, sglang, srt, sgl-kernel, LLM serving, model inference, KV cache, attention backend, FlashInfer, MLA, MoE routing, speculative decoding, disaggregated serving, TP/PP/EP, radix cache, continuous batching, chunked prefill, CUDA graph, model loading, quantization FP8/GPTQ/AWQ, JIT kernel, triton kernel SGLang, or asks about serving LLMs with SGLang.
Optimizes LLM inference with NVIDIA TensorRT for maximum throughput and lowest latency. Use for production deployment on NVIDIA GPUs (A100/H100), when you need 10-100x faster inference than PyTorch, or for serving models with quantization (FP8/INT4), in-flight batching, and multi-GPU scaling.
Expert skill for using TileKernels, a library of optimized GPU kernels for LLM operations (MoE routing, quantization, transpose, engram gating, Manifold HyperConnection) built with TileLang.
Review, design, and refactor TensorRT-LLM PyTorch MoE code for architecture fit, clean code, maintainability, and testability. Always use for any modification, review, refactor, or design planning that touches MoE modules, including tensorrt_llm/_torch/modules/fused_moe, ConfigurableMoE, MoE backends, MoEScheduler/moe_scheduler.py, forward execution/chunking, communication strategies, EPLB, quantization/weight handling, routing, factories, MoE docs, or MoE tests. Also use when the user asks whether a MoE design follows the current architecture or whether a MoE refactor is reasonable.
PR-backed and current-main optimization manual for the `MiniMaxAI/MiniMax-M2` series, including M2, M2.1, M2.5, M2.7, and M2.7-highspeed. Use when Codex needs to recover, extend, or audit MiniMax-specific optimizations, TP QK norm/all-reduce behavior, parser contracts, distributed runtime behavior, quantized loading, or backend-specific validation.
Evaluates accuracy of quantized or unquantized LLMs using NeMo Evaluator Launcher (NEL). Triggers on "evaluate model", "benchmark accuracy", "run MMLU", "evaluate quantized model", "accuracy drop", "run nel". Handles deployment, config generation, and evaluation execution. Not for quantizing models (use ptq) or deploying/serving models (use deployment).