Loading...
Loading...
Found 39 Skills
Use when debugging a Nemo Gym run or reward profiling job. Covers rollout collection failures, empty or partial JSONL outputs, stale materialized inputs, verifier/schema errors, Ray or Slurm issues, vLLM readiness, judge failures, tool/sandbox failures, cache problems, and throughput bottlenecks.
LLM and ML model deployment for inference. Use when serving models in production, building AI APIs, or optimizing inference. Covers vLLM (LLM serving), TensorRT-LLM (GPU optimization), Ollama (local), BentoML (ML deployment), Triton (multi-model), LangChain (orchestration), LlamaIndex (RAG), and streaming patterns.
Use whenever the user mentions LLM prompt/prefix cache misses, cached_tokens=0, cache_read_input_tokens/cache_creation_input_tokens, prompt_cache_key, cache_control/cachePoint placement, stable prefixes, tool/schema stability, TTFT/prefill latency, OpenAI/Claude/Bedrock/OpenRouter routing, vLLM/SGLang KV reuse, or LLM cost/speed regressions on repeated long prompts. Use when reviewing LLM request shape changes: prompt text, message order, request builders, tools, schemas, response_format, provider API surface, model/router settings, agent loop structure, context compaction, or inference deployment. Use for speeding up agents only when prompt-cache stability, TTFT, or cache cost is central. Do not use for generic prompt writing, generic RAG design, token counting, or non-LLM performance.
Python code refactoring skills, covering code smell identification, design pattern application, readability improvement, and practical experience. This skill is applicable when users request "refactor code", "refactor", "code optimization", "improve code quality", "code smell review", "apply design patterns", "enhance readability", or submit code review requests. It supports generating structured refactoring documents after refactoring completion ("output refactoring document", "generate refactoring report"). It includes practical patterns extracted from 20+ real refactoring PRs in the vllm-ascend repository.
Guarantee valid JSON/XML/code structure during generation, use Pydantic models for type-safe outputs, support local models (Transformers, vLLM), and maximize inference speed with Outlines - dottxt.ai's structured generation library
Meta's 7-8B specialized moderation model for LLM input/output filtering. 6 safety categories - violence/hate, sexual content, weapons, substances, self-harm, criminal planning. 94-95% accuracy. Deploy with vLLM, HuggingFace, Sagemaker. Integrates with NeMo Guardrails.
Use when creating or revising model PR optimization history documents for SGLang, vLLM, or another serving framework that cite GitHub PRs. Requires manual, per-PR source-diff review and documentation of motivation, key implementation approach, most important code excerpts, reviewed files, and validation implications instead of generated or one-line summaries.
Fast structured generation and serving for LLMs with RadixAttention prefix caching. Use for JSON/regex outputs, constrained decoding, agentic workflows with tool calls, or when you need 5× faster inference than vLLM with prefix sharing. Powers 300,000+ GPUs at xAI, AMD, NVIDIA, and LinkedIn.
End-to-end SGLang SOTA performance workflow. Use when a user names an LLM model and wants SGLang to match or beat the best observed vLLM and TensorRT-LLM serving performance by searching each framework's best deployment command, benchmarking them fairly, profiling SGLang if it is slower, identifying kernel/overlap/fusion bottlenecks, patching SGLang code, and revalidating with real model runs.
Master local LLM inference, model selection, VRAM optimization, and local deployment using Ollama, llama.cpp, vLLM, and LM Studio. Expert in quantization formats (GGUF, EXL2) and local AI privacy.
External NeMo-RL end-to-end validation workflow for Megatron-Bridge model/provider changes, including downstream compatibility checks, external RL lifecycle behavior, Megatron policy setup, HF import/export, checkpoint/resume, non-colocated vLLM refit, delta weight transfer, optional LoRA/generation variants, and questions such as "does this model work in NeMo-RL", "run NeMo-RL e2e", or "external RL loop validation". Covers running NeMo-RL Megatron policy jobs from a Bridge checkout, choosing GRPO/SFT/checkpoint/non-colocated refit variants, setting PYTHONPATH so NeMo-RL imports the local Bridge tree, and reporting pass/fail evidence.
Mistral AI efficient open models. Use for efficient AI.