Search Results: tensorrt-llm

Found 24 Skills

ad-add-fusion-transformation

Claude Code skill (trtllm-agent-toolkit): implement or extend TensorRT-LLM AutoDeploy fusion transforms under transform/library/ in a TensorRT-LLM checkout. Prefer existing kernels and custom ops; use Triton only when no viable existing-kernel path exists. Use ad-graph-dump for AD_DUMP_GRAPHS_DIR workflows. Covers TRT-LLM paths, registry, default.yaml registration, graph validation, tests, and a review checklist — without prescribing profiling tools or throughput targets.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

exec-local-compile

Compile TensorRT-LLM on a compute node inside a Docker container. Use this when already on a compute node with GPUs visible.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

trtllm-flashinfer-upgrade

Upgrade flashinfer-python version in TensorRT-LLM. Fetches the latest releases from GitHub (stable and nightly), compares with the current pinned version, lets the user pick a target version, and updates all version references across the repo. Use when the user wants to bump or upgrade flashinfer.

🇺🇸|EnglishTranslated

AI & Machine Learningancoleman/ai-design-compo...

model-serving

LLM and ML model deployment for inference. Use when serving models in production, building AI APIs, or optimizing inference. Covers vLLM (LLM serving), TensorRT-LLM (GPU optimization), Ollama (local), BentoML (ML deployment), Triton (multi-model), LangChain (orchestration), LlamaIndex (RAG), and streaming patterns.

🇺🇸|EnglishTranslated

5 scripts/Attention

AI & Machine Learningnvidia/skills

nemoclaw-user-configure-inference

Connects NemoClaw to a local inference server. Use when setting up Ollama, vLLM, TensorRT-LLM, NIM, or any OpenAI-compatible local model server with NemoClaw. Trigger keywords - nemoclaw local inference, ollama nemoclaw, vllm nemoclaw, local model server, openai compatible endpoint, switch nemoclaw inference model, change inference runtime, nemoclaw additional model, nemoclaw sub-agent model, openclaw sub-agent, agents.list, sessions_spawn, vlm-demo, nemoclaw tool calling, ollama tool calls, vllm tool-call-parser, raw json in tui, nemoclaw inference options, nemoclaw onboarding providers, nemoclaw inference routing.

🇺🇸|EnglishTranslated

AI & Machine Learningbbuf/sglang-auto-driven-s...

model-pr-history-knowledge

Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningbbuf/sglang-auto-driven-s...

vllm-sota-humanize-loop

Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.

🇺🇸|EnglishTranslated

AI & Machine Learningbbuf/sglang-auto-driven-s...

sglang-sota-humanize-loop

Run an autonomous Humanize-governed SGLang SOTA performance loop for one LLM model: first perform the fixed fair SGLang/vLLM/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches SGLang code, optionally uses ncu-report-skill for kernel evidence, and revalidates until SGLang matches or beats the best observed framework under the same workload and SLA.

🇺🇸|EnglishTranslated

AI & Machine Learningbbuf/sglang-auto-driven-s...

sglang-sota-performance

End-to-end SGLang SOTA performance workflow. Use when a user names an LLM model and wants SGLang to match or beat the best observed vLLM and TensorRT-LLM serving performance by searching each framework's best deployment command, benchmarking them fairly, profiling SGLang if it is slower, identifying kernel/overlap/fusion bottlenecks, patching SGLang code, and revalidating with real model runs.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

ad-conf-check

Check whether AutoDeploy YAML configs were actually applied by analyzing server logs and optionally graph dumps (AD_DUMP_GRAPHS_DIR). Use when the user wants to verify config application, debug config issues, or check if AutoDeploy transforms (piecewise CUDA graph, multi-stream, sharding, fusion, etc.) were applied or fell back. Triggers on: "check config", "verify config", "ad-conf-check", "were my configs applied", "config not working", "check if piecewise is enabled", "check log for config", or any request to compare AD YAML settings against runtime behavior.

🇺🇸|EnglishTranslated

1 scripts/Checked

AI & Machine Learningnvidia/skills

ad-accuracy-debug

Debug AutoDeploy accuracy regressions vs a reference score (PyTorch backend or published baseline). Use when an AutoDeploy model's eval score is significantly below the reference and the root cause is unknown.

🇺🇸|EnglishTranslated

AI & Machine Learningnvidia/skills

ad-model-onboard

Translates a HuggingFace model into a prefill-only AutoDeploy custom model using reference custom ops, validates with hierarchical equivalence tests.

🇺🇸|EnglishTranslated