vllm-sota-humanize-loop
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChinesevLLM SOTA Humanize Loop
vLLM SOTA Humanize循环
Overview
概述
Use this skill when the user names a model and wants the vLLM serving path to
autonomously keep improving until it matches or beats the best reproducible
SGLang or TensorRT-LLM result in the same target environment.
This workflow has two durable parts:
- A fixed baseline phase that must be completed once before any code patching.
- One Humanize RLCR loop that owns gap decision, profiling, required layer/kernel deep dive, vLLM patching, optional NCU evidence, and real-model revalidation.
Do not split the campaign into a pre-loop profiling phase plus a later patch
loop. After the fixed benchmark exists, Phase 2 gap decisions, Phase 3 profiling,
, kernel evidence, and code changes all belong inside the
same model-level RLCR loop.
llm-pipeline-analysis当用户指定某个模型,并希望vLLM服务路径自主持续优化,直至在目标环境中达到或超越可复现的最优SGLang或TensorRT-LLM结果时,可使用本技能。
此工作流包含两个持久化部分:
- 固定基准阶段:必须在任何代码修补前完成一次。
- Humanize RLCR循环:负责差距判定、性能分析、必要的层/内核深度调研、vLLM代码修补、可选的NCU证据获取,以及真实模型重新验证。
请勿将任务拆分为循环前的性能分析阶段与后续的修补循环。固定基准测试完成后,阶段2的差距判定、阶段3的性能分析、、内核证据获取以及代码变更都需纳入同一模型级别的RLCR循环中。
llm-pipeline-analysisRuntime Roots
运行时根目录
This skill can run from Claude Code, Codex, or another compatible skill runtime.
Resolve companion roots in this order:
- Prefer installed Claude Code skills under when running in Claude Code.
~/.claude/skills - Prefer installed Codex skills under when running in Codex.
${CODEX_HOME:-~/.codex}/skills - Fall back to checked-out repositories when the skills are symlinked or kept local for development.
Example local paths:
text
Humanize runtime: ${CODEX_HOME:-~/.codex}/skills/humanize
ncu-report-skill: ${CODEX_HOME:-~/.codex}/skills/ncu-report-skill/SKILL.md
Model PR history knowledge: <repo>/model-pr-optimization-historyFor Claude Code installs, the equivalent defaults are typically:
text
Humanize runtime: ~/.claude/skills/humanize
ncu-report-skill: ~/.claude/skills/ncu-report-skill/SKILL.md
Model PR history knowledge: ~/.claude/skills/model-pr-history-knowledgeIf the Humanize runtime is missing, locate a plugin or skill directory
containing . If is unavailable,
kernel edits may still proceed from torch-profiler/source evidence, but record
the missing NCU evidence path as a blocker when a kernel change would normally
need Nsight Compute diagnostics.
scripts/setup-rlcr-loop.shncu-report-skill本技能可在Claude Code、Codex或其他兼容技能运行时中执行。按以下优先级解析配套根目录:
- 在Claude Code中运行时,优先使用下已安装的Claude Code技能。
~/.claude/skills - 在Codex中运行时,优先使用下已安装的Codex技能。
${CODEX_HOME:-~/.codex}/skills - 当技能为符号链接或本地开发版本时,回退至已检出的仓库。
本地路径示例:
text
Humanize运行时:${CODEX_HOME:-~/.codex}/skills/humanize
ncu-report-skill:${CODEX_HOME:-~/.codex}/skills/ncu-report-skill/SKILL.md
模型PR历史知识库:<repo>/model-pr-optimization-history对于Claude Code安装,等效默认路径通常为:
text
Humanize运行时:~/.claude/skills/humanize
ncu-report-skill:~/.claude/skills/ncu-report-skill/SKILL.md
模型PR历史知识库:~/.claude/skills/model-pr-history-knowledge若缺少Humanize运行时,请查找包含的插件或技能目录。若不可用,仍可基于torch-profiler/源码证据进行内核编辑,但当内核变更通常需要Nsight Compute诊断时,需记录缺失的NCU证据路径作为阻塞点。
scripts/setup-rlcr-loop.shncu-report-skillCompanion Skills
配套技能
Read these before a real run:
../llm-serving-auto-benchmark/SKILL.md../llm-torch-profiler-analysis/SKILL.md../llm-pipeline-analysis/SKILL.md../../model-pr-optimization-history/SKILL.md- the matching host or operator skill for SSH, container, GPU, and artifact conventions
Read only when the active RLCR round is writing or
evaluating a CUDA, Triton, CuTe, CUTLASS, TileLang, or torch.compile kernel path
and Nsight Compute evidence is needed.
ncu-report-skill/SKILL.md正式运行前请阅读以下技能文档:
../llm-serving-auto-benchmark/SKILL.md../llm-torch-profiler-analysis/SKILL.md../llm-pipeline-analysis/SKILL.md../../model-pr-optimization-history/SKILL.md- 与SSH、容器、GPU和工件约定匹配的主机或算子技能文档
仅当当前RLCR轮次正在编写或评估CUDA、Triton、CuTe、CUTLASS、TileLang或torch.compile内核路径,且需要Nsight Compute证据时,才需阅读。
ncu-report-skill/SKILL.mdContract
约定
Given a model-level vLLM SOTA request, do not ask the user to run separate
benchmark, profiler, gen-plan, refine-plan, or Humanize setup commands. Do the
setup yourself.
Ask the user only if the model, target GPU environment, or precision/quantization
policy is missing and cannot be inferred from local configs or the active host
skill.
Keep only the fixed benchmark phase outside the RLCR patch loop. Once the fixed
cross-framework benchmark and model PR history notes exist, start Humanize. The
RLCR loop itself must decide whether a gap still exists, collect current
profiler evidence, run layer pipeline analysis, patch vLLM, and revalidate.
Treat the model optimization campaign as the durable unit, not one terminal
session. The campaign is recoverable from the run artifact root, checkpoint
files, benchmark/profile artifacts, NCU digests, and ledgers.
收到模型级别的vLLM SOTA请求后,请勿要求用户单独运行基准测试、性能分析、生成计划、细化计划或Humanize设置命令。请自行完成设置。
仅当模型、目标GPU环境或精度/量化策略缺失且无法从本地配置或当前主机技能推断时,才需询问用户。
仅将固定基准阶段置于RLCR修补循环之外。一旦完成跨框架固定基准测试并记录模型PR历史笔记,即可启动Humanize。RLCR循环自身需判定差距是否仍然存在、收集当前性能分析证据、运行层流水线分析、修补vLLM并重新验证。
将模型优化任务视为持久化单元,而非单次终端会话。可从运行工件根目录、检查点文件、基准测试/性能分析工件、NCU摘要和分类账恢复任务。
Phase 0: Inputs And Run Directory
阶段0:输入与运行目录
Collect or infer:
- model id or checkpoint path, tokenizer, precision, quantization, trust policy, and max context length
- target vLLM checkout to patch
- GPU type/count, visible GPU ids, container or remote shell, CUDA/NCCL versions, and whether multi-node is allowed
- framework set, defaulting to vLLM, SGLang, and TensorRT-LLM when available
- model-family history slug inferred from the model id, checkpoint, or hot vLLM/SGLang source path when possible
- artifact root
Create one run directory:
text
runs/YYYYMMDD_<model_slug>_sota_humanize/
manifest.md
help/
benchmark/
profiles/
analysis/
root-cause.md
layer-pipeline.md
history/
model-pr-history-notes.md
kernel/
ncu-digests/
patches/
humanize/
model-loop-checkpoint.md
final_report.mdNever save Hugging Face tokens or other secrets in artifacts.
收集或推断以下信息:
- 模型ID或检查点路径、分词器、精度、量化策略、信任策略以及最大上下文长度
- 待修补的目标vLLM检出版本
- GPU类型/数量、可见GPU ID、容器或远程Shell、CUDA/NCCL版本,以及是否允许多节点
- 框架集合,默认包含可用的vLLM、SGLang和TensorRT-LLM
- 可从模型ID、检查点或热门vLLM/SGLang源码路径推断的模型家族历史标记
- 工件根目录
创建一个运行目录:
text
runs/YYYYMMDD_<model_slug>_sota_humanize/
manifest.md
help/
benchmark/
profiles/
analysis/
root-cause.md
layer-pipeline.md
history/
model-pr-history-notes.md
kernel/
ncu-digests/
patches/
humanize/
model-loop-checkpoint.md
final_report.md切勿在工件中保存Hugging Face令牌或其他机密信息。
Phase 0.5: Model PR History Knowledge Gate
阶段0.5:模型PR历史知识库检查
Before the fixed benchmark and before any patch planning, query and read
for the target model family.
model-pr-optimization-historyRules:
- If the slug is unclear, run from the knowledge root and choose the closest model-family history.
scripts/query.py "<model id or family>" - Read the vLLM history for that family whenever it exists.
- Read the SGLang history too when SGLang is in the comparison set, later becomes the leading competitor, or its source/trace suggests a missing vLLM fast path.
- Write with the paths read, PR numbers, source files, symbols, validation risks, and the concrete decision each item influences.
history/model-pr-history-notes.md - Treat these notes as source and PR memory that helps choose a better vLLM patch, not measured proof by itself.
If the knowledge root is unavailable, record the blocker in the same notes file
and continue with benchmark/profile evidence.
在固定基准测试和任何修补计划之前,查询并读取中目标模型家族的内容。
model-pr-optimization-history规则:
- 若标记不明确,从知识库根目录运行,选择最匹配的模型家族历史。
scripts/query.py "<model id or family>" - 只要存在,就读取该家族的vLLM历史。
- 当SGLang属于对比集合、后续成为领先竞品或其源码/跟踪信息表明vLLM缺少快速路径时,也需读取SGLang历史。
- 在中记录读取的路径、PR编号、源码文件、符号、验证风险以及每项内容影响的具体决策。
history/model-pr-history-notes.md - 将这些笔记视为帮助选择更优vLLM修补方案的源码和PR记忆,而非实测证据。
若知识库根目录不可用,在同一笔记文件中记录阻塞点,并继续基于基准测试/性能分析证据推进。
Phase 1: Fixed Fair Benchmark Gate
阶段1:固定公平基准测试检查
This phase is mandatory and happens exactly once before Humanize starts.
Use as the source of truth for candidate generation,
result schema, workload, and comparison.
llm-serving-auto-benchmarkHard requirements:
- Search vLLM, SGLang, and TensorRT-LLM best deployment commands when each framework is supported in the target environment.
- Do not compare tuned vLLM against competitor defaults. Every framework gets its own bounded search.
- Use the same model weights, tokenizer, precision, quantization, GPU type/count, GPU ids, endpoint path, sampling settings, and SLA.
- Use the default two dataset scenarios from unless the user explicitly provides a production workload:
llm-serving-auto-benchmark- dataset kind ,
randomnum_prompts: 80 - : random input
chat, output10001000 - : random input
summarization, output80001000 - treat the two input/output pairs as aligned scenarios, not a cartesian product
- dataset kind
- Do not replace those scenarios with an easier smoke dataset for the real SOTA decision. Smoke runs are allowed only when labeled as flow checks.
- For TensorRT-LLM, keep ; reject non-PyTorch TensorRT-LLM server backends for this skill.
trtllm-serve serve --backend pytorch - Keep failed, skipped, and SLA-failing candidates in the benchmark artifact.
Write:
benchmark/candidates.jsonlbenchmark/summary.mdbenchmark/winning-commands.md- framework help outputs under
help/ - the exact launch and benchmark commands for every winner
Do not choose a code patch outside RLCR. The fixed winner table is the baseline
input to the model loop.
此阶段为强制阶段,必须在Humanize启动前完成一次。
以作为候选生成、结果 schema、工作负载和对比的事实来源。
llm-serving-auto-benchmark硬性要求:
- 当目标环境支持各框架时,搜索vLLM、SGLang和TensorRT-LLM的最优部署命令。
- 请勿将经过调优的vLLM与竞品默认配置对比。每个框架都需进行独立的有限搜索。
- 使用相同的模型权重、分词器、精度、量化策略、GPU类型/数量、GPU ID、端点路径、采样设置和SLA。
- 除非用户明确提供生产工作负载,否则使用中的默认两个数据集场景:
llm-serving-auto-benchmark- 数据集类型,
randomnum_prompts: 80 - :随机输入
chat,输出10001000 - :随机输入
summarization,输出80001000 - 将这两组输入/输出对视为对齐场景,而非笛卡尔积
- 数据集类型
- 请勿为真实SOTA决策将这些场景替换为更简单的冒烟测试数据集。冒烟测试仅允许标记为流程检查时使用。
- 对于TensorRT-LLM,保留;拒绝本技能使用非PyTorch的TensorRT-LLM服务器后端。
trtllm-serve serve --backend pytorch - 在基准测试工件中保留失败、跳过和未达SLA的候选结果。
写入以下内容:
benchmark/candidates.jsonlbenchmark/summary.mdbenchmark/winning-commands.md- 下的框架帮助输出
help/ - 每个最优结果对应的精确启动和基准测试命令
请勿在RLCR之外选择代码修补方案。固定最优结果表是模型循环的基线输入。
Phase 2: Build The Humanize Plan
阶段2:构建Humanize计划
Create a Humanize plan inside the vLLM checkout that will be patched:
text
.humanize/vllm-sota-agent/refined-plan.mdUse references/refined-plan-template.md
as the skeleton and fill it with the actual model, workload, benchmark winners,
artifact root, model PR history notes, and target vLLM checkout.
The plan must require every RLCR round to:
- preserve the fixed benchmark workload and SLA
- preserve and consult before choosing model-specific vLLM source paths
history/model-pr-history-notes.md - run the gap decision inside the loop before patching
- run inside the loop when vLLM is behind or when the previous patch changed the profiled path
llm-torch-profiler-analysis - run inside the loop after profiler triage and before choosing a source path, representative layer, or kernel target
llm-pipeline-analysis - patch vLLM code, not just benchmark parameters
- use inside the same loop when a kernel edit needs Nsight Compute evidence
ncu-report-skill - re-run real model benchmark/profile after each accepted patch
- continue through multiple minimal patches when one patch only closes part of the gap
- record every attempt, failed idea, partial win, rejected source idea, and final selected patch in artifacts
在待修补的vLLM检出版本中创建Humanize计划:
text
.humanize/vllm-sota-agent/refined-plan.md以references/refined-plan-template.md为框架,填充实际模型、工作负载、基准测试最优结果、工件根目录、模型PR历史笔记和目标vLLM检出版本信息。
计划必须要求每个RLCR轮次:
- 保留固定基准测试工作负载和SLA
- 在选择模型特定的vLLM源码路径前,保留并参考
history/model-pr-history-notes.md - 在修补前于循环内执行差距判定
- 当vLLM落后或上一次修补改变了性能分析路径时,在循环内运行
llm-torch-profiler-analysis - 在性能分析分类后、选择源码路径、代表性层或内核目标前,在循环内运行
llm-pipeline-analysis - 修补vLLM代码,而非仅调整基准测试参数
- 当内核编辑需要Nsight Compute证据时,在同一循环内使用
ncu-report-skill - 每次接受修补后重新运行真实模型基准测试/性能分析
- 当单次修补仅缩小部分差距时,继续执行多次最小化修补
- 在工件中记录每次尝试、失败的想法、部分成果、被拒绝的源码方案以及最终选择的修补
Phase 3: Start RLCR
阶段3:启动RLCR
Before starting Humanize from the vLLM checkout:
- Ensure the vLLM checkout is a git repository with at least one commit and a clean working tree, excluding only gitignored Humanize runtime state.
- Ensure is gitignored so RLCR state, round summaries, and local checkpoints cannot be staged accidentally.
.humanize* - Ensure the intended review base branch is present locally. Pass
if Humanize's auto-detection would be ambiguous.
--base-branch <branch> - Do not start a new loop if any existing is active in the vLLM checkout. Resume, finish, or cancel the old model loop first.
.humanize/rlcr/*/state.md
From the vLLM checkout, run:
bash
"$HUMANIZE_RUNTIME_ROOT/scripts/setup-rlcr-loop.sh" \
.humanize/vllm-sota-agent/refined-plan.md --yolo --strict-successIf is not already set by the client/plugin environment,
resolve it to the installed Humanize runtime first. In Codex, this is often
; in Claude Code it is often
or a plugin-provided Humanize runtime. If setup
exits non-zero, stop and report the error. Do not bypass the gate.
HUMANIZE_RUNTIME_ROOT${CODEX_HOME:-~/.codex}/skills/humanize~/.claude/skills/humanizeAfter setup succeeds:
- Find the active state file with
.
find .humanize/rlcr -maxdepth 2 -name state.md -print - Verify the state file exists and contains .
strict_success: true - Read .
.humanize/rlcr/<timestamp>/round-0-prompt.md - Execute the current round.
- Commit vLLM changes.
- Write the required Humanize round summary.
- Stop normally so the native Humanize Stop hook can review.
If no active state file exists, or if is missing, stop
and report that RLCR did not start correctly. Do not continue into vLLM patch
work outside the Humanize loop. If the hook blocks exit, follow the generated
next-round prompt exactly.
strict_success: true从vLLM检出版本启动Humanize前:
- 确保vLLM检出版本是包含至少一次提交的git仓库,且工作树干净,仅排除git忽略的Humanize运行时状态。
- 确保已被git忽略,避免RLCR状态、轮次摘要和本地检查点被意外暂存。
.humanize* - 确保预期的基准分支已存在于本地。若Humanize的自动检测存在歧义,传递参数。
--base-branch <branch> - 若vLLM检出版本中存在任何活跃的文件,请勿启动新循环。请先恢复、完成或取消旧模型循环。
.humanize/rlcr/*/state.md
从vLLM检出版本运行:
bash
"$HUMANIZE_RUNTIME_ROOT/scripts/setup-rlcr-loop.sh" \
.humanize/vllm-sota-agent/refined-plan.md --yolo --strict-success若客户端/插件环境未设置,请先解析为已安装的Humanize运行时路径。在Codex中,通常为;在Claude Code中,通常为或插件提供的Humanize运行时。若设置脚本返回非零值,请停止并报告错误,请勿绕过检查。
HUMANIZE_RUNTIME_ROOT${CODEX_HOME:-~/.codex}/skills/humanize~/.claude/skills/humanize设置成功后:
- 使用查找活跃状态文件。
find .humanize/rlcr -maxdepth 2 -name state.md -print - 验证状态文件存在且包含。
strict_success: true - 读取。
.humanize/rlcr/<timestamp>/round-0-prompt.md - 执行当前轮次。
- 提交vLLM变更。
- 写入所需的Humanize轮次摘要。
- 正常停止,以便原生Humanize停止钩子进行审核。
若不存在活跃状态文件,或缺少,请停止并报告RLCR未正确启动。请勿在Humanize循环之外继续vLLM修补工作。若钩子阻止退出,请严格遵循生成的下一轮次提示执行。
strict_success: trueInside Each RLCR Round
每个RLCR轮次内部流程
Gap Decision
差距判定
At the start of every round, compute current vLLM's gap against the best
SLA-passing framework for each fixed scenario.
Use as the default stable noise threshold. If the current result is within
, rerun the winning commands enough times to decide whether the gap is
stable before choosing a patch.
1%+/-1%Patch only when vLLM is slower than the best framework by more than , fails
SLA while another framework passes, or has a profiled bottleneck that explains
the remaining gap under the fixed workload.
1%If vLLM is already best or tied within the stable threshold, write the final
report and stop under the normal Humanize review path.
在每轮开始时,计算当前vLLM与每个固定场景下最优SLA达标框架的性能差距。
使用作为默认稳定噪声阈值。若当前结果在范围内,需多次运行最优命令以判定差距是否稳定,再选择修补方案。
1%+/-1%仅当vLLM比最优框架慢超过、未达SLA而其他框架达标,或性能分析显示固定工作负载下剩余差距存在明确瓶颈时,才进行修补。
1%若vLLM已处于最优或在稳定阈值内持平,请撰写最终报告并通过正常Humanize审核路径停止。
Required Profiling
必要的性能分析
When vLLM is behind, profile the current best vLLM command and the leading
competitor command with .
llm-torch-profiler-analysisRules:
- Always profile vLLM when it is behind.
- Always profile at least the current best framework.
- If both SGLang and TensorRT-LLM are more than ahead of vLLM in a stable result, profile both.
1% - Use the slow benchmark scenario lengths, not the profiler defaults:
- prefill profile: slow input length -> output token
1 - decode profile: input token -> slow output length
1
- prefill profile: slow input length ->
- For mixed or production datasets, use the slowest representative p50 or p95 bucket already recorded by the benchmark artifact.
- Capture or analyze separate prefill and decode evidence when the framework supports it.
For every profiled framework, save the same three tables:
- kernel table
- overlap-opportunity table
- fuse-pattern table
Then write or update with the current cross-framework
comparison: which stage is slower, which table rows explain it, and which vLLM
source paths or kernel families are plausible patch targets.
analysis/root-cause.mdDo not patch vLLM until this report exists for the current gap.
当vLLM落后时,使用对当前最优vLLM命令和领先竞品命令进行性能分析。
llm-torch-profiler-analysis规则:
- 当vLLM落后时,必须对其进行性能分析。
- 必须至少对当前最优框架进行性能分析。
- 若SGLang和TensorRT-LLM在稳定结果中均比vLLM领先超过,则需对两者都进行分析。
1% - 使用基准测试中的慢场景长度,而非性能分析器默认值:
- 预填充分析:慢输入长度 -> 个输出token
1 - 解码分析:个输入token -> 慢输出长度
1
- 预填充分析:慢输入长度 ->
- 对于混合或生产数据集,使用基准测试工件中已记录的最慢代表性p50或p95区间。
- 当框架支持时,捕获或分析独立的预填充和解码证据。
为每个经过性能分析的框架保存以下三张表:
- 内核表
- 重叠机会表
- 融合模式表
然后撰写或更新,包含当前跨框架对比:哪个阶段更慢、哪些表行可解释原因,以及哪些vLLM源码路径或内核家族是可行的修补目标。
analysis/root-cause.md在当前差距的该报告生成前,请勿修补vLLM。
Layer Pipeline Deep Dive
层流水线深度调研
Run inside every RLCR round after profiler triage and
before choosing a patch target.
llm-pipeline-analysisThe report must identify:
- the chosen forward pass and why it is representative
- the relevant layer types, especially for heterogeneous layers such as MoE,
hash layers, or
compress_ratios - representative layers for the patch target
- top hot kernels in those representative layers
- any Perfetto time ranges needed for inspection
Use the profiled vLLM trace and the served model config. Write
with the chosen forward pass, layer-type timing
table, representative layers, top hot kernels, and any Perfetto ranges used for
inspection. Do not choose a vLLM patch before this report exists for the current
round.
analysis/layer-pipeline.md在每个RLCR轮次中,完成性能分析分类后、选择修补目标前,运行。
llm-pipeline-analysis报告必须明确:
- 所选前向传播及其代表性原因
- 相关层类型,尤其是异构层如MoE、哈希层或
compress_ratios - 修补目标的代表性层
- 这些代表性层中的热门内核
- 检查所需的任何Perfetto时间范围
使用经过性能分析的vLLM跟踪数据和服务模型配置。在中记录所选前向传播、层类型计时表、代表性层、热门内核以及用于检查的Perfetto范围。在当前轮次的该报告生成前,请勿选择vLLM修补方案。
analysis/layer-pipeline.mdKernel Evidence Assist
内核证据辅助
Use only when the active RLCR round is writing a concrete
kernel or small kernel-family patch and torch-profiler evidence is not enough to
choose or validate the next edit.
ncu-report-skillKernel-level assistance is allowed only when all of these are true:
- vLLM is still more than behind the best framework for the fixed benchmark scenario after the required repeat/profiler checks.
1% - The slow stage has a concrete vLLM kernel or tightly scoped kernel family
in the kernel table with at least cumulative GPU-time share. Do not spend kernel-specialist effort on a lone kernel below
1%share unless a shared implementation affects an aggregated family above1%.1% - has identified the representative layer/forward pass and top hot kernels for the current round.
llm-pipeline-analysis - The proposed kernel target has a clear correctness reference, representative shapes/dtypes/layouts from the model run, and a path to wire the candidate into the active vLLM serving code.
For each eligible kernel target:
- Read and follow its Nsight Compute workflow for harness construction,
ncu-report-skill/SKILL.mdcollection, report parsing, stall diagnosis, and evidence-backed next-edit selection.ncu - Store NCU outputs under or the host's equivalent artifact root. Each digest must compare baseline vs candidate and end with exactly one concrete next edit.
kernel/ncu-digests/<version>/ - Patch the vLLM kernel or call path directly in the vLLM checkout, with
focused correctness and microbench coverage when available. CUDA and C++
kernel code lives under , Triton kernels and attention/quantization wrappers live under
csrc/, and torch.compile-driven paths live undervllm/.vllm/compilation/ - Wire the candidate into the active model-serving path that produced the original profiler row.
- Re-run the same real-model benchmark and profiler after the candidate is correct. A microbench or NCU win alone is not success.
If no focused harness exists, build the smallest harness that preserves the
model-derived shapes/dtypes/layouts. If NCU cannot run on the host, record the
blocker in the digest path and keep the next edit grounded in the available
torch-profiler, layer-pipeline, and source evidence.
Do not start any standalone session for a kernel target. Kernel
work stays inside the active model RLCR loop.
.humanize/rlcr仅当当前RLCR轮次正在编写具体内核或小型内核家族修补方案,且torch-profiler证据不足以选择或验证下一次编辑时,才使用。
ncu-report-skill仅当满足以下所有条件时,才允许内核级辅助:
- 经过必要的重复/性能分析检查后,vLLM在固定基准测试场景中仍比最优框架慢超过。
1% - 慢阶段的内核表中存在具体的vLLM内核或范围紧密的内核家族,其累计GPU时间占比至少为。除非共享实现影响占比超过
1%的聚合家族,否则请勿在占比低于1%的单个内核上投入内核专家资源。1% - 已确定当前轮次的代表性层/前向传播和热门内核。
llm-pipeline-analysis - 拟议的内核目标具有明确的正确性参考、模型运行得出的代表性形状/数据类型/布局,以及将候选方案接入活跃vLLM服务代码的路径。
对于每个符合条件的内核目标:
- 阅读并遵循其Nsight Compute工作流,包括测试 harness 构建、
ncu-report-skill/SKILL.md数据收集、报告解析、停滞诊断以及基于证据选择下一次编辑。ncu - 将NCU输出存储在或主机对应的工件根目录下。每个摘要必须对比基线与候选方案,并最终给出一个明确的下一次编辑建议。
kernel/ncu-digests/<version>/ - 直接在vLLM检出版本中修补vLLM内核或调用路径,尽可能覆盖正确性和微基准测试。CUDA和C++内核代码位于下,Triton内核和注意力/量化包装器位于
csrc/下,torch.compile驱动的路径位于vllm/下。vllm/compilation/ - 将候选方案接入生成原始性能分析行的活跃模型服务路径。
- 候选方案验证正确后,重新运行相同的真实模型基准测试和性能分析。仅微基准测试或NCU测试通过不算成功。
若不存在聚焦的测试 harness,请构建最小化的harness以保留模型衍生的形状/数据类型/布局。若主机无法运行NCU,请在摘要路径中记录阻塞点,并基于可用的torch-profiler、层流水线和源码证据进行下一次编辑。
请勿为内核目标启动独立的会话。内核工作需在活跃的模型RLCR循环内完成。
.humanize/rlcrModel-Loop Checkpoint
模型循环检查点
After every accepted round, update with:
humanize/model-loop-checkpoint.md- original model, tokenizer, precision, quantization, hardware, workload, SLA, artifact root, and benchmark winner commands
- current vLLM branch, commit, patches applied, tests run, and current best vLLM benchmark row
- remaining gap, profiler rows, model PR history notes, layer-pipeline notes, NCU digest paths, rejected source ideas, and the next planned vLLM patch
This checkpoint is for campaign recovery inside the same model-level workflow.
It records enough context to resume the campaign without losing
benchmark/profile lineage.
每次接受轮次后,更新,包含:
humanize/model-loop-checkpoint.md- 原始模型、分词器、精度、量化策略、硬件、工作负载、SLA、工件根目录和基准测试最优命令
- 当前vLLM分支、提交记录、已应用的修补、运行的测试以及当前最优vLLM基准测试行
- 剩余差距、性能分析行、模型PR历史笔记、层流水线笔记、NCU摘要路径、被拒绝的源码方案以及下一个计划的vLLM修补
此检查点用于同一模型级工作流内的任务恢复。它记录了足够的上下文,可恢复任务而不丢失基准测试/性能分析的谱系。
Loop Ledgers
循环分类账
Keep these files under the run artifact root or the vLLM checkout, depending
on the host convention:
text
humanize/attempt-ledger.md
humanize/optimization-ledger.md
humanize/source-idea-ledger.md
humanize/lineage.jsonl
humanize/profile-digests/Every patch attempt gets an attempt row. Only correct patches with measured
improvement get optimization rows. Source ideas must include profiler rows,
layer-pipeline evidence, NCU report paths when used, and code provenance so
later rounds can avoid re-reading the same source. Model PR history evidence
should be recorded beside vLLM, SGLang, TensorRT-LLM, and NCU source ideas when
it influenced the patch.
After two consecutive rounds with less than geomean improvement over the
prior best vLLM result, expand code-first research before editing again. Prefer
code and PR evidence from vLLM, SGLang, TensorRT-LLM, and relevant kernel source
guides before prose-only articles.
1%根据主机约定,将以下文件保存在运行工件根目录或vLLM检出版本中:
text
humanize/attempt-ledger.md
humanize/optimization-ledger.md
humanize/source-idea-ledger.md
humanize/lineage.jsonl
humanize/profile-digests/每次修补尝试都需记录在尝试行中。仅正确且实测有改进的修补才记录在优化行中。源码方案必须包含性能分析行、层流水线证据、使用时的NCU报告路径以及代码来源,以便后续轮次避免重复读取相同源码。当模型PR历史证据影响修补时,需将其与vLLM、SGLang、TensorRT-LLM和NCU源码方案一起记录。
若连续两轮的几何平均改进率低于前一次最优vLLM结果的,请在再次编辑前扩展代码优先的研究。优先参考vLLM、SGLang、TensorRT-LLM的代码和PR证据,以及相关内核源码指南,而非纯文字文章。
1%Stop Conditions
停止条件
Stop only when one of these is true:
- vLLM beats the best SLA-passing SGLang/TensorRT-LLM result on the fixed workload.
- vLLM is tied within the stable threshold after repeat runs.
1% - The remaining gap is proven external to vLLM, such as unavailable hardware support, missing framework dependency, unsupported TensorRT-LLM PyTorch backend, or model weights that cannot be loaded fairly.
- Profile evidence shows the remaining hot path is already near the relevant hardware or algorithmic limit and no low-risk vLLM patch remains.
The final report must include the fixed benchmark table, post-patch benchmark
table, all winner commands, model PR history paths, profile paths,
layer-pipeline paths when used, NCU digest paths when used, vLLM changed files,
tests, and whether vLLM reached target-environment SOTA.
仅当满足以下任一条件时停止:
- vLLM在固定工作负载上超越了最优SLA达标SGLang/TensorRT-LLM结果。
- 经过多次运行后,vLLM与最优结果在稳定的阈值内持平。
1% - 剩余差距被证明是vLLM外部因素导致,如硬件支持不可用、缺少框架依赖、TensorRT-LLM PyTorch后端不支持或模型权重无法公平加载。
- 性能分析证据显示剩余热点路径已接近相关硬件或算法极限,且无低风险vLLM修补方案剩余。
最终报告必须包含固定基准测试表、修补后基准测试表、所有最优命令、模型PR历史路径、性能分析路径、使用时的层流水线路径、使用时的NCU摘要路径、vLLM变更文件、测试记录,以及vLLM是否达到目标环境SOTA的结论。