vllm-sota-humanize-loop

Compare original and translation side by side

🇺🇸

Original

English
🇨🇳

Translation

Chinese

vLLM SOTA Humanize Loop

vLLM SOTA Humanize循环

Overview

概述

Use this skill when the user names a model and wants the vLLM serving path to autonomously keep improving until it matches or beats the best reproducible SGLang or TensorRT-LLM result in the same target environment.
This workflow has two durable parts:
  1. A fixed baseline phase that must be completed once before any code patching.
  2. One Humanize RLCR loop that owns gap decision, profiling, required layer/kernel deep dive, vLLM patching, optional NCU evidence, and real-model revalidation.
Do not split the campaign into a pre-loop profiling phase plus a later patch loop. After the fixed benchmark exists, Phase 2 gap decisions, Phase 3 profiling,
llm-pipeline-analysis
, kernel evidence, and code changes all belong inside the same model-level RLCR loop.
当用户指定某个模型,并希望vLLM服务路径自主持续优化,直至在目标环境中达到或超越可复现的最优SGLang或TensorRT-LLM结果时,可使用本技能。
此工作流包含两个持久化部分:
  1. 固定基准阶段:必须在任何代码修补前完成一次。
  2. Humanize RLCR循环:负责差距判定、性能分析、必要的层/内核深度调研、vLLM代码修补、可选的NCU证据获取,以及真实模型重新验证。
请勿将任务拆分为循环前的性能分析阶段与后续的修补循环。固定基准测试完成后,阶段2的差距判定、阶段3的性能分析、
llm-pipeline-analysis
、内核证据获取以及代码变更都需纳入同一模型级别的RLCR循环中。

Runtime Roots

运行时根目录

This skill can run from Claude Code, Codex, or another compatible skill runtime. Resolve companion roots in this order:
  1. Prefer installed Claude Code skills under
    ~/.claude/skills
    when running in Claude Code.
  2. Prefer installed Codex skills under
    ${CODEX_HOME:-~/.codex}/skills
    when running in Codex.
  3. Fall back to checked-out repositories when the skills are symlinked or kept local for development.
Example local paths:
text
Humanize runtime: ${CODEX_HOME:-~/.codex}/skills/humanize
ncu-report-skill: ${CODEX_HOME:-~/.codex}/skills/ncu-report-skill/SKILL.md
Model PR history knowledge: <repo>/model-pr-optimization-history
For Claude Code installs, the equivalent defaults are typically:
text
Humanize runtime: ~/.claude/skills/humanize
ncu-report-skill: ~/.claude/skills/ncu-report-skill/SKILL.md
Model PR history knowledge: ~/.claude/skills/model-pr-history-knowledge
If the Humanize runtime is missing, locate a plugin or skill directory containing
scripts/setup-rlcr-loop.sh
. If
ncu-report-skill
is unavailable, kernel edits may still proceed from torch-profiler/source evidence, but record the missing NCU evidence path as a blocker when a kernel change would normally need Nsight Compute diagnostics.
本技能可在Claude Code、Codex或其他兼容技能运行时中执行。按以下优先级解析配套根目录:
  1. 在Claude Code中运行时,优先使用
    ~/.claude/skills
    下已安装的Claude Code技能。
  2. 在Codex中运行时,优先使用
    ${CODEX_HOME:-~/.codex}/skills
    下已安装的Codex技能。
  3. 当技能为符号链接或本地开发版本时,回退至已检出的仓库。
本地路径示例:
text
Humanize运行时:${CODEX_HOME:-~/.codex}/skills/humanize
ncu-report-skill:${CODEX_HOME:-~/.codex}/skills/ncu-report-skill/SKILL.md
模型PR历史知识库:<repo>/model-pr-optimization-history
对于Claude Code安装,等效默认路径通常为:
text
Humanize运行时:~/.claude/skills/humanize
ncu-report-skill:~/.claude/skills/ncu-report-skill/SKILL.md
模型PR历史知识库:~/.claude/skills/model-pr-history-knowledge
若缺少Humanize运行时,请查找包含
scripts/setup-rlcr-loop.sh
的插件或技能目录。若
ncu-report-skill
不可用,仍可基于torch-profiler/源码证据进行内核编辑,但当内核变更通常需要Nsight Compute诊断时,需记录缺失的NCU证据路径作为阻塞点。

Companion Skills

配套技能

Read these before a real run:
  • ../llm-serving-auto-benchmark/SKILL.md
  • ../llm-torch-profiler-analysis/SKILL.md
  • ../llm-pipeline-analysis/SKILL.md
  • ../../model-pr-optimization-history/SKILL.md
  • the matching host or operator skill for SSH, container, GPU, and artifact conventions
Read
ncu-report-skill/SKILL.md
only when the active RLCR round is writing or evaluating a CUDA, Triton, CuTe, CUTLASS, TileLang, or torch.compile kernel path and Nsight Compute evidence is needed.
正式运行前请阅读以下技能文档:
  • ../llm-serving-auto-benchmark/SKILL.md
  • ../llm-torch-profiler-analysis/SKILL.md
  • ../llm-pipeline-analysis/SKILL.md
  • ../../model-pr-optimization-history/SKILL.md
  • 与SSH、容器、GPU和工件约定匹配的主机或算子技能文档
仅当当前RLCR轮次正在编写或评估CUDA、Triton、CuTe、CUTLASS、TileLang或torch.compile内核路径,且需要Nsight Compute证据时,才需阅读
ncu-report-skill/SKILL.md

Contract

约定

Given a model-level vLLM SOTA request, do not ask the user to run separate benchmark, profiler, gen-plan, refine-plan, or Humanize setup commands. Do the setup yourself.
Ask the user only if the model, target GPU environment, or precision/quantization policy is missing and cannot be inferred from local configs or the active host skill.
Keep only the fixed benchmark phase outside the RLCR patch loop. Once the fixed cross-framework benchmark and model PR history notes exist, start Humanize. The RLCR loop itself must decide whether a gap still exists, collect current profiler evidence, run layer pipeline analysis, patch vLLM, and revalidate.
Treat the model optimization campaign as the durable unit, not one terminal session. The campaign is recoverable from the run artifact root, checkpoint files, benchmark/profile artifacts, NCU digests, and ledgers.
收到模型级别的vLLM SOTA请求后,请勿要求用户单独运行基准测试、性能分析、生成计划、细化计划或Humanize设置命令。请自行完成设置。
仅当模型、目标GPU环境或精度/量化策略缺失且无法从本地配置或当前主机技能推断时,才需询问用户。
仅将固定基准阶段置于RLCR修补循环之外。一旦完成跨框架固定基准测试并记录模型PR历史笔记,即可启动Humanize。RLCR循环自身需判定差距是否仍然存在、收集当前性能分析证据、运行层流水线分析、修补vLLM并重新验证。
将模型优化任务视为持久化单元,而非单次终端会话。可从运行工件根目录、检查点文件、基准测试/性能分析工件、NCU摘要和分类账恢复任务。

Phase 0: Inputs And Run Directory

阶段0:输入与运行目录

Collect or infer:
  • model id or checkpoint path, tokenizer, precision, quantization, trust policy, and max context length
  • target vLLM checkout to patch
  • GPU type/count, visible GPU ids, container or remote shell, CUDA/NCCL versions, and whether multi-node is allowed
  • framework set, defaulting to vLLM, SGLang, and TensorRT-LLM when available
  • model-family history slug inferred from the model id, checkpoint, or hot vLLM/SGLang source path when possible
  • artifact root
Create one run directory:
text
runs/YYYYMMDD_<model_slug>_sota_humanize/
  manifest.md
  help/
  benchmark/
  profiles/
  analysis/
    root-cause.md
    layer-pipeline.md
  history/
    model-pr-history-notes.md
  kernel/
    ncu-digests/
  patches/
  humanize/
    model-loop-checkpoint.md
  final_report.md
Never save Hugging Face tokens or other secrets in artifacts.
收集或推断以下信息:
  • 模型ID或检查点路径、分词器、精度、量化策略、信任策略以及最大上下文长度
  • 待修补的目标vLLM检出版本
  • GPU类型/数量、可见GPU ID、容器或远程Shell、CUDA/NCCL版本,以及是否允许多节点
  • 框架集合,默认包含可用的vLLM、SGLang和TensorRT-LLM
  • 可从模型ID、检查点或热门vLLM/SGLang源码路径推断的模型家族历史标记
  • 工件根目录
创建一个运行目录:
text
runs/YYYYMMDD_<model_slug>_sota_humanize/
  manifest.md
  help/
  benchmark/
  profiles/
  analysis/
    root-cause.md
    layer-pipeline.md
  history/
    model-pr-history-notes.md
  kernel/
    ncu-digests/
  patches/
  humanize/
    model-loop-checkpoint.md
  final_report.md
切勿在工件中保存Hugging Face令牌或其他机密信息。

Phase 0.5: Model PR History Knowledge Gate

阶段0.5:模型PR历史知识库检查

Before the fixed benchmark and before any patch planning, query and read
model-pr-optimization-history
for the target model family.
Rules:
  • If the slug is unclear, run
    scripts/query.py "<model id or family>"
    from the knowledge root and choose the closest model-family history.
  • Read the vLLM history for that family whenever it exists.
  • Read the SGLang history too when SGLang is in the comparison set, later becomes the leading competitor, or its source/trace suggests a missing vLLM fast path.
  • Write
    history/model-pr-history-notes.md
    with the paths read, PR numbers, source files, symbols, validation risks, and the concrete decision each item influences.
  • Treat these notes as source and PR memory that helps choose a better vLLM patch, not measured proof by itself.
If the knowledge root is unavailable, record the blocker in the same notes file and continue with benchmark/profile evidence.
在固定基准测试和任何修补计划之前,查询并读取
model-pr-optimization-history
中目标模型家族的内容。
规则:
  • 若标记不明确,从知识库根目录运行
    scripts/query.py "<model id or family>"
    ,选择最匹配的模型家族历史。
  • 只要存在,就读取该家族的vLLM历史。
  • 当SGLang属于对比集合、后续成为领先竞品或其源码/跟踪信息表明vLLM缺少快速路径时,也需读取SGLang历史。
  • history/model-pr-history-notes.md
    中记录读取的路径、PR编号、源码文件、符号、验证风险以及每项内容影响的具体决策。
  • 将这些笔记视为帮助选择更优vLLM修补方案的源码和PR记忆,而非实测证据。
若知识库根目录不可用,在同一笔记文件中记录阻塞点,并继续基于基准测试/性能分析证据推进。

Phase 1: Fixed Fair Benchmark Gate

阶段1:固定公平基准测试检查

This phase is mandatory and happens exactly once before Humanize starts.
Use
llm-serving-auto-benchmark
as the source of truth for candidate generation, result schema, workload, and comparison.
Hard requirements:
  • Search vLLM, SGLang, and TensorRT-LLM best deployment commands when each framework is supported in the target environment.
  • Do not compare tuned vLLM against competitor defaults. Every framework gets its own bounded search.
  • Use the same model weights, tokenizer, precision, quantization, GPU type/count, GPU ids, endpoint path, sampling settings, and SLA.
  • Use the default two dataset scenarios from
    llm-serving-auto-benchmark
    unless the user explicitly provides a production workload:
    • dataset kind
      random
      ,
      num_prompts: 80
    • chat
      : random input
      1000
      , output
      1000
    • summarization
      : random input
      8000
      , output
      1000
    • treat the two input/output pairs as aligned scenarios, not a cartesian product
  • Do not replace those scenarios with an easier smoke dataset for the real SOTA decision. Smoke runs are allowed only when labeled as flow checks.
  • For TensorRT-LLM, keep
    trtllm-serve serve --backend pytorch
    ; reject non-PyTorch TensorRT-LLM server backends for this skill.
  • Keep failed, skipped, and SLA-failing candidates in the benchmark artifact.
Write:
  • benchmark/candidates.jsonl
  • benchmark/summary.md
  • benchmark/winning-commands.md
  • framework help outputs under
    help/
  • the exact launch and benchmark commands for every winner
Do not choose a code patch outside RLCR. The fixed winner table is the baseline input to the model loop.
此阶段为强制阶段,必须在Humanize启动前完成一次。
llm-serving-auto-benchmark
作为候选生成、结果 schema、工作负载和对比的事实来源。
硬性要求:
  • 当目标环境支持各框架时,搜索vLLM、SGLang和TensorRT-LLM的最优部署命令。
  • 请勿将经过调优的vLLM与竞品默认配置对比。每个框架都需进行独立的有限搜索。
  • 使用相同的模型权重、分词器、精度、量化策略、GPU类型/数量、GPU ID、端点路径、采样设置和SLA。
  • 除非用户明确提供生产工作负载,否则使用
    llm-serving-auto-benchmark
    中的默认两个数据集场景:
    • 数据集类型
      random
      num_prompts: 80
    • chat
      :随机输入
      1000
      ,输出
      1000
    • summarization
      :随机输入
      8000
      ,输出
      1000
    • 将这两组输入/输出对视为对齐场景,而非笛卡尔积
  • 请勿为真实SOTA决策将这些场景替换为更简单的冒烟测试数据集。冒烟测试仅允许标记为流程检查时使用。
  • 对于TensorRT-LLM,保留
    trtllm-serve serve --backend pytorch
    ;拒绝本技能使用非PyTorch的TensorRT-LLM服务器后端。
  • 在基准测试工件中保留失败、跳过和未达SLA的候选结果。
写入以下内容:
  • benchmark/candidates.jsonl
  • benchmark/summary.md
  • benchmark/winning-commands.md
  • help/
    下的框架帮助输出
  • 每个最优结果对应的精确启动和基准测试命令
请勿在RLCR之外选择代码修补方案。固定最优结果表是模型循环的基线输入。

Phase 2: Build The Humanize Plan

阶段2:构建Humanize计划

Create a Humanize plan inside the vLLM checkout that will be patched:
text
.humanize/vllm-sota-agent/refined-plan.md
Use references/refined-plan-template.md as the skeleton and fill it with the actual model, workload, benchmark winners, artifact root, model PR history notes, and target vLLM checkout.
The plan must require every RLCR round to:
  • preserve the fixed benchmark workload and SLA
  • preserve and consult
    history/model-pr-history-notes.md
    before choosing model-specific vLLM source paths
  • run the gap decision inside the loop before patching
  • run
    llm-torch-profiler-analysis
    inside the loop when vLLM is behind or when the previous patch changed the profiled path
  • run
    llm-pipeline-analysis
    inside the loop after profiler triage and before choosing a source path, representative layer, or kernel target
  • patch vLLM code, not just benchmark parameters
  • use
    ncu-report-skill
    inside the same loop when a kernel edit needs Nsight Compute evidence
  • re-run real model benchmark/profile after each accepted patch
  • continue through multiple minimal patches when one patch only closes part of the gap
  • record every attempt, failed idea, partial win, rejected source idea, and final selected patch in artifacts
在待修补的vLLM检出版本中创建Humanize计划:
text
.humanize/vllm-sota-agent/refined-plan.md
references/refined-plan-template.md为框架,填充实际模型、工作负载、基准测试最优结果、工件根目录、模型PR历史笔记和目标vLLM检出版本信息。
计划必须要求每个RLCR轮次:
  • 保留固定基准测试工作负载和SLA
  • 在选择模型特定的vLLM源码路径前,保留并参考
    history/model-pr-history-notes.md
  • 在修补前于循环内执行差距判定
  • 当vLLM落后或上一次修补改变了性能分析路径时,在循环内运行
    llm-torch-profiler-analysis
  • 在性能分析分类后、选择源码路径、代表性层或内核目标前,在循环内运行
    llm-pipeline-analysis
  • 修补vLLM代码,而非仅调整基准测试参数
  • 当内核编辑需要Nsight Compute证据时,在同一循环内使用
    ncu-report-skill
  • 每次接受修补后重新运行真实模型基准测试/性能分析
  • 当单次修补仅缩小部分差距时,继续执行多次最小化修补
  • 在工件中记录每次尝试、失败的想法、部分成果、被拒绝的源码方案以及最终选择的修补

Phase 3: Start RLCR

阶段3:启动RLCR

Before starting Humanize from the vLLM checkout:
  • Ensure the vLLM checkout is a git repository with at least one commit and a clean working tree, excluding only gitignored Humanize runtime state.
  • Ensure
    .humanize*
    is gitignored so RLCR state, round summaries, and local checkpoints cannot be staged accidentally.
  • Ensure the intended review base branch is present locally. Pass
    --base-branch <branch>
    if Humanize's auto-detection would be ambiguous.
  • Do not start a new loop if any existing
    .humanize/rlcr/*/state.md
    is active in the vLLM checkout. Resume, finish, or cancel the old model loop first.
From the vLLM checkout, run:
bash
"$HUMANIZE_RUNTIME_ROOT/scripts/setup-rlcr-loop.sh" \
  .humanize/vllm-sota-agent/refined-plan.md --yolo --strict-success
If
HUMANIZE_RUNTIME_ROOT
is not already set by the client/plugin environment, resolve it to the installed Humanize runtime first. In Codex, this is often
${CODEX_HOME:-~/.codex}/skills/humanize
; in Claude Code it is often
~/.claude/skills/humanize
or a plugin-provided Humanize runtime. If setup exits non-zero, stop and report the error. Do not bypass the gate.
After setup succeeds:
  1. Find the active state file with
    find .humanize/rlcr -maxdepth 2 -name state.md -print
    .
  2. Verify the state file exists and contains
    strict_success: true
    .
  3. Read
    .humanize/rlcr/<timestamp>/round-0-prompt.md
    .
  4. Execute the current round.
  5. Commit vLLM changes.
  6. Write the required Humanize round summary.
  7. Stop normally so the native Humanize Stop hook can review.
If no active state file exists, or if
strict_success: true
is missing, stop and report that RLCR did not start correctly. Do not continue into vLLM patch work outside the Humanize loop. If the hook blocks exit, follow the generated next-round prompt exactly.
从vLLM检出版本启动Humanize前:
  • 确保vLLM检出版本是包含至少一次提交的git仓库,且工作树干净,仅排除git忽略的Humanize运行时状态。
  • 确保
    .humanize*
    已被git忽略,避免RLCR状态、轮次摘要和本地检查点被意外暂存。
  • 确保预期的基准分支已存在于本地。若Humanize的自动检测存在歧义,传递
    --base-branch <branch>
    参数。
  • 若vLLM检出版本中存在任何活跃的
    .humanize/rlcr/*/state.md
    文件,请勿启动新循环。请先恢复、完成或取消旧模型循环。
从vLLM检出版本运行:
bash
"$HUMANIZE_RUNTIME_ROOT/scripts/setup-rlcr-loop.sh" \
  .humanize/vllm-sota-agent/refined-plan.md --yolo --strict-success
若客户端/插件环境未设置
HUMANIZE_RUNTIME_ROOT
,请先解析为已安装的Humanize运行时路径。在Codex中,通常为
${CODEX_HOME:-~/.codex}/skills/humanize
;在Claude Code中,通常为
~/.claude/skills/humanize
或插件提供的Humanize运行时。若设置脚本返回非零值,请停止并报告错误,请勿绕过检查。
设置成功后:
  1. 使用
    find .humanize/rlcr -maxdepth 2 -name state.md -print
    查找活跃状态文件。
  2. 验证状态文件存在且包含
    strict_success: true
  3. 读取
    .humanize/rlcr/<timestamp>/round-0-prompt.md
  4. 执行当前轮次。
  5. 提交vLLM变更。
  6. 写入所需的Humanize轮次摘要。
  7. 正常停止,以便原生Humanize停止钩子进行审核。
若不存在活跃状态文件,或缺少
strict_success: true
,请停止并报告RLCR未正确启动。请勿在Humanize循环之外继续vLLM修补工作。若钩子阻止退出,请严格遵循生成的下一轮次提示执行。

Inside Each RLCR Round

每个RLCR轮次内部流程

Gap Decision

差距判定

At the start of every round, compute current vLLM's gap against the best SLA-passing framework for each fixed scenario.
Use
1%
as the default stable noise threshold. If the current result is within
+/-1%
, rerun the winning commands enough times to decide whether the gap is stable before choosing a patch.
Patch only when vLLM is slower than the best framework by more than
1%
, fails SLA while another framework passes, or has a profiled bottleneck that explains the remaining gap under the fixed workload.
If vLLM is already best or tied within the stable threshold, write the final report and stop under the normal Humanize review path.
在每轮开始时,计算当前vLLM与每个固定场景下最优SLA达标框架的性能差距。
使用
1%
作为默认稳定噪声阈值。若当前结果在
+/-1%
范围内,需多次运行最优命令以判定差距是否稳定,再选择修补方案。
仅当vLLM比最优框架慢超过
1%
、未达SLA而其他框架达标,或性能分析显示固定工作负载下剩余差距存在明确瓶颈时,才进行修补。
若vLLM已处于最优或在稳定阈值内持平,请撰写最终报告并通过正常Humanize审核路径停止。

Required Profiling

必要的性能分析

When vLLM is behind, profile the current best vLLM command and the leading competitor command with
llm-torch-profiler-analysis
.
Rules:
  • Always profile vLLM when it is behind.
  • Always profile at least the current best framework.
  • If both SGLang and TensorRT-LLM are more than
    1%
    ahead of vLLM in a stable result, profile both.
  • Use the slow benchmark scenario lengths, not the profiler defaults:
    • prefill profile: slow input length ->
      1
      output token
    • decode profile:
      1
      input token -> slow output length
  • For mixed or production datasets, use the slowest representative p50 or p95 bucket already recorded by the benchmark artifact.
  • Capture or analyze separate prefill and decode evidence when the framework supports it.
For every profiled framework, save the same three tables:
  • kernel table
  • overlap-opportunity table
  • fuse-pattern table
Then write or update
analysis/root-cause.md
with the current cross-framework comparison: which stage is slower, which table rows explain it, and which vLLM source paths or kernel families are plausible patch targets.
Do not patch vLLM until this report exists for the current gap.
当vLLM落后时,使用
llm-torch-profiler-analysis
对当前最优vLLM命令和领先竞品命令进行性能分析。
规则:
  • 当vLLM落后时,必须对其进行性能分析。
  • 必须至少对当前最优框架进行性能分析。
  • 若SGLang和TensorRT-LLM在稳定结果中均比vLLM领先超过
    1%
    ,则需对两者都进行分析。
  • 使用基准测试中的慢场景长度,而非性能分析器默认值:
    • 预填充分析:慢输入长度 ->
      1
      个输出token
    • 解码分析:
      1
      个输入token -> 慢输出长度
  • 对于混合或生产数据集,使用基准测试工件中已记录的最慢代表性p50或p95区间。
  • 当框架支持时,捕获或分析独立的预填充和解码证据。
为每个经过性能分析的框架保存以下三张表:
  • 内核表
  • 重叠机会表
  • 融合模式表
然后撰写或更新
analysis/root-cause.md
,包含当前跨框架对比:哪个阶段更慢、哪些表行可解释原因,以及哪些vLLM源码路径或内核家族是可行的修补目标。
在当前差距的该报告生成前,请勿修补vLLM。

Layer Pipeline Deep Dive

层流水线深度调研

Run
llm-pipeline-analysis
inside every RLCR round after profiler triage and before choosing a patch target.
The report must identify:
  • the chosen forward pass and why it is representative
  • the relevant layer types, especially for heterogeneous layers such as MoE, hash layers, or
    compress_ratios
  • representative layers for the patch target
  • top hot kernels in those representative layers
  • any Perfetto time ranges needed for inspection
Use the profiled vLLM trace and the served model config. Write
analysis/layer-pipeline.md
with the chosen forward pass, layer-type timing table, representative layers, top hot kernels, and any Perfetto ranges used for inspection. Do not choose a vLLM patch before this report exists for the current round.
在每个RLCR轮次中,完成性能分析分类后、选择修补目标前,运行
llm-pipeline-analysis
报告必须明确:
  • 所选前向传播及其代表性原因
  • 相关层类型,尤其是异构层如MoE、哈希层或
    compress_ratios
  • 修补目标的代表性层
  • 这些代表性层中的热门内核
  • 检查所需的任何Perfetto时间范围
使用经过性能分析的vLLM跟踪数据和服务模型配置。在
analysis/layer-pipeline.md
中记录所选前向传播、层类型计时表、代表性层、热门内核以及用于检查的Perfetto范围。在当前轮次的该报告生成前,请勿选择vLLM修补方案。

Kernel Evidence Assist

内核证据辅助

Use
ncu-report-skill
only when the active RLCR round is writing a concrete kernel or small kernel-family patch and torch-profiler evidence is not enough to choose or validate the next edit.
Kernel-level assistance is allowed only when all of these are true:
  • vLLM is still more than
    1%
    behind the best framework for the fixed benchmark scenario after the required repeat/profiler checks.
  • The slow stage has a concrete vLLM kernel or tightly scoped kernel family in the kernel table with at least
    1%
    cumulative GPU-time share. Do not spend kernel-specialist effort on a lone kernel below
    1%
    share unless a shared implementation affects an aggregated family above
    1%
    .
  • llm-pipeline-analysis
    has identified the representative layer/forward pass and top hot kernels for the current round.
  • The proposed kernel target has a clear correctness reference, representative shapes/dtypes/layouts from the model run, and a path to wire the candidate into the active vLLM serving code.
For each eligible kernel target:
  1. Read
    ncu-report-skill/SKILL.md
    and follow its Nsight Compute workflow for harness construction,
    ncu
    collection, report parsing, stall diagnosis, and evidence-backed next-edit selection.
  2. Store NCU outputs under
    kernel/ncu-digests/<version>/
    or the host's equivalent artifact root. Each digest must compare baseline vs candidate and end with exactly one concrete next edit.
  3. Patch the vLLM kernel or call path directly in the vLLM checkout, with focused correctness and microbench coverage when available. CUDA and C++ kernel code lives under
    csrc/
    , Triton kernels and attention/quantization wrappers live under
    vllm/
    , and torch.compile-driven paths live under
    vllm/compilation/
    .
  4. Wire the candidate into the active model-serving path that produced the original profiler row.
  5. Re-run the same real-model benchmark and profiler after the candidate is correct. A microbench or NCU win alone is not success.
If no focused harness exists, build the smallest harness that preserves the model-derived shapes/dtypes/layouts. If NCU cannot run on the host, record the blocker in the digest path and keep the next edit grounded in the available torch-profiler, layer-pipeline, and source evidence.
Do not start any standalone
.humanize/rlcr
session for a kernel target. Kernel work stays inside the active model RLCR loop.
仅当当前RLCR轮次正在编写具体内核或小型内核家族修补方案,且torch-profiler证据不足以选择或验证下一次编辑时,才使用
ncu-report-skill
仅当满足以下所有条件时,才允许内核级辅助:
  • 经过必要的重复/性能分析检查后,vLLM在固定基准测试场景中仍比最优框架慢超过
    1%
  • 慢阶段的内核表中存在具体的vLLM内核或范围紧密的内核家族,其累计GPU时间占比至少为
    1%
    。除非共享实现影响占比超过
    1%
    的聚合家族,否则请勿在占比低于
    1%
    的单个内核上投入内核专家资源。
  • llm-pipeline-analysis
    已确定当前轮次的代表性层/前向传播和热门内核。
  • 拟议的内核目标具有明确的正确性参考、模型运行得出的代表性形状/数据类型/布局,以及将候选方案接入活跃vLLM服务代码的路径。
对于每个符合条件的内核目标:
  1. 阅读
    ncu-report-skill/SKILL.md
    并遵循其Nsight Compute工作流,包括测试 harness 构建、
    ncu
    数据收集、报告解析、停滞诊断以及基于证据选择下一次编辑。
  2. 将NCU输出存储在
    kernel/ncu-digests/<version>/
    或主机对应的工件根目录下。每个摘要必须对比基线与候选方案,并最终给出一个明确的下一次编辑建议。
  3. 直接在vLLM检出版本中修补vLLM内核或调用路径,尽可能覆盖正确性和微基准测试。CUDA和C++内核代码位于
    csrc/
    下,Triton内核和注意力/量化包装器位于
    vllm/
    下,torch.compile驱动的路径位于
    vllm/compilation/
    下。
  4. 将候选方案接入生成原始性能分析行的活跃模型服务路径。
  5. 候选方案验证正确后,重新运行相同的真实模型基准测试和性能分析。仅微基准测试或NCU测试通过不算成功。
若不存在聚焦的测试 harness,请构建最小化的harness以保留模型衍生的形状/数据类型/布局。若主机无法运行NCU,请在摘要路径中记录阻塞点,并基于可用的torch-profiler、层流水线和源码证据进行下一次编辑。
请勿为内核目标启动独立的
.humanize/rlcr
会话。内核工作需在活跃的模型RLCR循环内完成。

Model-Loop Checkpoint

模型循环检查点

After every accepted round, update
humanize/model-loop-checkpoint.md
with:
  • original model, tokenizer, precision, quantization, hardware, workload, SLA, artifact root, and benchmark winner commands
  • current vLLM branch, commit, patches applied, tests run, and current best vLLM benchmark row
  • remaining gap, profiler rows, model PR history notes, layer-pipeline notes, NCU digest paths, rejected source ideas, and the next planned vLLM patch
This checkpoint is for campaign recovery inside the same model-level workflow. It records enough context to resume the campaign without losing benchmark/profile lineage.
每次接受轮次后,更新
humanize/model-loop-checkpoint.md
,包含:
  • 原始模型、分词器、精度、量化策略、硬件、工作负载、SLA、工件根目录和基准测试最优命令
  • 当前vLLM分支、提交记录、已应用的修补、运行的测试以及当前最优vLLM基准测试行
  • 剩余差距、性能分析行、模型PR历史笔记、层流水线笔记、NCU摘要路径、被拒绝的源码方案以及下一个计划的vLLM修补
此检查点用于同一模型级工作流内的任务恢复。它记录了足够的上下文,可恢复任务而不丢失基准测试/性能分析的谱系。

Loop Ledgers

循环分类账

Keep these files under the run artifact root or the vLLM checkout, depending on the host convention:
text
humanize/attempt-ledger.md
humanize/optimization-ledger.md
humanize/source-idea-ledger.md
humanize/lineage.jsonl
humanize/profile-digests/
Every patch attempt gets an attempt row. Only correct patches with measured improvement get optimization rows. Source ideas must include profiler rows, layer-pipeline evidence, NCU report paths when used, and code provenance so later rounds can avoid re-reading the same source. Model PR history evidence should be recorded beside vLLM, SGLang, TensorRT-LLM, and NCU source ideas when it influenced the patch.
After two consecutive rounds with less than
1%
geomean improvement over the prior best vLLM result, expand code-first research before editing again. Prefer code and PR evidence from vLLM, SGLang, TensorRT-LLM, and relevant kernel source guides before prose-only articles.
根据主机约定,将以下文件保存在运行工件根目录或vLLM检出版本中:
text
humanize/attempt-ledger.md
humanize/optimization-ledger.md
humanize/source-idea-ledger.md
humanize/lineage.jsonl
humanize/profile-digests/
每次修补尝试都需记录在尝试行中。仅正确且实测有改进的修补才记录在优化行中。源码方案必须包含性能分析行、层流水线证据、使用时的NCU报告路径以及代码来源,以便后续轮次避免重复读取相同源码。当模型PR历史证据影响修补时,需将其与vLLM、SGLang、TensorRT-LLM和NCU源码方案一起记录。
若连续两轮的几何平均改进率低于前一次最优vLLM结果的
1%
,请在再次编辑前扩展代码优先的研究。优先参考vLLM、SGLang、TensorRT-LLM的代码和PR证据,以及相关内核源码指南,而非纯文字文章。

Stop Conditions

停止条件

Stop only when one of these is true:
  • vLLM beats the best SLA-passing SGLang/TensorRT-LLM result on the fixed workload.
  • vLLM is tied within the stable
    1%
    threshold after repeat runs.
  • The remaining gap is proven external to vLLM, such as unavailable hardware support, missing framework dependency, unsupported TensorRT-LLM PyTorch backend, or model weights that cannot be loaded fairly.
  • Profile evidence shows the remaining hot path is already near the relevant hardware or algorithmic limit and no low-risk vLLM patch remains.
The final report must include the fixed benchmark table, post-patch benchmark table, all winner commands, model PR history paths, profile paths, layer-pipeline paths when used, NCU digest paths when used, vLLM changed files, tests, and whether vLLM reached target-environment SOTA.
仅当满足以下任一条件时停止:
  • vLLM在固定工作负载上超越了最优SLA达标SGLang/TensorRT-LLM结果。
  • 经过多次运行后,vLLM与最优结果在稳定的
    1%
    阈值内持平。
  • 剩余差距被证明是vLLM外部因素导致,如硬件支持不可用、缺少框架依赖、TensorRT-LLM PyTorch后端不支持或模型权重无法公平加载。
  • 性能分析证据显示剩余热点路径已接近相关硬件或算法极限,且无低风险vLLM修补方案剩余。
最终报告必须包含固定基准测试表、修补后基准测试表、所有最优命令、模型PR历史路径、性能分析路径、使用时的层流水线路径、使用时的NCU摘要路径、vLLM变更文件、测试记录,以及vLLM是否达到目标环境SOTA的结论。