orchestration-workflow
Compare original and translation side by side
🇺🇸
Original
English🇨🇳
Translation
ChineseOrchestration Workflow
编排工作流
Optimization Cycle
优化周期
text
1. ANALYZE -> Read devlog overview + current best results
2. PROFILE -> Launch Profiler Agent -> get compact bottleneck analysis
3. SELECT -> Choose implementation language based on the required control level
4. STRATEGIZE -> Load the shared optimization guidance plus the language-specific optimization catalog
5. DESIGN -> Launch Kernel Designer Agent -> implement optimizations in new version
6. VERIFY -> Launch Profiler Agent on new version -> compare with previous
7. EVALUATE -> Target met? -> Done. Not met? -> Back to step 3
8. LEARN -> If novel insight: update knowledge base (mandatory)The step is mandatory whenever a profiling or implementation iteration reveals a reusable pattern, a refined applicability condition, or a confuted prior assumption.
LEARNtext
1. ANALYZE -> 读取开发日志概述 + 当前最优结果
2. PROFILE -> 启动Profiler Agent -> 获取精简的瓶颈分析报告
3. SELECT -> 根据所需控制级别选择实现语言
4. STRATEGIZE -> 加载共享优化指南以及特定语言的优化目录
5. DESIGN -> 启动Kernel Designer Agent -> 在新版本中实现优化
6. VERIFY -> 在新版本上启动Profiler Agent -> 与之前版本对比
7. EVALUATE -> 是否达到目标?-> 完成。未达到?-> 返回步骤3
8. LEARN -> 若有新颖见解:更新知识库(强制要求)每当性能分析或实现迭代揭示出可复用模式、细化的适用条件或推翻先前假设时,步骤都是强制性的。
LEARNDefault Configurations
默认配置
Speculative Decoding
推测解码(Speculative Decoding)
b=32, s=16, t=4096bb=32, s=16, t=4096bDefault Trial Matrix for Exploratory Tuning
探索性调优的默认测试矩阵
- Decode-like:
b=32, s=1, t=4096 - Speculative-like:
b=32, s=16, t=4096 - Add a third stress configuration only when a strategy or devlog evidence shows the first two points are insufficient.
- 类解码场景:
b=32, s=1, t=4096 - 类推测场景:
b=32, s=16, t=4096 - 仅当策略或开发日志表明前两种配置不足以覆盖情况时,才添加第三种压力测试配置。
Tuning Space Exploration
调优空间探索
- Start with testing a manual tuning tiling configuration that is expected to be good
- Explore some tiling configurations to understand how performance changes with tile sizes
- If the situation is complex, consider gross autotuning to find good configurations, but be mindful of the combinatorial explosion (use domain knowledge to prune the search space), autotuning time should be small enough to keep the kernel design exploration fast (e.g., < 1 hour per kernel version)
- If a kernel version shows significant potential, consider a more fine-grained autotuning to further refine the performance, otherwise, proceed to the next optimization iteration (do not spend too much time autotuning a kernel version that is not promising enough)
For instance, a bad example is , which has 10'000 autotuning configurations, and does not provide any significant improvement over . Kernel design is more important than autotuning.
mla_var6_plus_v3mla_var6_plus_v2- 首先测试一个预期表现良好的手动调优分块配置
- 探索多种分块配置,了解性能如何随分块大小变化
- 若情况复杂,可考虑全局自动调优以找到合适配置,但需注意组合爆炸问题(使用领域知识修剪搜索空间),自动调优时间应足够短,以保证内核设计探索的高效性(例如:每个内核版本耗时<1小时)
- 若某个内核版本显示出显著潜力,可考虑更精细的自动调优以进一步提升性能;否则,进入下一轮优化迭代(不要在潜力不足的内核版本上花费过多时间进行自动调优)
例如,就是一个反面案例,它有10000种自动调优配置,但相比并未带来显著提升。内核设计比自动调优更重要。
mla_var6_plus_v3mla_var6_plus_v2Sub-Agent Launch Templates
子Agent启动模板
Agent definitions are in . Spawn agents by name with a task prompt.
.claude/agents/Agent定义位于目录下。通过名称结合任务提示启动Agent。
.claude/agents/Profiler
Profiler
Spawn the agent with a task prompt:
profilertext
Profile <kernel> <version> at b=X, s=X, t=X.
Return compact summary per the output contract.
Update the devlog performance section in docs/kernels/<kernel>.md.The profiler has preloaded via its agent definition.
/profile-kernel使用以下任务提示启动 Agent:
profilertext
Profile <kernel> <version> at b=X, s=X, t=X.
Return compact summary per the output contract.
Update the devlog performance section in docs/kernels/<kernel>.md.Profiler的Agent定义中已预加载功能。
/profile-kernelKernel Designer
Kernel Designer
Spawn the agent with a task prompt:
kernel-designertext
Create <kernel> <new_version> from <current_version>.
Language: <cutile-dsl|cute-dsl>.
Load /design-<language>-kernel and the matching reference skill if one exists.
Apply optimizations: [specific list with rationale].
Return implementation summary per the output contract.The kernel-designer has preloaded. Language-specific skills must still be loaded on-demand since the language depends on the orchestrator's selection.
/design-kernel使用以下任务提示启动 Agent:
kernel-designertext
Create <kernel> <new_version> from <current_version>.
Language: <cutile-dsl|cute-dsl>.
Load /design-<language>-kernel and the matching reference skill if one exists.
Apply optimizations: [specific list with rationale].
Return implementation summary per the output contract.Kernel Designer已预加载功能。由于语言取决于编排器的选择,特定语言的技能仍需按需加载。
/design-kernelAgent Communication Contracts
Agent通信协议
Profiler -> Orchestrator (compact output)
Profiler -> 编排器(精简输出)
markdown
undefinedmarkdown
undefinedProfile: [kernel] [version] | b=X, s=X, t=X
Profile: [kernel] [version] | b=X, s=X, t=X
Stages
Stages
| Stage | Duration | TC% | DRAM% | Occ% | Bottleneck | Key Issue |
| Stage | Duration | TC% | DRAM% | Occ% | Bottleneck | Key Issue |
Bottleneck: [Memory/Compute/Latency]-bound
Bottleneck: [Memory/Compute/Latency]-bound
Root cause: [2 sentences]
Root cause: [2 sentences]
Top 3 Opportunities (ranked by estimated impact)
Top 3 Opportunities (ranked by estimated impact)
- [name] -- est. X% gain -- trigger: [metric=value]
- [name] -- est. X% gain -- trigger: [metric=value]
vs Baseline (if applicable)
vs Baseline (if applicable)
| Metric | Previous | Current | Change |
|---|
undefined| Metric | Previous | Current | Change |
|---|
undefinedOrchestrator -> Designer (instructions)
编排器 -> Designer(指令)
markdown
undefinedmarkdown
undefinedOptimization Task: [kernel] [current] -> [new_version]
Optimization Task: [kernel] [current] -> [new_version]
Current Bottleneck: [from profiler]
Current Bottleneck: [from profiler]
Optimizations to Apply:
Optimizations to Apply:
- [specific optimization + rationale + link to the shared or language-specific knowledge file]
- [specific optimization + rationale + link to the shared or language-specific knowledge file]
Constraints
Constraints
- register budget, target occupancy, required control level, and language-specific constraints
undefined- register budget, target occupancy, required control level, and language-specific constraints
undefinedDesigner -> Orchestrator (summary)
Designer -> 编排器(总结)
markdown
undefinedmarkdown
undefinedNew Version: [kernel] [version]
New Version: [kernel] [version]
Changes Applied: [list]
Changes Applied: [list]
Files: Created/Modified [paths]
Files: Created/Modified [paths]
Correctness: [PASS/FAIL]
Correctness: [PASS/FAIL]
Trial Configurations Checked
Trial Configurations Checked
- [b, s, t] -- [why this point matters]
- [b, s, t] -- [why this point matters]
Devlog Entry Written: [path]
Devlog Entry Written: [path]
---
---Knowledge Base Update Protocol
知识库更新规范
When to Update
更新时机
- After profiling a kernel, if a new relevant optimization or anti-pattern is identified that is not currently in the catalog
- After profiling a kernel, if an existing optimization/anti-pattern is confuted, to update its validity conditions or change it to an anti-pattern/optimization as needed
- When new performance evidence refines the estimated impact of an optimization or the failure mode of an anti-pattern
- When new interactions between optimizations are discovered
- When a device-specific result can be abstracted into a reusable rule
- 分析内核后,若识别出当前目录中未收录的相关优化或反模式
- 分析内核后,若现有优化/反模式被推翻,需更新其有效性条件或根据需要将其转换为反模式/优化
- 当新的性能证据细化了优化的预估影响或反模式的失效模式时
- 当发现优化之间的新交互关系时
- 当特定设备的结果可抽象为可复用规则时
New Optimization Validated
已验证的新优化
- Decide whether the finding is shared algorithmic/hardware knowledge or language-specific implementation knowledge.
- Shared knowledge goes under .
docs/knowledge/optimizations/<name>.md - Language-specific knowledge goes under .
docs/knowledge/languages/<language>/optimizations/<name>.md - Add the corresponding row to the optimization index in the skill.
/optimization-catalog - Capture the reusable pattern, applicability context, and the primary metrics affected.
- Explicitly separate local evidence and generalization.
- 判断该发现属于共享算法/硬件知识还是特定语言的实现知识。
- 共享知识存入。
docs/knowledge/optimizations/<name>.md - 特定语言知识存入。
docs/knowledge/languages/<language>/optimizations/<name>.md - 在技能的优化索引中添加对应条目。
/optimization-catalog - 记录可复用模式、适用场景以及受影响的主要指标。
- 明确区分本地证据和通用结论。
Optimization Caused Clear Regression
导致明显性能退化的优化
- Decide whether the failure mode is shared or language-specific.
- Shared anti-patterns go under .
docs/knowledge/anti-patterns/<name>.md - Language-specific anti-patterns go under .
docs/knowledge/languages/<language>/anti-patterns/<name>.md - Add the corresponding row to the anti-pattern index in the skill.
/optimization-catalog - Document the failure mode in reusable terms, not just the failing kernel/version.
- Record which metrics exposed the problem and under what context it appears.
- 判断失效模式属于共享类型还是特定语言类型。
- 共享反模式存入。
docs/knowledge/anti-patterns/<name>.md - 特定语言反模式存入。
docs/knowledge/languages/<language>/anti-patterns/<name>.md - 在技能的反模式索引中添加对应条目。
/optimization-catalog - 以可复用术语记录失效模式,而非仅记录失效的内核/版本。
- 记录暴露问题的指标以及问题出现的场景。
Detail File Template
详情文件模板
Every optimization and anti-pattern must refer clearly to the applicable context (e.g., MLA-specific, any online-softmax kernel, any kernel with a certain pattern). The context goes in the section.
When to Applymarkdown
undefined每个优化和反模式都必须明确说明适用场景(例如:特定MLA、任意在线softmax内核、具有特定模式的任意内核)。场景信息需放在部分。
适用时机markdown
undefined[Optimization Name]
[Optimization Name]
When to Apply
适用时机
- [Context 1: e.g., specific kernel design or reference layer/kernel]
- [Context 2]
- [Metric condition 1]
- [Metric condition 2]
- [场景1:例如特定内核设计或参考层/内核]
- [场景2]
- [指标条件1]
- [指标条件2]
Mechanism
机制
[How and why this optimization works]
[该优化的工作原理及原因]
Affected Metrics
受影响指标
- [Metric 1: e.g. occupancy]
- [Metric 2: e.g. registers/thread]
- [Metric 3: e.g. Tensor Core utilization, DRAM throughput, L2 hit rate, local-memory traffic]
- [指标1:例如occupancy]
- [指标2:例如registers/thread]
- [指标3:例如Tensor Core利用率、DRAM吞吐量、L2命中率、本地内存流量]
Implementation
实现
```python
```python
Code snippet
Code snippet
```
```
Performance Evidence
性能证据
Source type: [local experiment / external report]
| Config | Before | After | Change |
|---|
来源类型:[本地实验 / 外部报告]
| Config | Before | After | Change |
|---|
Generalization
通用性
[Device-agnostic takeaway. Mention architecture/device facts only insofar as they sharpen the reusable rule.]
[与设备无关的结论。仅在能强化可复用规则的前提下提及架构/设备相关事实。]
Pitfalls
注意事项
- [Known failure modes]
- [已知失效模式]
Interactions
交互关系
- [How this interacts with other optimizations]
undefined- [该优化与其他优化的交互方式]
undefined