orchestration-workflow

Compare original and translation side by side

🇺🇸

Original

English

🇨🇳

Translation

Chinese

Orchestration Workflow

编排工作流

Optimization Cycle

优化周期

text

1. ANALYZE    -> Read devlog overview + current best results
2. PROFILE    -> Launch Profiler Agent -> get compact bottleneck analysis
3. SELECT     -> Choose implementation language based on the required control level
4. STRATEGIZE -> Load the shared optimization guidance plus the language-specific optimization catalog
5. DESIGN     -> Launch Kernel Designer Agent -> implement optimizations in new version
6. VERIFY     -> Launch Profiler Agent on new version -> compare with previous
7. EVALUATE   -> Target met? -> Done. Not met? -> Back to step 3
8. LEARN      -> If novel insight: update knowledge base (mandatory)

The

LEARN

step is mandatory whenever a profiling or implementation iteration reveals a reusable pattern, a refined applicability condition, or a confuted prior assumption.

text

1. ANALYZE    -> 读取开发日志概述 + 当前最优结果
2. PROFILE    -> 启动Profiler Agent -> 获取精简的瓶颈分析报告
3. SELECT     -> 根据所需控制级别选择实现语言
4. STRATEGIZE -> 加载共享优化指南以及特定语言的优化目录
5. DESIGN     -> 启动Kernel Designer Agent -> 在新版本中实现优化
6. VERIFY     -> 在新版本上启动Profiler Agent -> 与之前版本对比
7. EVALUATE   -> 是否达到目标？-> 完成。未达到？-> 返回步骤3
8. LEARN      -> 若有新颖见解：更新知识库（强制要求）

每当性能分析或实现迭代揭示出可复用模式、细化的适用条件或推翻先前假设时，

LEARN

步骤都是强制性的。

Default Configurations

默认配置

Speculative Decoding

推测解码（Speculative Decoding）

b=32, s=16, t=4096

-- reduce

if OOM.

b=32, s=16, t=4096

-- 若出现内存不足（OOM）则减小

的值。

Default Trial Matrix for Exploratory Tuning

探索性调优的默认测试矩阵

Decode-like:
```
b=32, s=1, t=4096
```
Speculative-like:
```
b=32, s=16, t=4096
```
Add a third stress configuration only when a strategy or devlog evidence shows the first two points are insufficient.

类解码场景：
```
b=32, s=1, t=4096
```
类推测场景：
```
b=32, s=16, t=4096
```
仅当策略或开发日志表明前两种配置不足以覆盖情况时，才添加第三种压力测试配置。

Tuning Space Exploration

调优空间探索

Start with testing a manual tuning tiling configuration that is expected to be good
Explore some tiling configurations to understand how performance changes with tile sizes
If the situation is complex, consider gross autotuning to find good configurations, but be mindful of the combinatorial explosion (use domain knowledge to prune the search space), autotuning time should be small enough to keep the kernel design exploration fast (e.g., < 1 hour per kernel version)
If a kernel version shows significant potential, consider a more fine-grained autotuning to further refine the performance, otherwise, proceed to the next optimization iteration (do not spend too much time autotuning a kernel version that is not promising enough)

For instance, a bad example is

mla_var6_plus_v3

, which has 10'000 autotuning configurations, and does not provide any significant improvement over

mla_var6_plus_v2

. Kernel design is more important than autotuning.

首先测试一个预期表现良好的手动调优分块配置
探索多种分块配置，了解性能如何随分块大小变化
若情况复杂，可考虑全局自动调优以找到合适配置，但需注意组合爆炸问题（使用领域知识修剪搜索空间），自动调优时间应足够短，以保证内核设计探索的高效性（例如：每个内核版本耗时<1小时）
若某个内核版本显示出显著潜力，可考虑更精细的自动调优以进一步提升性能；否则，进入下一轮优化迭代（不要在潜力不足的内核版本上花费过多时间进行自动调优）

例如，

mla_var6_plus_v3

就是一个反面案例，它有10000种自动调优配置，但相比

mla_var6_plus_v2

并未带来显著提升。内核设计比自动调优更重要。

Sub-Agent Launch Templates

子Agent启动模板

Agent definitions are in

.claude/agents/

. Spawn agents by name with a task prompt.

Agent定义位于

.claude/agents/

目录下。通过名称结合任务提示启动Agent。

Profiler

Spawn the

profiler

agent with a task prompt:

text

Profile <kernel> <version> at b=X, s=X, t=X.
Return compact summary per the output contract.
Update the devlog performance section in docs/kernels/<kernel>.md.

The profiler has

/profile-kernel

preloaded via its agent definition.

使用以下任务提示启动

profiler

Agent：

text

Profile <kernel> <version> at b=X, s=X, t=X.
Return compact summary per the output contract.
Update the devlog performance section in docs/kernels/<kernel>.md.

Profiler的Agent定义中已预加载

/profile-kernel

功能。

Kernel Designer

Spawn the

kernel-designer

agent with a task prompt:

text

Create <kernel> <new_version> from <current_version>.
Language: <cutile-dsl|cute-dsl>.
Load /design-<language>-kernel and the matching reference skill if one exists.
Apply optimizations: [specific list with rationale].
Return implementation summary per the output contract.

The kernel-designer has

/design-kernel

preloaded. Language-specific skills must still be loaded on-demand since the language depends on the orchestrator's selection.

使用以下任务提示启动

kernel-designer

Agent：

text

Create <kernel> <new_version> from <current_version>.
Language: <cutile-dsl|cute-dsl>.
Load /design-<language>-kernel and the matching reference skill if one exists.
Apply optimizations: [specific list with rationale].
Return implementation summary per the output contract.

Kernel Designer已预加载

/design-kernel

功能。由于语言取决于编排器的选择，特定语言的技能仍需按需加载。

Agent Communication Contracts

Agent通信协议

Profiler -> Orchestrator (compact output)

Profiler -> 编排器（精简输出）

markdown

undefined

markdown

undefined

Profile: [kernel] [version] | b=X, s=X, t=X

Stages

Bottleneck: [Memory/Compute/Latency]-bound

Root cause: [2 sentences]

Top 3 Opportunities (ranked by estimated impact)

[name] -- est. X% gain -- trigger: [metric=value]

[name] -- est. X% gain -- trigger: [metric=value]

vs Baseline (if applicable)

Metric	Previous	Current	Change

undefined

Metric	Previous	Current	Change

undefined

Orchestrator -> Designer (instructions)

编排器 -> Designer（指令）

markdown

undefined

markdown

undefined

Optimization Task: [kernel] [current] -> [new_version]

Current Bottleneck: [from profiler]

Optimizations to Apply:

[specific optimization + rationale + link to the shared or language-specific knowledge file]

[specific optimization + rationale + link to the shared or language-specific knowledge file]

Constraints

undefined

undefined

Designer -> Orchestrator (summary)

Designer -> 编排器（总结）

markdown

undefined

markdown

undefined

New Version: [kernel] [version]

Changes Applied: [list]

Files: Created/Modified [paths]

Correctness: [PASS/FAIL]

Trial Configurations Checked

[b, s, t] -- [why this point matters]

[b, s, t] -- [why this point matters]

Devlog Entry Written: [path]

---

---

Knowledge Base Update Protocol

知识库更新规范

When to Update

更新时机

After profiling a kernel, if a new relevant optimization or anti-pattern is identified that is not currently in the catalog
After profiling a kernel, if an existing optimization/anti-pattern is confuted, to update its validity conditions or change it to an anti-pattern/optimization as needed
When new performance evidence refines the estimated impact of an optimization or the failure mode of an anti-pattern
When new interactions between optimizations are discovered
When a device-specific result can be abstracted into a reusable rule

分析内核后，若识别出当前目录中未收录的相关优化或反模式
分析内核后，若现有优化/反模式被推翻，需更新其有效性条件或根据需要将其转换为反模式/优化
当新的性能证据细化了优化的预估影响或反模式的失效模式时
当发现优化之间的新交互关系时
当特定设备的结果可抽象为可复用规则时

New Optimization Validated

已验证的新优化

Decide whether the finding is shared algorithmic/hardware knowledge or language-specific implementation knowledge.
Shared knowledge goes under
```
docs/knowledge/optimizations/<name>.md
```
.

Language-specific knowledge goes under

docs/knowledge/languages/<language>/optimizations/<name>.md

Add the corresponding row to the optimization index in the
```
/optimization-catalog
```
skill.
Capture the reusable pattern, applicability context, and the primary metrics affected.
Explicitly separate local evidence and generalization.

判断该发现属于共享算法/硬件知识还是特定语言的实现知识。
共享知识存入
```
docs/knowledge/optimizations/<name>.md
```
。

特定语言知识存入

docs/knowledge/languages/<language>/optimizations/<name>.md

。

在
```
/optimization-catalog
```
技能的优化索引中添加对应条目。
记录可复用模式、适用场景以及受影响的主要指标。
明确区分本地证据和通用结论。

Optimization Caused Clear Regression

导致明显性能退化的优化

Decide whether the failure mode is shared or language-specific.
Shared anti-patterns go under
```
docs/knowledge/anti-patterns/<name>.md
```
.

Language-specific anti-patterns go under

docs/knowledge/languages/<language>/anti-patterns/<name>.md

Add the corresponding row to the anti-pattern index in the
```
/optimization-catalog
```
skill.
Document the failure mode in reusable terms, not just the failing kernel/version.
Record which metrics exposed the problem and under what context it appears.

判断失效模式属于共享类型还是特定语言类型。
共享反模式存入
```
docs/knowledge/anti-patterns/<name>.md
```
。

特定语言反模式存入

docs/knowledge/languages/<language>/anti-patterns/<name>.md

。

在
```
/optimization-catalog
```
技能的反模式索引中添加对应条目。
以可复用术语记录失效模式，而非仅记录失效的内核/版本。
记录暴露问题的指标以及问题出现的场景。

Detail File Template

详情文件模板

Every optimization and anti-pattern must refer clearly to the applicable context (e.g., MLA-specific, any online-softmax kernel, any kernel with a certain pattern). The context goes in the

When to Apply

section.

markdown

undefined

每个优化和反模式都必须明确说明适用场景（例如：特定MLA、任意在线softmax内核、具有特定模式的任意内核）。场景信息需放在

适用时机

部分。

markdown

undefined

[Optimization Name]

When to Apply

适用时机

[Context 1: e.g., specific kernel design or reference layer/kernel]
[Context 2]
[Metric condition 1]
[Metric condition 2]

[场景1：例如特定内核设计或参考层/内核]
[场景2]
[指标条件1]
[指标条件2]

Mechanism

机制

[How and why this optimization works]

[该优化的工作原理及原因]

Affected Metrics

受影响指标

[Metric 1: e.g. occupancy]
[Metric 2: e.g. registers/thread]
[Metric 3: e.g. Tensor Core utilization, DRAM throughput, L2 hit rate, local-memory traffic]

[指标1：例如occupancy]
[指标2：例如registers/thread]
[指标3：例如Tensor Core利用率、DRAM吞吐量、L2命中率、本地内存流量]

Implementation

实现

```python

Code snippet

```

Performance Evidence

性能证据

Source type: [local experiment / external report]

Config	Before	After	Change

来源类型：[本地实验 / 外部报告]

Config	Before	After	Change

Generalization

通用性

[Device-agnostic takeaway. Mention architecture/device facts only insofar as they sharpen the reusable rule.]

[与设备无关的结论。仅在能强化可复用规则的前提下提及架构/设备相关事实。]

Pitfalls

注意事项

[Known failure modes]

[已知失效模式]

Interactions

交互关系

[How this interacts with other optimizations]

undefined

[该优化与其他优化的交互方式]

undefined