Loading...
Loading...
Compare original and translation side by side
| Language key | Python package path | Design skill | API reference skill | Use when |
|---|---|---|---|---|
| | | | Block-level control, tiling, CTA remapping, compiler hints are sufficient |
| | | | Explicit thread/warp scheduling, TMA pipelines, shared memory control needed |
| 语言标识 | Python包路径 | 设计技能 | API参考技能 | 适用场景 |
|---|---|---|---|---|
| | | | 块级控制、tiling、CTA重映射、编译器提示已足够 |
| | | | 需要显式线程/ warp调度、TMA流水线、共享内存控制 |
cute-dslcute_pythonsrc/mla_var3/kernel/<lang_pkg>/mla/<design>/...python -m mla_var3.kernel <kernel> [<version>]cute-dslcute_pythonsrc/mla_var3/kernel/<lang_pkg>/mla/<design>/...python -m mla_var3.kernel <kernel> [<version>]kernelkernel.cutilekernel.cute_pythonkernel.cutile.mlakernel.cutile.mla.flash_mlakernel.cutile.mla.flash_mla.flash_mla_v2flash_mla_v2.pyKernelPlankernelkernel.cutilekernel.cute_pythonkernel.cutile.mlakernel.cutile.mla.flash_mlakernel.cutile.mla.flash_mla.flash_mla_v2flash_mla_v2.pyKernelPlankernel.<lang_pkg>.mla.<design>.<design>[_v<N>].<design>[_v<N>].pykernel.<lang_pkg>.mla.<design>.<design>[_v<N>].<design>[_v<N>].py<design>/<design>/<design>/<design>_vN/<design>/<design>/<design>/<design>_vN/undefinedundefined
---
---source .venv/bin/activate
python ./scripts/clone-kernel.py <kernel_full_name> <new_suffix>@ct.kernel@cute.kernel@cute.jitKernelPlanTilingsource .venv/bin/activate
python -m mla_var3.kernel.<lang_pkg>.mla.<design> <version> --prof_type=disabled --checksource .venv/bin/activate
python ./scripts/clone-kernel.py <kernel_full_name> <new_suffix>@ct.kernel@cute.kernel@cute.jitKernelPlanTilingsource .venv/bin/activate
python -m mla_var3.kernel.<lang_pkg>.mla.<design> <version> --prof_type=disabled --checkKernelPlanplan()@dataclass
class MyKernel(KernelPlan):
b: int = 64; s: int = 1; t: int = 4096 # problem dimensions
tiling: MyTiling = field(default_factory=MyTiling)
def prepare_inputs(self, device) -> tuple:
# Allocate and return input tensors
def reference_fn(self, *inputs) -> tuple:
# Reference implementation for --check
def _autotune_configs(self) -> list[MyTiling]:
# Candidate tiling configs for autotuner search
def _algorithmic_flops_bytes(self, tiling) -> tuple[int, int]:
# Analytical (FLOPs, bytes) for roofline
def plan(self, *inputs) -> BenchmarkFn:
# Build executable runtime object (DSL-specific)
def plan_empty(self, peak_tflops, peak_gbps) -> BenchmarkFn:
# Roofline-only prediction (no real tensors)KernelPlanplan()@dataclass
class MyKernel(KernelPlan):
b: int = 64; s: int = 1; t: int = 4096 # problem dimensions
tiling: MyTiling = field(default_factory=MyTiling)
def prepare_inputs(self, device) -> tuple:
# Allocate and return input tensors
def reference_fn(self, *inputs) -> tuple:
# Reference implementation for --check
def _autotune_configs(self) -> list[MyTiling]:
# Candidate tiling configs for autotuner search
def _algorithmic_flops_bytes(self, tiling) -> tuple[int, int]:
# Analytical (FLOPs, bytes) for roofline
def plan(self, *inputs) -> BenchmarkFn:
# Build executable runtime object (DSL-specific)
def plan_empty(self, peak_tflops, peak_gbps) -> BenchmarkFn:
# Roofline-only prediction (no real tensors)@dataclass
class MyTiling(Tiling):
# DSL-specific fields — see the language-specific skill for examples
def validate(self, pd: "MyKernel") -> bool:
# Return True if this tiling is valid for the given problem dimensions
...@dataclass
class MyTiling(Tiling):
# DSL-specific fields — see the language-specific skill for examples
def validate(self, pd: "MyKernel") -> bool:
# Return True if this tiling is valid for the given problem dimensions
...def plan(self, *inputs) -> KernelPipeline:
stage1 = stage1_plan.plan(...)
stage2 = stage2_plan.plan(...)
return KernelPipeline(_name="my_pipeline", stages=[stage1, stage2])def plan(self, *inputs) -> KernelPipeline:
stage1 = stage1_plan.plan(...)
stage2 = stage2_plan.plan(...)
return KernelPipeline(_name="my_pipeline", stages=[stage1, stage2])def plan(self, *inputs) -> KernelPipeline:
a = plan_a.plan(...)
b = plan_b.plan(...)
concurrent = ConcurrentKernels(
_name="overlap_group", concurrent_kernels=[a, b],
validate_joint_tiling_fn=validate_fn,
)
combine = combine_plan.plan(...)
return KernelPipeline(_name="pipeline", stages=[concurrent, combine])def plan(self, *inputs) -> KernelPipeline:
a = plan_a.plan(...)
b = plan_b.plan(...)
concurrent = ConcurrentKernels(
_name="overlap_group", concurrent_kernels=[a, b],
validate_joint_tiling_fn=validate_fn,
)
combine = combine_plan.plan(...)
return KernelPipeline(_name="pipeline", stages=[concurrent, combine])python ./scripts/clone-kernel.pydocs/knowledge/--prof_type=disabled --checkpython ./scripts/clone-kernel.pydocs/knowledge/--prof_type=disabled --checkdocs/knowledge/optimizations/docs/knowledge/anti-patterns/docs/knowledge/languages/<language>/...docs/knowledge/optimizations/docs/knowledge/anti-patterns/docs/knowledge/languages/<language>/...docs/kernels/<kernel>.md## Development logundefineddocs/kernels/<kernel>.md## Development logundefinedsrc/mla_var3/kernel/<lang_pkg>/mla/<kernel>/<kernel>_v<N>/
Performance metrics, bottleneck analysis, issues, and insights are filled by the profiler agent after profiling.src/mla_var3/kernel/<lang_pkg>/mla/<kernel>/<kernel>_v<N>/
性能指标、瓶颈分析、问题和结论将由性能分析Agent在分析后补充。undefinedundefinedundefinedundefineddocs/knowledge/optimizations/docs/knowledge/anti-patterns/docs/kernels/<kernel>.mddocs/knowledge/optimizations/docs/knowledge/anti-patterns/docs/kernels/<kernel>.md