Loading...
Loading...
Full optimization workflow, sub-agent launch templates, agent communication contracts, default configurations, tuning strategy, and knowledge base update protocol. Use when: (1) starting an optimization cycle, (2) launching a Profiler or Designer sub-agent, (3) interpreting or formatting agent communication, (4) updating the knowledge base after a profiling or implementation iteration, (5) deciding default configurations or tuning strategy for a kernel.
npx skill4agent add pepperu96/hyper-mla orchestration-workflow1. ANALYZE -> Read devlog overview + current best results
2. PROFILE -> Launch Profiler Agent -> get compact bottleneck analysis
3. SELECT -> Choose implementation language based on the required control level
4. STRATEGIZE -> Load the shared optimization guidance plus the language-specific optimization catalog
5. DESIGN -> Launch Kernel Designer Agent -> implement optimizations in new version
6. VERIFY -> Launch Profiler Agent on new version -> compare with previous
7. EVALUATE -> Target met? -> Done. Not met? -> Back to step 3
8. LEARN -> If novel insight: update knowledge base (mandatory)LEARNb=32, s=16, t=4096bb=32, s=1, t=4096b=32, s=16, t=4096mla_var6_plus_v3mla_var6_plus_v2.claude/agents/profilerProfile <kernel> <version> at b=X, s=X, t=X.
Return compact summary per the output contract.
Update the devlog performance section in docs/kernels/<kernel>.md./profile-kernelkernel-designerCreate <kernel> <new_version> from <current_version>.
Language: <cutile-dsl|cute-dsl>.
Load /design-<language>-kernel and the matching reference skill if one exists.
Apply optimizations: [specific list with rationale].
Return implementation summary per the output contract./design-kernel## Profile: [kernel] [version] | b=X, s=X, t=X
### Stages
| Stage | Duration | TC% | DRAM% | Occ% | Bottleneck | Key Issue |
### Bottleneck: [Memory/Compute/Latency]-bound
Root cause: [2 sentences]
### Top 3 Opportunities (ranked by estimated impact)
1. [name] -- est. X% gain -- trigger: [metric=value]
### vs Baseline (if applicable)
| Metric | Previous | Current | Change |
|--------|----------|---------|--------|## Optimization Task: [kernel] [current] -> [new_version]
### Current Bottleneck: [from profiler]
### Optimizations to Apply:
1. [specific optimization + rationale + link to the shared or language-specific knowledge file]
### Constraints
- register budget, target occupancy, required control level, and language-specific constraints## New Version: [kernel] [version]
### Changes Applied: [list]
### Files: Created/Modified [paths]
### Correctness: [PASS/FAIL]
### Trial Configurations Checked
1. [b, s, t] -- [why this point matters]
### Devlog Entry Written: [path]docs/knowledge/optimizations/<name>.mddocs/knowledge/languages/<language>/optimizations/<name>.md/optimization-catalogdocs/knowledge/anti-patterns/<name>.mddocs/knowledge/languages/<language>/anti-patterns/<name>.md/optimization-catalogWhen to Apply# [Optimization Name]
## When to Apply
- [Context 1: e.g., specific kernel design or reference layer/kernel]
- [Context 2]
- [Metric condition 1]
- [Metric condition 2]
## Mechanism
[How and why this optimization works]
## Affected Metrics
- [Metric 1: e.g. occupancy]
- [Metric 2: e.g. registers/thread]
- [Metric 3: e.g. Tensor Core utilization, DRAM throughput, L2 hit rate, local-memory traffic]
## Implementation
\`\`\`python
# Code snippet
\`\`\`
## Performance Evidence
Source type: [local experiment / external report]
| Config | Before | After | Change |
|--------|--------|-------|--------|
## Generalization
[Device-agnostic takeaway. Mention architecture/device facts only insofar as they sharpen the reusable rule.]
## Pitfalls
- [Known failure modes]
## Interactions
- [How this interacts with other optimizations]