Loading...
Loading...
MLA (Multi-Latent Attention) cost models, regime analysis, and kernel selection guide. Use when: (1) reasoning about which kernel approach to use for a given regime, (2) understanding cost model tradeoffs between FlashMLA, FlashAttention, and MLAvar6+, (3) analyzing roofline behavior across decode/speculative/prefill regimes, (4) setting optimization targets, (5) understanding MLA math and absorption trick.
npx skill4agent add pepperu96/hyper-mla mla-analysis| Regime | s Range | Best Kernel | Why |
|---|---|---|---|
| Decode | s=1 | FlashMLA | 16x latency reduction vs FlashAttention (compressed KV) |
| Speculative | s=2-32 | MLAvar6+ or FlashMLA | MLAvar6+ should be able to beat FlashMLA and FlashAttention |
| Prefill | s>128 | FlashAttention | Avoids 4x FLOP penalty of latent-space compute |
2bhst(2d + p)2bhst * 320w * bh(s+t)(2d + p)w * bh(s+t) * 3202bhst(2k + p)2bhst * 1088w * (bhs(2k+p) + bt(k+p))2bhstp + 4bhsnd + 4bhsokw * (bhsp + bhsd + bhsk + bok + btp + bhnd + bhnd + bhsd + bhsk)Score: Q @ K^T = Q @ (Z @ Wk)^T = (Q @ Wk^T) @ Z^T = Qz @ Z^T
Value: softmax(A) @ V = softmax(A) @ (Z @ Wv) = (softmax(A) @ Z) @ WvO_latent @ W_kvb2^TNote: These results are from RTX 5090 development. Reprofile on the current device before using as optimization targets. Ridge point, bandwidth, and compute ceilings differ across devices — see.src/mla_var3/conf/devices.json
| Config | FlashMLA | FlashAttention | MLAvar6+ V3 (best) |
|---|---|---|---|
| s=1 | 419 μs | 6,781 μs | 829 μs (V2) |
| s=16 | 5,161 μs | 6,727 μs | 4,444 μs |
| Symbol | Description | Default |
|---|---|---|
| b | Batch size | 64 |
| h | Number of heads | 128 |
| s | Query sequence length | varies |
| t | KV context length | 4096 |
| d | Head dimension | 128 |
| p | Positional embedding dim | 64 |
| k | Latent (compressed) dim | 512 |
[B, H, S, D][B, H, S, K][B, T, H, D][B, T, K][B, H, S, P][B, T, P]docs/cost-analysis.md