-
sparse_mode == 0 (full attention):
$$
\text{FLOPs} = \text{full_attention}
$$
-
sparse_mode == 2 or 3, and $q_s == k_s$ (causal or similar, equal sequence lengths):
$$
\text{FLOPs} = \text{full_attention} \times 0.5
$$
-
sparse_mode == 2, and $q_s > k_s$ (causal, longer query):
$$
\text{FLOPs} = \text{full_attention} \times \frac{q_s \times k_s - k_s \times k_s / 2}{k_s \times k_s}
$$
-
sparse_mode == 3, and $q_d > k_d$ (special sparse mode):
$$
\text{FLOPs} = \text{full_attention} \times \frac{k_s \times k_s / 2}{q_s \times k_s}
$$
-
sparse_mode == 2, and $q_d < k_d$:
$$
\text{FLOPs} = \text{full_attention} \times \frac{q_s \times q_s / 2}{q_s \times k_s}
$$
-
sparse_mode == 3, and $q_d < k_d$:
$$
\text{FLOPs} = \text{full_attention} \times \frac{q_s \times k_s - q_s \times q_s / 2}{q_s \times k_s}
$$