Skip to content

Conversation

@0cc4m
Copy link
Collaborator

@0cc4m 0cc4m commented Dec 22, 2025

See #17715
Supercedes #18033

This is based on some guesswork on my part, since I'm not that familiar with the Vulkan Flash Attention code yet. Let me know if this makes sense. Performance looks alright.

AMD Radeon 8060SWith Coopmat:

model size params ngl fa test t/s (before) t/s (after) diff
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 pp512 488.62 ± 5.65 487.46 ± 4.62 -0.2%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 43.67 ± 0.17 43.29 ± 0.18 -0.9%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 pp512 @ d8192 266.87 ± 2.25 269.40 ± 0.56 +0.9%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 @ d8192 33.05 ± 0.10 32.99 ± 0.13 -0.2%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 627.26 ± 2.75 631.85 ± 20.90 +0.7%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 45.44 ± 0.24 45.47 ± 0.05 +0.1%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 @ d8192 306.56 ± 1.99 309.79 ± 2.08 +1.1%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 @ d8192 33.93 ± 0.15 34.50 ± 0.14 +1.7%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 pp512 1261.92 ± 70.65 1322.13 ± 30.72 +4.8%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 117.06 ± 0.26 117.71 ± 0.34 +0.6%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 pp512 @ d8192 1102.37 ± 24.31 1136.17 ± 9.01 +3.1%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 @ d8192 110.35 ± 1.02 111.08 ± 0.94 +0.7%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 611.04 ± 1.81 633.09 ± 14.23 +3.6%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 94.54 ± 1.12 101.33 ± 0.67 +7.2%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 @ d8192 277.39 ± 1.15 284.11 ± 4.44 +2.4%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 @ d8192 62.02 ± 0.21 62.65 ± 0.18 +1.0%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 pp512 956.45 ± 43.81 990.88 ± 22.63 +3.6%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 74.06 ± 0.52 75.84 ± 0.15 +2.4%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 pp512 @ d8192 635.69 ± 7.64 667.39 ± 4.65 +5.0%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 @ d8192 64.86 ± 0.82 66.59 ± 0.25 +2.7%

Without coopmat:

model size params ngl fa test t/s (before) t/s (after) diff
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 pp512 743.42 ± 17.17 812.31 ± 7.25 +9.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 43.37 ± 0.13 44.12 ± 0.19 +1.7%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 pp512 @ d8192 166.39 ± 0.82 167.35 ± 0.25 +0.6%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 @ d8192 32.51 ± 0.13 32.79 ± 0.04 +0.9%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 813.98 ± 16.18 868.31 ± 13.72 +6.7%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 45.00 ± 0.04 46.18 ± 0.11 +2.6%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 pp512 @ d8192 169.09 ± 1.41 170.72 ± 0.31 +1.0%
llama 8B Q4_0 4.33 GiB 8.03 B 99 1 tg128 @ d8192 33.14 ± 0.07 33.72 ± 0.09 +1.8%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 pp512 1666.99 ± 75.09 1707.99 ± 45.57 +2.5%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 116.60 ± 0.34 117.58 ± 1.01 +0.8%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 pp512 @ d8192 1329.74 ± 6.17 1344.71 ± 20.99 +1.1%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 @ d8192 110.72 ± 0.87 110.19 ± 0.45 -0.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 548.95 ± 3.79 584.59 ± 13.05 +6.5%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 93.77 ± 0.50 101.50 ± 0.36 +8.2%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 @ d8192 172.44 ± 0.44 170.28 ± 0.59 -1.3%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 @ d8192 61.79 ± 0.29 62.44 ± 0.26 +1.1%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 pp512 1153.83 ± 43.07 1166.10 ± 30.29 +1.1%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 72.02 ± 0.34 75.37 ± 0.44 +4.7%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 pp512 @ d8192 671.43 ± 4.57 679.51 ± 3.69 +1.2%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 @ d8192 67.25 ± 0.64 68.46 ± 0.34 +1.8%
AMD Radeon Pro VII
model size params ngl fa test t/s (before) t/s (after) diff
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 pp512 680.94 ± 1.48 665.84 ± 0.73 -2.2%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 76.77 ± 0.23 80.15 ± 0.27 +4.4%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 pp512 @ d8192 168.38 ± 0.23 167.81 ± 0.31 -0.3%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 @ d8192 51.26 ± 0.07 50.91 ± 0.01 -0.7%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 pp512 184.39 ± 0.23 181.04 ± 0.49 -1.8%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 tg128 24.62 ± 0.03 24.95 ± 0.02 +1.3%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 pp512 @ d8192 31.45 ± 0.19 31.41 ± 0.21 -0.1%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 tg128 @ d8192 11.58 ± 0.00 11.55 ± 0.01 -0.3%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 pp512 1364.56 ± 2.24 1311.74 ± 2.57 -3.9%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 116.31 ± 0.19 117.18 ± 0.07 +0.7%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 pp512 @ d8192 1077.54 ± 5.86 1046.76 ± 6.43 -2.9%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 @ d8192 110.14 ± 0.23 109.77 ± 0.16 -0.3%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 590.86 ± 3.50 596.91 ± 2.34 +1.0%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 84.25 ± 0.09 96.28 ± 0.07 +14.3%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 @ d8192 162.30 ± 0.18 161.83 ± 0.23 -0.3%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 @ d8192 57.34 ± 0.09 57.31 ± 0.09 -0.1%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 pp512 1171.81 ± 7.10 1191.57 ± 4.13 +1.7%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 131.52 ± 0.13 139.36 ± 0.11 +6.0%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 pp512 @ d8192 558.73 ± 1.25 563.02 ± 1.66 +0.8%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 @ d8192 116.72 ± 0.22 119.10 ± 0.23 +2.0%
RTX 3090
model size params ngl fa test t/s (before) t/s (after) diff
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 pp512 4863.26 ± 78.80 4807.87 ± 19.28 -1.1%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 129.86 ± 1.15 127.44 ± 0.69 -1.9%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 pp512 @ d8192 2821.90 ± 83.73 2776.25 ± 84.02 -1.6%
llama 8B Q4_K - Small 4.36 GiB 8.03 B 99 1 tg128 @ d8192 108.18 ± 0.05 107.21 ± 0.25 -0.9%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 pp512 1656.30 ± 7.07 1623.96 ± 3.64 -2.0%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 tg128 43.32 ± 0.13 42.82 ± 0.07 -1.2%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 pp512 @ d8192 1274.57 ± 14.86 1260.11 ± 13.05 -1.1%
llama 13B Q5_K - Small 15.18 GiB 23.57 B 99 1 tg128 @ d8192 39.42 ± 0.05 38.97 ± 0.04 -1.1%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 pp512 4538.43 ± 24.11 4525.94 ± 16.30 -0.3%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 223.54 ± 0.46 221.74 ± 0.70 -0.8%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 pp512 @ d8192 3360.39 ± 92.82 3391.57 ± 117.26 +0.9%
granitehybrid 1B Q4_K - Small 3.75 GiB 6.94 B 99 1 tg128 @ d8192 206.29 ± 1.05 203.94 ± 0.89 -1.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 2929.01 ± 6.71 2925.94 ± 33.82 -0.1%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 184.45 ± 0.72 183.88 ± 0.88 -0.3%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 pp512 @ d8192 1885.31 ± 18.20 1869.33 ± 14.37 -0.8%
qwen3moe 30B.A3B Q2_K - Medium 10.48 GiB 30.53 B 99 1 tg128 @ d8192 143.63 ± 0.32 143.19 ± 0.34 -0.3%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 pp512 4070.55 ± 17.57 4067.58 ± 41.42 -0.1%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 183.32 ± 0.73 182.58 ± 0.53 -0.4%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 pp512 @ d8192 2926.57 ± 121.41 2898.52 ± 119.52 -1.0%
gpt-oss 20B Q8_0 11.27 GiB 20.91 B 99 1 tg128 @ d8192 157.61 ± 0.86 157.41 ± 1.05 -0.1%

@0cc4m 0cc4m requested a review from jeffbolznv December 22, 2025 08:03
@github-actions github-actions bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 22, 2025
@0cc4m 0cc4m force-pushed the 0cc4m/vulkan-flash-attention-tuning branch from 8e9ebae to c9b4b5e Compare December 22, 2025 08:20
@jeffbolznv
Copy link
Collaborator

I think this is OK, for smaller KV size the memory reuse of having more rows is less important and this may spread out the work to more workgroups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants