Skip to content

vulkan: fix UMA performance by preferring cached host memory and handling non…#23762

Open
winstonma wants to merge 1 commit into
ggml-org:masterfrom
winstonma:vulkan-uma-cache-optimization
Open

vulkan: fix UMA performance by preferring cached host memory and handling non…#23762
winstonma wants to merge 1 commit into
ggml-org:masterfrom
winstonma:vulkan-uma-cache-optimization

Conversation

@winstonma
Copy link
Copy Markdown
Contributor

@winstonma winstonma commented May 27, 2026

Overview

The original code was potentially allocating write-combining (WC) memory for device buffers, which is fast to write but painfully slow to read back.

In addition, in ggml_vk_create_buffer, memory_property_flags was being set to the requested flags rather than what was actually allocated. This PR sets memory_property_flags from what the driver actually gave.

Additional information

I use this benchmark script to test ggml_backend_tensor_set and ggml_backend_tensor_get on my AMD UMA device.

Before the patch

❯ cmake --build build --target test-vulkan-uma-perf && ./build/bin/test-vulkan-uma-perf
[  0%] Built target cpp-httplib
[  3%] Built target ggml-base
[  3%] Performing build step for 'vulkan-shaders-gen'
[100%] Built target vulkan-shaders-gen
[  3%] Performing install step for 'vulkan-shaders-gen'
-- Up-to-date: /home/winston/Code/llama.cpp/build/Release/./vulkan-shaders-gen
[  4%] Completed 'vulkan-shaders-gen'
[  4%] Built target vulkan-shaders-gen
[ 63%] Built target ggml-vulkan
[ 65%] Built target ggml-cpu
[ 65%] Built target ggml
[ 93%] Built target llama
[ 93%] Built target llama-common-base
[100%] Built target llama-common
[100%] Built target test-vulkan-uma-perf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 880M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat

Vulkan UMA Transfer Performance Matrix
=============================================
Size (B)       Set BW (GB/s)  Get BW (GB/s)  
---------------------------------------------
2048           0.03           0.05           
4096           0.09           0.10           
8192           0.19           0.20           
16384          0.40           0.41           
32768          0.70           0.79           
65536          1.60           1.54           
131072         3.08           2.32           
262144         4.17           3.27           
524288         3.93           4.62           
1048576        7.13           8.35           
2097152        9.01           9.07           
4194304        10.67          12.03          
8388608        12.39          12.79          
16777216       13.22          13.34          
33554432       12.55          13.69          
67108864       13.44          13.83          
134217728      13.63          14.17          
268435456      14.34          14.44          
536870912      14.52          14.97          
1073741824     14.52          14.97          
=============================================

Patch is applied:

❯ cmake --build build --target test-vulkan-uma-perf && ./build/bin/test-vulkan-uma-perf
[  0%] Built target cpp-httplib
[  3%] Built target ggml-base
[  3%] Performing build step for 'vulkan-shaders-gen'
[100%] Built target vulkan-shaders-gen
[  3%] Performing install step for 'vulkan-shaders-gen'
-- Up-to-date: /home/winston/Code/llama.cpp/build/Release/./vulkan-shaders-gen
[  4%] Completed 'vulkan-shaders-gen'
[  4%] Built target vulkan-shaders-gen
[ 62%] Built target ggml-vulkan
[ 64%] Built target ggml-cpu
[ 64%] Built target ggml
[ 92%] Built target llama
[ 93%] Built target llama-common-base
[100%] Built target llama-common
[100%] Built target test-vulkan-uma-perf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 880M Graphics (RADV STRIX1) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
ggml_vulkan: Allocated buffer of size 2048 using memory type 5 flags { HostVisible | HostCoherent | HostCached } (import=0)
ggml_vulkan: Allocated buffer of size 4096 using memory type 5 flags { HostVisible | HostCoherent | HostCached } (import=0)
ggml_vulkan: Allocated buffer of size 8192 using memory type 5 flags { HostVisible | HostCoherent | HostCached } (import=0)
ggml_vulkan: Allocated buffer of size 16384 using memory type 5 flags { HostVisible | HostCoherent | HostCached } (import=0)
ggml_vulkan: Allocated buffer of size 32768 using memory type 5 flags { HostVisible | HostCoherent | HostCached } (import=0)
ggml_vulkan: Allocated buffer of size 65536 using memory type 5 flags { HostVisible | HostCoherent | HostCached } (import=0)
ggml_vulkan: Allocated buffer of size 131072 using memory type 5 flags { HostVisible | HostCoherent | HostCached } (import=0)
ggml_vulkan: Allocated buffer of size 262144 using memory type 5 flags { HostVisible | HostCoherent | HostCached } (import=0)
ggml_vulkan: Allocated buffer of size 524288 using memory type 5 flags { HostVisible | HostCoherent | HostCached } (import=0)
ggml_vulkan: Allocated buffer of size 1048576 using memory type 5 flags { HostVisible | HostCoherent | HostCached } (import=0)
ggml_vulkan: Allocated buffer of size 2097152 using memory type 5 flags { HostVisible | HostCoherent | HostCached } (import=0)
ggml_vulkan: Allocated buffer of size 4194304 using memory type 5 flags { HostVisible | HostCoherent | HostCached } (import=0)
ggml_vulkan: Allocated buffer of size 8388608 using memory type 5 flags { HostVisible | HostCoherent | HostCached } (import=0)
ggml_vulkan: Allocated buffer of size 16777216 using memory type 5 flags { HostVisible | HostCoherent | HostCached } (import=0)
ggml_vulkan: Allocated buffer of size 33554432 using memory type 5 flags { HostVisible | HostCoherent | HostCached } (import=0)
ggml_vulkan: Allocated buffer of size 67108864 using memory type 5 flags { HostVisible | HostCoherent | HostCached } (import=0)
ggml_vulkan: Allocated buffer of size 134217728 using memory type 5 flags { HostVisible | HostCoherent | HostCached } (import=0)
ggml_vulkan: Allocated buffer of size 268435456 using memory type 5 flags { HostVisible | HostCoherent | HostCached } (import=0)
ggml_vulkan: Allocated buffer of size 536870912 using memory type 5 flags { HostVisible | HostCoherent | HostCached } (import=0)
ggml_vulkan: Allocated buffer of size 1073741824 using memory type 5 flags { HostVisible | HostCoherent | HostCached } (import=0)

Vulkan UMA Transfer Performance Matrix
=============================================
Size (B)       Set BW (GB/s)  Get BW (GB/s)  
---------------------------------------------
2048           9.29           25.45          
4096           37.47          64.55          
8192           29.36          89.59          
16384          36.84          108.79         
32768          42.76          47.14          
65536          38.88          44.74          
131072         41.56          45.46          
262144         44.48          48.03          
524288         36.47          40.33          
1048576        28.50          30.79          
2097152        25.10          28.33          
4194304        31.85          35.26          
8388608        22.62          30.74          
16777216       21.06          24.17          
33554432       21.99          23.16          
67108864       20.90          21.50          
134217728      22.71          22.23          
268435456      21.92          22.23          
536870912      22.21          21.94          
1073741824     22.21          22.09          
=============================================

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, for finding/implementing and create benchmark test

@winstonma winstonma requested a review from a team as a code owner May 27, 2026 06:26
@winstonma
Copy link
Copy Markdown
Contributor Author

winstonma commented May 27, 2026

It should supersede #22930.

The result from the benchmark script shows that the Coherent flag would create write-combining memory which hurt the read speed.

❯ wget https://gist.githubusercontent.com/winstonma/86f3cfac104e6abbcef4ae534eacdc84/raw/8d86474aa9d5394c8f66e578b015f57143af516b/uma_benchmark.cpp && g++ -O3 uma_benchmark.cpp -o uma_benchmark -lvulkan -mavx && ./uma_benchmark
--2026-05-27 11:00:42--  https://gist.githubusercontent.com/winstonma/86f3cfac104e6abbcef4ae534eacdc84/raw/8d86474aa9d5394c8f66e578b015f57143af516b/uma_benchmark.cpp
Loaded CA certificate '/etc/ssl/certs/ca-certificates.crt'
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12368 (12K) [text/plain]
Saving to: ‘uma_benchmark.cpp’

uma_benchmark.cpp                         100%[=====================================================================================>]  12.08K  --.-KB/s    in 0.1s    

2026-05-27 11:00:43 (99.3 KB/s) - ‘uma_benchmark.cpp’ saved [12368/12368]

Running Vulkan Benchmark...
Vulkan Device: AMD Radeon 880M Graphics (RADV STRIX1)

Size      --------- Vulkan (Cached) ---------        --------- Vulkan (Coherent) -------        
          Write BW   Read BW    S-Write    Raw BW    Write BW   Read BW    S-Write    Raw BW    
--------------------------------------------------------------------------------------
2 KB      32.55      38.15      31.67      32.87     21.94      0.18       32.04      35.60     
4 KB      49.46      54.38      42.62      48.06     41.28      0.18       42.97      50.96     
8 KB      62.17      62.85      42.44      59.69     56.95      0.17       42.96      64.03     
16 KB     73.93      75.12      52.70      71.49     59.60      0.18       53.11      76.15     
32 KB     41.16      41.88      58.16      42.70     55.21      0.17       55.76      42.22     
64 KB     41.03      47.85      52.56      35.57     50.64      0.34       52.25      41.48     
128 KB    41.29      34.38      50.94      38.32     49.64      0.20       50.84      46.19     
256 KB    107.90     75.51      57.56      107.09    58.28      0.20       58.22      107.70    
512 KB    80.02      91.38      58.13      79.91     58.11      0.20       58.14      88.85     
1 MB      30.40      31.57      47.95      33.39     46.72      0.20       48.63      30.52     
2 MB      66.30      61.88      57.93      71.60     57.78      0.20       57.85      60.00     
4 MB      44.35      45.57      57.64      38.90     57.35      0.19       57.85      40.05     
8 MB      33.92      48.01      39.30      32.15     30.91      0.18       33.11      29.51     
16 MB     23.54      22.94      27.87      23.93     23.29      0.18       27.94      22.87     
32 MB     23.15      22.58      29.19      23.32     24.37      0.18       30.45      23.08     

@winstonma winstonma changed the title fix UMA performance by preferring cached host memory and handling non… vulkan: fix UMA performance by preferring cached host memory and handling non… May 27, 2026
@github-actions github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels May 27, 2026
@jeffbolznv
Copy link
Copy Markdown
Contributor

This needs to be justified by real model performance, not a synthetic test. We shouldn't be reading from write combined memory.

@winstonma
Copy link
Copy Markdown
Contributor Author

winstonma commented May 28, 2026

This needs to be justified by real model performance, not a synthetic test. We shouldn't be reading from write combined memory.

After adding logging to ggml_backend_tensor_get, I confirmed its behavior is correct, meaning we can safely overlook potential gains from Get bandwidth (BW). Conversely, ggml_backend_tensor_set is called frequently.

Shifting focus to Set BW, the frequency distribution below (from running single llama-bench on Gemma4-26B) highlights a high volume of tiny writes (<16KB):

image

Comparing the Set BW benchmark results reveals a massive throughput improvement for small allocations:

Size (B) Baseline (GB/s) Patched (GB/s)
2048 0.03 9.29
4096 0.09 37.47
8192 0.19 29.36
16384 0.40 36.84

While this micro-optimization may not translate to a massive real-world wall-clock difference (perhaps saving 0.1 seconds over a full benchmark run), optimizing these frequent, small set operations is still highly beneficial for eliminating overhead.

EDIT: After logs are added in ggml_backend_tensor_get_async. It does call 647 times. It's just a note because I didn't benchmark on ggml_backend_tensor_get_async.

EDIT2: I ran the llama-bench again with -ngl 0 and there are 1019 calls of ggml_backend_tensor_get

@0cc4m
Copy link
Copy Markdown
Contributor

0cc4m commented May 28, 2026

That doesn't change anything. As far as I know, cached memory is supposed to be used for outputs, where the GPU calculates something that gets transferred to CPU afterwards. It is possible using it everywhere reduces GPU access performance in general. Additionally, you just tested a single iGPU. Even on one that doesn't even have any non-coherent cached memory type.

@winstonma
Copy link
Copy Markdown
Contributor Author

Thanks, that makes sense. I agree the current change could be too broad for the evidence I have so far. The memory_property_flags fix should be still independently correct, since the buffer should record what was actually allocated rather than what was originally requested, but the UMA allocation policy change is a different question.

Right now the changes the default buffer preference for UMA devices in a way that could help my small-write case while still risking regressions for general GPU access or for other devices and workloads that I couldn't tested. I totally agree that the change needs validation on more hardware.

In addition I’m not trying to claim the response above is the generally correct policy for all Vulkan UMA devices. I’m offering this as a counterpoint to the current direction, based on one observed workload/device, to show there may be cases where the opposite tradeoff is preferable.

If the memory_property_flags fix makes sense to you then I think it is better to split that PR into a separate fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Vulkan Issues specific to the Vulkan backend

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants