UMA buffers prefer host-visible memory by winstonma · Pull Request #22930 · ggml-org/llama.cpp

winstonma · 2026-05-11T01:54:17Z

Overview

On UMA (Unified Memory Architecture) devices the CPU and GPU share the same physical memory. Despite this, the Vulkan backend was still allocating device-local buffers without the eHostVisible flag, which prevented the CPU from directly accessing GPU tensor data. This meant that even on hardware where a zero-copy path was physically possible, the backend was forced to go through an unnecessary staging copy whenever tensor data needed to be read back to the host (e.g. during loss evaluation or prediction readback in training). The UMA zero copy was not implemented in this PR.

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: YES, to AI to find out the zero copy for the UMA path

ggml-gh-bot · 2026-05-11T01:58:16Z

Hi @winstonma, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 3 open PRs.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

jeffbolznv · 2026-05-11T14:24:32Z

Which buffers need this? What performance benefits do you see? I think this function is generally used for internal allocations that are not transferred to/from.

winstonma · 2026-05-12T03:40:11Z

After further testing, I determined that the necessary logic could be handled more effectively on the other side. Consequently, I have replaced the previous commit with a refined implementation.

Context & Investigation

While implementing zero-copy support for the UMA path (aiming for parity with the Metal backend), I developed a test script to evaluate performance gains. Initial benchmark results were underwhelming; investigation revealed that the root cause was a memory allocation issue that prevented zero-copy from functioning correctly. Therefore this commit is being submitted.

Benchmarks

Testing conducted is testing latency of moving data from the backend (GPU/Vulkan) back to the host (CPU) over 1 million time on AMD Radeon 880M Graphics:

Before: 36.888 s
After: 0.327 s

And the uncommitted code change would improve the speed to ~0.2 s

Lastly since it is not on the hot memory read/write path the pp and tg performance won't be improved.

0cc4m · 2026-05-12T05:42:51Z

Even for UMA systems, preferring device-local is still correct. It may still be "easier" to access for the GPU, go through less intermediate steps, depending on the architecture. Posting random unverifiable numbers is not gonna help your case, we'd need something reproduceable that shows what cases this improves. There are a lot of devices this would affect.

winstonma · 2026-05-12T21:04:53Z

The benchmark results was posted as requested. The performance gain comes from how ggml-vulkan handles eHostVisible memory more efficiently.

There are a lot of devices this would affect.

Would you mind explaining the disadvantage/limitations of falling back to device-local memory?

0cc4m · 2026-05-13T07:47:14Z

Your attitude is wrong, it is not my job to disprove your claims, it is your job to prove them. A benchmark result is not worth anything to me if I cannot reproduce it, and you didn't even share your script.

But you're in luck, I just found an easily measurable performance issue on Nvidia DGX Spark caused by preferring just device-local memory, and this kind of change is needed to fix it.

Co-authored-by: Ruben Ortlam <picard12@live.de>

winstonma · 2026-05-13T09:42:24Z

Your attitude is wrong, it is not my job to disprove your claims, it is your job to prove them. A benchmark result is not worth anything to me if I cannot reproduce it, and you didn't even share your script.

I am sorry that I make you feel this. I could agree that I don't have all the uma architecture so I am not certain, like you mentioned, how the change would affect others. So I would sincerely want to see how the fallback would affect other uma architecture (e.g. driver problem that allow the system to assign eHostVisible but break the downstream?). Since you are the maintainer so I would like to understand your a lot of devices this would affect concern.

Since you would like to reproduce the benchmark situation. Should I commit the cpp file in this branch? Would the cpp file get merged? This is my concern.

Sorry again.

0cc4m · 2026-05-13T11:12:41Z

I could agree that I don't have all the uma architecture so I am not certain, like you mentioned, how the change would affect others. So I would sincerely want to see how the fallback would affect other uma architecture (e.g. driver problem that allow the system to assign eHostVisible but break the downstream?). Since you are the maintainer so I would like to understand your a lot of devices this would affect concern.

Vulkan runs on many devices, so changes like this affect a lot of architectures. Besides your AMD APU, there are older AMD devices with different architectures, Intel Xe iGPUs, Intel UHD iGPUs, Nvidia iGPUs and even phone SoCs that this affects. That's why some care is necessary.

Same as with the other PR, you can upload benchmark code to a Github gist.

0cc4m · 2026-05-13T11:17:22Z

@rillomas That's a lot of failures in the Linux Intel Vulkan CI runner. I don't really see how this change could cause that, can you take a look?

rillomas · 2026-05-14T06:33:09Z

So far I see random unit tests failing with test-backend-ops on f3c217b even when tested manually on a Linux PantherLake. For example when I run test-backend-ops -o POOL_2D the tests failing/passing are random and sometimes all test pass.

0cc4m · 2026-05-14T06:36:50Z

So device-local buffers are reliable, but device-local | host-visible | host-coherent randomly fail? That sounds like a driver bug to me.

rillomas · 2026-05-14T07:29:36Z

Possibly? I also tried the latest Vulkan SDK 1.4.350.0 from LunarG rather than the Ubuntu 24.04.4 default libvulkan-dev (1.3.275.0-1build1), though I still see random failing with tests like test-backend-ops -o SET_ROWS

jeffbolznv · 2026-05-14T12:35:53Z

Can you check if it helps to have eHostRead/eHost in the destination stages when we do pipelinebarrier?

rillomas · 2026-05-15T07:44:04Z

Adding the following sync seems to fix the unit test failures (diagnosed by Claude Opus 4.7 and me)

diff --git a/ggml/src/ggml-vulkan/ggml-vulkan.cpp b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
index 303c5a8d6..7abc1c97e 100644
--- a/ggml/src/ggml-vulkan/ggml-vulkan.cpp
+++ b/ggml/src/ggml-vulkan/ggml-vulkan.cpp
@@ -7189,6 +7189,46 @@ static void ggml_vk_buffer_read_2d(vk_buffer& src, size_t offset, void * dst, si
     if(src->memory_property_flags & vk::MemoryPropertyFlagBits::eHostVisible && src->device->uma) {
         GGML_ASSERT(src->memory_property_flags & vk::MemoryPropertyFlagBits::eHostCoherent);

+        // The UMA fast path bypasses the submit/fence machinery, so we must
+        // explicitly drain any in-flight GPU writes to this buffer and make
+        // them visible to the host before the memcpy below. Without this, a
+        // graph_compute on the support_async path (which does not call
+        // ggml_vk_synchronize) can race with the host read and produce
+        // stale/partial results. Observed as intermittent SET_ROWS failures
+        // on Intel Linux iGPUs.
+        //
+        // We submit a tiny command buffer to the compute queue containing
+        // a barrier that makes shader/transfer writes available to the host
+        // domain, and wait on a fence. Submission order guarantees this
+        // executes after any prior compute work, the barrier flushes any
+        // device caches into the coherent memory, and the fence wait gives
+        // the host a memory dependency on those writes.
+        {
+            std::lock_guard<std::recursive_mutex> guard(src->device->mutex);
+
+            {
+                std::lock_guard<std::mutex> qguard(queue_mutex);
+                src->device->compute_queue.queue.waitIdle();
+                if (src->device->transfer_queue.queue != src->device->compute_queue.queue) {
+                    src->device->transfer_queue.queue.waitIdle();
+                }
+            }
+
+            vk_context subctx = ggml_vk_create_temporary_context(src->device->compute_queue.cmd_pool);
+            ggml_vk_ctx_begin(src->device, subctx);
+            subctx->s->buffer->buf.pipelineBarrier(
+                vk::PipelineStageFlagBits::eAllCommands,
+                vk::PipelineStageFlagBits::eHost,
+                {},
+                { { { vk::AccessFlagBits::eMemoryWrite, vk::AccessFlagBits::eHostRead } } },
+                {}, {});
+            ggml_vk_ctx_end(subctx);
+            ggml_vk_submit(subctx, src->device->fence);
+            VK_CHECK(src->device->device.waitForFences({ src->device->fence }, true, UINT64_MAX), "vk_buffer_read_2d uma waitForFences");
+            src->device->device.resetFences({ src->device->fence });
+            ggml_vk_queue_command_pools_cleanup(src->device);
+        }
+
         for (size_t i = 0; i < height; i++) {
             memcpy((uint8_t *) dst + i * dpitch, (const uint8_t *) src->ptr + offset + i * spitch, width);
         }

0cc4m · 2026-05-15T10:47:35Z

This is a pretty deep question about Vulkan API and llama.cpp backend library assumptions. I'm not sure the pipelineBarrier makes sense, host-coherent means the memory should be accessible from the host without manual cache management. Did you check if only the fence or a waitIdle() is enough? The non-memcpy path does a fence synchronization, so maybe the memcpy does need to make sure the GPU queue is idle. But for AMD and Nvidia this does not seem to be required, so I'm not sure if this is a driver bug or an issue on our side.

What do you think @jeffbolznv ?

jeffbolznv · 2026-05-15T13:44:41Z

Host-coherent means vkFlush/InvalidateMappedMemoryRanges aren't needed, but doesn't mean that you don't need to do a pipeline barrier from device to host. In practice, this pipeline barrier flushes the GPU L2 cache, whereas vkFlush/Invalidate are on the CPU cache.

I think the pipelinebarrier could either go here or in ggml_vk_synchronize. If we put it here, I think we also need it in some other places (like potentially read_async), but if we put it in ggml_vk_synchronize then we may do it more often than needed. I guess I lean toward putting it in the read paths.

I don't think the additional idling in this diff is specifically what's fixing the issue, since there ought to be a ggml_vk_synchronize before the read.

0cc4m · 2026-05-27T08:47:29Z

I opened a PR with a possible solution in #23770

winstonma requested a review from a team as a code owner May 11, 2026 01:54

github-actions Bot added Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels May 11, 2026

implement UMA host-visible memory

08319b9

winstonma force-pushed the vulkan-uma-allocation branch from 95fdd41 to 08319b9 Compare May 12, 2026 03:25

winstonma changed the title ~~UMA buffers get host-visible memory at allocation time~~ UMA buffers prefer host-visible memory May 12, 2026

0cc4m reviewed May 13, 2026

View reviewed changes

Comment thread ggml/src/ggml-vulkan/ggml-vulkan.cpp Outdated

Update ggml/src/ggml-vulkan/ggml-vulkan.cpp

88803d5

Co-authored-by: Ruben Ortlam <picard12@live.de>

engrtipusultan mentioned this pull request May 13, 2026

vulkan: avoid preferring transfer queue on AMD UMA devices #22455

Merged

Merge branch 'ggml-org:master' into vulkan-uma-allocation

f3c217b

Merge branch 'ggml-org:master' into vulkan-uma-allocation

dddb22e

0cc4m mentioned this pull request May 27, 2026

vulkan: add pipeline barriers for memcpy read/write operations #23770

Open

winstonma mentioned this pull request May 27, 2026

vulkan: fix UMA performance by preferring cached host memory and handling non… #23762

Open

Conversation

winstonma commented May 11, 2026

Overview

Additional information

Requirements

Uh oh!

ggml-gh-bot Bot commented May 11, 2026

Uh oh!

jeffbolznv commented May 11, 2026

Uh oh!

winstonma commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context & Investigation

Benchmarks

Uh oh!

0cc4m commented May 12, 2026

Uh oh!

winstonma commented May 12, 2026

Uh oh!

0cc4m commented May 13, 2026

Uh oh!

Uh oh!

winstonma commented May 13, 2026

Uh oh!

0cc4m commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented May 13, 2026

Uh oh!

rillomas commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

0cc4m commented May 14, 2026

Uh oh!

rillomas commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffbolznv commented May 14, 2026

Uh oh!

rillomas commented May 15, 2026

Uh oh!

0cc4m commented May 15, 2026

Uh oh!

jeffbolznv commented May 15, 2026

Uh oh!

0cc4m commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

winstonma commented May 12, 2026 •

edited

Loading

0cc4m commented May 13, 2026 •

edited

Loading

rillomas commented May 14, 2026 •

edited

Loading

rillomas commented May 14, 2026 •

edited

Loading