Skip to content

Pcie transfer performance#126

Draft
amd-vserbu wants to merge 23 commits into
Xilinx:devfrom
amd-vserbu:perf/pcie-transfer-performance
Draft

Pcie transfer performance#126
amd-vserbu wants to merge 23 commits into
Xilinx:devfrom
amd-vserbu:perf/pcie-transfer-performance

Conversation

@amd-vserbu

Copy link
Copy Markdown
Collaborator

PCIe transfer performance: registered buffers + two-channel transfers

Branch: perf/pcie-transfer-performancetarget: dev · Status: draft

Summary

This PR reworks the QDMA host↔device data path to get much more bandwidth out of bulk transfers. Two changes do most of the work: a new registered-buffer path that pins and DMA-maps a host buffer once and reuses it for many transfers (instead of paying that cost per transfer), and a placement-aware policy that splits each transfer across both of the V80's PCIe NoC channels so both paths stay busy. v80-smi validate gains the bandwidth-benchmark knobs used to measure all of this.

What changed

  • Driver: new registered-buffer ioctls (register once, transfer many times, auto-cleanup on close), plus per-queue NoC channel selection so the two PCIe masters can be A/B tested.
  • Channel policy: a transfer is split across both channels based on the buffer's device address, so both the host-side ingress and the memory-side egress paths are driven (DDR splits in half; HBM routes by its half-memory boundary).
  • vrtd/libvrtd: buffer open now hands back two queue-pair fds (one per channel) and the client decides how to spread each transfer across them; host buffers use hugepages.
  • v80-smi: validate gains bandwidth modes over two backends (raw SLASH and the stock Xilinx QDMA driver) reporting Read/Write/Total for HBM and DDR, with knobs for channel selection, ring size, iteration/duration, and buffer placement.
  • Docs & packaging: documented the new ABI and validate options; Debian/RPM ship the local libqdma patches.

Earlier commits experimented with large-page transfers and custom libqdma scatter-gather/channel patches; those were dropped. The final path is 4 KiB-only, and the speedup comes from the registered-buffer fast path and the two-channel split.

Results

Sustained one-directional bandwidth with the registered-buffer path and the two-channel split:

Path C2H (device→host) H2C (host→device)
DDR ~23 GB/s ~23 GB/s
HBM ~12 GB/s ~23 GB/s
  • 20 GB/s+ is reached on DDR with a single buffer, 2 threads, 2 queues, in sustained mode, as long as it is large enough (64 MB is sufficient).
  • 2 MB pages were tried but discarded: they were slower than 4 KiB at all sizes.

Still to do

  • Running read and write at the same time almost halves throughput; needs investigation into whether the path is full-duplex.
  • Running HBM and DDR H2C together still caps at ~23 GB/s, with no gain over a single memory (an improvement was expected).
  • HBM C2H remains slow (~12 GB/s) compared to everything else.

Why this is still a draft

The transfer API is not in its final shape. Today it requires userspace to use threads to keep multiple transfers in flight. Before merging, we want to move back to plain read/write calls so multiple transfers can be submitted at once without threading.

@amd-vserbu amd-vserbu force-pushed the perf/pcie-transfer-performance branch from dc24ce1 to 67b2251 Compare June 17, 2026 15:12
…for RHEL 9.8

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
…e /tmp

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
…scriptor size

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
… knobs

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
…ansfers

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
…olicy

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
…-only

Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
@amd-vserbu amd-vserbu force-pushed the perf/pcie-transfer-performance branch from 67b2251 to 6599ce7 Compare June 17, 2026 16:46
@amd-vserbu

Copy link
Copy Markdown
Collaborator Author

Updates

The earlier registered-buffer API worked but required userspace to run its own threads to keep both of the V80's PCIe NoC channels busy. That was the one thing blocking this from leaving draft. The API has now been reshaped so a single, plain transfer call keeps both channels saturated.

New commits:

  • Changed API to kernel allocated-buffers
  • vrtd: track client ownership on raw buffers and close per-owner

Goals of this update

  • Take the transfer API out of draft.
  • Make the fast path the simple path. A normal synchronous call should drive both NoC channels at once, with no special effort from the caller.
  • Move buffer lifetime management somewhere safe. Replace the register/unregister bookkeeping with a single, leak-proof ownership model.
  • Keep the door open for async io_uring for future work (unfortunately not supported on RHEL 9 and Ubuntu 22.04 GA kernel).

Key decisions

  • The kernel owns the DMA buffer now. Instead of userspace allocating memory and registering it, the kernel allocates the buffer, does all the expensive setup once up front, and hands back a buffer that userspace maps for CPU access. The buffer's lifetime is tied to that handle, so closing it is the only cleanup step and there is no leak path. This also removes the separate bounce-buffer allocation userspace used to manage.

  • One transfer can span both channels in a single call. A transfer handle now covers both NoC channels, and one call can carry several sub-transfers that the kernel runs concurrently. This is the decision that removes the threading requirement: a plain synchronous call now saturates both channels.

  • Async is optional. There is an io_uring-based async path where the kernel supports it, but the synchronous path is always available as the universal fallback. Once users start moving to newer distributions, we will be able to offer async support, with some VRT API changes requirement.

Bugfix: per-client raw-buffer ownership

Separately from the API reshape, this update fixes a multi-client correctness problem with raw buffers (vrtd: track client ownership on raw buffers and close per-owner):

  • Raw buffers are now attributed to the client that created them and are cleaned up automatically when that client disconnects.
  • Because raw buffers can legitimately share the same address across different clients, closing one now selects the caller's own buffer rather than rejecting (or acting on the wrong buffer) on the first address match.

API shape

The finalized kernel ABI:

#define SLASH_QDMA_FD_MAX_QPAIRS 2u

/* QPAIR_GET_FD: bind 1..2 qpairs into one transfer fd; array index == qpair_index. */
struct slash_qdma_qpair_fd_request {
    __u32 size;
    __u32 qid;          /* legacy single qpair; used only when qpair_count == 0 */
    __u32 flags;        /* O_CLOEXEC only */
    __u32 qpair_count;  /* 1..SLASH_QDMA_FD_MAX_QPAIRS; 0 = use qid */
    __u32 qpair_ids[SLASH_QDMA_FD_MAX_QPAIRS];
};

/* BUF_CREATE: kernel allocates + DMA-maps once, returns a mappable buffer fd. */
struct slash_qdma_buf_create {
    __u32 size;
    __u32 flags;         /* O_CLOEXEC only */
    __u64 length;        /* [in]  page multiple */
    __u32 granule;       /* [out] bytes per descriptor (host page size) */
    __u32 transfer_hint; /* [out] enum slash_qdma_transfer_hint */
};

/* One per-qpair sub-transfer within a batch. */
struct slash_qdma_subxfer {
    __u32 qpair_index;   /* index into the fd's bound qpairs */
    __u32 direction;     /* enum slash_qdma_transfer_dir (H2C or C2H) */
    __s32 buf_fd;        /* kernel buffer fd from BUF_CREATE */
    __u32 pad0;
    __u64 buf_offset;
    __u64 dev_addr;      /* device-side endpoint address */
    __u64 length;
};

/* QPAIR_IOCTL_TRANSFER: submit up to FD_MAX_QPAIRS sub-transfers; distinct
 * qpairs run concurrently in-kernel. Returns total bytes transferred. */
struct slash_qdma_transfer {
    __u32 size;
    __u32 count;         /* 1..SLASH_QDMA_FD_MAX_QPAIRS */
    struct slash_qdma_subxfer xfers[SLASH_QDMA_FD_MAX_QPAIRS];
};

#define SLASH_QDMA_IOCTL_QPAIR_GET_FD   _IOWR('v', 0x53, struct slash_qdma_qpair_fd_request)
#define SLASH_QDMA_IOCTL_BUF_CREATE     _IOWR('v', 0x54, struct slash_qdma_buf_create)
/* 'v' 0x55 reserved (was SLASH_QDMA_IOCTL_BUF_UNREGISTER, removed) */
#define SLASH_QDMA_QPAIR_IOCTL_TRANSFER _IOWR('v', 0x56, struct slash_qdma_transfer)
#define SLASH_QDMA_URING_CMD_TRANSFER   0x56u  /* optional io_uring SQE cmd_op */

The libslash (slash/qdma.h) wrappers over that ABI:

/* A kernel-owned DMA buffer and its CPU mapping. */
struct slash_qdma_buffer {
    int fd;                                       /* close via destroy */
    void *addr;                                   /* mmap of the buffer */
    uint64_t length;
    uint32_t granule;                             /* bytes per descriptor */
    enum slash_qdma_transfer_hint transfer_hint;  /* advisory channel policy */
};

/* Transfer fd bound to one or more qpairs (channels). */
int slash_qdma_qpair_get_fd_multi(struct slash_qdma *qdma, const uint32_t *qids,
                                  uint32_t qpair_count, int flags);

/* Create / destroy a kernel-owned buffer (also a qpair-fd form for SCM_RIGHTS clients). */
int slash_qdma_buffer_create(struct slash_qdma *qdma, uint64_t length,
                             struct slash_qdma_buffer *buf_out);
int slash_qdma_qpair_buffer_create(int qpair_fd, uint64_t length,
                                   struct slash_qdma_buffer *buf_out);
int slash_qdma_buffer_destroy(struct slash_qdma_buffer *buf);

/* Single sub-transfer convenience wrapper, and the batched multi-channel form. */
ssize_t slash_qdma_qpair_transfer(int qpair_fd, int buf_fd, uint64_t buf_offset,
                                  uint64_t dev_addr, uint64_t length, uint32_t direction);
ssize_t slash_qdma_qpair_transfer_batch(int qpair_fd,
                                        const struct slash_qdma_subxfer *xfers,
                                        uint32_t count);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant