Pcie transfer performance#126
Conversation
dc24ce1 to
67b2251
Compare
…for RHEL 9.8 Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
…e /tmp Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
…scriptor size Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
… knobs Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
…ansfers Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
…olicy Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
…-only Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
Signed-off-by: Vlad-Gabriel Serbu <Vlad-Gabriel.Serbu@amd.com>
67b2251 to
6599ce7
Compare
UpdatesThe earlier registered-buffer API worked but required userspace to run its own threads to keep both of the V80's PCIe NoC channels busy. That was the one thing blocking this from leaving draft. The API has now been reshaped so a single, plain transfer call keeps both channels saturated. New commits:
Goals of this update
Key decisions
Bugfix: per-client raw-buffer ownershipSeparately from the API reshape, this update fixes a multi-client correctness problem with raw buffers (
API shapeThe finalized kernel ABI: #define SLASH_QDMA_FD_MAX_QPAIRS 2u
/* QPAIR_GET_FD: bind 1..2 qpairs into one transfer fd; array index == qpair_index. */
struct slash_qdma_qpair_fd_request {
__u32 size;
__u32 qid; /* legacy single qpair; used only when qpair_count == 0 */
__u32 flags; /* O_CLOEXEC only */
__u32 qpair_count; /* 1..SLASH_QDMA_FD_MAX_QPAIRS; 0 = use qid */
__u32 qpair_ids[SLASH_QDMA_FD_MAX_QPAIRS];
};
/* BUF_CREATE: kernel allocates + DMA-maps once, returns a mappable buffer fd. */
struct slash_qdma_buf_create {
__u32 size;
__u32 flags; /* O_CLOEXEC only */
__u64 length; /* [in] page multiple */
__u32 granule; /* [out] bytes per descriptor (host page size) */
__u32 transfer_hint; /* [out] enum slash_qdma_transfer_hint */
};
/* One per-qpair sub-transfer within a batch. */
struct slash_qdma_subxfer {
__u32 qpair_index; /* index into the fd's bound qpairs */
__u32 direction; /* enum slash_qdma_transfer_dir (H2C or C2H) */
__s32 buf_fd; /* kernel buffer fd from BUF_CREATE */
__u32 pad0;
__u64 buf_offset;
__u64 dev_addr; /* device-side endpoint address */
__u64 length;
};
/* QPAIR_IOCTL_TRANSFER: submit up to FD_MAX_QPAIRS sub-transfers; distinct
* qpairs run concurrently in-kernel. Returns total bytes transferred. */
struct slash_qdma_transfer {
__u32 size;
__u32 count; /* 1..SLASH_QDMA_FD_MAX_QPAIRS */
struct slash_qdma_subxfer xfers[SLASH_QDMA_FD_MAX_QPAIRS];
};
#define SLASH_QDMA_IOCTL_QPAIR_GET_FD _IOWR('v', 0x53, struct slash_qdma_qpair_fd_request)
#define SLASH_QDMA_IOCTL_BUF_CREATE _IOWR('v', 0x54, struct slash_qdma_buf_create)
/* 'v' 0x55 reserved (was SLASH_QDMA_IOCTL_BUF_UNREGISTER, removed) */
#define SLASH_QDMA_QPAIR_IOCTL_TRANSFER _IOWR('v', 0x56, struct slash_qdma_transfer)
#define SLASH_QDMA_URING_CMD_TRANSFER 0x56u /* optional io_uring SQE cmd_op */The libslash ( /* A kernel-owned DMA buffer and its CPU mapping. */
struct slash_qdma_buffer {
int fd; /* close via destroy */
void *addr; /* mmap of the buffer */
uint64_t length;
uint32_t granule; /* bytes per descriptor */
enum slash_qdma_transfer_hint transfer_hint; /* advisory channel policy */
};
/* Transfer fd bound to one or more qpairs (channels). */
int slash_qdma_qpair_get_fd_multi(struct slash_qdma *qdma, const uint32_t *qids,
uint32_t qpair_count, int flags);
/* Create / destroy a kernel-owned buffer (also a qpair-fd form for SCM_RIGHTS clients). */
int slash_qdma_buffer_create(struct slash_qdma *qdma, uint64_t length,
struct slash_qdma_buffer *buf_out);
int slash_qdma_qpair_buffer_create(int qpair_fd, uint64_t length,
struct slash_qdma_buffer *buf_out);
int slash_qdma_buffer_destroy(struct slash_qdma_buffer *buf);
/* Single sub-transfer convenience wrapper, and the batched multi-channel form. */
ssize_t slash_qdma_qpair_transfer(int qpair_fd, int buf_fd, uint64_t buf_offset,
uint64_t dev_addr, uint64_t length, uint32_t direction);
ssize_t slash_qdma_qpair_transfer_batch(int qpair_fd,
const struct slash_qdma_subxfer *xfers,
uint32_t count); |
PCIe transfer performance: registered buffers + two-channel transfers
Branch:
perf/pcie-transfer-performance→ target:dev· Status: draftSummary
This PR reworks the QDMA host↔device data path to get much more bandwidth out of bulk transfers. Two changes do most of the work: a new registered-buffer path that pins and DMA-maps a host buffer once and reuses it for many transfers (instead of paying that cost per transfer), and a placement-aware policy that splits each transfer across both of the V80's PCIe NoC channels so both paths stay busy.
v80-smi validategains the bandwidth-benchmark knobs used to measure all of this.What changed
validategains bandwidth modes over two backends (raw SLASH and the stock Xilinx QDMA driver) reporting Read/Write/Total for HBM and DDR, with knobs for channel selection, ring size, iteration/duration, and buffer placement.validateoptions; Debian/RPM ship the local libqdma patches.Earlier commits experimented with large-page transfers and custom libqdma scatter-gather/channel patches; those were dropped. The final path is 4 KiB-only, and the speedup comes from the registered-buffer fast path and the two-channel split.
Results
Sustained one-directional bandwidth with the registered-buffer path and the two-channel split:
Still to do
Why this is still a draft
The transfer API is not in its final shape. Today it requires userspace to use threads to keep multiple transfers in flight. Before merging, we want to move back to plain read/write calls so multiple transfers can be submitted at once without threading.