Fused kernel for calculating offsets from first dim splits#2755
Fused kernel for calculating offsets from first dim splits#2755ksivaman wants to merge 6 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
|
/te-ci |
Greptile SummaryThis PR replaces a three-operation PyTorch sequence (
Confidence Score: 4/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["splits_to_offsets(first_dims, logical_last_dim)
[PyTorch Python API]"]
B["splits_to_offsets()
misc.cpp
— validate: CUDA, int64, 1D, logical_last_dim > 0
— allocate output[num_tensors + 1]"]
C["nvte_splits_to_offsets()
common.cu [C API]
— validate: non-null, num_tensors > 0, logical_last_dim > 0"]
D["splits_to_offsets_kernel<<<1, 256>>>
— thread 0: output[0] = 0, chunk_prefix = 0
— chunk loop (stride 256):
load first_dims × logical_last_dim → block_scan
Hillis-Steele inclusive scan
write output[idx+1] = chunk_prefix + block_scan[tid]
thread 255: chunk_prefix += block_scan[255]"]
E["build_grouped_tensor_offsets()
quantizer.cpp [internal path]"]
F["output[0..num_tensors]
device int64 prefix-sum tensor"]
A --> B --> C --> D --> F
E -->|"NVTE_SCOPED_GIL_RELEASE"| C
|
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Description
Introduce a single kernel that calculates offsets from first dimension splits for the case where last dimension is not varying. Previously, the scale, cumulative sum, and zero concat were unfused.
Type of change
Changes
Checklist: