Skip to content

Gemlite 0.6.0 major upgrade#56

Open
mobicham wants to merge 56 commits intodropbox:masterfrom
mobicham:rtx6000_debug
Open

Gemlite 0.6.0 major upgrade#56
mobicham wants to merge 56 commits intodropbox:masterfrom
mobicham:rtx6000_debug

Conversation

@mobicham
Copy link
Collaborator

This PR is makes major changes to gemlite to address various bugs and performance improvements, especially for sm_120 gpus (RTX PRO 6000 more specifically)

  • Various bug fixes with unfriendly shapes and improved performance for some kernels with friendly shapes by skipping masking during loading / storing
  • Activation quant kernels have been rewritten for improved speed. For example, the MXFP4/NVFP4 now support cvt.rn.satfinite.e2m1x2.f16x2 with ptxas 13 but also supports a fallback implementation for older ptxas versions.
  • Big improvement for MXFP/NVFP kernels, sometimes outperforming Flashinfer's Cutlass NVFP4 kernel. The kernels use TMA for a,b, and b_scales but not for a_scales to improve decoding speed on the RTX PRO 6000.

@mobicham mobicham self-assigned this Mar 14, 2026
@mobicham mobicham changed the title Gemlite 0.6.0 upgrade Fixes Gemlite 0.6.0 major upgrade Mar 14, 2026
mobicham and others added 5 commits March 19, 2026 08:02
…ersion)

Requires CUDA 13.0+ ptxas. Gives ~10% speedup on activation quantization
kernels, translating to 3-5% end-to-end improvement at M=1024.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds GEMLITE_ENABLE_PTX_PACK global (default False) and
gemlite.set_ptx_pack(True/False) API. When enabled, MXFP4/NVFP4
activation quantization kernels use hardware cvt.rn.satfinite.e2m1x2
PTX instruction instead of threshold comparisons. Requires CUDA 13.0+
ptxas to be installed in the Triton backends directory.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant