Improve TE Group MLP CPU Overhead by zhongbozhu · Pull Request #2991 · NVIDIA/TransformerEngine

zhongbozhu · 2026-05-14T06:37:23Z

Description

Improve TE grouped mlp CPU overhead, suppose cuda graph is not enabled.

This is for issue: #2897

Fixes # (issue)

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Please list the changes introduced in this PR:

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

for more information, see https://pre-commit.ci

zhongbozhu and others added 2 commits May 13, 2026 23:35

updated

b6938ff

Signed-off-by: Zhongbo Zhu <zhongboz@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

5643f46

for more information, see https://pre-commit.ci