Skip to content

[CI] Refactor CI build on GitHub#2723

Open
ptrendx wants to merge 26 commits intoNVIDIA:mainfrom
ptrendx:pr_github_build
Open

[CI] Refactor CI build on GitHub#2723
ptrendx wants to merge 26 commits intoNVIDIA:mainfrom
ptrendx:pr_github_build

Conversation

@ptrendx
Copy link
Member

@ptrendx ptrendx commented Mar 2, 2026

Description

Run the build in the minimal container and use the pip wheel based CUDA to minimize the used disk space.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

ptrendx and others added 24 commits February 24, 2026 15:16
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
@ptrendx ptrendx marked this pull request as ready for review March 10, 2026 18:02
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 10, 2026

Greptile Summary

This PR refactors the GitHub CI build pipeline to use pre-built minimal containers (ghcr.io/ptrendx/te_gha_pytorch:latest and ghcr.io/ptrendx/te_gha_all:latest) and pip-wheel-based CUDA instead of spinning up a full DEVEL container at runtime. It also replaces sccache with standard ccache backed by actions/cache@v4, removes the disk-space maximization steps, and adds a get_cuda_library_dirs() utility for wheel-based CUDA library discovery.

Key changes:

  • build.yml: All four jobs (core, pytorch, jax, all) now use actions/cache@v4 + ccache; pytorch and all jobs use new pre-built containers and set CUDA_PATH/CUDNN_PATH from the nvidia pip namespace.
  • New Dockerfiles (Dockerfile.pytorch, Dockerfile.all): Define the new minimal CI containers with CUDA 13.x pip wheels; create unversioned .so symlinks for cudart, cublas, and nccl — but omit a corresponding libcudnn.so symlink.
  • logging.h: Moves #include "nccl.h" inside #ifdef NVTE_WITH_CUBLASMP, but the NVTE_CHECK_NCCL macro (which references ncclResult_t etc.) remains defined unconditionally — a latent compilation hazard for any consumer that doesn't separately include nccl.h.
  • utils.py: Adds get_cuda_library_dirs() and a fallback for the nvidia-cuda-runtime (non-cu12-suffixed) package in cuda_version().
  • jax.py: Wires get_cuda_library_dirs() into the JAX extension build.

Confidence Score: 3/5

  • PR is mostly safe but has one latent C++ compilation issue and a potential link-time failure in the new Dockerfiles.
  • The CI workflow refactor is well-structured, but moving nccl.h inside #ifdef NVTE_WITH_CUBLASMP while leaving NVTE_CHECK_NCCL unconditionally defined in logging.h creates a broken public API contract that will silently fail for future consumers. Additionally, the new Dockerfiles create .so symlinks for cudart, cublas, and nccl but not for cudnn, which could produce link errors depending on what the cudnn pip wheel ships.
  • transformer_engine/common/util/logging.h (NVTE_CHECK_NCCL / nccl.h guard mismatch) and build_tools/ci/Dockerfile.pytorch / build_tools/ci/Dockerfile.all (missing libcudnn.so symlink).

Important Files Changed

Filename Overview
.github/workflows/build.yml Major CI refactor: replaces manual Docker container spin-up and disk-space maximization steps with pre-built containers and native ccache via actions/cache@v4. Switches from sccache to standard ccache. Adds CUDA_PATH/CUDNN_PATH exports derived from nvidia pip wheels.
build_tools/ci/Dockerfile.pytorch New Dockerfile for the pytorch CI container; installs CUDA 13.x nvidia wheels and creates symlinks for cudart, cublas, and nccl — but is missing a libcudnn.so unversioned symlink which could cause link failures.
build_tools/ci/Dockerfile.all New Dockerfile for the all-framework CI container; adds JAX and PyTorch with CUDA 13 nvidia wheels, but also missing a libcudnn.so unversioned symlink.
build_tools/utils.py Adds get_cuda_library_dirs() mirroring get_cuda_include_dirs(), and adds a fallback for the generic nvidia-cuda-runtime package in cuda_version(). The return type annotation (Tuple[str, str]) is copy-pasted incorrectly from the include-dir counterpart but this is a pre-existing style issue.
build_tools/jax.py Imports and passes get_cuda_library_dirs() to the JAX extension build; clean, straightforward addition.
transformer_engine/common/util/logging.h Moves nccl.h include inside #ifdef NVTE_WITH_CUBLASMP, but NVTE_CHECK_NCCL macro (outside the guard) still references NCCL types, creating a latent compilation failure for consumers that don't separately include nccl.h.
transformer_engine/jax/csrc/extensions/cgemm_helper.h Adds direct #include <nccl.h> to compensate for the removal from logging.h; correct for this file but doesn't fix the broader logging.h API contract issue.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[GitHub Actions Trigger] --> B{Job}
    B --> C[core job - ubuntu-latest runner]
    B --> D[pytorch job]
    B --> E[jax job]
    B --> F[all job]

    C --> C1[apt install ccache + cuda libs]
    C1 --> C2[actions/cache for /root/.ccache]
    C2 --> C3[pip install TE - NVTE_FRAMEWORK=none]
    C3 --> C4[Sanity check]

    D --> D1[Container: te_gha_pytorch - Dockerfile.pytorch]
    D1 --> D2[actions/cache for /root/.ccache]
    D2 --> D3[Set CUDA_PATH and CUDNN_PATH from nvidia wheels]
    D3 --> D4[pip install TE - NVTE_FRAMEWORK=pytorch - MAX_JOBS=2]
    D4 --> D5[Sanity check]

    E --> E1[Container: ghcr.io/nvidia/jax:jax]
    E1 --> E2[apt install ccache]
    E2 --> E3[actions/cache for /root/.ccache]
    E3 --> E4[pip install TE - NVTE_FRAMEWORK=jax]
    E4 --> E5[Sanity check]

    F --> F1[Container: te_gha_all - Dockerfile.all]
    F1 --> F2[actions/cache for /root/.ccache]
    F2 --> F3[Set CUDA_PATH and CUDNN_PATH from nvidia wheels]
    F3 --> F4[pip install TE - NVTE_FRAMEWORK=all - MAX_JOBS=2]
    F4 --> F5[Sanity check PyTorch]
    F5 --> F6[Sanity check JAX]
Loading

Comments Outside Diff (1)

  1. transformer_engine/common/util/logging.h, line 109-115 (link)

    NVTE_CHECK_NCCL macro missing nccl.h include

    NVTE_CHECK_NCCL (lines 109–115) references ncclResult_t, ncclSuccess, and ncclGetErrorString, all of which are declared in nccl.h. However, nccl.h is now only included inside the #ifdef NVTE_WITH_CUBLASMP guard (lines 15–19). Since NVTE_CHECK_NCCL is defined unconditionally, any translation unit that:

    • includes logging.h (directly or transitively), AND
    • expands NVTE_CHECK_NCCL, AND
    • does NOT separately #include <nccl.h> before use

    …will fail to compile with "unknown type ncclResult_t" or similar errors when NVTE_WITH_CUBLASMP is not defined.

    The current codebase avoids this only because cgemm_helper.h (the sole non-logging user) now adds its own #include <nccl.h>. But NVTE_CHECK_NCCL is a public macro in a general-purpose utility header, so this is a fragile and surprising contract for future consumers.

    The fix is to either move NVTE_CHECK_NCCL inside the same #ifdef NVTE_WITH_CUBLASMP block, or add an unconditional #include "nccl.h" back at the top of the file alongside the other unconditional CUDA library includes.

Last reviewed commit: 42c7836

Comment on lines +322 to +323
except:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bare except: swallows all exceptions

The bare except: clause silently catches everything including KeyboardInterrupt, SystemExit, and MemoryError, making debugging harder if the failure isn't PackageNotFoundError. It should be narrowed to the expected exception type:

Suggested change
except:
pass
except importlib.metadata.PackageNotFoundError:
pass

if not force_wheels and cuda_toolkit_include_path() is not None:
return []

# Use pip wheels to include all headers.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy-paste comment refers to "headers" instead of "libraries"

This comment was copied from get_cuda_include_dirs and still says "headers" instead of "libraries":

Suggested change
# Use pip wheels to include all headers.
# Use pip wheels to find all library directories.

ptrendx added 2 commits March 10, 2026 21:42
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Comment on lines +22 to +28

# Create symlinks for CUDA libraries
RUN CUDA_PATH=$(python3 -c "import nvidia; print(list(nvidia.__path__)[0] + '/cu13')") && \
ln -s $CUDA_PATH/lib/libcudart.so.13 $CUDA_PATH/lib/libcudart.so && \
ln -s $CUDA_PATH/lib/libcublas.so.13 $CUDA_PATH/lib/libcublas.so && \
ln -s $CUDA_PATH/../nccl/lib/libnccl.so.2 $CUDA_PATH/../nccl/lib/libnccl.so && \
ln -s $CUDA_PATH/lib $CUDA_PATH/lib64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing libcudnn.so unversioned symlink

The symlink block creates unversioned aliases for libcudart, libcublas, and libnccl (needed for -lcudart, -lcublas, -lnccl at link time), but no equivalent symlink for libcudnn.so. The logging.h header unconditionally includes <cudnn.h>, and the build links against cudnn. If the cudnn pip wheel (typically pulled in as a torch dependency) only ships a versioned filename such as libcudnn.so.9, the linker will fail to resolve -lcudnn without an unversioned symlink.

Consider adding a similar symlink for cudnn, e.g.:

CUDNN_LIB=$(python3 -c "import nvidia; print(list(nvidia.__path__)[0] + '/cudnn/lib')") && \
    ln -s $CUDNN_LIB/libcudnn.so.9 $CUDNN_LIB/libcudnn.so

The same applies to Dockerfile.all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant