Conversation
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
for more information, see https://pre-commit.ci
Greptile SummaryThis PR refactors the GitHub CI build pipeline to use pre-built minimal containers ( Key changes:
Confidence Score: 3/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[GitHub Actions Trigger] --> B{Job}
B --> C[core job - ubuntu-latest runner]
B --> D[pytorch job]
B --> E[jax job]
B --> F[all job]
C --> C1[apt install ccache + cuda libs]
C1 --> C2[actions/cache for /root/.ccache]
C2 --> C3[pip install TE - NVTE_FRAMEWORK=none]
C3 --> C4[Sanity check]
D --> D1[Container: te_gha_pytorch - Dockerfile.pytorch]
D1 --> D2[actions/cache for /root/.ccache]
D2 --> D3[Set CUDA_PATH and CUDNN_PATH from nvidia wheels]
D3 --> D4[pip install TE - NVTE_FRAMEWORK=pytorch - MAX_JOBS=2]
D4 --> D5[Sanity check]
E --> E1[Container: ghcr.io/nvidia/jax:jax]
E1 --> E2[apt install ccache]
E2 --> E3[actions/cache for /root/.ccache]
E3 --> E4[pip install TE - NVTE_FRAMEWORK=jax]
E4 --> E5[Sanity check]
F --> F1[Container: te_gha_all - Dockerfile.all]
F1 --> F2[actions/cache for /root/.ccache]
F2 --> F3[Set CUDA_PATH and CUDNN_PATH from nvidia wheels]
F3 --> F4[pip install TE - NVTE_FRAMEWORK=all - MAX_JOBS=2]
F4 --> F5[Sanity check PyTorch]
F5 --> F6[Sanity check JAX]
|
| except: | ||
| pass |
There was a problem hiding this comment.
Bare except: swallows all exceptions
The bare except: clause silently catches everything including KeyboardInterrupt, SystemExit, and MemoryError, making debugging harder if the failure isn't PackageNotFoundError. It should be narrowed to the expected exception type:
| except: | |
| pass | |
| except importlib.metadata.PackageNotFoundError: | |
| pass |
build_tools/utils.py
Outdated
| if not force_wheels and cuda_toolkit_include_path() is not None: | ||
| return [] | ||
|
|
||
| # Use pip wheels to include all headers. |
There was a problem hiding this comment.
Copy-paste comment refers to "headers" instead of "libraries"
This comment was copied from get_cuda_include_dirs and still says "headers" instead of "libraries":
| # Use pip wheels to include all headers. | |
| # Use pip wheels to find all library directories. |
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
Signed-off-by: Przemek Tredak <ptredak@nvidia.com>
|
|
||
| # Create symlinks for CUDA libraries | ||
| RUN CUDA_PATH=$(python3 -c "import nvidia; print(list(nvidia.__path__)[0] + '/cu13')") && \ | ||
| ln -s $CUDA_PATH/lib/libcudart.so.13 $CUDA_PATH/lib/libcudart.so && \ | ||
| ln -s $CUDA_PATH/lib/libcublas.so.13 $CUDA_PATH/lib/libcublas.so && \ | ||
| ln -s $CUDA_PATH/../nccl/lib/libnccl.so.2 $CUDA_PATH/../nccl/lib/libnccl.so && \ | ||
| ln -s $CUDA_PATH/lib $CUDA_PATH/lib64 |
There was a problem hiding this comment.
Missing libcudnn.so unversioned symlink
The symlink block creates unversioned aliases for libcudart, libcublas, and libnccl (needed for -lcudart, -lcublas, -lnccl at link time), but no equivalent symlink for libcudnn.so. The logging.h header unconditionally includes <cudnn.h>, and the build links against cudnn. If the cudnn pip wheel (typically pulled in as a torch dependency) only ships a versioned filename such as libcudnn.so.9, the linker will fail to resolve -lcudnn without an unversioned symlink.
Consider adding a similar symlink for cudnn, e.g.:
CUDNN_LIB=$(python3 -c "import nvidia; print(list(nvidia.__path__)[0] + '/cudnn/lib')") && \
ln -s $CUDNN_LIB/libcudnn.so.9 $CUDNN_LIB/libcudnn.soThe same applies to Dockerfile.all.
Description
Run the build in the minimal container and use the pip wheel based CUDA to minimize the used disk space.
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: