Skip to content

Commit ccd190b

Browse files
authored
Merge pull request #672 from casparvl/update_gpu_support_for_202506
Update documentation to account for changes in GPU support in 2025.06
2 parents b4de28c + c959e97 commit ccd190b

File tree

1 file changed

+89
-78
lines changed
  • docs/site_specific_config

1 file changed

+89
-78
lines changed

docs/site_specific_config/gpu.md

Lines changed: 89 additions & 78 deletions
Original file line numberDiff line numberDiff line change
@@ -23,123 +23,134 @@ can use the GPU in your system is available below.
2323
EESSI supports running CUDA-enabled software. All CUDA-enabled modules are marked with the `(gpu)` feature,
2424
which is visible in the output produced by `module avail`.
2525

26-
### NVIDIA GPU drivers {: #nvidia_drivers }
26+
### Configuring runtime support {: #nvidia_drivers}
2727

2828
For CUDA-enabled software to run, it needs to be able to find the **NVIDIA GPU drivers** of the host system.
2929
The challenge here is that the NVIDIA GPU drivers are not _always_ in a standard system location, and that we
3030
can not install the GPU drivers in EESSI (since they are too closely tied to the client OS and GPU hardware).
3131

32-
### Compiling software on top of CUDA, cuDNN and other SDKs provided by NVIDIA {: #cuda_sdk }
32+
#### Enabling runtime support for a native EESSI installation (using the helper script) {: #nvidia_eessi_native }
3333

34-
An additional requirement is necessary if you want to be able to compile software
35-
that makes use of a CUDA installation or cu\* SDKs (e.g., cuDNN) included in
36-
EESSI. This requires a *full* installation of the CUDA SDK, cuDNN, etc. However,
37-
the [CUDA SDK End User License Agreement (EULA)](https://docs.nvidia.com/cuda/eula/index.html)
38-
and the [Software License Agreement (SLA) for NVIDIA cuDNN](https://docs.nvidia.com/deeplearning/cudnn/latest/reference/eula.html)
39-
do not allow for full redistribution. In EESSI, we are (currently) only allowed to
40-
redistribute the files needed to *run* CUDA and cuDNN software.
34+
To get runtime support, we need to ensure that the EESSI runtime linker can find the drivers. To do this, we symlink the drivers
35+
in a predictable location that is searched by the EESSI runtime linker.
4136

42-
!!! note "A full CUDA SDK or cuDNN SDK is only needed to *compile* CUDA or cuDNN software"
43-
Without a full CUDA SDK or cuDNN SDK on the host system, you will still
44-
be able to *run* CUDA-enabled or cuDNN-enabled software from the EESSI stack,
45-
you just won't be able to *compile* additional CUDA or cuDNN software.
37+
*Step 1:* [initialize a version of EESSI](../using_eessi/setting_up_environment.md).
4638

47-
Below, we describe how to make sure that the EESSI software stack can find your
48-
NVIDIA GPU drivers and (optionally) full installations of the CUDA SDK and the
49-
cuDNN SDK.
39+
*Step 2 (EESSI 2025.06 and newer, mandatory):* define the `EESSI_NVIDIA_OVERRIDE_DEFAULT` variable in your local CernVM-FS configuration to point to a directory where you want
40+
to store the symlinks to the drivers. For example, to store these under `/opt/eessi/nvidia`, one would run:
5041

51-
### Configuring CUDA driver location {: #driver_location }
52-
53-
All CUDA-enabled software in EESSI expects the CUDA drivers to be available in a specific subdirectory of this `host_injections` directory.
54-
In addition, installations of the CUDA SDK and cuDNN SDK included EESSI are stripped down to the files that we are allowed to redistribute;
55-
all other files are replaced by symbolic links that point to another specific subdirectory of `host_injections`. For example:
56-
```
57-
$ ls -l /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/CUDA/12.1.1/bin/nvcc
58-
lrwxrwxrwx 1 cvmfs cvmfs 109 Dec 21 14:49 /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/CUDA/12.1.1/bin/nvcc -> /cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64/amd/zen3/software/CUDA/12.1.1/bin/nvcc
42+
```{ .bash .copy }
43+
sudo bash -c "echo 'EESSI_NVIDIA_OVERRIDE_DEFAULT=/opt/eessi/nvidia' >> /etc/cvmfs/default.local"
5944
```
6045

61-
If the corresponding full installation of the CUDA SDK is available there, the
62-
CUDA installation included in EESSI can be used to build CUDA software. The same
63-
applies to the cuDNN SDK.
64-
65-
66-
### Using NVIDIA GPUs via a native EESSI installation {: #nvidia_eessi_native }
67-
68-
Here, we describe the steps to enable GPU support when you have a [native EESSI installation](../getting_access/native_installation.md) on your system.
46+
*Step 2 (EESSI 2023.06, optional):* Change the location in which the symlinks will end up by configuring `EESSI_HOST_INJECTIONS` explicitly (default: `/opt/eessi`):
6947

70-
!!! warning "Required permissions"
71-
To enable GPU support for EESSI on your system, you will typically need to have system administration rights, since you need write permissions on the folder to the target directory of the `host_injections` symlink.
72-
73-
#### Exposing NVIDIA GPU drivers
48+
```{ .bash copy }
49+
sudo bash -c "echo 'EESSI_HOST_INJECTIONS=/desired/path/to/host/injections' >> /etc/cvmfs/default.local"
50+
```
7451

75-
To install the symlinks to your GPU drivers in `host_injections`, run the `link_nvidia_host_libraries.sh` script that is included in EESSI:
52+
Third, you run the helper script
7653

7754
```{ .bash .copy }
7855
/cvmfs/software.eessi.io/versions/${EESSI_VERSION}/scripts/gpu_support/nvidia/link_nvidia_host_libraries.sh
7956
```
8057

81-
This script uses `ldconfig` on your host system to locate your GPU drivers, and creates symbolic links to them in the correct location under `host_injections` directory. It also stores the CUDA version supported by the driver that the symlinks were created for.
58+
!!! tip "Rerun script after each driver update"
59+
You should re-run this script every time you update the NVIDIA GPU drivers on the host system, as it may expose libraries that are new to your driver version.
60+
Note that it is safe to re-run the script even if no driver updates were done: the script should detect that the current version of the drivers were already symlinked.
8261

83-
!!! tip "Re-run `link_nvidia_host_libraries.sh` after NVIDIA GPU driver update"
84-
You should re-run this script every time you update the NVIDIA GPU drivers on the host system.
62+
!!! tip "Maintaining different driver versions for each EESSI version"
63+
The standard approach for EESSI >= 2025.06 means that the drivers may be found by any EESSI version. If you prefer to create one set of symlinks per EESSI
64+
version, instead of defining a single location through EESSI_NVIDIA_OVERRIDE_DEFAULT, you can define one per EESSI version, by setting EESSI_<VERSION>_NVIDIA_OVERRIDE.
65+
For example:
66+
```{ .bash .copy}
67+
sudo bash -c "echo 'EESSI_202506_NVIDIA_OVERRIDE=/opt/eessi/2025.06/nvidia' >> /etc/cvmfs/default.local"
68+
```
8569

86-
Note that it is safe to re-run the script even if no driver updates were done: the script should detect that the current version of the drivers were already symlinked.
70+
!!! note "How does EESSI find the linked drivers?"
8771

88-
#### Installing full CUDA SDK and cuDNN SDK (optional) {: #installing-full-cuda-sdk-optional }
72+
The runtime linker provided by the EESSI [compatibility layer](../compatibility_layer.md) is configured to search an
73+
additional directory (run `ld.so --help | grep -A 10 "Shared library search path"` after initializing EESSI).
74+
For `EESSI/2025.06` and later, that is: `/cvmfs/software.eessi.io/versions/<EESSI_VERSION>/compat/<OS>/<ARCH>/lib/nvidia`).
75+
This directory is special, since it is a CernVM-FS [Variant Symlink](https://cvmfs.readthedocs.io/en/stable/cpt-repo.html#variant-symlinks).
76+
The target of this symlink is what you configure in your local CernVM-FS configuration.
8977

90-
To install a full CUDA SDK and cuDNN SDK under `host_injections`, use the `install_cuda_and_libraries.sh` script that is included in EESSI:
78+
#### Enabling runtime support for a native EESSI installation (using manual symlinking)
79+
If, for some reason, the helper script is unable to locate the drivers on your system you _can_ link them manually.
80+
To do so, grab the list of libraries that need to be symlinked from [here](https://raw.githubusercontent.com/apptainer/apptainer/main/etc/nvliblist.conf).
81+
Then, change to the correct directory:
82+
- For EESSI 2025.06 and later: `/cvmfs/software.eessi.io/versions/${EESSI_VERSION}>/compat/${EESSI_OS_TYPE}/${EESSI_CPU_FAMILY}/lib/nvidia`,
83+
- For EESSI 2023.06: `/cvmfs/software.eessi.io/host_injections/${EESSI_VERSION}/compat/${EESSI_OS_TYPE}/${EESSI_CPU_FAMILY}/lib`
84+
Then, manually create the symlinks for each of the files in the aforementioned list (if they exist on your system) to the current directory.
9185

92-
```{ .bash .copy }
93-
/cvmfs/software.eessi.io/versions/${EESSI_VERSION}/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
94-
```
86+
#### Runtime support when using EESSI in a container: {: #nvidia_eessi_container }
9587

96-
For example, to install CUDA 12.1.1 and cuDNN 8.9.2.26 in the directory that the [`host_injections` variant symlink](host_injections.md) points to,
97-
using `/tmp/$USER/EESSI` as directory to store temporary files:
98-
```
99-
/cvmfs/software.eessi.io/versions/${EESSI_VERSION}/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh --temp-dir /tmp/$USER/EESSI --accept-cuda-eula --accept-cudnn-eula
100-
```
101-
The versions 12.1.1 for CUDA and 8.9.2.26 for cuDNN are defined in an easystack
102-
file that is also included in EESSI:
103-
```
104-
/cvmfs/software.eessi.io/versions/${EESSI_VERSION}/scripts/gpu_support/nvidia/easystacks/eessi-2023.06-eb-4.9.4-2023a-CUDA-host-injections.yml
105-
```
106-
By default, the install script processes all files matching `eessi-*CUDA*.yml` in
107-
the above `/cvmfs/software.eessi.io/versions/${EESSI_VERSION}/scripts/gpu_support/nvidia/easystacks` directory.
88+
If you are running your own [Apptainer](https://apptainer.org/)/[Singularity](https://sylabs.io/singularity) container,
89+
it is sufficient to use the [`--nv` option](https://apptainer.org/docs/user/latest/gpu.html#nvidia-gpus-cuda-standard)
90+
to enable access to GPUs from within the container. This will ensure the container runtime exposes the drivers through
91+
the `$LD_LIBRARY_PATH`.
92+
93+
If you are using the [EESSI container](../getting_access/eessi_container.md) to access the EESSI software,
94+
simply pass `--nvidia run` or `--nvidia all` to enable nvidia GPU runtime support.
10895

109-
You can run `/cvmfs/software.eessi.io/versions/${EESSI_VERSION}/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh --help` to check all of the options.
96+
### Configuring compile time support {: #cuda_sdk }
11097

111-
!!! tip
98+
To compile new CUDA software using dependencies from EESSI, additional configuration is needed.
99+
100+
The [CUDA license](https://docs.nvidia.com/cuda/eula/index.html) and [cuDNN license](https://docs.nvidia.com/deeplearning/cudnn/latest/reference/eula.html)
101+
only allow redistribution of their _runtime_ libraries. Thus, the installations of CUDA and cuDNN that come with EESSI have been stripped down to contain
102+
only the runtime libraries. A local installation of CUDA and cuDNN is required to compile new software.
103+
104+
!!! note "A full CUDA SDK or cuDNN SDK is only needed to *compile* CUDA or cuDNN software"
105+
Without a full CUDA SDK or cuDNN SDK on the host system, you will still
106+
be able to *run* CUDA-enabled or cuDNN-enabled software from the EESSI stack
107+
(provided the required configuration for runtime support was done - see above),
108+
you just won't be able to *compile* additional CUDA or cuDNN software.
112109

113-
This script uses EasyBuild to install the CUDA SDK and the cuDNN SDK. For this to work, two requirements need to be satisfied:
110+
First, [initialize a version of EESSI](../using_eessi/setting_up_environment.md).
114111

115-
* `module load EasyBuild/${EB_VERSION}` must work (EB_VERSION is extracted
116-
from the name of the easystack file (e.g., from `eb-4.9.4` EB_VERSION is
117-
derived as 4.9.4);
118-
* `module load EESSI-extend/${EESSI_VERSION}-easybuild` must work.
112+
Second, (optionally) define the `EESSI_HOST_INJECTIONS` variable in your local CernVM-FS configuration to point to a directory where you want to
113+
store the local installations of CUDA and cuDNN (the default location is `/opt/eessi`):
119114

120-
Both modules are included in EESSI.
115+
```{ .bash .copy }
116+
sudo bash -c "echo 'EESSI_HOST_INJECTIONS=/my/custom/prefix' >> /etc/cvmfs/default.local"
117+
```
121118

119+
Third, run the helper script to install the CUDA and cuDNN versions that are used _in that version of EESSI_.
122120

123-
### Using NVIDIA GPUs via EESSI in a container {: #nvidia_eessi_container }
121+
```{ .bash .copy }
122+
/cvmfs/software.eessi.io/versions/${EESSI_VERSION}/scripts/gpu_support/nvidia/install_cuda_and_libraries.sh
123+
```
124124

125-
We focus here on the [Apptainer](https://apptainer.org/)/[Singularity](https://sylabs.io/singularity) use case,
126-
and have only tested the [`--nv` option](https://apptainer.org/docs/user/latest/gpu.html#nvidia-gpus-cuda-standard)
127-
to enable access to GPUs from within the container.
125+
Note that this script uses EasyBuild in order to install CUDA and cuDNN - and EasyBuild does not allow running as root by default.
126+
The recommended approach is to change ownership of the `host_injections` directory to a non-root user, and perform the installation with
127+
that user. Alternatively (but not recommended), you can override EasyBuild's behaviour and install as root by setting
128+
`export EASYBUILD_ALLOW_USE_AS_ROOT_AND_ACCEPT_CONSEQUENCES=1` before running `install_cuda_and_libraries.sh`.
128129

129-
If you are using the [EESSI container](../getting_access/eessi_container.md) to access the EESSI software,
130-
the procedure for enabling GPU support is slightly different and will be documented here eventually.
130+
The script searches `/cvmfs/software.eessi.io/versions/${EESSI_VERSION}/scripts/gpu_support/nvidia/easystacks` for any file
131+
named `eessi-*CUDA*.yml`, and installs all CUDA and cuDNN versions defined in those files.
131132

132-
#### Exposing NVIDIA GPU drivers
133+
Thus, you may want to periodically run this script to pick up on new CUDA and cuDNN versions that get added to EESSI over time.
133134

134-
When running a container with `apptainer` or `singularity` it is _not_ necessary to run the `install_cuda_host_injections.sh`
135-
script since both these tools use `$LD_LIBRARY_PATH` internally in order to make the host GPU drivers available
136-
in the container.
135+
!!! note "How does EESSI find the local installation of CUDA/cuDNN?"
136+
The non-redistributable components of CUDA/cuDNN in EESSI have been replaced by symlinks that point to a specific directory
137+
in the `/cvmfs/software.eessi.io/host_injections` prefix. For example, the `nvcc` compiler can not be redistributed, so it
138+
is replaced in EESSI with a symlink:
139+
```
140+
$ ls -l /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/CUDA/12.1.1/bin/nvcc
141+
lrwxrwxrwx 1 cvmfs cvmfs 109 Dec 21 14:49 /cvmfs/software.eessi.io/versions/2023.06/software/linux/x86_64/amd/zen3/software/CUDA/12.1.1/bin/nvcc -> /cvmfs/software.eessi.io/host_injections/2023.06/software/linux/x86_64/amd/zen3/software/CUDA/12.1.1/bin/nvcc
142+
```
143+
the `/cvmfs/software.eessi.io/host_injections` directory is special, since it is not part of the actual EESSI repository:
144+
it is a CernVM-FS [Variant Symlink](https://cvmfs.readthedocs.io/en/stable/cpt-repo.html#variant-symlinks) that points to
145+
a directory on the local system (`/opt/eessi` by default).
146+
The `install_cuda_and_libraries.sh` script installs CUDA and cuDNN in this local directory, thus un-breaking the symlinks.
147+
This means that from an end-user point of view, the EESSI CUDA module now 'just works', all while adhering to the EULA
148+
(e.g. not redistributing the compiler through EESSI itself).
137149

138-
The only scenario where this would be required is if `$LD_LIBRARY_PATH` is modified or undefined.
139150

140151
### Testing the GPU support {: #gpu_cuda_testing }
141152

142-
The quickest way to test if software installations included in EESSI can access and use your GPU is to run the
153+
The quickest way to test if software installations included in EESSI can access and use your GPU is to run the
143154
`deviceQuery` executable that is part of the `CUDA-Samples` module:
144155
```
145156
module load CUDA-Samples

0 commit comments

Comments
 (0)