diff --git a/GPU/documentation/README.md b/GPU/documentation/README.md index e69de29bb2d1d..de888ab6e2436 100644 --- a/GPU/documentation/README.md +++ b/GPU/documentation/README.md @@ -0,0 +1,13 @@ +[build-O2.md](https://github.com/AliceO2Group/AliceO2/blob/dev/GPU/documentation/build-O2.md) : +- Instructions how to build O2 with GPU support. +- Description of the CMake variables used. + +[build-standalone.md](https://github.com/AliceO2Group/AliceO2/blob/dev/GPU/documentation/build-standalone.md) : +- Instructions how to build and run the standalone benchmark. +- Instructions how to extract data sets for the standalone benchmark from real data or using simulation. + +[deterministic-mode.md](https://github.com/AliceO2Group/AliceO2/blob/dev/GPU/documentation/deterministic-mode.md) : +- Instructions how to use the deterministic mode for both the standalone benchmark and O2. + +[run-time-compilation.md](https://github.com/AliceO2Group/AliceO2/blob/dev/GPU/documentation/run-time-compilation.md) : +- Instructions how to use run time compilation (RTC) for the GPU code. diff --git a/GPU/documentation/build-O2.md b/GPU/documentation/build-O2.md index 809d1fe0d5439..098629f45a832 100644 --- a/GPU/documentation/build-O2.md +++ b/GPU/documentation/build-O2.md @@ -12,17 +12,17 @@ If you just want to reproduce the GPU build locally without running it, it might The provisioning script of the container also demonstrates which patches need to be applied such that everything works correctly. *GPU Tracking with CUDA* - * The CMake option -DENABLE_CUDA=ON/OFF/AUTO steers whether CUDA is forced enabled / unconditionally disabled / auto-detected. - * The CMake option -DCUDA_COMPUTETARGET= fixes a GPU target, e.g. 61 for PASCAL or 75 for Turing (if unset, it compiles for the lowest supported architecture) + * The CMake option `-DENABLE_CUDA=ON/OFF/AUTO` steers whether CUDA is forced enabled / unconditionally disabled / auto-detected. + * The CMake option `-DCUDA_COMPUTETARGET=...` fixes a GPU target, e.g. 61 for PASCAL or 75 for Turing (if unset, it compiles for the lowest supported architecture) * CUDA is detected via the CMake language feature, so essentially nvcc must be in the Path. - * We require CUDA version >= 11.2 + * We require CUDA version >= 12.8 * CMake will report "Building GPUTracking with CUDA support" when enabled. *GPU Tracking with HIP* * HIP and HCC must be installed, and CMake must be able to detect HIP via find_package(hip). - * If HIP and HCC are not installed to /opt/rocm, the environment variables $HIP_PATH and $HCC_HOME must point to the installation directories. + * If HIP and HCC are not installed to /opt/rocm, the environment variables `$HIP_PATH` and `$HCC_HOME` must point to the installation directories. * HIP from ROCm >= 4.0 is required. - * The CMake option -DHIP_AMDGPUTARGET= forces a GPU target, e.g. gfx906 for Radeon VII (if unset, it auto-detects the GPU). + * The CMake option `-DHIP_AMDGPUTARGET=...` forces a GPU target, e.g. gfx906 for Radeon VII (if unset, it auto-detects the GPU). * CMake will report "Building GPUTracking with HIP support" when enabled. * It may be that some patches must be applied to ROCm after the installation. You find the details in the provisioning script of the GPU CI container below. @@ -49,14 +49,14 @@ The provisioning script of the container also demonstrates which patches need to * The docker images is `alisw/slc8-gpu-builder`. * The container exports the `ALIBUILD_O2_FORCE_GPU` env variable, which force-enables all GPU builds. * Note that it might not be possible out-of-the-box to run the GPU version from within the container. In case of HIP it should work when you forwards the necessary GPU devices in the container. For CUDA however, you would either need to (in addition to device forwarding) match the system CUDA driver and toolkit installation to the files present in the container, or you need to use the CUDA docker runtime, which is currently not installed in the container. - * There are currently some patches needed to install all the GPU backends in a proper way and together. Please refer to the container provisioning script https://github.com/alisw/docks/blob/master/slc9-gpu-builder/provision.sh. If you want to reproduce the installation locally, it is recommended to follow the steps from the script. + * There are currently some patches needed to install all the GPU backends in a proper way and together. Please refer to the container provisioning script [provision.sh](https://github.com/alisw/docks/blob/master/slc9-gpu-builder/provision.sh). If you want to reproduce the installation locally, it is recommended to follow the steps from the script. *Summary* If you want to enforce the GPU builds on a system without GPU, please set the following CMake settings: - * ENABLE_CUDA=ON - * ENABLE_HIP=ON - * ENABLE_OPENCL=ON - * HIP_AMDGPUTARGET=gfx906;gfx908 - * CUDA_COMPUTETARGET=86 89 -Alternatively you can set the environment variables ALIBUILD_ENABLE_CUDA and ALIBUILD_ENABLE_HIP to enforce building CUDA or HIP without modifying the alidist scripts. + * `ENABLE_CUDA=ON` + * `ENABLE_HIP=ON` + * `ENABLE_OPENCL=ON + * `HIP_AMDGPUTARGET=default` + * `CUDA_COMPUTETARGET=default` +Alternatively you can set the environment variables `ALIBUILD_ENABLE_CUDA=1` and `ALIBUILD_ENABLE_HIP=1` to enforce building CUDA or HIP without modifying the alidist scripts. diff --git a/GPU/documentation/build-standalone.md b/GPU/documentation/build-standalone.md index d4e9da5cd5bf3..891d16b4dc2c4 100644 --- a/GPU/documentation/build-standalone.md +++ b/GPU/documentation/build-standalone.md @@ -30,7 +30,7 @@ nano config.cmake # edit config file to enable / disable dependencies as needed. make install -j32 ``` -You can edit certain build settings in `config.cmake`. Some of them are identical to the GPU build settings for O2, as described in O2-786. +You can edit certain build settings in `config.cmake`. Some of them are identical to the GPU build settings for O2, as described in [build-O2.md](https://github.com/AliceO2Group/AliceO2/blob/dev/GPU/documentation/build-O2.md). And there are plenty of additional settings to enable/disable event display, qa, usage of ROOT, FMT, etc. libraries. This will create the `ca` binary in `~/standalone`, which is basically the same as the `o2-gpu-standalone-benchmark`, but built outside of O2. @@ -68,7 +68,7 @@ This will dump the event data to the local folder, all dumped files have a `.dum Data can be dumped from raw data, or from MC data, e.g. generated by the Full System Test. In case of MC data, also MC labels are dumped, such that they are used in the `./ca --qa` mode. -To get a dump from simulated data, please run e.g. the FST simulation as described in O2-2633. +To get a dump from simulated data, please run e.g. the FST simulation as described in [full-system-test-setup.md](https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/documentation/full-system-test-setup.md). A simple run as ``` DISABLE_PROCESSING=1 NEvents=5 NEventsQED=100 SHMSIZE=16000000000 $O2_ROOT/prodtests/full_system_test.sh diff --git a/GPU/documentation/deterministic-mode.md b/GPU/documentation/deterministic-mode.md new file mode 100644 index 0000000000000..9c8db2930ceaa --- /dev/null +++ b/GPU/documentation/deterministic-mode.md @@ -0,0 +1,31 @@ +The TPC tracking code is not fully deterministic, i.e. running multiple times on the same data set might yield a slightly different number of tracks on the O(per mille) level. +- This comes from concurrency, i.e. when tracks are processed in parallel, the output order might change, which might have small effects on the consecutive steps. +- Also compile options and optimizations play a row, e.g. using ffast-math or fused-multiply-add might slightly change the rounding of floating point, and in rare cases lead to the acceptance or rejection of a track, and thus a different number of tracks. + +For debugging, testing, and validation, a deterministic mode is implemented, which should yield 100% reproducible results, on CPU and on GPU and when running multiple times. +It uses a combination of +- Compile time options, e.g. disabling all optimizations that change floating point rounding. +- Run time options, e.g. to use deterministic sorting, and add additional sorting steps after kernels to make the output deterministic, also intermediate outputs. + +This is steered by 3 options: +- The `-DGPUCA_DETERMINISTIC_MODE` Cmake setting : Compile-time setting. +- The `--PROCdeterministicGPUReconstruction` command line option / `GPU_proc.deterministicGPUReconstruction` `--configKeyValue` setting : Run time setting. +- The `--RTCdeterministic` command line option / `GPU_proc_rtc.deterministic` `--configKeyValue` setting. (Auto-enabled by the `deterministicGPUReconstruction` setting.) : Compile-time setting for RTC code. + +In order to be fully deterministic, all settings must be enabled, where the RTC setting is automatically enabled if not explicitly disabled. + +`GPUCA_DETERMINISTIC_MODE` has multiple levels, which are described here: [FindO2GPU.cmake](https://github.com/AliceO2Group/AliceO2/blob/80a80a17f5a1d9cb77743e2a39b15b653fe1a4f9/dependencies/FindO2GPU.cmake#L72). +- In order to have fully deterministic GPUReconstruction (i.e. all algorithms that come with the GPUTracking library, like TPC tracking), the level `GPUCA_DETERMINISTIC_MODE=GPU` is needed. +- In order to apply it to all of O2, e.g. for ITS tracking, please use `GPUCA_DETERMINISTIC_MODE=WHOLEO2` + +Enabling the options is a bit different for O2 and for the standalone benchmark: +- For enabling it in the standalone benchmark, please set GPUCA_DETERMINISTIC_MODE=GPU in [config.cmake](https://github.com/AliceO2Group/AliceO2/blob/dev/GPU/GPUTracking/Standalone/cmake/config.cmake) and use the command line argument `--PROCdeterministicGPUReconstruction 1`. +- For O2, Either add `set(GPUCA_DETERMINISTIC_MODE GPU)` to the beginning of the [GPU CMakeLists.txt](https://github.com/AliceO2Group/AliceO2/blob/dev/GPU/CMakeLists.txt) or add `set(GPUCA_DETERMINISTIC_MODE WHOLEO2)` to the beginning of the [Global CMakeLists.txt](https://github.com/AliceO2Group/AliceO2/blob/dev/CMakeLists.txt), and use the `configKeyValue` `GPU_proc.deterministicGPUReconstruction`. In order to enable this for the Full-System-Test or with [dpl-workflow.sh](https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/dpl-workflow.sh), please export `CONFIG_EXTRA_PROCESS_o2_gpu_reco_workflow=GPU_proc.deterministicGPUReconstruction=1;`. + +With these settings, if one runs multiple times, the number of clusters and number of tracks should be always fully identical. +Note that this yields a significant performance penalty during the processing, therefore the deterministic mode is not compiled in by default, but it must be enabled explicitly and code must be recompiled. + +Beyond comparing only the number of clusters and number of tracks, it is also possible to compare intermediate results. To do so, please use the standalone benchmark (either `./ca` or `o2-gpu-standalone-benchmark` binary) with the `--debug 6` option. +It will create a dump container all (most) intermediate results in text form, which can be compared. The output files is called `CPU.out` if using the CPU backend, and `GPU.out` for the GPU backend. +Note that the dump files will be huge and the processing will be slow and consume much more memory than normal with `--debug 6 . It has been tested with datasets containing up to 50 Pb-Pb collisions, and might fail for larger data. +The dump files (if the deterministic mode is used with both compile- and runtime-activation), the files should be 100% identical and can just be compared with `diff`. diff --git a/GPU/documentation/run-time-compilation.md b/GPU/documentation/run-time-compilation.md new file mode 100644 index 0000000000000..accfceb47b870 --- /dev/null +++ b/GPU/documentation/run-time-compilation.md @@ -0,0 +1,21 @@ +Run time compilation is a feature of the GPUReconstruction library, which can recompile the GPU code for HIP and for CUDA at runtime, and apply some optimizations and changes. It is planned to add support for CPU code and OpenCL code in the future. + +The changes that can be applied are: +- `constexpr` optimization: configuration values that are constant during the processing are replaced by `constexpr` expressions, which allows the compiler to optimize the code better. Benchmarks in 2024 habe shown 5% performance improvement with CUDA and 2% improvement with HIP. +- Disabling of unused code, in particular this is currently used to remove the TPC code for V/M shape correction during online processing, simplifying the code, and yielding better compiler optimization, for a 20%-30% speedup on the MI50 GPUs. +- Use different GPU constant parameters / launch bounds: These are tuning parameters, which are architecutre-dependent. The default values are taken from the first architecture the GPU code is compiled for in the normal compilation phase. If the architecture we are running on is different, different parameters can be loaded for RTC. +- Compiling for different target architectures. This allows us to enable running on hardware, for which the code was not compiled in the original compilation. + +Generally, RTC is enabled via the `--RTCenable` flag for the standalone benchmark, or via the `GPU_proc_rtc.enable=1` `configKeyValue` for O2. +For a list of RTC options, please see [GPUSettingsList.h](https://github.com/AliceO2Group/AliceO2/blob/80a80a17f5a1d9cb77743e2a39b15b653fe1a4f9/GPU/GPUTracking/Definitions/GPUSettingsList.h#L215). + +Caching the output: +- The RTC output can be cached and reused, so that when running multiple times, compilation is not repeated. This is enabled via the `--RTCcacheOutput` setting. The folder to store the cache files can be selected via `--RTCTECHcacheFolder` and with `--RTCTECHcacheMutex` (default: enabled), a file-lock mutex can be used to synchronize access to the cache folder. The cached code is checked against the to-be-compiled source code with SHA1 hashes, and only if the code is not change the cache is used, otherwise the code is recompiled and the cache updated. It is possible to force using outdated cache files via the `--RTCTECHignoreCacheValid` option. + +For chaning the launch bounds and other parameters, please consider `--RTCTECHloadLaunchBoundsFromFile` (and `--RTCTECHprintLaunchBounds`), which can launch a parameter set which can be created via [dumpGPUDefParam.C](https://github.com/AliceO2Group/AliceO2/blob/dev/GPU/GPUTracking/Standalone/tools/dumpGPUDefParam.C). A set of default parameters is stored in `[INSTALL_FOLDER]/share/GPU`. + +It is possible to select a different target architecture for the compilation via `--RTCTECHoverrideArchitecture`, and the compilation can be prepended by a command with `--RTCTECHprependCommand`, e.g. for CPU pinning. See for example [dpl-workflow.sh](https://github.com/AliceO2Group/AliceO2/blob/80a80a17f5a1d9cb77743e2a39b15b653fe1a4f9/prodtests/full-system-test/dpl-workflow.sh#L335). + +`--RTCdeterministic` enables the [Deterministic Mode](https://github.com/AliceO2Group/AliceO2/blob/dev/GPU/documentation/deterministic-mode.md) (compile-time setting) for RTC. Usually you don't need to bother, as for the deterministic mode it is autoenabled from `--PROCdeterministicGPUReconstruction`, but the explicit `--RTCdeterministic` is available for tests. + +Finally, `--RTCoptConstexpr` and `--RTCoptSpecialCode` enable the constexpr and code removal optimizations. For an example how the TPC V/M shape corrections are removed, see [TPCFastTransform.h](https://github.com/AliceO2Group/AliceO2/blob/fc3ace17eca580c338751163ef4528e3ec47f9d6/GPU/TPCFastTransformation/TPCFastTransform.h#L445). diff --git a/prodtests/full-system-test/documentation/README.md b/prodtests/full-system-test/documentation/README.md new file mode 100644 index 0000000000000..1fdef1da36ecd --- /dev/null +++ b/prodtests/full-system-test/documentation/README.md @@ -0,0 +1,17 @@ +[full-system-test.md](https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/documentation/full-system-test.md) : +- Full system test quick start guide + +[full-system-test-setup.md](https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/documentation/full-system-test-setup.md) : +- More detailed description of full-system-test scripts, simulation of data set, and script to run the workflow + +[full-system-test-as-stress-test.md](https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/documentation/full-system-test-as-stress-test.md) : +- Details on how to use the full system test as stress test and for validation of an EPN online compute node + +[dpl-workflow-options.md](https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/documentation/dpl-workflow-options.md) : +- Description of the main workflow script [dpl-workflow.sh](https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/dpl-workflow.sh) and its options. + +[env-variables.md](https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/documentation/env-variables.md) : +- List of common environment variables used by the workflow scripts (defaults set by https://github.com/davidrohr/O2DPG/blob/master/DATA/common/setenv.sh) + +[raw-tf-conversion.md](https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/documentation/raw-tf-conversion.md) : +- This is automated in a script now, but just in case details how readout files are converted to a .tf file for usage in the full system test with replay from DataDistribution. diff --git a/prodtests/full-system-test/documentation/env-variables.md b/prodtests/full-system-test/documentation/env-variables.md index b93622c0a0f94..5a13f2ee9e19d 100644 --- a/prodtests/full-system-test/documentation/env-variables.md +++ b/prodtests/full-system-test/documentation/env-variables.md @@ -1,4 +1,4 @@ -The `setenv-sh` script sets the following environment options +The [setenv-sh](https://github.com/davidrohr/O2DPG/blob/master/DATA/common/setenv.sh) script sets the following environment options * `NTIMEFRAMES`: Number of time frames to process. * `TFDELAY`: Delay in seconds between publishing time frames (1 / rate). * `NGPUS`: Number of GPUs to use, data distributed round-robin. @@ -25,7 +25,7 @@ The `setenv-sh` script sets the following environment options * `EXTINPUT`: Receive input from raw FMQ channel instead of running o2-raw-file-reader. * 0: `dpl-workflow.sh` can run as standalone benchmark, and will read the input itself. * 1: To be used in combination with either `datadistribution.sh` or `raw-reader.sh` or with another DataDistribution instance. -* `CTFINPUT`: Read input from CTF ROOT file. This option is incompatible to EXTINPUT=1. The CTF ROOT file can be stored via SAVECTF=1. +* `CTFINPUT`: Read input from CTF ROOT file. This option is incompatible to `EXTINPUT=1`. The CTF ROOT file can be stored via `SAVECTF=1`. * `NHBPERTF`: Time frame length (in HBF) * `GLOBALDPLOPT`: Global DPL workflow options appended to o2-dpl-run. * `EPNPIPELINES`: Set default EPN pipeline multiplicities. diff --git a/prodtests/full-system-test/documentation/full-system-test-as-stress-test.md b/prodtests/full-system-test/documentation/full-system-test-as-stress-test.md index 0c4637ece0920..c78d81b236c1c 100644 --- a/prodtests/full-system-test/documentation/full-system-test-as-stress-test.md +++ b/prodtests/full-system-test/documentation/full-system-test-as-stress-test.md @@ -7,7 +7,7 @@ This is a quick summary how to run the full system test (FST) as stress test on - Enter the O2PDPSuite environment either vie `alienv enter O2PDPSuite/latest Readout/latest`. - Go to an empty directory. - Run the FST simulation via: `NEvents=650 NEventsQED=10000 SHMSIZE=128000000000 TPCTRACKERSCRATCHMEMORY=40000000000 SPLITTRDDIGI=0 GENERATE_ITSMFT_DICTIONARIES=1 $O2_ROOT/prodtests/full_system_test.sh` - - Get a current matbud.root (e.g. from here https://alice.its.cern.ch/jira/browse/O2-2288) and place it in that folder. + - Material budget table (e.g. from here https://alice.its.cern.ch/jira/browse/O2-2288) now comes from CCDB, no need any more to pull it manually. - Create a timeframe file from the raw files: `$O2_ROOT/prodtests/full-system-test/convert-raw-to-tf-file.sh`. - Prepare the ramdisk folder: `mv raw/timeframe raw/timeframe-org; mkdir raw/timeframe-tmpfs; ln -s timeframe-tmpfs raw/timeframe` diff --git a/prodtests/full-system-test/documentation/full-system-test-setup.md b/prodtests/full-system-test/documentation/full-system-test-setup.md index 82ef9b7d0c74f..e90a3984dd3da 100644 --- a/prodtests/full-system-test/documentation/full-system-test-setup.md +++ b/prodtests/full-system-test/documentation/full-system-test-setup.md @@ -16,7 +16,7 @@ If you just want to test a small dataset, you can skip the following steps, and - I'd suggest to do a first small test with 1-5 events to check the machinery, 100 events is already a good size which should not exhaust the memory, I'd go to 600 only after 100 works. 1. Compile O2 with GPU support, in addition you need O2sim, DataDistribution, and Readout (latest versions from alidist will do). GPUs for O2 should be auto-detected, but you can set the environment variables ALIBUILD_ENABLE_CUDA / ALIBUILD_ENABLE_HIP to enforce it (and get a failure when detection fails). Look for CMake log messages "Building GPUTracking with CUDA support" (etc) to verify. - For more information, see https://github.com/AliceO2Group/AliceO2/blob/dev/GPU/documentation/build.md + For more information, see https://github.com/AliceO2Group/AliceO2/blob/dev/GPU/documentation/build-O2.md 1. Optionally place some binary configuration files in the simulation folder. Default objects will be used if no such files are placed. There are instructions at the end of this post how to generate these files. (Currently, these files are: matbud.root, ITSdictionary.bin, ctf_dictionary.root, tpctransform.root, dedxsplines.root, and tpcpadgaincalib.root) 1. Load the O2sim environment (`alienv enter O2sim/latest`) and run the following full system test script for a full simulation and digits to raw conversion (this will already include 1 CPU reconstruction run): ``` @@ -37,7 +37,7 @@ If you just want to test a small dataset, you can skip the following steps, and ``` This will use 4 GPU with the HIP backend and allocate 22 GB of scratch memory on the GPU (should be sufficient for 128 orbit TF). You can change the GPU type as indicated in the linked README.md above, e.g. `GPUTYPE=CUDA NGPUS=1` for 1 CUDA GPU. 1. With this, the full chain is running inside O2 DPL. Next we are adding DataDistribution. - 1. Ceate the TF files as explained in the subtask (https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/documentation/raw-data-simulation.md). For convenience, there is a script that should do it automatically, from a shell that has loaded both DataDistribution and Readout: `$O2_ROOT/prodtests/full-system-test/convert-raw-to-tf-file.sh`. + 1. Ceate the TF files as explained in the subtask ([raw-tf-conversion.md](https://github.com/AliceO2Group/AliceO2/blob/dev/prodtests/full-system-test/documentation/raw-tf-conversion.md)). For convenience, there is a script that should do it automatically, from a shell that has loaded both DataDistribution and Readout: `$O2_ROOT/prodtests/full-system-test/convert-raw-to-tf-file.sh`. 1. Enter the O2 environment, and run the following script (please adjust the variables as in the test before). ``` EXTINPUT=1 SHMSIZE=128000000000 GPUTYPE=CPU $O2_ROOT/prodtests/full-system-test/dpl-workflow.sh diff --git a/prodtests/full-system-test/documentation/raw-data-simulation.md b/prodtests/full-system-test/documentation/raw-tf-conversion.md similarity index 100% rename from prodtests/full-system-test/documentation/raw-data-simulation.md rename to prodtests/full-system-test/documentation/raw-tf-conversion.md