Fix nvbug6084457: Make NVLINK_MAX_LINKS version-dependent#2192
Conversation
|
|
||
| NVLINK_MAX_LINKS = 18 | ||
|
|
||
| if tuple(int(x) for x in system_get_nvml_version().split(".")) < (3, 13): |
There was a problem hiding this comment.
Could we obtain this value from the source of truth?
E.g. I see:
$ grep NVLINK_MAX_LINKS /usr/local/cuda_*/include/nvml.h
/usr/local/cuda_13.3.0_610.43.02_linux_kitpick035/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 36 //!< Maximum number of NVLink links supported.
I have every single /usr/local/cuda-*.* from 12.0 to 13.3, but I only see NVML_NVLINK_MAX_LINKS in the 13.3 include file.
Maybe something like this could work?
int internal_only_get_NVLINK_MAX_LINKS() {
#ifdef NVML_NVLINK_MAX_LINKS
return NVML_NVLINK_MAX_LINKS;
#else
return 18; // Prior to CUDA 13.3 this value was hard-wired.
#endif
}
If there isn't a practical solution, could you please add a comment to explain?
There was a problem hiding this comment.
I have every single /usr/local/cuda-. from 12.0 to 13.3, but I only see NVML_NVLINK_MAX_LINKS in the 13.3 include file.
Are you sure? I see it in 12.9 through 13.3 (though the value changed in 13.3).
We have to build a single binary that works for every version 12.9 - 13.3, so the only way to do this is with a runtime computation. Additionally, we build without the nvml.h header present. I'll add a comment.
There was a problem hiding this comment.
Oh, sorry, I messed up like this (retrieved from my bash history; note the cuda_):
grep NVLINK_MAX_LINKS /usr/local/cuda_*/include/nvml.h
(I have a custom softlink for 13.3, which is why that matched.)
It looks much different like this:
$ grep NVLINK_MAX_LINKS /usr/local/cuda-*/include/nvml.h
/usr/local/cuda-12.0/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.1/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.2/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.3/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.4/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.5/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.6/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.8/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.9/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-13.0/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-13.1/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18 //!< Maximum number of NVLink links supported.
/usr/local/cuda-13.2/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18 //!< Maximum number of NVLink links supported.
/usr/local/cuda-13.3/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 36 //!< Maximum number of NVLink links supported.
There was a problem hiding this comment.
I didn't think it through before, but at second look, of course the helper function I was envisioning would need to live in nvml.h itself.
I had my agent do a quick check in CUDA 13.3's nvml.h and the NVML docs. It didn't surface any runtime C API that reports NVML_NVLINK_MAX_LINKS. Your solution does indeed appear to be our only option. Ideally we'd request a runtime API, but that's for another day.
There was a problem hiding this comment.
This is a module-level constant so there's not much we can do with a helper function. @mdboom that said, this would force nvml to be loaded at import time, which breaks CPU-only envs. Can we hide this behind module __getattr__
There was a problem hiding this comment.
Experimentally, it doesn't break CPU-only builds, but it does break on systems without NVML installed, so I agree the __getattr__ trick is probably justified, but it's narrower than you think. nvmlSystemGetVersion doesn't require nvmlInit and doesn't require a GPU.
There was a problem hiding this comment.
CPU-only envs have a broad definition, including but not limited to: GPU driver is not installed 😉
|
|
||
| NVLINK_MAX_LINKS = 18 | ||
|
|
||
| if tuple(int(x) for x in system_get_nvml_version().split(".")) < (3, 13): |
There was a problem hiding this comment.
I didn't think it through before, but at second look, of course the helper function I was envisioning would need to live in nvml.h itself.
I had my agent do a quick check in CUDA 13.3's nvml.h and the NVML docs. It didn't surface any runtime C API that reports NVML_NVLINK_MAX_LINKS. Your solution does indeed appear to be our only option. Ideally we'd request a runtime API, but that's for another day.
|
|
@leofang: There seem to be failures on specific hardware (a100, t4) on Windows only that we can't use all of the stated nvlinks. I have asked internally why this might be the case. It /shouldn't/ be a driver vs. CTK difference since NVML ships with the driver and we are asking NVML (not the CTK or cuda-bindings itself or something) what version we have to determine how many links we have. But check my work, it could be that I'm not checking the appropriate thing. |
NVLINK_MAX_LINKS was updated in CTK 13.3. We therefore need to make this value dynamic based on the NVML version.