Skip to content

Fix nvbug6084457: Make NVLINK_MAX_LINKS version-dependent#2192

Open
mdboom wants to merge 8 commits into
NVIDIA:mainfrom
mdboom:nvlink-max-links-dynamic
Open

Fix nvbug6084457: Make NVLINK_MAX_LINKS version-dependent#2192
mdboom wants to merge 8 commits into
NVIDIA:mainfrom
mdboom:nvlink-max-links-dynamic

Conversation

@mdboom

@mdboom mdboom commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

NVLINK_MAX_LINKS was updated in CTK 13.3. We therefore need to make this value dynamic based on the NVML version.

@mdboom mdboom added this to the cuda.bindings 13.3.1 milestone Jun 10, 2026
@mdboom mdboom self-assigned this Jun 10, 2026
@mdboom mdboom added bug Something isn't working cuda.bindings Everything related to the cuda.bindings module labels Jun 10, 2026
@github-actions github-actions Bot added the cuda.core Everything related to the cuda.core module label Jun 10, 2026
Comment thread cuda_bindings/cuda/bindings/nvml.pyx Outdated

NVLINK_MAX_LINKS = 18

if tuple(int(x) for x in system_get_nvml_version().split(".")) < (3, 13):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we obtain this value from the source of truth?

E.g. I see:

$ grep NVLINK_MAX_LINKS /usr/local/cuda_*/include/nvml.h
/usr/local/cuda_13.3.0_610.43.02_linux_kitpick035/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 36 //!< Maximum number of NVLink links supported.

I have every single /usr/local/cuda-*.* from 12.0 to 13.3, but I only see NVML_NVLINK_MAX_LINKS in the 13.3 include file.

Maybe something like this could work?

int internal_only_get_NVLINK_MAX_LINKS() {
#ifdef NVML_NVLINK_MAX_LINKS
    return NVML_NVLINK_MAX_LINKS;
#else
    return 18; // Prior to CUDA 13.3 this value was hard-wired. 
#endif
}

If there isn't a practical solution, could you please add a comment to explain?

@mdboom mdboom Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have every single /usr/local/cuda-. from 12.0 to 13.3, but I only see NVML_NVLINK_MAX_LINKS in the 13.3 include file.

Are you sure? I see it in 12.9 through 13.3 (though the value changed in 13.3).

We have to build a single binary that works for every version 12.9 - 13.3, so the only way to do this is with a runtime computation. Additionally, we build without the nvml.h header present. I'll add a comment.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I messed up like this (retrieved from my bash history; note the cuda_):

grep NVLINK_MAX_LINKS /usr/local/cuda_*/include/nvml.h

(I have a custom softlink for 13.3, which is why that matched.)

It looks much different like this:

$ grep NVLINK_MAX_LINKS /usr/local/cuda-*/include/nvml.h
/usr/local/cuda-12.0/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.1/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.2/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.3/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.4/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.5/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.6/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.8/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-12.9/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-13.0/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18
/usr/local/cuda-13.1/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18 //!< Maximum number of NVLink links supported.
/usr/local/cuda-13.2/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 18 //!< Maximum number of NVLink links supported.
/usr/local/cuda-13.3/include/nvml.h:#define NVML_NVLINK_MAX_LINKS 36 //!< Maximum number of NVLink links supported.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think it through before, but at second look, of course the helper function I was envisioning would need to live in nvml.h itself.

I had my agent do a quick check in CUDA 13.3's nvml.h and the NVML docs. It didn't surface any runtime C API that reports NVML_NVLINK_MAX_LINKS. Your solution does indeed appear to be our only option. Ideally we'd request a runtime API, but that's for another day.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a module-level constant so there's not much we can do with a helper function. @mdboom that said, this would force nvml to be loaded at import time, which breaks CPU-only envs. Can we hide this behind module __getattr__

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(For example this would break the import test.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Experimentally, it doesn't break CPU-only builds, but it does break on systems without NVML installed, so I agree the __getattr__ trick is probably justified, but it's narrower than you think. nvmlSystemGetVersion doesn't require nvmlInit and doesn't require a GPU.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPU-only envs have a broad definition, including but not limited to: GPU driver is not installed 😉

@mdboom mdboom requested a review from rwgk June 11, 2026 19:13
Comment thread cuda_bindings/cuda/bindings/nvml.pyx Outdated

NVLINK_MAX_LINKS = 18

if tuple(int(x) for x in system_get_nvml_version().split(".")) < (3, 13):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think it through before, but at second look, of course the helper function I was envisioning would need to live in nvml.h itself.

I had my agent do a quick check in CUDA 13.3's nvml.h and the NVML docs. It didn't surface any runtime C API that reports NVML_NVLINK_MAX_LINKS. Your solution does indeed appear to be our only option. Ideally we'd request a runtime API, but that's for another day.

@github-actions

Copy link
Copy Markdown

@mdboom

mdboom commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

@leofang: There seem to be failures on specific hardware (a100, t4) on Windows only that we can't use all of the stated nvlinks. I have asked internally why this might be the case. It /shouldn't/ be a driver vs. CTK difference since NVML ships with the driver and we are asking NVML (not the CTK or cuda-bindings itself or something) what version we have to determine how many links we have. But check my work, it could be that I'm not checking the appropriate thing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working cuda.bindings Everything related to the cuda.bindings module cuda.core Everything related to the cuda.core module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants