Skip to content

Update compatibility matrix files#2400

Merged
tempusfrangit merged 2 commits intomainfrom
md/fix-compatgen
Feb 19, 2026
Merged

Update compatibility matrix files#2400
tempusfrangit merged 2 commits intomainfrom
md/fix-compatgen

Conversation

@michaeldwan
Copy link
Member

@michaeldwan michaeldwan commented Jun 9, 2025

Summary

Fix compatgen tool to correctly extract CuDNN versions from newer nvidia/cuda images, and regenerate all compatibility matrices.

Problem

Newer nvidia/cuda image tags no longer include the CuDNN version number (e.g. 12.9.1-cudnn-devel-ubuntu24.04 instead of 12.6.3-cudnn9-devel-ubuntu22.04). The tag parser in compatgen couldn't extract CuDNN from these tags, so new CUDA versions had to be manually patched into the JSON (see #2036).

Fix

  • tools/compatgen/internal/cuda.go: Fetch the Docker image config via go-containerregistry and read NV_CUDNN_VERSION from environment variables instead of parsing the tag string. Uses authn.DefaultKeychain for Docker Hub auth to avoid rate limits. Adds deterministic sorting of output.
  • tools/compatgen/internal/torch.go: Filter out torch entries with no supported Python versions (all below 3.10 minimum) instead of emitting them with "Pythons": [].
  • tools/compatgen/main.go: Pass context.Context to support the image fetching.

Regenerated files

  • cuda_compatibility.json: Adds CUDA 13.0.x and 13.1.x. CuDNN values now correctly extracted from image configs.
  • torch_compatibility.json: Adds torch 2.10.0, 2.9.x. Drops 41 entries for torch versions with no supported Python versions.
  • tf_compatibility.json: Adds TensorFlow 2.20.0.

Test updates

Updated 4 tests that referenced dropped torch versions (1.7.x, 1.8.0) to use versions still in the matrix (1.11.0, 1.13.1, 2.0.1).

@michaeldwan michaeldwan marked this pull request as ready for review June 9, 2025 21:11
@michaeldwan michaeldwan requested a review from a team June 9, 2025 21:12
@michaeldwan michaeldwan force-pushed the md/fix-compatgen branch 2 times, most recently from ac77130 to dc9197c Compare July 3, 2025 23:21
@michaeldwan michaeldwan requested a review from markphelps July 8, 2025 15:16

func NewVersion(s string) (version *Version, err error) {
// TODO[md]: handle prerelease versions (0.1.2-rc1) so they aren't appended to the previous component
// todo[md]: tbh just switch to hashicorp/go-version or github.com/Masterminds/semver/v3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 ive used semver/v3 in the past, its pretty nice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always go to that one too, but I found it the other day it doesn't support "invalid" semver input like ubuntu's "22.04" with leading zeros. I was hoping to use the same version code in cog and the new base image generator code 😒

Copy link
Contributor

@markphelps markphelps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one question, but overall lgtm

if len(parts) != 4 {
return nil, fmt.Errorf("Tag must be in the format <cudaVersion>-cudnn<cudnnVersion>-{devel,runtime}-ubuntu<ubuntuVersion>. Invalid tag: %s", tag)
func parseCUDABaseImage(ctx context.Context, tag string) (*config.CUDABaseImage, error) {
fmt.Println("parsing", tag)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debug printlns ? / do we want to keep these ?

images := make([]config.CUDABaseImage, len(tags))
eg, egctx := errgroup.WithContext(context.TODO())
// set a concurrency limit to avoid throttling by the docker hub api (since these are authenticated requests)
eg.SetLimit(1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use error group at all then if we are running them serially?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

natural evolution of intermittent issues and sloppy code. fixing :)

@michaeldwan
Copy link
Member Author

@markphelps I removed the unnecessary errgroup and excessive print statements. I left one for each image since the process takes a few minutes and it's nice to see some output to know it's not hanging

markphelps
markphelps previously approved these changes Jul 8, 2025
Copy link
Contributor

@markphelps markphelps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

Newer nvidia/cuda image tags no longer include the CuDNN version number
(e.g. '12.9.1-cudnn-devel-ubuntu24.04' instead of '12.6.3-cudnn9-devel-ubuntu22.04').
This fetches the Docker image config and reads CUDA_VERSION and NV_CUDNN_VERSION
from environment variables instead of parsing the tag string.

Also adds deterministic sorting of output and Docker Hub auth to avoid rate limits.
…hon versions

- Regenerated cuda_compatibility.json: adds CUDA 13.0.x and 13.1.x, CuDNN
  versions now correctly extracted from image configs instead of manual patches
- Regenerated torch_compatibility.json: adds torch 2.10.0, 2.9.x; drops 41
  entries for torch versions with no supported Python versions (all < 3.10)
- Updated tests to use torch versions still in the compatibility matrix
@tempusfrangit tempusfrangit merged commit faa27ae into main Feb 19, 2026
31 checks passed
@tempusfrangit tempusfrangit deleted the md/fix-compatgen branch February 19, 2026 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments