Fix: image vulnerabilities#124
Open
Chmokachka wants to merge 31 commits into
Open
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Drives all
runpod/*images to a clean Trivy / Hadolint scan, plus a few CI fixes that surfaced along the way. Targets every image we ship out ofofficial-templates/andhelper-templates/.What's fixed
Image vulnerabilities (Trivy `--severity HIGH,CRITICAL)
jupyterlab,notebook, OpenSSH-related deps; stripped theefa_metricsdirectory from NVIDIA Nsight Compute. That directory ships an internal Go binary (nic_sampler) that NVIDIA builds with an old Go toolchain and was triggering recurring Go-stdlib HIGH/CRITICAL findings on every rebuild. The plugin is AWS-EFA-only (x86, AWS hardware) and never runs on RunPod, so deleting it is safe and thefind ... || trueguard keeps it a no-op on ROCm / CPU images.max-parallelismto 3 in CI and increased the workflow timeout (the matrix was OOM-killing the runner before).scrub-stale-metadata.py(see below) to remove orphan.dist-info/.egg-infotrees that kept Trivy reporting fixed wheels as still-vulnerable.Hadolint
DL3008/DL3009/DL3015findings fixed across the touched Dockerfiles (--no-install-recommends,apt-get clean && rm -rf /var/lib/apt/lists/*, version pins where reasonable).CI / tooling
nvidia.yml,rocm.yml,hadolint-pr.yml,hadolint-push.yml..github/actions/trivy— exposes askip_filesinput sonvidia-pytorchcan skip the publicly-known CA bundle that Trivy flags as a "secret". The cert is the upstream NGC trust bundle published on GitHub, so flagging it is a false positive.RUNPODCTL_VERSION=v2.3.0inbase/Dockerfileto stop trackinglatest.docker/setup-qemu-actioninvocation that started failing after the action's input rename.New:
scripts/scrub-stale-metadata.pySmall helper invoked by Dockerfiles after
pip install. NGC base images bundle several Python packages as in-tree source builds whose.egg-infolives next to the source.pip install --upgradeupgrades the wheel install but cannot reach those bundled trees, so Trivy keeps reporting the old version even though the runtime resolves to the new one. The script reads our pinnedrequirements.txtand deletes any.dist-info/.egg-infowhoseVersion:disagrees with the pin.What's NOT fixed (deliberate)
Three images still have findings we can't act on in this PR:
runpod/base:...-rocm644-...-pytorch251runpod/autoresearch:...-cuda1281-ubuntu2204runpod/autoresearch:...-cuda1281-ubuntu2404These are tracked separately; everything else is now clean.
Validation
Follow-ups (separate PRs)