Skip to content

Fix: image vulnerabilities#124

Open
Chmokachka wants to merge 31 commits into
feat/image-security-scannerfrom
fix/image-vulnerabilities
Open

Fix: image vulnerabilities#124
Chmokachka wants to merge 31 commits into
feat/image-security-scannerfrom
fix/image-vulnerabilities

Conversation

@Chmokachka
Copy link
Copy Markdown
Collaborator

@Chmokachka Chmokachka commented May 18, 2026

Summary

Drives all runpod/* images to a clean Trivy / Hadolint scan, plus a few CI fixes that surfaced along the way. Targets every image we ship out of official-templates/ and helper-templates/.

What's fixed

Image vulnerabilities (Trivy `--severity HIGH,CRITICAL)

  • base — bumped jupyterlab, notebook, OpenSSH-related deps; stripped the efa_metrics directory from NVIDIA Nsight Compute. That directory ships an internal Go binary (nic_sampler) that NVIDIA builds with an old Go toolchain and was triggering recurring Go-stdlib HIGH/CRITICAL findings on every rebuild. The plugin is AWS-EFA-only (x86, AWS hardware) and never runs on RunPod, so deleting it is safe and the find ... || true guard keeps it a no-op on ROCm / CPU images.
  • autoresearch — fixed Hadolint findings, aligned with new base.
  • pytorch — Hadolint fixes; bumped max-parallelism to 3 in CI and increased the workflow timeout (the matrix was OOM-killing the runner before).
  • rocm — addressed all fixable CVEs; pinned the relevant deps.
  • nvidia-pytorch — patched OS-package CVEs; added scrub-stale-metadata.py (see below) to remove orphan .dist-info / .egg-info trees that kept Trivy reporting fixed wheels as still-vulnerable.

Hadolint

  • All DL3008 / DL3009 / DL3015 findings fixed across the touched Dockerfiles (--no-install-recommends, apt-get clean && rm -rf /var/lib/apt/lists/*, version pins where reasonable).
  • Hadolint-on-push workflow now ignores the rules we already chose to accept project-wide (matches the PR check behaviour).

CI / tooling

  • Upgraded GitHub Actions versions across nvidia.yml, rocm.yml, hadolint-pr.yml, hadolint-push.yml.
  • Replaced the brittle Trivy action call with our internal .github/actions/trivy — exposes a skip_files input so nvidia-pytorch can skip the publicly-known CA bundle that Trivy flags as a "secret". The cert is the upstream NGC trust bundle published on GitHub, so flagging it is a false positive.
  • Pinned RUNPODCTL_VERSION=v2.3.0 in base/Dockerfile to stop tracking latest.
  • Fixed docker/setup-qemu-action invocation that started failing after the action's input rename.

New: scripts/scrub-stale-metadata.py

Small helper invoked by Dockerfiles after pip install. NGC base images bundle several Python packages as in-tree source builds whose .egg-info lives next to the source. pip install --upgrade upgrades the wheel install but cannot reach those bundled trees, so Trivy keeps reporting the old version even though the runtime resolves to the new one. The script reads our pinned requirements.txt and deletes any .dist-info / .egg-info whose Version: disagrees with the pin.

What's NOT fixed (deliberate)

Three images still have findings we can't act on in this PR:

Image Reason
runpod/base:...-rocm644-...-pytorch251 All remaining CVEs are in PyTorch 2.5.1 itself, fixed only in 2.6.0+. Two options: drop the 2.5.1 variant, or wait for an upstream backport. Left for a separate decision.
runpod/autoresearch:...-cuda1281-ubuntu2204 Findings are in transitive deps that need an autoresearch app-level dependency upgrade — out of scope for this PR.
runpod/autoresearch:...-cuda1281-ubuntu2404 Same as above.

These are tracked separately; everything else is now clean.

Validation

  • Trivy table-mode scans of each rebuilt tag — clean HIGH/CRITICAL on every targeted image.
  • Hadolint runs against the touched Dockerfiles — clean.

Follow-ups (separate PRs)

  • Open autoresearch-side PR to upgrade transitive deps.

@Chmokachka Chmokachka changed the base branch from main to feat/image-security-scanner May 18, 2026 13:19
@blacksmith-sh

This comment has been minimized.

@blacksmith-sh

This comment has been minimized.

@Chmokachka Chmokachka marked this pull request as ready for review May 21, 2026 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants