Skip to content

Prune stale per-container version metrics to stop unbounded Prometheus series growth#434

Open
Copilot wants to merge 3 commits into
mainfrom
copilot/fix-version-checker-memory-leak
Open

Prune stale per-container version metrics to stop unbounded Prometheus series growth#434
Copilot wants to merge 3 commits into
mainfrom
copilot/fix-version-checker-memory-leak

Conversation

Copy link
Copy Markdown

Copilot AI commented May 9, 2026

version-checker was retaining superseded version_checker_is_latest_version series for active containers. In clusters that run long enough and observe image/version churn, this causes Prometheus label cardinality to grow over time and can present as steady memory growth/OOMs.

  • Root cause

    • AddImage writes version_checker_is_latest_version with current_version and latest_version in the label set.
    • For the same pod/container, each new observed version created a new gauge series, while the previous series remained registered until pod/container removal.
  • Change

    • Before recording the latest image state for a container, remove the existing per-container gauge series for:
      • version_checker_is_latest_version
      • version_checker_last_checked
    • Keep duration and error metrics unchanged; this only de-duplicates the “current state” metrics.
  • Result

    • Each pod/container now has a single current version/check timestamp series instead of an ever-growing set of historical version-labeled gauges.
    • This bounds metric memory usage for long-lived workloads while preserving the intended exported state.
  • Regression coverage

    • Add a focused metrics test that updates the same container twice and asserts the stale version-labeled series is no longer present in the registry.
labels := buildContainerPartialLabels(namespace, pod, container, containerType)

m.containerImageVersion.DeletePartialMatch(labels)
m.containerImageChecked.DeletePartialMatch(labels)

m.containerImageVersion.With(
	buildFullLabels(namespace, pod, container, containerType, imageURL, currentVersion, latestVersion),
).Set(isLatestF)

Copilot AI requested review from Copilot and removed request for Copilot May 9, 2026 14:32
Copilot AI linked an issue May 9, 2026 that may be closed by this pull request
Agent-Logs-Url: https://github.com/jetstack/version-checker/sessions/f773d382-07fb-40b4-8c53-eec3cb652854

Co-authored-by: davidcollom <1504448+davidcollom@users.noreply.github.com>
Copilot AI requested review from Copilot and removed request for Copilot May 9, 2026 14:37
Agent-Logs-Url: https://github.com/jetstack/version-checker/sessions/f773d382-07fb-40b4-8c53-eec3cb652854

Co-authored-by: davidcollom <1504448+davidcollom@users.noreply.github.com>
Copilot AI requested review from Copilot and removed request for Copilot May 9, 2026 14:40
Copilot AI changed the title [WIP] Fix memory leak in version-checker causing OOM kills Prune stale per-container version metrics to stop unbounded Prometheus series growth May 9, 2026
Copilot AI requested a review from davidcollom May 9, 2026 14:41
@davidcollom davidcollom marked this pull request as ready for review May 9, 2026 14:44
@davidcollom davidcollom requested a review from maria-reynoso as a code owner May 9, 2026 14:44
Copilot AI review requested due to automatic review settings May 9, 2026 14:44
@davidcollom davidcollom enabled auto-merge (squash) May 9, 2026 14:44
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses unbounded Prometheus series growth by pruning stale per-container “current state” gauge series when the observed image/version for a container changes, keeping metric cardinality bounded for long-lived clusters.

Changes:

  • In AddImage, delete existing per-container gauge series for is_latest_version and last_checked before recording the newest values.
  • Add a regression test to ensure stale version-labeled is_latest_version series are removed when the same container is updated multiple times.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
pkg/metrics/metrics.go Prunes existing per-container gauge series prior to writing updated “current state” metrics to prevent series buildup.
pkg/metrics/metrics_test.go Adds a regression test asserting stale version-labeled gauge series are removed after updates.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/metrics/metrics.go
Comment on lines +121 to +127
labels := buildContainerPartialLabels(namespace, pod, container, containerType)

// Remove any existing "current state" gauge series for this container before
// registering the newest values. Otherwise each version change leaves behind
// a distinct Prometheus series due to the current/latest version labels.
m.containerImageVersion.DeletePartialMatch(labels)
m.containerImageChecked.DeletePartialMatch(labels)
Comment on lines +76 to +88
func TestAddImageReplacesExistingVersionMetrics(t *testing.T) {
reg := prometheus.NewRegistry()
m := New(logrus.NewEntry(logrus.New()), reg, fakek8s)

m.AddImage("namespace", "pod", "container", "container", "url", false, "1.0.0", "1.1.0")
m.AddImage("namespace", "pod", "container", "container", "url", true, "1.1.0", "1.1.0")

assert.Equal(t, 1,
testutil.CollectAndCount(m.containerImageVersion.MetricVec, MetricNamespace+"_is_latest_version"),
)
assert.Equal(t, 1,
testutil.CollectAndCount(m.containerImageChecked.MetricVec, MetricNamespace+"_last_checked"),
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

version-checker seemingly leaks memory and gets oom-killed

3 participants