Skip to content

WIP OCPNODE-4055: Add comprehensive test suite for Additional Storage Support feature#31083

Draft
BhargaviGudi wants to merge 1 commit intoopenshift:mainfrom
BhargaviGudi:additionalArtifactsAutomation
Draft

WIP OCPNODE-4055: Add comprehensive test suite for Additional Storage Support feature#31083
BhargaviGudi wants to merge 1 commit intoopenshift:mainfrom
BhargaviGudi:additionalArtifactsAutomation

Conversation

@BhargaviGudi
Copy link
Copy Markdown
Contributor

@BhargaviGudi BhargaviGudi commented Apr 28, 2026

This adds API validation and E2E tests for the Additional Storage Configurations feature (TechPreviewNoUpgrade):

  • additionalArtifactStores (max 10 stores)
  • additionalImageStores (max 10 stores)
  • additionalLayerStores (max 5 stores, supports :ref suffix)
  • Combined storage configurations Test suite: openshift/additional-storage-support

Summary by CodeRabbit

  • Tests
    • Added comprehensive E2E and validation tests for CRI-O additional stores functionality, covering artifact stores, image stores, layer stores, and combined configurations.
    • Added testing utilities for container runtime configuration validation and worker node setup.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 28, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 28, 2026

@BhargaviGudi: This pull request references OCPNODE-4055 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "5.0.0" version, but no target version was set.

Details

In response to this:

This adds API validation and E2E tests for the Additional Storage Configurations feature (TechPreviewNoUpgrade):

  • additionalArtifactStores (max 10 stores)
  • additionalImageStores (max 10 stores)
  • additionalLayerStores (max 5 stores, supports :ref suffix)
  • Combined storage configurations Test suite: openshift/additional-storage-support

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 28, 2026
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 28, 2026
@openshift-ci openshift-ci Bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 28, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 28, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This pull request adds comprehensive test coverage for OpenShift's CRI-O additional storage support feature (AdditionalLayerStores, AdditionalImageStores, AdditionalArtifactStores). It includes API validation test suites, disruptive E2E lifecycle tests, and supporting test utilities for managing container storage configurations across worker nodes.

Changes

Additional Stores Test Suite

Layer / File(s) Summary
Test Infrastructure & Utilities
test/extended/node/node_utils.go, test/extended/node/stargz_store_setup.go
Adds helper functions for MCP polling, directory management via MachineConfig/chroot, ContainerRuntimeConfig validation, Pod lifecycle management, image prepopulation, and DaemonSet utilities. Introduces StargzStoreSetup struct with Deploy/Cleanup methods for stargz-store provisioning across worker nodes via privileged DaemonSet.
API Validation Tests
test/extended/node/additional_artifact_stores.go, test/extended/node/additional_image_stores.go, test/extended/node/additional_layer_stores.go, test/extended/node/additional_stores_combined.go
Defines non-disruptive API validation suites (TC1–TC9 per file) testing path format validation, count limits, uniqueness, invalid characters, length constraints, consecutive slashes, and cross-type path sharing for each store type and combinations thereof. Gated by TechPreviewNoUpgrade feature flag.
E2E Lifecycle Tests
test/extended/node/additional_artifact_stores.go, test/extended/node/additional_image_stores.go, test/extended/node/additional_layer_stores.go, test/extended/node/additional_stores_combined.go
Implements serial disruptive E2E suites (TC10–TC15 per file) covering full lifecycle: directory creation on worker nodes, ContainerRuntimeConfig deployment, MCO/MCP rollout waits, storage.conf verification on all workers, configuration updates and removals, pod deployment/validation, and cleanup. Includes Stargz-store and Nydus-store integration scenarios with snapshot and mount verification.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 7 | ❌ 5

❌ Failed checks (5 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Tests lack proper structure: No AfterEach blocks, 288+ assertions without messages, operations missing timeouts, large E2E tests test 6+ behaviors violating single-responsibility. Add AfterEach for cleanup, add failure messages to all Expect assertions, wrap cluster operations in context.WithTimeout, split large E2E tests into focused single-behavior tests.
Microshift Test Compatibility ⚠️ Warning Tests use unavailable OpenShift APIs (FeatureGate, MachineConfigPool, MachineConfig, ContainerRuntimeConfig) and multi-node assumptions without MicroShift protection. Add [apigroup:config.openshift.io] and [apigroup:machineconfiguration.openshift.io] tags to test names, or add [Skipped:MicroShift] labels, or guard with exutil.IsMicroShiftCluster() checks.
Topology-Aware Scheduling Compatibility ⚠️ Warning stargz_store_setup.go verifyStargzStoreRunning() filters control-plane nodes. On SNO, single node with both labels fails with "no pure worker nodes found". Handle SNO case: skip verification when no pure workers found, or verify all worker-labeled nodes including control-plane combos.
Ipv6 And Disconnected Network Test Compatibility ⚠️ Warning Tests require external connectivity to quay.io and github.com without [Skipped:Disconnected] markers. Add [Skipped:Disconnected] to tests pulling from external registries and those deploying stargz-store via GitHub downloads.
✅ Passed checks (7 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: adding a comprehensive test suite for Additional Storage Support feature, which aligns with the substantial additions of test files for artifact/image/layer stores.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All 48 test cases use stable, deterministic names with no dynamic information (variables, timestamps, node names, etc.). Test names are descriptive static strings with [TCn] identifiers.
Single Node Openshift (Sno) Test Compatibility ✅ Passed Tests skip gracefully on SNO. getPureWorkerNodes() filters control-plane nodes, returning empty list on SNO. Tests then skip with e2eskipper.Skipf().
Ote Binary Stdout Contract ✅ Passed All test files follow OTE Binary Stdout Contract. No stdout violations: no package-level print statements, no init() functions, all logging uses framework.Logf routing to stderr.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pkg/testsuites/standard_suites.go`:
- Around line 504-506: The suite's Qualifiers currently filter for
`"[Suite:openshift/additional-storage-support]"` but the tests are tagged
`"[Feature:AdditionalStorageSupport]"`, so update the Qualifiers slice in the
suite definition (the `Qualifiers: []string{ ... }` entry) to match the test
tag—e.g. replace or add `name.contains("[Feature:AdditionalStorageSupport]")`
(or include both selectors if you want to run both forms) so the new tests are
actually selected.

In `@test/extended/node/additional_artifact_stores.go`:
- Around line 460-481: The test races because it calls waitForMCP(...)
immediately after mutating a ContainerRuntimeConfig; add a prior gate call to
waitForMCPToStartUpdating(...) so the helper observes the pool beginning to
update before waiting for rollout: call waitForMCPToStartUpdating(ctx, mcClient,
"worker", <short timeout>) and check the error, then call waitForMCP(ctx,
mcClient, "worker", 25*time.Minute) as before; apply the same change for all
other ContainerRuntimeConfig mutation sites (e.g., the blocks referencing
ContainerRuntimeConfig, waitForMCP, and generatedMCName).
- Around line 45-89: The test currently only logs when an invalid
ContainerRuntimeConfig (ctrcfg) is accepted instead of failing the spec; change
the negative-path assertions so each invalid case explicitly fails the test:
inside the loop replace the "if err != nil { framework.Logf(...) } else { ... }"
logic with an assertion that the create returned an error (e.g.,
o.Expect(err).To(o.HaveOccurred())) OR, if the API accepts the object, poll MCO
status and assert the object reaches ContainerRuntimeConfigFailure=True (use the
existing cleanupContainerRuntimeConfig helper for accepted cases). Apply this
pattern for the current g.It block and the other similar blocks referenced
(lines 91-141, 143-177, 209-255, 257-291, 293-325) so tests fail when invalid
configs are accepted instead of merely logging.

In `@test/extended/node/additional_image_stores.go`:
- Around line 534-587: The test currently only logs failures from
prepopulateImageOnNode and registry pulls instead of failing; change it to
assert failures so the spec enforces use of additionalImageStores: replace the
soft log when prepopulateImageOnNode returns an error with a hard assertion
(e.g., o.Expect(err).NotTo(o.HaveOccurred()) or framework.Failf) referencing
prepopulateImageOnNode, and after scanning pod events use the pulledFromRegistry
boolean to assert the pod did NOT pull from the registry
(o.Expect(pulledFromRegistry).To(o.BeFalse()) or framework.Failf if true) so the
test fails when prepopulation fails or when a registry pull is observed for
testPod.
- Around line 612-615: The tests mutate ContainerRuntimeConfig then immediately
call waitForMCP(...), which can race if the MCP is still marked Updated on the
old config; insert a call to waitForMCPToStartUpdating(ctx, mcClient, "worker",
<suitable timeout>) immediately after each
waitForContainerRuntimeConfigSuccess(...) (and before the subsequent
waitForMCP(...)) to ensure the MCP has begun rolling before asserting readiness.
Update all occurrences mentioned (e.g., the instances around
waitForContainerRuntimeConfigSuccess -> waitForMCP at lines referenced and the
other listed locations) so the sequence is:
waitForContainerRuntimeConfigSuccess(...), waitForMCPToStartUpdating(...), then
waitForMCP(...).
- Around line 848-966: The baseline measurement is contaminated because the
image pulled by baselinePod remains in CRI-O local storage on testNode and
speeds up optimizedPod; after baselinePod finishes (or before creating
optimizedPod) remove the image from the node's local container storage so the
optimized test only benefits from additionalImageStores. Specifically, after
waitForPodDeleted/baseline cleanup and before prepopulateImageOnNode/creating
optimizedPod, run a removal on testNode (via ExecOnNodeWithChroot) to delete the
image (e.g., crictl rmi or podman/rpm command appropriate for CRI-O) targeting
testImageDefault, or alternatively change the baseline/optimized pod placement
or pull policies so the baseline pull occurs on a different node than testNode;
update references: baselinePod, optimizedPod, testNode, testImageDefault,
ExecOnNodeWithChroot, prepopulateImageOnNode, waitForPodDeleted accordingly.
- Around line 47-91: Tests currently treat API acceptance alone as success,
masking validation regressions; update the spec so each invalid-path case fails
unless the Create call returns an error or the MachineConfigOperator surfaces
the ContainerRuntimeConfigFailure=True condition. Concretely, in the test that
calls mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create (and in
the other similar blocks), change the success branch to poll the MCO for the
ContainerRuntimeConfigFailure condition for the created resource (and only defer
cleanupContainerRuntimeConfig if you will delete it on success), and assert the
test fails if neither the Create returned an error nor the MCO condition becomes
True within a timeout. Include these checks for all analogous test blocks
mentioned (lines ranges in the comment).

In `@test/extended/node/additional_layer_stores.go`:
- Around line 721-725: After mutating the ContainerRuntimeConfig (e.g., after
calling waitForContainerRuntimeConfigSuccess with ctrcfg.Name), ensure you block
until the MCP has left its old steady state by calling
waitForMCPToStartUpdating(ctx, mcClient, "worker", <suitable timeout>) before
calling waitForMCP(ctx, mcClient, "worker", ...); specifically, insert a call to
waitForMCPToStartUpdating between waitForContainerRuntimeConfigSuccess(...) and
waitForMCP(...). Apply the same insertion for the other similar flows referenced
(lines around the other waitForMCP usages) so the final waitForMCP polls the new
rollout rather than returning on the previous rendered config.
- Around line 167-213: The test currently treats successful creation of a
ContainerRuntimeConfig with an invalid path as a warning; change it to fail the
test when the API accepts an invalid path. In the loop that calls
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create (and
currently logs "Warning: Path ... accepted"), replace that branch to defer
cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) and then assert
failure using the test's assertion helpers (e.g.,
o.Expect(err).To(o.HaveOccurred(), "expected create to fail for path %s",
tc.path) or explicitly call g.Fail/fatal) so the test fails if the create
succeeds; keep the existing behavior that logs success when creation is
rejected.
- Around line 1448-1453: Replace the hardcoded personal image string assigned to
eStargzImage with the project-managed image lookup using image.LocationFor(...)
and add that image's repository:tag to the allowedImages registry in the image
utility so CI doesn't depend on a personal quay repo; specifically, add the
stable repository entry to allowedImages (in the image registry map) and change
eStargzImage to use image.LocationFor("your-org/your-image:tag") or the chosen
stable org image name following the same pattern used by other node tests.

In `@test/extended/node/additional_stores_combined.go`:
- Around line 209-218: The test currently treats both success and failure as
passing because it only logs on error; change the branching to assert
explicitly: where you call
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg,
metav1.CreateOptions{}) (and the similar block at the other occurrence), replace
the current if/else with an assertion that err did not occur (e.g.,
o.Expect(err).NotTo(o.HaveOccurred()) or equivalent) when the spec is supposed
to be accepted, then defer cleanupContainerRuntimeConfig and run the o.Expect
checks on created.Spec...; if the test is intended to assert rejection instead,
assert that err did occur (o.Expect(err).To(HaveOccurred())) and remove the
success-path expectations. Reference symbols:
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create,
cleanupContainerRuntimeConfig,
created.Spec.ContainerRuntimeConfig.AdditionalLayerStores/AdditionalImageStores/AdditionalArtifactStores,
and apply the same change to the second block around the other Create call.
- Around line 563-566: After computing pureWorkers via getPureWorkerNodes
(called after getNodesByLabel), add a guard that fails the test when pureWorkers
is empty; specifically, immediately after pureWorkers :=
getPureWorkerNodes(workerNodes) assert that len(pureWorkers) > 0 (e.g.,
o.Expect(len(pureWorkers)).NotTo(o.BeZero()) with a descriptive message) so
TC12/TC13 don’t silently no-op when there are zero pure workers.
- Around line 363-365: The current test only asserts an error occurred when
creating the ContainerRuntimeConfig but may match unrelated failures; after the
existing o.Expect(err).To(o.HaveOccurred()) add a stronger assertion that the
error message references the specific invalid image path used in the test (the
invalid image path value supplied when building ctrcfg). Locate the Create call
(mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create) and the
ctrcfg variable and assert err.Error() (or the testing matcher equivalent)
contains the exact invalid image path string used to construct ctrcfg so only
the intended validation rejection passes.
- Around line 70-73: The test is creating real ContainerRuntimeConfig objects
(mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create with
metav1.CreateOptions{}) which triggers MCO reconciliation; change these TC1–TC10
Create calls to use a dry-run by passing metav1.CreateOptions{DryRun:
[]string{"All"}} so the API validates without persisting, and remove or skip the
matching deferred cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) for
those dry-run cases since nothing is created; update every occurrence of
Create(...) and the subsequent defer in the file (references:
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create,
metav1.CreateOptions{}, and cleanupContainerRuntimeConfig) accordingly.

In `@test/extended/node/stargz_store_setup.go`:
- Around line 409-425: The loop over pureWorkers currently swallows errors from
ExecOnNodeWithChroot (used for "systemctl is-active stargz-store" and "mount"),
logging warnings and continuing, which allows Deploy to falsely succeed; change
both error branches to return a wrapped error instead of continuing so the
caller fails fast (include node.Name and the original err in the returned
fmt.Errorf) and keep the existing status check that returns an error when the
service is not "active"; update the branches around ExecOnNodeWithChroot calls
in this block to propagate errors rather than logging and continuing.
- Around line 79-107: The Cleanup method (StargzStoreSetup.Cleanup) currently
only deletes the namespace (s.namespace) and must also undo node-level changes:
for each worker node reachable via s.oc (e.g., using the cluster's node exec/SSH
helper), stop and disable the stargz-store systemd unit, remove the installed
binary and config directories (the host paths the deployment created), reload
systemd, and restart CRI-O to clear side effects; perform these steps before or
after namespace delete, ensure s.deployed is set false only after node cleanup,
and log warnings on non-fatal failures while returning an error only on fatal
failures.
- Around line 45-75: Set the setup as deployed earlier and make cleanup safe to
call on partial failures: in Deploy, mark s.deployed = true immediately after
the cluster-mutating resources exist (e.g., after successful createNamespace and
createConfigMap or at least right after createDaemonSet is called) so
AfterEach/Cleanup will run even on later verification failures; also update
Cleanup/IsDeployed/Cleanup helpers to be idempotent (safe to delete
namespace/daemonset/configmap even if some resources were never created) and
make createDaemonSet, waitForDaemonSetReady, and verifyStargzStoreRunning
tolerant of repeated cleanup attempts.
- Around line 217-220: The test currently downloads the stargz-snapshotter
binary via curl (the stargz-snapshotter release fetch block); remove that
external GitHub download and instead load the binary from a cluster-local or
prepackaged source: change the setup to first check a configurable local path or
environment variable (e.g., STARGZ_SNAPSHOTTER_BIN) and use that binary if
present, otherwise fall back to a cluster-local payload URL served inside CI;
also remove the curl/URL usage and add a clear error message if neither local
binary nor cluster payload is available so tests fail fast in restricted-network
environments.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 626a6a31-816a-4934-9431-1563660f7dba

📥 Commits

Reviewing files that changed from the base of the PR and between b6f09c2 and bf419f3.

📒 Files selected for processing (7)
  • pkg/testsuites/standard_suites.go
  • test/extended/node/additional_artifact_stores.go
  • test/extended/node/additional_image_stores.go
  • test/extended/node/additional_layer_stores.go
  • test/extended/node/additional_stores_combined.go
  • test/extended/node/node_utils.go
  • test/extended/node/stargz_store_setup.go

Comment thread pkg/testsuites/standard_suites.go
Comment on lines +45 to +89
// TC1: Validate Path Format Restrictions
g.It("should reject invalid path formats for additionalArtifactStores [TC1]", func(ctx context.Context) {
mcClient, err := mcclient.NewForConfig(oc.KubeFramework().ClientConfig())
o.Expect(err).NotTo(o.HaveOccurred())

invalidPaths := []struct {
name string
path string
description string
}{
{"relative-path", "relative/path", "relative path without leading slash"},
{"empty-path", "", "empty path"},
}

for _, tc := range invalidPaths {
g.By(fmt.Sprintf("Testing invalid path: %s (%s)", tc.path, tc.description))
ctrcfg := &machineconfigv1.ContainerRuntimeConfig{
ObjectMeta: metav1.ObjectMeta{
Name: fmt.Sprintf("artifact-invalid-path-test-%s", tc.name),
},
Spec: machineconfigv1.ContainerRuntimeConfigSpec{
MachineConfigPoolSelector: &metav1.LabelSelector{
MatchLabels: map[string]string{
"pools.operator.machineconfiguration.openshift.io/worker": "",
},
},
ContainerRuntimeConfig: &machineconfigv1.ContainerRuntimeConfiguration{
AdditionalArtifactStores: []machineconfigv1.AdditionalArtifactStore{
{Path: machineconfigv1.StorePath(tc.path)},
},
},
},
}

_, err := mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg, metav1.CreateOptions{})
if err != nil {
framework.Logf("Path '%s' correctly rejected at API level: %v", tc.path, err)
} else {
defer cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name)
framework.Logf("Path '%s' accepted at API level, checking MCO validation", tc.path)
}
}

framework.Logf("Test PASSED: Invalid path formats handled correctly")
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

These “should reject …” specs do not actually fail when the invalid config is accepted.

Several negative-path tests only log a warning, or poll the status without asserting that MCO set a failure condition. If validation regresses, the suite will still go green. Each of these specs should Expect a create error or explicitly Expect ContainerRuntimeConfigFailure=True.

Also applies to: 91-141, 143-177, 209-255, 257-291, 293-325

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_artifact_stores.go` around lines 45 - 89, The
test currently only logs when an invalid ContainerRuntimeConfig (ctrcfg) is
accepted instead of failing the spec; change the negative-path assertions so
each invalid case explicitly fails the test: inside the loop replace the "if err
!= nil { framework.Logf(...) } else { ... }" logic with an assertion that the
create returned an error (e.g., o.Expect(err).To(o.HaveOccurred())) OR, if the
API accepts the object, poll MCO status and assert the object reaches
ContainerRuntimeConfigFailure=True (use the existing
cleanupContainerRuntimeConfig helper for accepted cases). Apply this pattern for
the current g.It block and the other similar blocks referenced (lines 91-141,
143-177, 209-255, 257-291, 293-325) so tests fail when invalid configs are
accepted instead of merely logging.

Comment thread test/extended/node/additional_artifact_stores.go
Comment thread test/extended/node/additional_image_stores.go
Comment thread test/extended/node/additional_image_stores.go Outdated
Comment on lines +563 to +566
workerNodes, err := getNodesByLabel(ctx, oc, "node-role.kubernetes.io/worker")
o.Expect(err).NotTo(o.HaveOccurred())
pureWorkers := getPureWorkerNodes(workerNodes)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Guard TC12/TC13 against zero pure workers.

Line 565 and Line 660 compute pureWorkers but don’t enforce non-empty. If it’s empty, setup and verification loops can no-op and the test passes without validation.

Proposed guard (same pattern in TC12 and TC13)
 workerNodes, err := getNodesByLabel(ctx, oc, "node-role.kubernetes.io/worker")
 o.Expect(err).NotTo(o.HaveOccurred())
 pureWorkers := getPureWorkerNodes(workerNodes)
+if len(pureWorkers) < 1 {
+    e2eskipper.Skipf("Need at least 1 pure worker node for this test")
+}

Also applies to: 658-661

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_stores_combined.go` around lines 563 - 566,
After computing pureWorkers via getPureWorkerNodes (called after
getNodesByLabel), add a guard that fails the test when pureWorkers is empty;
specifically, immediately after pureWorkers := getPureWorkerNodes(workerNodes)
assert that len(pureWorkers) > 0 (e.g.,
o.Expect(len(pureWorkers)).NotTo(o.BeZero()) with a descriptive message) so
TC12/TC13 don’t silently no-op when there are zero pure workers.

Comment on lines +45 to +75
func (s *StargzStoreSetup) Deploy(ctx context.Context) error {
framework.Logf("Deploying stargz-store to cluster...")

// Create namespace
if err := s.createNamespace(ctx); err != nil {
return fmt.Errorf("failed to create namespace: %w", err)
}

// Create ConfigMap with stargz-store config and systemd service
if err := s.createConfigMap(ctx); err != nil {
return fmt.Errorf("failed to create configmap: %w", err)
}

// Create DaemonSet
if err := s.createDaemonSet(ctx); err != nil {
return fmt.Errorf("failed to create daemonset: %w", err)
}

// Wait for DaemonSet to be ready
if err := s.waitForDaemonSetReady(ctx, 10*time.Minute); err != nil {
return fmt.Errorf("failed waiting for daemonset: %w", err)
}

// Verify stargz-store is running on nodes
if err := s.verifyStargzStoreRunning(ctx); err != nil {
return fmt.Errorf("stargz-store verification failed: %w", err)
}

s.deployed = true
framework.Logf("stargz-store deployed successfully")
return nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Partial deploy failures will leak cluster state.

AfterEach only calls Cleanup when IsDeployed() is true, but deployed is set only after all verification succeeds. If createDaemonSet, rollout, or node verification fails after any host mutation, cleanup is skipped entirely. Make cleanup idempotent and mark the setup as deployed as soon as the namespace/resources exist.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/stargz_store_setup.go` around lines 45 - 75, Set the setup
as deployed earlier and make cleanup safe to call on partial failures: in
Deploy, mark s.deployed = true immediately after the cluster-mutating resources
exist (e.g., after successful createNamespace and createConfigMap or at least
right after createDaemonSet is called) so AfterEach/Cleanup will run even on
later verification failures; also update Cleanup/IsDeployed/Cleanup helpers to
be idempotent (safe to delete namespace/daemonset/configmap even if some
resources were never created) and make createDaemonSet, waitForDaemonSetReady,
and verifyStargzStoreRunning tolerant of repeated cleanup attempts.

Comment on lines +79 to +107
func (s *StargzStoreSetup) Cleanup(ctx context.Context) error {
if !s.deployed {
return nil
}

framework.Logf("Cleaning up stargz-store...")

// Delete namespace (cascades to all resources)
err := s.oc.AdminKubeClient().CoreV1().Namespaces().Delete(ctx, s.namespace, metav1.DeleteOptions{})
if err != nil && !apierrors.IsNotFound(err) {
framework.Logf("Warning: failed to delete namespace %s: %v", s.namespace, err)
}

// Wait for namespace to be deleted
err = wait.PollUntilContextTimeout(ctx, 5*time.Second, 2*time.Minute, true, func(ctx context.Context) (bool, error) {
_, err := s.oc.AdminKubeClient().CoreV1().Namespaces().Get(ctx, s.namespace, metav1.GetOptions{})
if apierrors.IsNotFound(err) {
return true, nil
}
return false, nil
})

if err != nil {
framework.Logf("Warning: namespace deletion timed out: %v", err)
}

s.deployed = false
framework.Logf("stargz-store cleanup completed")
return nil
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Cleanup does not actually uninstall stargz-store from the nodes.

This only deletes the namespace. The binary, config, systemd unit, enabled service, and CRI-O restart side effects all remain on each worker, so later specs inherit modified nodes and can pass or fail for the wrong reason. Cleanup needs to stop/disable the service on every worker, remove the host files/directories, reload systemd, and restart CRI-O before dropping the namespace.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/stargz_store_setup.go` around lines 79 - 107, The Cleanup
method (StargzStoreSetup.Cleanup) currently only deletes the namespace
(s.namespace) and must also undo node-level changes: for each worker node
reachable via s.oc (e.g., using the cluster's node exec/SSH helper), stop and
disable the stargz-store systemd unit, remove the installed binary and config
directories (the host paths the deployment created), reload systemd, and restart
CRI-O to clear side effects; perform these steps before or after namespace
delete, ensure s.deployed is set false only after node cleanup, and log warnings
on non-fatal failures while returning an error only on fatal failures.

Comment on lines +217 to +220
# Download stargz-snapshotter release
echo "Downloading stargz-snapshotter %s..."
curl -L -o /tmp/stargz.tar.gz \
https://github.com/containerd/stargz-snapshotter/releases/download/%s/stargz-snapshotter-%s-linux-amd64.tar.gz
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's check the target file and surrounding context
head -n 230 test/extended/node/stargz_store_setup.go | tail -n 25

Repository: openshift/origin

Length of output: 949


🏁 Script executed:

# Search for references to stargz in the test setup
rg "stargz" test/extended/node/ -A 3 -B 3

Repository: openshift/origin

Length of output: 50372


🏁 Script executed:

# Check if there are any test image definitions or Dockerfiles
fd -e Dockerfile -e dockerfile test/

Repository: openshift/origin

Length of output: 386


🏁 Script executed:

# Look for environment variable handling or cluster-local payload alternatives
rg "stargz|cluster.*local|payload" test/extended/ --type go -i

Repository: openshift/origin

Length of output: 50374


Remove GitHub release download dependency during test initialization.

The curl download from a public GitHub release (lines 217-220) introduces an external internet dependency that makes the test vulnerable to network flakiness and restricted network policies. In CI environments with limited or no external connectivity, this turns the test suite into an infrastructure verification rather than a feature verification.

Pre-package the stargz-snapshotter binary (v0.18.2) in the test container image or implement a cluster-local payload source with a fallback mechanism to handle environments where GitHub is unreachable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/stargz_store_setup.go` around lines 217 - 220, The test
currently downloads the stargz-snapshotter binary via curl (the
stargz-snapshotter release fetch block); remove that external GitHub download
and instead load the binary from a cluster-local or prepackaged source: change
the setup to first check a configurable local path or environment variable
(e.g., STARGZ_SNAPSHOTTER_BIN) and use that binary if present, otherwise fall
back to a cluster-local payload URL served inside CI; also remove the curl/URL
usage and add a clear error message if neither local binary nor cluster payload
is available so tests fail fast in restricted-network environments.

Comment on lines +409 to +425
for _, node := range pureWorkers {
// Check stargz-store service status
output, err := ExecOnNodeWithChroot(s.oc, node.Name, "systemctl", "is-active", "stargz-store")
if err != nil {
framework.Logf("Warning: failed to check stargz-store status on node %s: %v", node.Name, err)
continue
}

if strings.TrimSpace(output) != "active" {
return fmt.Errorf("stargz-store is not active on node %s (status: %s)", node.Name, strings.TrimSpace(output))
}

// Check FUSE mount
mountOutput, err := ExecOnNodeWithChroot(s.oc, node.Name, "mount")
if err != nil {
framework.Logf("Warning: failed to check mounts on node %s: %v", node.Name, err)
continue
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fail verification on per-node command errors instead of logging and continuing.

If ExecOnNodeWithChroot fails on a worker, Deploy can still report success even though that node was never validated. That produces false greens and very hard-to-debug downstream failures in the E2E specs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/stargz_store_setup.go` around lines 409 - 425, The loop
over pureWorkers currently swallows errors from ExecOnNodeWithChroot (used for
"systemctl is-active stargz-store" and "mount"), logging warnings and
continuing, which allows Deploy to falsely succeed; change both error branches
to return a wrapped error instead of continuing so the caller fails fast
(include node.Name and the original err in the returned fmt.Errorf) and keep the
existing status check that returns an error when the service is not "active";
update the branches around ExecOnNodeWithChroot calls in this block to propagate
errors rather than logging and continuing.

@BhargaviGudi BhargaviGudi force-pushed the additionalArtifactsAutomation branch from bf419f3 to f09da80 Compare April 28, 2026 07:43
@openshift-ci openshift-ci Bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 28, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (9)
test/extended/node/additional_layer_stores.go (3)

167-213: ⚠️ Potential issue | 🟠 Major

TC5 still allows a green pass when invalid paths are accepted.

In the success branch, the test only logs a warning and continues, so invalid-character validation regressions won’t fail the spec.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_layer_stores.go` around lines 167 - 213, The
test loop currently treats acceptance of invalid paths as a warning instead of
failing the spec; update the loop in the "should reject path with invalid
characters for additionalLayerStores [TC5]" test so that when
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create returns no
error for a test case (i.e., the API accepted the invalid path), the test fails
(use the test assertion helper, e.g., o.Expect(err).To(o.HaveOccurred()) or
o.Expect(false).To(o.BeTrue()) with a clear message) rather than merely logging
a warning; keep the cleanup via cleanupContainerRuntimeConfig(ctx, mcClient,
ctrcfg.Name) for the success branch but ensure the assertion runs so the spec
fails on regressions.

721-725: ⚠️ Potential issue | 🟠 Major

Several mutation flows still miss MCP “start updating” synchronization.

Direct waitForMCP(...) calls here can complete against the pre-change state and race subsequent assertions.

Also applies to: 738-740, 791-795, 815-819, 861-866, 938-942, 1049-1053, 1129-1135, 1192-1212, 1496-1500, 1726-1730

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_layer_stores.go` around lines 721 - 725, The
direct waitForMCP(...) calls can race the MCP "start updating" transition;
before each waitForMCP(ctx, mcClient, "worker", ...), insert a call to a
synchronization helper that waits for the MCP to enter the "Updating" state
(e.g., call waitForMCPStartUpdating(ctx, mcClient, "worker", <short timeout>))
so the subsequent waitForMCP observes the in-progress update rather than the
pre-change state; update every occurrence (the shown call and the other
duplicated spots) to perform waitForMCPStartUpdating(...) immediately before
waitForMCP(...).

1435-1453: ⚠️ Potential issue | 🟠 Major

Replace the personal external image dependency with a project-managed image lookup.

Hardcoding quay.io/bgudi/test-small:estargz makes CI fragile (ownership/tag availability risk). Prefer a repository-managed image via the standard test-image registry mechanism.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_layer_stores.go` around lines 1435 - 1453,
Replace the hardcoded personal image string assigned to eStargzImage in the test
(additional_layer_stores.go) with the project-managed test-image lookup helper
(instead of "quay.io/bgudi/test-small:estargz"); call the centralized test image
registry/helper used elsewhere in the repo (for example the project’s test image
getter or framework image helper) to obtain the estargz test image, assign its
result to eStargzImage, and remove the literal so CI uses a repository-managed
image tag.
test/extended/node/additional_image_stores.go (4)

612-616: ⚠️ Potential issue | 🟠 Major

Rollout wait is racy without MCP “start updating” gate.

These flows mutate ContainerRuntimeConfig and then immediately call waitForMCP(...), which can observe the prior steady state and return early.

Proposed sequence for each mutation path
 err = waitForContainerRuntimeConfigSuccess(ctx, mcClient, ctrcfg.Name, 5*time.Minute)
 o.Expect(err).NotTo(o.HaveOccurred())

+err = waitForMCPToStartUpdating(ctx, mcClient, "worker", 5*time.Minute)
+o.Expect(err).NotTo(o.HaveOccurred())
+
 err = waitForMCP(ctx, mcClient, "worker", 25*time.Minute)
 o.Expect(err).NotTo(o.HaveOccurred())

Also applies to: 629-631, 682-686, 706-710, 753-757, 809-813, 1028-1032, 1100-1104, 1210-1214, 1290-1296, 1353-1357, 1369-1373

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_image_stores.go` around lines 612 - 616, The
test is racy because after mutating ContainerRuntimeConfig (ctrcfg.Name) the
code calls waitForMCP(...) immediately and may see the old steady state; update
each mutation flow (where waitForContainerRuntimeConfigSuccess(ctx, mcClient,
ctrcfg.Name, ...) is followed by waitForMCP(ctx, mcClient, "worker", ...)) to
first wait for the MCP to enter the "updating" phase (i.e., a start-updating
gate) before calling waitForMCP for completion: add a helper/check that polls
the MachineConfigPool/MC status for .status.degraded===false and
.status.conditions includes an Updating/Progressing condition (or reuse an
existing waitForMCPState helper if present) using mcClient and ctrcfg.Name, then
only call waitForMCP(ctx, mcClient, "worker", ...) to wait for finished rollout;
apply this change to all similar sites (the listed line groups).

47-91: ⚠️ Potential issue | 🟠 Major

Negative-path specs can pass even when invalid configs are accepted.

These branches only log on successful create and/or don’t assert the poll result, so validation regressions can still report green.

Proposed pattern fix
- _, err := mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg, metav1.CreateOptions{})
- if err != nil {
-   framework.Logf("... rejected ...")
- } else {
-   defer cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name)
-   framework.Logf("... accepted ..., checking MCO validation")
- }
+ _, err := mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg, metav1.CreateOptions{})
+ if err == nil {
+   defer cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name)
+   err = wait.PollUntilContextTimeout(ctx, 5*time.Second, 2*time.Minute, true, func(ctx context.Context) (bool, error) {
+     cfg, getErr := mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Get(ctx, ctrcfg.Name, metav1.GetOptions{})
+     if getErr != nil {
+       return false, getErr
+     }
+     for _, condition := range cfg.Status.Conditions {
+       if condition.Type == machineconfigv1.ContainerRuntimeConfigFailure && condition.Status == corev1.ConditionTrue {
+         return true, nil
+       }
+     }
+     return false, nil
+   })
+ }
+ o.Expect(err).To(o.HaveOccurred(), "invalid config must be rejected by API or by MCO failure condition")

Also applies to: 120-142, 145-179, 211-257, 259-293, 295-327

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_image_stores.go` around lines 47 - 91, The test
currently only logs when an invalid ContainerRuntimeConfig (ctrcfg) is accepted,
so create success can silently pass; update the loop that calls
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create to assert
outcomes: if Create returns an error, keep the current log; if Create succeeds,
immediately defer cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) and
then poll the MCO validation (replace the current log path) — use an assertion
(o.Expect / g.Eventually) to fail the test if the config remains accepted by the
MCO (i.e., wait for the object to be rejected/rolled back or for the MCO to
surface an error), and fail the test instead of just logging; ensure these
checks reference ctrcfg, ContainerRuntimeConfigs().Create, and
cleanupContainerRuntimeConfig so the validation branch cannot silently pass.

535-538: ⚠️ Potential issue | 🟠 Major

Feature assertion is soft; tests still pass on prepopulation or fallback pull failures.

Both TC10 and TC15 only log when prepopulation fails or registry pulls occur, so the core “use additionalImageStores” behavior isn’t enforced.

Also applies to: 559-587, 933-937, 989-999

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_image_stores.go` around lines 535 - 538, The
tests currently only log failures from prepopulateImageOnNode and registry pulls
(e.g., calls using prepopulateImageOnNode(ctx, oc, testNode, testImageDefault,
additionalImageStorePath)), making the feature check soft; change those log-only
branches to assert/fail the test instead (use the test framework's failure
helper such as framework.Failf or ExpectNoError) so the test fails when
prepopulation or pull fails; apply the same change to the other occurrences
where prepopulateImageOnNode or registry pull errors are merely logged (the
duplicated call sites mentioned) so the behavior is actually enforced rather
than only advertised in logs.

850-883: ⚠️ Potential issue | 🟠 Major

Performance comparison is biased by baseline cache warm-up.

The baseline pod pulls the same image on the same node, but local image cache is not cleared before the optimized run. The “optimized faster” result can be unrelated to additionalImageStores.

Also applies to: 939-987

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_image_stores.go` around lines 850 - 883, The
baseline measurement is biased by a warm image cache because the same image
(testImageDefault) is pulled on the node and not removed before the optimized
run; fix by ensuring a consistent cache state before each measurement: remove
the image from the node(s) (or use a fresh unique image tag) prior to creating
baselinePod and again prior to the optimized run so both runs start from the
same cold-cache state. Locate the pod creation flow (baselinePod,
waitForPodRunning) and add a pre-step that evicts the image (for example by
running a node-level image remove via the test harness/oc or executing a
privileged pod that runs the container runtime image remove) or switch to using
a uniquely tagged test image for each run; apply the same change for the similar
block around the second range referenced in the comment.
test/extended/node/additional_artifact_stores.go (2)

45-89: ⚠️ Potential issue | 🟠 Major

Invalid-input tests can false-pass when bad configs are accepted.

Several “should reject …” specs only log acceptance (or poll without asserting final failure), so regressions can pass undetected.

Also applies to: 91-141, 143-177, 209-255, 257-291, 293-325

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_artifact_stores.go` around lines 45 - 89, The
test currently just logs when an invalid ContainerRuntimeConfig is accepted (in
the loop around invalidPaths) which can false-pass; change the logic so that
after mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(...)
returns nil you do not merely log but explicitly fail the test if MCO does not
reject the bad config: use the Gomega/o.Expect (or g.Fail) to assert the
creation should not succeed, and if you must accept API creation for negative
tests then replace the log+defer with a deterministic check that the MCO rejects
the config (polling status/conditions or checking the rendered/validated state)
and call o.Expect(...).To(o.BeTrue())/o.Fail() on the rejection condition; keep
cleanupContainerRuntimeConfig to remove any created resources on failure. Ensure
this pattern is applied to the invalid-path loop and the other similar blocks
referenced (the other test ranges).

460-481: ⚠️ Potential issue | 🟠 Major

Add MCP update-start gating before readiness waits.

After CRC create/update/delete, these tests call waitForMCP(...) directly. That can return on the old rendered state before rollout actually starts.

Also applies to: 548-567, 618-646, 688-693, 744-749, 810-814, 920-924, 1000-1006, 1063-1083

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_artifact_stores.go` around lines 460 - 481, The
test must gate the rollout readiness wait with an MCP "update-start" check after
creating/updating/deleting the ContainerRuntimeConfig and after you discover the
generated MachineConfig (variables/functions: ContainerRuntimeConfig,
generatedMCName, mcClient, waitForMCP); add a short wait that asserts the worker
MachineConfigPool has begun rolling out the new config (for example
implement/use a helper like waitForMCPUpdateStart(ctx, mcClient, "worker",
generatedMCName, shortTimeout) or poll the worker MachineConfigPool status and
return when its current/configuration name equals generatedMCName and
UpdatedMachineCount < MachineCount or when a Condition/Updating flag is set)
before calling the existing waitForMCP(...) readiness waiter so the test no
longer returns the old rendered state.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/node/additional_image_stores.go`:
- Line 883: The call to waitForPodDeleted(ctx, oc, baselinePod.Name,
2*time.Minute) ignores its returned error which can mask pod-deletion timing
failures; update the test to capture and assert the error (e.g., use
require.NoError/Expect/if err != nil { t.Fatalf(...) }) immediately after
calling waitForPodDeleted with the same ctx, oc and baselinePod.Name so the test
fails fast on deletion timeouts instead of proceeding and introducing racey
cleanup behavior.

In `@test/extended/node/additional_layer_stores.go`:
- Around line 1605-1623: The test currently ignores errors from
ExecOnNodeWithChroot calls (used for "crictl rmi", "crictl images", and "ls
/var/lib/stargz-store/store/"), which can hide failures; modify the code around
the ExecOnNodeWithChroot invocations that set rmiOutput, crioImages, and
storeAfterDelete to capture and check the returned error and fail the test when
non-nil (e.g., use o.Expect(err).NotTo(o.HaveOccurred()) or return the error),
and ensure any useful output (rmiOutput, crioImages, storeAfterDelete) is still
logged only after the error check so the test reliably asserts that the CRI-O
image removal and subsequent verifications actually succeeded.

---

Duplicate comments:
In `@test/extended/node/additional_artifact_stores.go`:
- Around line 45-89: The test currently just logs when an invalid
ContainerRuntimeConfig is accepted (in the loop around invalidPaths) which can
false-pass; change the logic so that after
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(...) returns
nil you do not merely log but explicitly fail the test if MCO does not reject
the bad config: use the Gomega/o.Expect (or g.Fail) to assert the creation
should not succeed, and if you must accept API creation for negative tests then
replace the log+defer with a deterministic check that the MCO rejects the config
(polling status/conditions or checking the rendered/validated state) and call
o.Expect(...).To(o.BeTrue())/o.Fail() on the rejection condition; keep
cleanupContainerRuntimeConfig to remove any created resources on failure. Ensure
this pattern is applied to the invalid-path loop and the other similar blocks
referenced (the other test ranges).
- Around line 460-481: The test must gate the rollout readiness wait with an MCP
"update-start" check after creating/updating/deleting the ContainerRuntimeConfig
and after you discover the generated MachineConfig (variables/functions:
ContainerRuntimeConfig, generatedMCName, mcClient, waitForMCP); add a short wait
that asserts the worker MachineConfigPool has begun rolling out the new config
(for example implement/use a helper like waitForMCPUpdateStart(ctx, mcClient,
"worker", generatedMCName, shortTimeout) or poll the worker MachineConfigPool
status and return when its current/configuration name equals generatedMCName and
UpdatedMachineCount < MachineCount or when a Condition/Updating flag is set)
before calling the existing waitForMCP(...) readiness waiter so the test no
longer returns the old rendered state.

In `@test/extended/node/additional_image_stores.go`:
- Around line 612-616: The test is racy because after mutating
ContainerRuntimeConfig (ctrcfg.Name) the code calls waitForMCP(...) immediately
and may see the old steady state; update each mutation flow (where
waitForContainerRuntimeConfigSuccess(ctx, mcClient, ctrcfg.Name, ...) is
followed by waitForMCP(ctx, mcClient, "worker", ...)) to first wait for the MCP
to enter the "updating" phase (i.e., a start-updating gate) before calling
waitForMCP for completion: add a helper/check that polls the
MachineConfigPool/MC status for .status.degraded===false and .status.conditions
includes an Updating/Progressing condition (or reuse an existing waitForMCPState
helper if present) using mcClient and ctrcfg.Name, then only call
waitForMCP(ctx, mcClient, "worker", ...) to wait for finished rollout; apply
this change to all similar sites (the listed line groups).
- Around line 47-91: The test currently only logs when an invalid
ContainerRuntimeConfig (ctrcfg) is accepted, so create success can silently
pass; update the loop that calls
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create to assert
outcomes: if Create returns an error, keep the current log; if Create succeeds,
immediately defer cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) and
then poll the MCO validation (replace the current log path) — use an assertion
(o.Expect / g.Eventually) to fail the test if the config remains accepted by the
MCO (i.e., wait for the object to be rejected/rolled back or for the MCO to
surface an error), and fail the test instead of just logging; ensure these
checks reference ctrcfg, ContainerRuntimeConfigs().Create, and
cleanupContainerRuntimeConfig so the validation branch cannot silently pass.
- Around line 535-538: The tests currently only log failures from
prepopulateImageOnNode and registry pulls (e.g., calls using
prepopulateImageOnNode(ctx, oc, testNode, testImageDefault,
additionalImageStorePath)), making the feature check soft; change those log-only
branches to assert/fail the test instead (use the test framework's failure
helper such as framework.Failf or ExpectNoError) so the test fails when
prepopulation or pull fails; apply the same change to the other occurrences
where prepopulateImageOnNode or registry pull errors are merely logged (the
duplicated call sites mentioned) so the behavior is actually enforced rather
than only advertised in logs.
- Around line 850-883: The baseline measurement is biased by a warm image cache
because the same image (testImageDefault) is pulled on the node and not removed
before the optimized run; fix by ensuring a consistent cache state before each
measurement: remove the image from the node(s) (or use a fresh unique image tag)
prior to creating baselinePod and again prior to the optimized run so both runs
start from the same cold-cache state. Locate the pod creation flow (baselinePod,
waitForPodRunning) and add a pre-step that evicts the image (for example by
running a node-level image remove via the test harness/oc or executing a
privileged pod that runs the container runtime image remove) or switch to using
a uniquely tagged test image for each run; apply the same change for the similar
block around the second range referenced in the comment.

In `@test/extended/node/additional_layer_stores.go`:
- Around line 167-213: The test loop currently treats acceptance of invalid
paths as a warning instead of failing the spec; update the loop in the "should
reject path with invalid characters for additionalLayerStores [TC5]" test so
that when mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create
returns no error for a test case (i.e., the API accepted the invalid path), the
test fails (use the test assertion helper, e.g.,
o.Expect(err).To(o.HaveOccurred()) or o.Expect(false).To(o.BeTrue()) with a
clear message) rather than merely logging a warning; keep the cleanup via
cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) for the success branch
but ensure the assertion runs so the spec fails on regressions.
- Around line 721-725: The direct waitForMCP(...) calls can race the MCP "start
updating" transition; before each waitForMCP(ctx, mcClient, "worker", ...),
insert a call to a synchronization helper that waits for the MCP to enter the
"Updating" state (e.g., call waitForMCPStartUpdating(ctx, mcClient, "worker",
<short timeout>)) so the subsequent waitForMCP observes the in-progress update
rather than the pre-change state; update every occurrence (the shown call and
the other duplicated spots) to perform waitForMCPStartUpdating(...) immediately
before waitForMCP(...).
- Around line 1435-1453: Replace the hardcoded personal image string assigned to
eStargzImage in the test (additional_layer_stores.go) with the project-managed
test-image lookup helper (instead of "quay.io/bgudi/test-small:estargz"); call
the centralized test image registry/helper used elsewhere in the repo (for
example the project’s test image getter or framework image helper) to obtain the
estargz test image, assign its result to eStargzImage, and remove the literal so
CI uses a repository-managed image tag.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 60404f2d-f230-4e26-93ed-7e3d1c85b473

📥 Commits

Reviewing files that changed from the base of the PR and between bf419f3 and f09da80.

📒 Files selected for processing (6)
  • test/extended/node/additional_artifact_stores.go
  • test/extended/node/additional_image_stores.go
  • test/extended/node/additional_layer_stores.go
  • test/extended/node/additional_stores_combined.go
  • test/extended/node/node_utils.go
  • test/extended/node/stargz_store_setup.go
✅ Files skipped from review due to trivial changes (1)
  • test/extended/node/stargz_store_setup.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/extended/node/additional_stores_combined.go

Comment thread test/extended/node/additional_image_stores.go Outdated
Comment thread test/extended/node/additional_layer_stores.go Outdated
@BhargaviGudi BhargaviGudi force-pushed the additionalArtifactsAutomation branch from f09da80 to af7b257 Compare April 30, 2026 09:50
@BhargaviGudi
Copy link
Copy Markdown
Contributor Author

/test all

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

♻️ Duplicate comments (6)
test/extended/node/stargz_store_setup.go (4)

411-425: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail fast on per-node verification command errors.

If ExecOnNodeWithChroot(...) fails on a worker, Deploy() currently logs and continues. That allows success without actually verifying every targeted node.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/stargz_store_setup.go` around lines 411 - 425, The
per-node checks in Deploy currently swallow ExecOnNodeWithChroot errors (they
call framework.Logf and continue), which can allow Deploy to report success
without verifying all nodes; change the behavior so that when
ExecOnNodeWithChroot returns a non-nil err (both for the "systemctl is-active"
call and the "mount" call) the function returns an error immediately (e.g.,
fmt.Errorf) that includes the node name and the original error; update the logic
around ExecOnNodeWithChroot usages in Deploy to fail-fast instead of continuing
so calling code knows verification failed.

79-107: ⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Cleanup only deletes the namespace; it does not revert host changes.

The service, binary, systemd unit, config, mount, and CRI-O side effects remain on each worker after Cleanup(). That contaminates later specs with modified nodes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/stargz_store_setup.go` around lines 79 - 107, The Cleanup
method on StargzStoreSetup only deletes the namespace but must also undo
host-level changes applied during deployment: update Cleanup to iterate the
affected nodes and idempotently stop/disable the stargz-store systemd unit,
remove the installed binary and config files, unmount any mounts created, revert
any CRI-O config changes (and reload/restart CRI-O/kubelet as needed), and
remove any sidecar/static-pod artifacts; use the StargzStoreSetup.Cleanup
function to call helper routines (or add new helpers) that run the required
commands via the existing cluster exec/ssh mechanism (referencing s.namespace,
s.oc, s.deployed and the Cleanup method) and ensure each step logs failures but
continues so cleanup is best-effort and safe to re-run.

217-220: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid downloading the stargz binary from public GitHub during test setup.

This makes the suite depend on external internet reachability and GitHub availability instead of just cluster behavior. A prepackaged or cluster-local artifact source would make the test much more stable in CI.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/stargz_store_setup.go` around lines 217 - 220, Replace the
external GitHub curl download (the curl -L -o /tmp/stargz.tar.gz and the
hardcoded https://github.com/.../stargz-snapshotter-%s-linux-amd64.tar.gz URL
and the echo "Downloading stargz-snapshotter %s...") with a
cluster-local/prepackaged artifact lookup: read the stargz-snapshotter tarball
from the test artifact store or a local registry (e.g., a helper like
getLocalArtifact("stargz-snapshotter", version) or the test fixture artifact
directory) and write it to /tmp/stargz.tar.gz, failing fast if the artifact is
missing; ensure any code that expects the filename
stargz-snapshotter-%s-linux-amd64.tar.gz still works with the local artifact
source.

48-75: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Partial deploy failures can still leak cluster state.

deployed flips only after all verification succeeds, but node mutation starts much earlier. If namespace/configmap/daemonset creation succeeds and a later step fails, callers using IsDeployed() will skip cleanup entirely.

Also applies to: 79-83

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/stargz_store_setup.go` around lines 48 - 75, The current
flow only sets s.deployed = true after all verification succeeds, so partial
failures (namespace/configmap/daemonset created but verification failed) leave
resources in cluster because IsDeployed() returns false; change the logic to
mark the deployment as present as soon as cluster-affecting resources are
created: set s.deployed = true immediately after successful createDaemonSet (or
right after createConfigMap/createDaemonSet sequence) and before calling
waitForDaemonSetReady/verifyStargzStoreRunning, and mirror the same change in
the similar block around lines 79-83 so callers invoking IsDeployed() can
trigger cleanup when verification fails.
test/extended/node/additional_artifact_stores.go (1)

79-89: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

These negative specs still do not fail when invalid configs are accepted.

Several should reject ... cases only log success on API rejection and otherwise continue, and TC2 still does not assert the MCO failure path. A validation regression can still produce a green run here.

Also applies to: 118-140, 168-177, 246-255, 283-291, 317-325

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_artifact_stores.go` around lines 79 - 89, The
test currently only logs when an invalid ContainerRuntimeConfig is accepted
(using mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create and
cleanupContainerRuntimeConfig) instead of failing; update each negative case
(including the blocks around the ranges noted) to assert failure when the API
accepts an invalid config by calling t.Fatalf or framework.Failf (so the test
fails immediately) and, for cases that expect MCO-level rejection (e.g., TC2),
add the additional wait/assert logic to verify the MCO validation rejects the
config (polling the MCO status or relevant resource and asserting the expected
failure) rather than just logging “checking MCO validation”; ensure each branch
that currently only logs on API acceptance is replaced with an explicit test
failure or a follow-up MCO validation assertion referencing tc.path, the Create
call, and cleanupContainerRuntimeConfig to locate the affected code.
test/extended/node/additional_image_stores.go (1)

80-89: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

These negative specs still go green when invalid configs are accepted.

Several should reject ... cases only log acceptance, and TC2 also ignores the result of the MCO polling path. If API validation regresses, this suite can still pass without ever asserting ContainerRuntimeConfigFailure=True.

Also applies to: 119-140, 169-177, 247-255, 284-291, 318-325

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_image_stores.go` around lines 80 - 89, The test
currently only logs when invalid ContainerRuntimeConfig paths are accepted by
the API (using
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create) and never
fails the test if MCO validation doesn't set ContainerRuntimeConfigFailure;
update the test logic so that for each tc.path: if Create returns an error,
assert/log that as expected, but if Create succeeds (and you defer
cleanupContainerRuntimeConfig), actively poll the MCO status for the created
config (check for the ContainerRuntimeConfigFailure condition on the
corresponding object) within a timeout and fail the test if the failure
condition is not observed; apply this change to the other similar blocks
referenced (the ranges around lines 119-140, 169-177, 247-255, 284-291, 318-325)
so all “should reject” cases assert MCO-level rejection rather than only logging
acceptance.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/node/additional_artifact_stores.go`:
- Around line 567-570: After calling getPureWorkerNodes(workerNodes) check
whether pureWorkers is empty and skip the spec if so; specifically, add a guard
like if len(pureWorkers) == 0 { o.Skip("no pure worker nodes available; skipping
spec") } immediately after the pureWorkers := getPureWorkerNodes(workerNodes)
line in the block that uses getNodesByLabel and getPureWorkerNodes (this
prevents the node-verification loops from being vacuously successful). Apply the
same guard to the other occurrence where getPureWorkerNodes is used (the block
around the other call at the later occurrence).
- Around line 540-545: Replace the fragile "cat" check with a command that
succeeds but only prints when the file exists so we can assert absence
specifically: use ExecOnNodeWithChroot(oc, node.Name, "sh", "-c", "if [ -e
/etc/crio/crio.conf.d/01-ctrcfg-additionalArtifactStores ]; then echo present;
fi") and then assert the command returned no error and produced an empty stdout
(meaning the file is absent); keep framework.Logf("Node %s: CRI-O config file
removed successfully", node.Name) and update the expectation to check output ==
"" rather than just expecting ExecOnNodeWithChroot to fail.

In `@test/extended/node/additional_image_stores.go`:
- Around line 718-720: The tests call getNodesByLabel(...) and derive
pureWorkers via getPureWorkerNodes(workerNodes) but don't skip when pureWorkers
is empty, causing SNO/compact clusters to exercise no-op loops and falsely pass;
after computing pureWorkers (the variable), add a guard like if len(pureWorkers)
== 0 { o.Skip("no pure worker nodes for this cluster topology") } where the
first instance occurs (around the getNodesByLabel/getPureWorkerNodes usage) and
apply the same guard at the other similar location that computes pureWorkers
later in the file so these specs are skipped when no pure workers exist.
- Around line 631-644: The hard 2x timing assertion using podStartupTime and
pod2StartupTime is too brittle; change the Expect that currently uses
o.BeNumerically("<", pod2StartupTime/2) to a relaxed check (e.g.,
o.BeNumerically("<", pod2StartupTime)) or remove the numeric timing assertion
entirely and rely on the existing event/source checks for additionalImageStores;
also update the failure message near the Expect and the Logf that prints
"Speedup" to avoid implying a strict 2x requirement.
- Around line 550-576: The test reads Events once right after
waitForPodRunning(...) which can miss the asynchronous "Pulled" event; change
the logic around oc.AdminKubeClient().CoreV1().Events(...).List(...) and the
foundAlreadyPresentEvent check to poll until a timeout (e.g., using
framework/poll or wait.PollImmediate) and retry listing events and scanning for
event.Reason == "Pulled" and the "already present on machine and can be accessed
by the pod" substring, failing only after the timeout; apply the same polling
fix to the other similar block that checks events (the block around the second
occurrence noted in the comment, currently at lines 607-629) so both checks
reliably wait for the Pulled event before asserting using
additionalImageStorePath and testPod.
- Around line 583-599: After calling Delete on testPod, wait for the pod to be
fully removed (or to reach a terminal state) before running the podman rmi and
creating the fallback pod: poll with wait.PollImmediate
(k8s.io/apimachinery/pkg/util/wait) against
oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Get(ctx, testPod.Name,
metav1.GetOptions{}) and loop until the Get returns a NotFound error or the
pod.Status.Phase is Succeeded/Failed (or deletionTimestamp is set and pod is
gone); only then run ExecOnNodeWithChroot (the removeCmd) and call createTestPod
for testPod2. Ensure the wait has a reasonable timeout and fail the test if the
pod does not terminate in time.

In `@test/extended/node/node_utils.go`:
- Around line 950-959: waitForDaemonSetReady currently always uses
oc.Namespace() when calling
oc.AdminKubeClient().AppsV1().DaemonSets(oc.Namespace()).Get(...), which can
poll the wrong DaemonSet when createTestDaemonSet created resources in a
different namespace; change the helper to accept an explicit namespace parameter
(e.g., add ns string to waitForDaemonSetReady signature) and use
oc.AdminKubeClient().AppsV1().DaemonSets(ns).Get(ctx, name, metav1.GetOptions{})
instead, and apply the same namespace-parameter fix to the other occurrence
referenced (lines 1125-1158) so callers pass the intended namespace rather than
relying on oc.Namespace().
- Around line 1085-1092: In waitForPodDeleted, non-NotFound GET errors are being
swallowed (returning false, nil) which masks RBAC/transport/apiserver failures;
update the closure in waitForPodDeleted to return the actual error (return
false, err) for any err that is not apierrors.IsNotFound(err) so polling will
surface and abort on real API errors while still returning (true, nil) when
apierrors.IsNotFound(err).

---

Duplicate comments:
In `@test/extended/node/additional_artifact_stores.go`:
- Around line 79-89: The test currently only logs when an invalid
ContainerRuntimeConfig is accepted (using
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create and
cleanupContainerRuntimeConfig) instead of failing; update each negative case
(including the blocks around the ranges noted) to assert failure when the API
accepts an invalid config by calling t.Fatalf or framework.Failf (so the test
fails immediately) and, for cases that expect MCO-level rejection (e.g., TC2),
add the additional wait/assert logic to verify the MCO validation rejects the
config (polling the MCO status or relevant resource and asserting the expected
failure) rather than just logging “checking MCO validation”; ensure each branch
that currently only logs on API acceptance is replaced with an explicit test
failure or a follow-up MCO validation assertion referencing tc.path, the Create
call, and cleanupContainerRuntimeConfig to locate the affected code.

In `@test/extended/node/additional_image_stores.go`:
- Around line 80-89: The test currently only logs when invalid
ContainerRuntimeConfig paths are accepted by the API (using
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create) and never
fails the test if MCO validation doesn't set ContainerRuntimeConfigFailure;
update the test logic so that for each tc.path: if Create returns an error,
assert/log that as expected, but if Create succeeds (and you defer
cleanupContainerRuntimeConfig), actively poll the MCO status for the created
config (check for the ContainerRuntimeConfigFailure condition on the
corresponding object) within a timeout and fail the test if the failure
condition is not observed; apply this change to the other similar blocks
referenced (the ranges around lines 119-140, 169-177, 247-255, 284-291, 318-325)
so all “should reject” cases assert MCO-level rejection rather than only logging
acceptance.

In `@test/extended/node/stargz_store_setup.go`:
- Around line 411-425: The per-node checks in Deploy currently swallow
ExecOnNodeWithChroot errors (they call framework.Logf and continue), which can
allow Deploy to report success without verifying all nodes; change the behavior
so that when ExecOnNodeWithChroot returns a non-nil err (both for the "systemctl
is-active" call and the "mount" call) the function returns an error immediately
(e.g., fmt.Errorf) that includes the node name and the original error; update
the logic around ExecOnNodeWithChroot usages in Deploy to fail-fast instead of
continuing so calling code knows verification failed.
- Around line 79-107: The Cleanup method on StargzStoreSetup only deletes the
namespace but must also undo host-level changes applied during deployment:
update Cleanup to iterate the affected nodes and idempotently stop/disable the
stargz-store systemd unit, remove the installed binary and config files, unmount
any mounts created, revert any CRI-O config changes (and reload/restart
CRI-O/kubelet as needed), and remove any sidecar/static-pod artifacts; use the
StargzStoreSetup.Cleanup function to call helper routines (or add new helpers)
that run the required commands via the existing cluster exec/ssh mechanism
(referencing s.namespace, s.oc, s.deployed and the Cleanup method) and ensure
each step logs failures but continues so cleanup is best-effort and safe to
re-run.
- Around line 217-220: Replace the external GitHub curl download (the curl -L -o
/tmp/stargz.tar.gz and the hardcoded
https://github.com/.../stargz-snapshotter-%s-linux-amd64.tar.gz URL and the echo
"Downloading stargz-snapshotter %s...") with a cluster-local/prepackaged
artifact lookup: read the stargz-snapshotter tarball from the test artifact
store or a local registry (e.g., a helper like
getLocalArtifact("stargz-snapshotter", version) or the test fixture artifact
directory) and write it to /tmp/stargz.tar.gz, failing fast if the artifact is
missing; ensure any code that expects the filename
stargz-snapshotter-%s-linux-amd64.tar.gz still works with the local artifact
source.
- Around line 48-75: The current flow only sets s.deployed = true after all
verification succeeds, so partial failures (namespace/configmap/daemonset
created but verification failed) leave resources in cluster because IsDeployed()
returns false; change the logic to mark the deployment as present as soon as
cluster-affecting resources are created: set s.deployed = true immediately after
successful createDaemonSet (or right after createConfigMap/createDaemonSet
sequence) and before calling waitForDaemonSetReady/verifyStargzStoreRunning, and
mirror the same change in the similar block around lines 79-83 so callers
invoking IsDeployed() can trigger cleanup when verification fails.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: b5c9d8fe-95ec-4b74-8c24-f0b1dcb4c3b2

📥 Commits

Reviewing files that changed from the base of the PR and between f09da80 and af7b257.

📒 Files selected for processing (7)
  • pkg/testsuites/standard_suites.go
  • test/extended/node/additional_artifact_stores.go
  • test/extended/node/additional_image_stores.go
  • test/extended/node/additional_layer_stores.go
  • test/extended/node/additional_stores_combined.go
  • test/extended/node/node_utils.go
  • test/extended/node/stargz_store_setup.go
✅ Files skipped from review due to trivial changes (1)
  • pkg/testsuites/standard_suites.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • test/extended/node/additional_layer_stores.go
  • test/extended/node/additional_stores_combined.go

Comment on lines +540 to +545
g.By("PHASE 6: Verifying CRI-O config file is removed after CRC deletion")
for _, node := range pureWorkers {
_, err := ExecOnNodeWithChroot(oc, node.Name, "cat", "/etc/crio/crio.conf.d/01-ctrcfg-additionalArtifactStores")
o.Expect(err).To(o.HaveOccurred(),
"CRI-O config file should be removed after ContainerRuntimeConfig deletion on node %s", node.Name)
framework.Logf("Node %s: CRI-O config file removed successfully", node.Name)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Assert that the file is absent, not just that cat failed.

Any ExecOnNodeWithChroot(...) failure satisfies this expectation today, so a broken debug session or transient node error looks identical to successful cleanup.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_artifact_stores.go` around lines 540 - 545,
Replace the fragile "cat" check with a command that succeeds but only prints
when the file exists so we can assert absence specifically: use
ExecOnNodeWithChroot(oc, node.Name, "sh", "-c", "if [ -e
/etc/crio/crio.conf.d/01-ctrcfg-additionalArtifactStores ]; then echo present;
fi") and then assert the command returned no error and produced an empty stdout
(meaning the file is absent); keep framework.Logf("Node %s: CRI-O config file
removed successfully", node.Name) and update the expectation to check output ==
"" rather than just expecting ExecOnNodeWithChroot to fail.

Comment thread test/extended/node/additional_artifact_stores.go
Comment on lines +550 to +576
g.By("Verifying pod events and image pull behavior")
events, err := oc.AdminKubeClient().CoreV1().Events(oc.Namespace()).List(ctx, metav1.ListOptions{
FieldSelector: fmt.Sprintf("involvedObject.name=%s", testPod.Name),
})
o.Expect(err).NotTo(o.HaveOccurred())

var foundAlreadyPresentEvent bool
for _, event := range events.Items {
if event.Reason == "Pulled" {
framework.Logf("Image pulled event: %s", event.Message)
// Check if the event indicates image was already present on machine
// This is the expected message when image is loaded from additionalImageStores
if strings.Contains(event.Message, "already present on machine and can be accessed by the pod") {
foundAlreadyPresentEvent = true
framework.Logf("SUCCESS: Image was loaded from additionalImageStore - event message: %s", event.Message)
break
}
}
}

// Validation: Verify image was loaded from additional storage
// Expected event message: "Container image ... already present on machine and can be accessed by the pod"
o.Expect(foundAlreadyPresentEvent).To(o.BeTrue(),
"Image should have been loaded from additionalImageStore (%s). "+
"Expected event message containing 'already present on machine and can be accessed by the pod' but did not find it. "+
"This indicates the image was not pre-populated correctly or not loaded from additional storage.", additionalImageStorePath)
framework.Logf("Verified: Image was loaded from additional storage at %s", additionalImageStorePath)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Poll for the pull event instead of reading events once.

Event recording is asynchronous. Listing events immediately after waitForPodRunning(...) can miss the Pulled message and make both the additional-store and fallback assertions flaky.

Also applies to: 607-629

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_image_stores.go` around lines 550 - 576, The
test reads Events once right after waitForPodRunning(...) which can miss the
asynchronous "Pulled" event; change the logic around
oc.AdminKubeClient().CoreV1().Events(...).List(...) and the
foundAlreadyPresentEvent check to poll until a timeout (e.g., using
framework/poll or wait.PollImmediate) and retry listing events and scanning for
event.Reason == "Pulled" and the "already present on machine and can be accessed
by the pod" substring, failing only after the timeout; apply the same polling
fix to the other similar block that checks events (the block around the second
occurrence noted in the comment, currently at lines 607-629) so both checks
reliably wait for the Pulled event before asserting using
additionalImageStorePath and testPod.

Comment on lines +583 to +599
g.By("Deleting first pod")
err = oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Delete(ctx, testPod.Name, metav1.DeleteOptions{})
o.Expect(err).NotTo(o.HaveOccurred())
framework.Logf("First pod deleted")

g.By("Removing image from additional storage to test fallback")
removeCmd := fmt.Sprintf("podman --root %s rmi %s", additionalImageStorePath, testImageDefault)
removeOutput, err := ExecOnNodeWithChroot(oc, testNode, "sh", "-c", removeCmd)
o.Expect(err).NotTo(o.HaveOccurred())
framework.Logf("Image removed from additional storage: %s", removeOutput)

g.By("Creating second pod to test fallback to registry")
testPod2 := createTestPod("imagestore-fallback-pod", testImageDefault, testNode)
startTime2 := time.Now()
_, err = oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Create(ctx, testPod2, metav1.CreateOptions{})
o.Expect(err).NotTo(o.HaveOccurred())
defer oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Delete(ctx, testPod2.Name, metav1.DeleteOptions{})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wait for the first pod to terminate before removing the image and creating the fallback pod.

Delete is asynchronous. Right now Phase 5 can race a terminating first pod that is still using the image, which makes the podman rmi step and the fallback behavior nondeterministic.

Suggested fix
 		g.By("Deleting first pod")
 		err = oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Delete(ctx, testPod.Name, metav1.DeleteOptions{})
 		o.Expect(err).NotTo(o.HaveOccurred())
+		err = waitForPodDeleted(ctx, oc, testPod.Name, 2*time.Minute)
+		o.Expect(err).NotTo(o.HaveOccurred())
 		framework.Logf("First pod deleted")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
g.By("Deleting first pod")
err = oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Delete(ctx, testPod.Name, metav1.DeleteOptions{})
o.Expect(err).NotTo(o.HaveOccurred())
framework.Logf("First pod deleted")
g.By("Removing image from additional storage to test fallback")
removeCmd := fmt.Sprintf("podman --root %s rmi %s", additionalImageStorePath, testImageDefault)
removeOutput, err := ExecOnNodeWithChroot(oc, testNode, "sh", "-c", removeCmd)
o.Expect(err).NotTo(o.HaveOccurred())
framework.Logf("Image removed from additional storage: %s", removeOutput)
g.By("Creating second pod to test fallback to registry")
testPod2 := createTestPod("imagestore-fallback-pod", testImageDefault, testNode)
startTime2 := time.Now()
_, err = oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Create(ctx, testPod2, metav1.CreateOptions{})
o.Expect(err).NotTo(o.HaveOccurred())
defer oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Delete(ctx, testPod2.Name, metav1.DeleteOptions{})
g.By("Deleting first pod")
err = oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Delete(ctx, testPod.Name, metav1.DeleteOptions{})
o.Expect(err).NotTo(o.HaveOccurred())
err = waitForPodDeleted(ctx, oc, testPod.Name, 2*time.Minute)
o.Expect(err).NotTo(o.HaveOccurred())
framework.Logf("First pod deleted")
g.By("Removing image from additional storage to test fallback")
removeCmd := fmt.Sprintf("podman --root %s rmi %s", additionalImageStorePath, testImageDefault)
removeOutput, err := ExecOnNodeWithChroot(oc, testNode, "sh", "-c", removeCmd)
o.Expect(err).NotTo(o.HaveOccurred())
framework.Logf("Image removed from additional storage: %s", removeOutput)
g.By("Creating second pod to test fallback to registry")
testPod2 := createTestPod("imagestore-fallback-pod", testImageDefault, testNode)
startTime2 := time.Now()
_, err = oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Create(ctx, testPod2, metav1.CreateOptions{})
o.Expect(err).NotTo(o.HaveOccurred())
defer oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Delete(ctx, testPod2.Name, metav1.DeleteOptions{})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_image_stores.go` around lines 583 - 599, After
calling Delete on testPod, wait for the pod to be fully removed (or to reach a
terminal state) before running the podman rmi and creating the fallback pod:
poll with wait.PollImmediate (k8s.io/apimachinery/pkg/util/wait) against
oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Get(ctx, testPod.Name,
metav1.GetOptions{}) and loop until the Get returns a NotFound error or the
pod.Status.Phase is Succeeded/Failed (or deletionTimestamp is set and pod is
gone); only then run ExecOnNodeWithChroot (the removeCmd) and call createTestPod
for testPod2. Ensure the wait has a reasonable timeout and fail the test if the
pod does not terminate in time.

Comment on lines +631 to +644
g.By("Verifying performance improvement with additionalImageStores")
framework.Logf("Performance comparison:")
framework.Logf(" Pod 1 (prepopulated from additionalImageStore): %v", podStartupTime)
framework.Logf(" Pod 2 (pulled from registry): %v", pod2StartupTime)
speedup := float64(pod2StartupTime) / float64(podStartupTime)
framework.Logf(" Speedup: %.2fx faster with additionalImageStores", speedup)

// Verify that prepopulated image is significantly faster than registry pull
// For a 6GB image, prepopulated should be at least 2x faster
o.Expect(podStartupTime).To(o.BeNumerically("<", pod2StartupTime/2),
"Pod using prepopulated image from additionalImageStore should be significantly faster. "+
"Expected pod1 (%v) to be at least 2x faster than pod2 (%v) pulling from registry.",
podStartupTime, pod2StartupTime)
framework.Logf("Performance improvement verified: Prepopulated image is %.2fx faster", speedup)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The hard 2x timing assertion is likely to be noisy in CI.

The event checks already validate the source of the image. Requiring a strict wall-clock speedup couples the test to registry performance and cluster load, which can fail even when the feature works.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/additional_image_stores.go` around lines 631 - 644, The
hard 2x timing assertion using podStartupTime and pod2StartupTime is too
brittle; change the Expect that currently uses o.BeNumerically("<",
pod2StartupTime/2) to a relaxed check (e.g., o.BeNumerically("<",
pod2StartupTime)) or remove the numeric timing assertion entirely and rely on
the existing event/source checks for additionalImageStores; also update the
failure message near the Expect and the Logf that prints "Speedup" to avoid
implying a strict 2x requirement.

Comment thread test/extended/node/additional_image_stores.go
Comment thread test/extended/node/node_utils.go Outdated
Comment on lines +1085 to +1092
func waitForPodDeleted(ctx context.Context, oc *exutil.CLI, podName string, timeout time.Duration) error {
return wait.PollUntilContextTimeout(ctx, 5*time.Second, timeout, true, func(ctx context.Context) (bool, error) {
_, err := oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Get(ctx, podName, metav1.GetOptions{})
if apierrors.IsNotFound(err) {
return true, nil
}
return false, nil
})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't turn pod GET failures into silent timeouts.

Any non-NotFound error is currently returned as false, nil, so RBAC, transport, or apiserver failures get masked until the poll times out.

Suggested fix
 func waitForPodDeleted(ctx context.Context, oc *exutil.CLI, podName string, timeout time.Duration) error {
 	return wait.PollUntilContextTimeout(ctx, 5*time.Second, timeout, true, func(ctx context.Context) (bool, error) {
 		_, err := oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Get(ctx, podName, metav1.GetOptions{})
 		if apierrors.IsNotFound(err) {
 			return true, nil
 		}
+		if err != nil {
+			return false, err
+		}
 		return false, nil
 	})
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func waitForPodDeleted(ctx context.Context, oc *exutil.CLI, podName string, timeout time.Duration) error {
return wait.PollUntilContextTimeout(ctx, 5*time.Second, timeout, true, func(ctx context.Context) (bool, error) {
_, err := oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Get(ctx, podName, metav1.GetOptions{})
if apierrors.IsNotFound(err) {
return true, nil
}
return false, nil
})
func waitForPodDeleted(ctx context.Context, oc *exutil.CLI, podName string, timeout time.Duration) error {
return wait.PollUntilContextTimeout(ctx, 5*time.Second, timeout, true, func(ctx context.Context) (bool, error) {
_, err := oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Get(ctx, podName, metav1.GetOptions{})
if apierrors.IsNotFound(err) {
return true, nil
}
if err != nil {
return false, err
}
return false, nil
})
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/node/node_utils.go` around lines 1085 - 1092, In
waitForPodDeleted, non-NotFound GET errors are being swallowed (returning false,
nil) which masks RBAC/transport/apiserver failures; update the closure in
waitForPodDeleted to return the actual error (return false, err) for any err
that is not apierrors.IsNotFound(err) so polling will surface and abort on real
API errors while still returning (true, nil) when apierrors.IsNotFound(err).

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented Apr 30, 2026

@BhargaviGudi: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@BhargaviGudi BhargaviGudi force-pushed the additionalArtifactsAutomation branch from af7b257 to 990e5f3 Compare May 5, 2026 08:43
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (8)
test/extended/node/node_utils.go (2)

1085-1092: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't swallow non-NotFound pod GET errors.

Line 1091 returns false, nil for every error besides IsNotFound, so real API failures only show up as a timeout instead of failing the poll immediately.

Suggested fix
 func waitForPodDeleted(ctx context.Context, oc *exutil.CLI, podName string, timeout time.Duration) error {
 	return wait.PollUntilContextTimeout(ctx, 5*time.Second, timeout, true, func(ctx context.Context) (bool, error) {
 		_, err := oc.AdminKubeClient().CoreV1().Pods(oc.Namespace()).Get(ctx, podName, metav1.GetOptions{})
 		if apierrors.IsNotFound(err) {
 			return true, nil
 		}
+		if err != nil {
+			return false, err
+		}
 		return false, nil
 	})
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/node_utils.go` around lines 1085 - 1092, The poll in
function waitForPodDeleted is swallowing non-NotFound GET errors; modify the
closure used by wait.PollUntilContextTimeout so that after calling
oc.AdminKubeClient().CoreV1().Pods(...).Get(...) you check
apierrors.IsNotFound(err) -> return true,nil, else if err != nil -> return
false, err (so the poll fails fast with the actual API error), and only return
false,nil when err == nil and the pod still exists; update the closure around
the Get call in waitForPodDeleted accordingly.

950-958: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Poll the intended DaemonSet and surface real GET failures.

Line 952 hardcodes oc.Namespace() even though createTestDaemonSet(...) lets callers choose a namespace, so this helper can wait on the wrong DaemonSet. Lines 953-954 also turn every API error into a silent timeout, which makes RBAC/apiserver failures look like readiness flakes.

Suggested fix
-func waitForDaemonSetReady(ctx context.Context, oc *exutil.CLI, name string, timeout time.Duration) error {
+func waitForDaemonSetReady(ctx context.Context, oc *exutil.CLI, namespace, name string, timeout time.Duration) error {
 	return wait.PollUntilContextTimeout(ctx, 10*time.Second, timeout, true, func(ctx context.Context) (bool, error) {
-		ds, err := oc.AdminKubeClient().AppsV1().DaemonSets(oc.Namespace()).Get(ctx, name, metav1.GetOptions{})
-		if err != nil {
+		ds, err := oc.AdminKubeClient().AppsV1().DaemonSets(namespace).Get(ctx, name, metav1.GetOptions{})
+		if apierrors.IsNotFound(err) {
 			return false, nil
 		}
+		if err != nil {
+			return false, err
+		}
 		return ds.Status.NumberReady == ds.Status.DesiredNumberScheduled &&
 			ds.Status.DesiredNumberScheduled > 0, nil
 	})
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/node_utils.go` around lines 950 - 958, The helper
waitForDaemonSetReady currently always polls oc.Namespace() and silences GET
errors; change its signature to accept an explicit namespace (e.g.
waitForDaemonSetReady(ctx context.Context, oc *exutil.CLI, namespace string,
name string, timeout time.Duration)) then call
oc.AdminKubeClient().AppsV1().DaemonSets(namespace).Get(...) so it polls the
intended DaemonSet created by createTestDaemonSet, and do not swallow API
errors—return the actual err from the GET (return false, err) instead of
converting all errors into silent timeouts; update all call sites to pass the
correct namespace.
test/extended/node/additional_layer_stores.go (3)

460-462: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Avoid a personal Quay image here.

Line 461 hardcodes quay.io/bgudi/test-small:estargz, which makes this suite depend on a non-repository-owned external image. If that tag is removed, permissions change, or mirroring rules differ, CI breaks for reasons unrelated to the code under test. Prefer a repository-managed image routed through image.LocationFor(...) and registered in the allowed image list.

#!/bin/bash
set -euo pipefail

rg -n 'quay\.io/bgudi/test-small:estargz' test/extended/node
fd -i 'image.go' test/extended/util | xargs rg -n 'allowedImages|bgudi/test-small|estargz'
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 460 - 462,
Replace the hardcoded external image value assigned to eStargzImage
("quay.io/bgudi/test-small:estargz") with a repository-managed image lookup
using image.LocationFor(...) (or another test helper that returns an allowed
image), and ensure the chosen image is registered in the allowed image list used
by the test suite; update the assignment where eStargzImage is defined and any
test harness references so they use the LocationFor result rather than the
personal quay URL.

207-213: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail TC5 if the invalid path is accepted.

Lines 207-213 only log a warning on a successful create, so this spec still passes if invalid-character validation regresses.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 207 - 213, The
test currently only logs a warning when an invalid path is accepted, so add a
failing assertion instead: after
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(...)
succeeds, call the test-failure helper (e.g. framework.Failf or t.Fatalf
depending on test harness) with a clear message including tc.char and
ctrcfg.Name rather than just logging; keep the existing
cleanupContainerRuntimeConfig defer and replace the framework.Logf warning
branch with a failure call to ensure the spec fails when invalid paths are
accepted.

744-748: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wait for the worker MCP to leave the old steady state before waitForMCP(...).

These paths jump straight from waitForContainerRuntimeConfigSuccess(...) to waitForMCP(...). If the pool is still Updated when polling starts, waitForMCP(...) can return on the previous rendered config and race the later storage.conf assertions.

Also applies to: 770-774, 840-844

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 744 - 748, After
waitForContainerRuntimeConfigSuccess(ctx, mcClient, ctrcfg.Name, ...), ensure
the worker MachineConfigPool leaves its previous steady state before calling
waitForMCP(...): insert a short wait that polls the "worker" MCP until its
status no longer reports the prior steady-state condition (e.g., not
"Updated"/not matching the old rendered configuration) — you can reuse the
existing waitForMCP helper with a flag/variant or add a small helper (e.g.,
waitForMCPToLeaveSteadyState) to poll the "worker" pool and return only when it
has transitioned away from the old steady state, then proceed to the existing
waitForMCP(ctx, mcClient, "worker", 25*time.Minute); apply the same insertion at
the other occurrences noted (around the blocks at the other two locations).
test/extended/node/additional_artifact_stores.go (3)

567-575: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip these specs when there are no pure workers.

After getPureWorkerNodes(...), both specs continue with an empty slice. On SNO/compact clusters that makes the node-verification loops vacuously succeed and the test can pass without checking a worker at all.

Also applies to: 656-664

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_artifact_stores.go` around lines 567 - 575, The
test continues even when getPureWorkerNodes(...) returns an empty slice, causing
vacuous success on SNO/compact clusters; add an early skip when pureWorkers is
empty by checking len(pureWorkers)==0 and invoking the test skip helper (e.g.,
o.Skip or t.Skip) with a clear message before calling createDirectoriesOnNodes;
apply the same guard at the other occurrence around the block referencing
createDirectoriesOnNodes/cleanupDirectoriesOnNodes (the second instance around
lines 656-664) so both specs bail out when no pure workers exist.

79-85: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Make these negative-path specs fail when invalid configs are accepted.

Lines 79-85 and the similar blocks only log success when the API accepts an invalid ContainerRuntimeConfig, so the spec still goes green on a validation regression. TC2 has the same problem after the poll: if ObservedGeneration == Generation without ever reaching ContainerRuntimeConfigFailure=True, the test still passes.

Also applies to: 118-140, 168-174, 246-253, 283-290, 317-324

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_artifact_stores.go` around lines 79 - 85, The
test currently treats API acceptance of invalid ContainerRuntimeConfig as
success; change the Create() success branch (the block around
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create and the defer
cleanupContainerRuntimeConfig call) to explicitly fail the test when an invalid
config is accepted (use the test framework's failure/assert method instead of
only logging), and likewise update the poll/validation checks that compare
ObservedGeneration to Generation to fail if ObservedGeneration==Generation
occurs without ever observing the ContainerRuntimeConfigFailure=True condition
(i.e., after your existing poll of the resource state, assert failure when no
failure condition was ever set). Target the create/accept branches and the poll
that checks ObservedGeneration/Generation and ContainerRuntimeConfigFailure to
make negative-path specs fail on invalid-config acceptance.

540-545: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Assert file absence explicitly instead of expecting cat to fail.

Any ExecOnNodeWithChroot(...) failure satisfies this check today, so a broken debug session is indistinguishable from successful cleanup. Check for existence with a shell test and assert empty output instead.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_artifact_stores.go` around lines 540 - 545,
Replace the fragile "cat" failure check with an explicit existence test: call
ExecOnNodeWithChroot(oc, node.Name, "sh", "-c", "test -e
/etc/crio/crio.conf.d/01-ctrcfg-additionalArtifactStores && echo exists") for
each node in pureWorkers, capture the stdout, and assert that the output is
empty (e.g. o.Expect(out).To(o.BeEmpty(), "CRI-O config file should be removed
after ContainerRuntimeConfig deletion on node %s", node.Name)); leave the
framework.Logf success message unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/node/additional_layer_stores.go`:
- Around line 507-514: After successfully creating the ContainerRuntimeConfig
(the call to
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg,
metav1.CreateOptions{})), immediately schedule its cleanup by calling defer
cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) right after the
o.Expect(err).NotTo(o.HaveOccurred()) that follows the create; this ensures the
ctrcfg (variable ctrcfg) is removed on test exit even if subsequent assertions
(including waitForContainerRuntimeConfigSuccess) fail.
- Around line 712-719: The test currently proceeds even when
getPureWorkerNodes(...) returns an empty slice, letting directory
setup/verification be no-ops and falsely passing; fix by checking
len(pureWorkers) immediately after pureWorkers :=
getPureWorkerNodes(workerNodes) and, if zero, skip the spec (or return) with a
clear message so the createDirectoriesOnNodes and subsequent verification loops
are not run; apply the same guard in both places where pureWorkers is used (the
block around createDirectoriesOnNodes and the later verification loop) and
reference getPureWorkerNodes, pureWorkers, and createDirectoriesOnNodes when
adding the check.

---

Duplicate comments:
In `@test/extended/node/additional_artifact_stores.go`:
- Around line 567-575: The test continues even when getPureWorkerNodes(...)
returns an empty slice, causing vacuous success on SNO/compact clusters; add an
early skip when pureWorkers is empty by checking len(pureWorkers)==0 and
invoking the test skip helper (e.g., o.Skip or t.Skip) with a clear message
before calling createDirectoriesOnNodes; apply the same guard at the other
occurrence around the block referencing
createDirectoriesOnNodes/cleanupDirectoriesOnNodes (the second instance around
lines 656-664) so both specs bail out when no pure workers exist.
- Around line 79-85: The test currently treats API acceptance of invalid
ContainerRuntimeConfig as success; change the Create() success branch (the block
around mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create and
the defer cleanupContainerRuntimeConfig call) to explicitly fail the test when
an invalid config is accepted (use the test framework's failure/assert method
instead of only logging), and likewise update the poll/validation checks that
compare ObservedGeneration to Generation to fail if
ObservedGeneration==Generation occurs without ever observing the
ContainerRuntimeConfigFailure=True condition (i.e., after your existing poll of
the resource state, assert failure when no failure condition was ever set).
Target the create/accept branches and the poll that checks
ObservedGeneration/Generation and ContainerRuntimeConfigFailure to make
negative-path specs fail on invalid-config acceptance.
- Around line 540-545: Replace the fragile "cat" failure check with an explicit
existence test: call ExecOnNodeWithChroot(oc, node.Name, "sh", "-c", "test -e
/etc/crio/crio.conf.d/01-ctrcfg-additionalArtifactStores && echo exists") for
each node in pureWorkers, capture the stdout, and assert that the output is
empty (e.g. o.Expect(out).To(o.BeEmpty(), "CRI-O config file should be removed
after ContainerRuntimeConfig deletion on node %s", node.Name)); leave the
framework.Logf success message unchanged.

In `@test/extended/node/additional_layer_stores.go`:
- Around line 460-462: Replace the hardcoded external image value assigned to
eStargzImage ("quay.io/bgudi/test-small:estargz") with a repository-managed
image lookup using image.LocationFor(...) (or another test helper that returns
an allowed image), and ensure the chosen image is registered in the allowed
image list used by the test suite; update the assignment where eStargzImage is
defined and any test harness references so they use the LocationFor result
rather than the personal quay URL.
- Around line 207-213: The test currently only logs a warning when an invalid
path is accepted, so add a failing assertion instead: after
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(...)
succeeds, call the test-failure helper (e.g. framework.Failf or t.Fatalf
depending on test harness) with a clear message including tc.char and
ctrcfg.Name rather than just logging; keep the existing
cleanupContainerRuntimeConfig defer and replace the framework.Logf warning
branch with a failure call to ensure the spec fails when invalid paths are
accepted.
- Around line 744-748: After waitForContainerRuntimeConfigSuccess(ctx, mcClient,
ctrcfg.Name, ...), ensure the worker MachineConfigPool leaves its previous
steady state before calling waitForMCP(...): insert a short wait that polls the
"worker" MCP until its status no longer reports the prior steady-state condition
(e.g., not "Updated"/not matching the old rendered configuration) — you can
reuse the existing waitForMCP helper with a flag/variant or add a small helper
(e.g., waitForMCPToLeaveSteadyState) to poll the "worker" pool and return only
when it has transitioned away from the old steady state, then proceed to the
existing waitForMCP(ctx, mcClient, "worker", 25*time.Minute); apply the same
insertion at the other occurrences noted (around the blocks at the other two
locations).

In `@test/extended/node/node_utils.go`:
- Around line 1085-1092: The poll in function waitForPodDeleted is swallowing
non-NotFound GET errors; modify the closure used by wait.PollUntilContextTimeout
so that after calling oc.AdminKubeClient().CoreV1().Pods(...).Get(...) you check
apierrors.IsNotFound(err) -> return true,nil, else if err != nil -> return
false, err (so the poll fails fast with the actual API error), and only return
false,nil when err == nil and the pod still exists; update the closure around
the Get call in waitForPodDeleted accordingly.
- Around line 950-958: The helper waitForDaemonSetReady currently always polls
oc.Namespace() and silences GET errors; change its signature to accept an
explicit namespace (e.g. waitForDaemonSetReady(ctx context.Context, oc
*exutil.CLI, namespace string, name string, timeout time.Duration)) then call
oc.AdminKubeClient().AppsV1().DaemonSets(namespace).Get(...) so it polls the
intended DaemonSet created by createTestDaemonSet, and do not swallow API
errors—return the actual err from the GET (return false, err) instead of
converting all errors into silent timeouts; update all call sites to pass the
correct namespace.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 8bbb8226-1e43-42a4-8e65-d214abf75a43

📥 Commits

Reviewing files that changed from the base of the PR and between af7b257 and 990e5f3.

📒 Files selected for processing (7)
  • pkg/testsuites/standard_suites.go
  • test/extended/node/additional_artifact_stores.go
  • test/extended/node/additional_image_stores.go
  • test/extended/node/additional_layer_stores.go
  • test/extended/node/additional_stores_combined.go
  • test/extended/node/node_utils.go
  • test/extended/node/stargz_store_setup.go
✅ Files skipped from review due to trivial changes (2)
  • test/extended/node/additional_stores_combined.go
  • test/extended/node/stargz_store_setup.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/extended/node/additional_image_stores.go

Comment thread test/extended/node/additional_layer_stores.go
Comment on lines +712 to +719
workerNodes, err := getNodesByLabel(ctx, oc, "node-role.kubernetes.io/worker")
o.Expect(err).NotTo(o.HaveOccurred())
pureWorkers := getPureWorkerNodes(workerNodes)

g.By("Creating shared layer directories on worker nodes")
layerDirs := []string{"/var/lib/layerstore-1", "/var/lib/layerstore-2", "/var/lib/layerstore-3"}
err = createDirectoriesOnNodes(oc, pureWorkers, layerDirs)
o.Expect(err).NotTo(o.HaveOccurred())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip when getPureWorkerNodes(...) returns none.

TC12 and TC13 do not guard the pureWorkers slice. On SNO/compact clusters, the directory setup and verification loops become no-ops, so the specs can pass without validating a worker node.

Also applies to: 798-812

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 712 - 719, The
test currently proceeds even when getPureWorkerNodes(...) returns an empty
slice, letting directory setup/verification be no-ops and falsely passing; fix
by checking len(pureWorkers) immediately after pureWorkers :=
getPureWorkerNodes(workerNodes) and, if zero, skip the spec (or return) with a
clear message so the createDirectoriesOnNodes and subsequent verification loops
are not run; apply the same guard in both places where pureWorkers is used (the
block around createDirectoriesOnNodes and the later verification loop) and
reference getPureWorkerNodes, pureWorkers, and createDirectoriesOnNodes when
adding the check.

This adds API validation and E2E tests for the Additional Storage
Configurations feature (TechPreviewNoUpgrade):
- additionalArtifactStores (max 10 stores)
- additionalImageStores (max 10 stores)
- additionalLayerStores (max 5 stores, supports :ref suffix)
- Combined storage configurations
Test suite: openshift/additional-storage-support

Signed-off-by: Bhargavi Gudi <BhargaviGudi@users.noreply.github.com>

Add comprehensive test cases for Additional Storage Support feature

Signed-off-by: Bhargavi Gudi <BhargaviGudi@users.noreply.github.com>

Updated testcases
@BhargaviGudi BhargaviGudi force-pushed the additionalArtifactsAutomation branch from 990e5f3 to d98f6f9 Compare May 6, 2026 05:02
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 6, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: BhargaviGudi
Once this PR has been reviewed and has the lgtm label, please assign cpmeadors for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (18)
test/extended/node/additional_layer_stores.go (3)

427-429: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Defer CRC cleanup immediately after create.

If any later assertion fails before Phase 11, the worker pool stays mutated and can poison the rest of the suite.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 427 - 429, After
creating the ContainerRuntimeConfig with
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg,
metav1.CreateOptions{}), immediately schedule its cleanup with a defer that
deletes the resource (use
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Delete(ctx,
ctrcfg.Name, metav1.DeleteOptions{})) so the worker pool is restored if later
assertions fail; place the defer directly after the Create call and keep using
the existing ctx and ctrcfg.Name to identify the resource.

632-635: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip these specs when there are no pure workers.

Without the guard, directory setup and verification loops can no-op on SNO/compact clusters and still report success.

Also applies to: 730-733

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 632 - 635, After
computing pureWorkers with getPureWorkerNodes(workerNodes), add a guard that
checks if len(pureWorkers) == 0 and skips the spec (e.g., call ginkgo.Skip with
a short message) so the test does not proceed on SNO/compact clusters; apply the
same change at the other occurrence around lines 730-733 where
getPureWorkerNodes is used. Ensure you reference getNodesByLabel and
getPureWorkerNodes to locate the blocks to modify.

208-214: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail this negative test when create succeeds.

Right now an invalid-character regression only logs a warning and the spec still passes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 208 - 214, The
negative test currently only logs a warning when creating the
ContainerRuntimeConfig unexpectedly succeeds; change that to fail the test so
regressions break the suite. Replace the else branch body that currently defers
cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) and calls
framework.Logf("Warning: Path with '%s' accepted at API level", tc.char) with
code that still defers cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name)
but then calls framework.Failf (or the appropriate test-failing helper used in
this test suite) with a clear message referencing tc.char and ctrcfg.Name so the
test is marked failed when
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(...)
unexpectedly succeeds.
test/extended/node/node_utils.go (1)

950-955: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Stop turning polling API failures into timeouts.

These helpers currently return false, nil on GET errors, so real apiserver/RBAC/transport failures get masked until the poll times out. Return the error for non-NotFound cases and only retry on the states that are actually expected during polling.

Also applies to: 1052-1056, 1087-1091

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/node_utils.go` around lines 950 - 955, The helper
waitForDaemonSetReady currently swallows GET errors by returning false, nil
which masks apiserver/RBAC/transport failures; change the
oc.AdminKubeClient().AppsV1().DaemonSets(namespace).Get(...) error handling to
return the error for all non-missing cases (i.e., if err != nil and
!apierrors.IsNotFound(err) return false, err) and only treat NotFound or
specific transient states as retryable; apply the same fix pattern to the other
similar helpers referenced (the GET calls at the blocks around lines 1052-1056
and 1087-1091) so polling only suppresses expected transient states and surfaces
real errors.
test/extended/node/additional_image_stores.go (5)

80-86: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

These negative API tests still pass if invalid configs are accepted.

The success branches only log, so a validation regression can still produce a green suite. Each case should either assert create-time rejection or, if the API accepts the object, assert that MCO reports ContainerRuntimeConfigFailure=True.

Also applies to: 119-141, 169-175, 247-253, 284-290, 318-324

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_image_stores.go` around lines 80 - 86, The
tests currently only log when Create succeeds, so invalid configs can slip
through; update each test block that calls
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create (and the
other similar blocks at the noted ranges) to assert behavior: if Create returns
an error then fail the test only if the error is unexpected (or explicitly
assert rejection for negative tests), and if Create succeeds then defer
cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) and immediately
poll/assert that the Machine Config Operator reports
ContainerRuntimeConfigFailure=True for the created object (use the existing MCO
status-check helper or implement a short wait loop checking the MCO status
condition for the object referenced by ctrcfg.Name and tc.path), failing the
test if the MCO does not set ContainerRuntimeConfigFailure=True within the
timeout; replace the plain framework.Logf success branches with these assertions
for all mentioned blocks.

550-576: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Poll for the pull event instead of reading events once.

Event recording is asynchronous. Listing once immediately after waitForPodRunning(...) makes both the additional-store and fallback assertions flaky.

Also applies to: 607-629

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_image_stores.go` around lines 550 - 576, The
test currently lists events once after waitForPodRunning and checks for a
"Pulled" event message using variables/events from
oc.AdminKubeClient().CoreV1().Events(...) and foundAlreadyPresentEvent, which is
flaky because events are recorded asynchronously; replace the single List() call
with a polling loop (e.g., wait.PollImmediate or a retry with timeout) that
repeatedly calls oc.AdminKubeClient().CoreV1().Events(...).List(ctx,
metav1.ListOptions{FieldSelector: fmt.Sprintf("involvedObject.name=%s",
testPod.Name)}) and scans events.Items for the Pulled reason and the "already
present on machine and can be accessed by the pod" message until found or
timeout, then assert foundAlreadyPresentEvent; apply the same polling change to
the similar block referenced at lines 607-629 so both the additional-store and
fallback assertions use polling and are no longer flaky (keep references to
testPod and additionalImageStorePath in the assertion message).

718-720: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip these specs when there are no pure workers.

Without a guard, the setup and verification loops can no-op on SNO/compact clusters and still report success.

Also applies to: 807-809

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_image_stores.go` around lines 718 - 720, The
test assumes pure worker nodes exist but doesn't handle SNO/compact clusters;
after calling getPureWorkerNodes(workerNodes) add a guard that checks if
len(pureWorkers) == 0 and skip the spec (e.g., call framework.Skipf or
ginkgo.Skip with a clear message like "no pure worker nodes found, skipping") so
the setup and verification loops do not run and report false success; apply the
same guard to the other occurrence that mirrors this logic near the later block.

583-592: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Wait for the first pod to disappear before removing the image.

Deletion is asynchronous here. Running podman rmi immediately can race the terminating pod still using that image on the node, which makes the fallback phase nondeterministic.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_image_stores.go` around lines 583 - 592, After
calling AdminKubeClient().CoreV1().Pods(...).Delete for testPod, wait for the
pod to be fully removed before running the podman rmi on the node: poll the API
(e.g., using wait.PollImmediate or a simple loop calling
AdminKubeClient().CoreV1().Pods(oc.Namespace()).Get(ctx, testPod.Name,
metav1.GetOptions())) until the Get returns a NotFound error (use
metav1.IsNotFound(err)) or the pod is clearly terminated, then proceed to
construct removeCmd and call ExecOnNodeWithChroot to run podman rmi against
additionalImageStorePath/testImageDefault. Ensure the wait uses a reasonable
timeout to avoid flakiness.

631-644: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

The hard 2x startup-time assertion is too brittle for CI.

Registry latency and cluster load can dominate wall-clock time even when the event checks already proved the image source. This is likely to create false reds unrelated to the feature.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_image_stores.go` around lines 631 - 644, The
hard 2x assertion using o.Expect(podStartupTime).To(o.BeNumerically("<",
pod2StartupTime/2)) is brittle; change it to a more tolerant, configurable check
on the computed speedup variable (speedup :=
float64(pod2StartupTime)/float64(podStartupTime)) instead of dividing times
directly. Replace the strict check with an assertion like
o.Expect(speedup).To(o.BeNumerically(">", minSpeedup)) where minSpeedup is a
constant (e.g., 1.2) or read from an env var (e.g.,
ADDITIONAL_IMAGE_STORES_MIN_SPEEDUP) so CI can relax the threshold, and keep the
framework.Logf output of speedup for diagnostics; update references to
podStartupTime, pod2StartupTime, speedup and the o.Expect call in
additional_image_stores.go.
test/extended/node/stargz_store_setup.go (4)

49-84: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Mark the setup as deployed before post-create verification starts.

If namespace/DaemonSet creation succeeds but readiness or node verification fails, callers skip Cleanup() because deployed is still false. That leaks partial cluster state from failed setups.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/stargz_store_setup.go` around lines 49 - 84, The setup
currently sets s.deployed = true only after all post-create verifications, which
prevents Cleanup() from running on partial failures; move the deployment marker
so cleanup runs on failures during readiness/verification: after
createDaemonSet(ctx) returns nil (successful creation of the DaemonSet) set
s.deployed = true, then proceed to waitForDaemonSetReady(ctx, ...) and
verifyStargzStoreRunning(ctx). Update the code around createDaemonSet,
waitForDaemonSetReady, and verifyStargzStoreRunning to reflect this ordering
(set s.deployed = true immediately after createDaemonSet succeeds).

468-484: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fail deployment when per-node verification commands fail.

Continuing after ExecOnNodeWithChroot(...) errors lets Deploy() report success without actually validating every worker.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/stargz_store_setup.go` around lines 468 - 484, The loop
that verifies stargz-store currently ignores failures from ExecOnNodeWithChroot
and continues, allowing Deploy() to succeed despite missing per-node validation;
change the error handling in the checks that call ExecOnNodeWithChroot (both the
"systemctl is-active stargz-store" call and the "mount" call) to return a
descriptive error (e.g., fmt.Errorf with node.Name and the original err) instead
of logging a warning and continuing so that Deploy() fails when per-node
verification cannot be performed.

90-118: ⚠️ Potential issue | 🔴 Critical | 🏗️ Heavy lift

Cleanup does not undo the host-level installation.

Deleting the namespace leaves the binary, systemd unit, config, mount, enabled service, and CRI-O side effects on the workers. Later specs will inherit modified nodes unless cleanup also stops/disables stargz-store, removes the host files, reloads systemd, and restarts CRI-O.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/stargz_store_setup.go` around lines 90 - 118, The Cleanup
method currently only deletes the namespace and leaves host-level artifacts;
update Cleanup(ctx context.Context) error to also perform host cleanup on all
worker nodes where stargz-store was installed: for each node (discover via the
cluster client or stored node list on the setup struct) SSH/exec to stop and
disable the systemd unit (e.g., "stargz-store.service"), remove the unit file
and any drop-ins, remove the installed binary and config files, unmount and
remove any mountpoints created by stargz-store, reload systemd daemon, and
restart the CRI-O service to undo side effects; ensure all operations handle
NotFound errors and surface failures via framework.Logf, and keep s.deployed set
to false only after host cleanup completes.

275-284: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Avoid downloading stargz-store from GitHub during test setup.

This makes the spec depend on external internet reachability and release availability instead of cluster behavior. A repo/CI-provided artifact or payload-owned image is much safer for extended tests.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/stargz_store_setup.go` around lines 275 - 284, The test
setup currently downloads stargz-store from GitHub using the curl/tar commands
(the "Downloading stargz-snapshotter" block and subsequent "curl -L -o
/tmp/stargz.tar.gz", "tar -xzf /tmp/stargz.tar.gz -C /tmp/", and "cp
/tmp/stargz-store /host/usr/local/bin/" lines), which creates an external
dependency; replace that download sequence with a deterministic source: pull the
binary from a repo/CI-provided artifact or extract it from a payload-owned image
(for example unpack a test-assets tar in the repo or use a container image that
contains stargz-store), and change the script to copy from that local path
(e.g., ../test-assets/stargz-store or an image extraction step) into
/host/usr/local/bin/, adding a clear failure message if the artifact is missing.
Ensure the curl/tar lines are removed and that the script still sets +x on the
copied stargz-store.
test/extended/node/additional_stores_combined.go (2)

269-271: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip these specs when there are no pure workers.

On SNO/compact topologies, getPureWorkerNodes(...) can be empty. In that case the directory setup and verification loops no-op, and the spec can pass without validating any worker node.

Also applies to: 367-369, 441-443, 546-548

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_stores_combined.go` around lines 269 - 271, The
spec must be skipped when there are no pure worker nodes: after calling
getNodesByLabel(...) and computing pureWorkers :=
getPureWorkerNodes(workerNodes) add a guard that if len(pureWorkers) == 0 you
call the test skip helper (e.g., Skip/Skipf from your test framework) and return
so the setup/verification loops are not treated as a passing test; apply the
same check at the other occurrences of getPureWorkerNodes (the blocks around
lines 367-369, 441-443, and 546-548) to consistently skip those specs when
pureWorkers is empty.

67-69: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert the rejection reason, not just that some error happened.

This passes on unrelated create failures too. Checking that the error mentions the invalid image path under test keeps TC1 tied to the actual validation being exercised.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_stores_combined.go` around lines 67 - 69, The
test currently only asserts that Create returned some error; change it to assert
the rejection reason explicitly by checking the error message contains the
invalid image path string used in ctrcfg. After the Create call on
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create, keep the
existence check but add an assertion that err.Error() (or the returned Status
message) contains the specific bad image path value from the ctrcfg object (the
invalid image path constant/field you set for the test) using a substring
matcher (e.g. Gomega ContainSubstring) so the failure is tied to the intended
validation.
test/extended/node/additional_artifact_stores.go (3)

567-570: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Skip these specs when there are no pure workers.

If getPureWorkerNodes(...) returns empty, the setup and verification loops become no-ops and the spec can report success without touching a worker node.

Also applies to: 656-658

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_artifact_stores.go` around lines 567 - 570, The
spec must early-skip when no pure worker nodes exist: after calling
getPureWorkerNodes (the pureWorkers variable) check if len(pureWorkers) == 0 and
call the test-skip helper (e.g., ginkgo.Skip / o.Skip with a clear message) so
the setup and verification loops become no-ops safely; apply the same guard in
the second occurrence around the later getPureWorkerNodes usage (the block at
lines ~656-658) to avoid falsely passing when no worker nodes are available.

79-85: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

These negative specs still go green when invalid configs are accepted.

Several cases only log on success, and TC2's fallback poll path never asserts that ContainerRuntimeConfigFailure=True was actually reached. If validation regresses, this suite will still pass. Each case should either require create-time rejection or explicitly assert the failure condition after creation.

Also applies to: 118-140, 168-174, 246-252, 283-289, 317-323

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_artifact_stores.go` around lines 79 - 85, The
negative tests currently only log success and never fail the test when an
invalid ContainerRuntimeConfig is accepted; update each case that expects
rejection (e.g., the Create call via
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create and the TC2
fallback poll) so they either assert create-time failure (t.Fatalf/t.Errorf)
when Create returns nil error, or, if creation is allowed, immediately poll and
assert the MCO reached ContainerRuntimeConfigFailure=True via the existing poll
helper (the same helper used in other checks) before deferring
cleanupContainerRuntimeConfig; apply the same change pattern to the other
similar blocks you flagged (around the ranges noted) so invalid configs cause
test failures instead of only logging.

540-545: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Assert that the config file is absent, not just that cat failed.

Any debug-session or node-command failure satisfies this expectation today. Use a command that succeeds and prints only when the file exists, then assert empty stdout.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_artifact_stores.go` around lines 540 - 545,
Replace the fragile "cat" failure check in the PHASE 6 loop: instead of
expecting ExecOnNodeWithChroot(oc, node.Name, "cat",
"/etc/crio/crio.conf.d/01-ctrcfg-additionalArtifactStores") to error, run
ExecOnNodeWithChroot(oc, node.Name, "sh", "-c", "<command that only emits
content if the file exists>") (or equivalent) so the command always succeeds and
prints only when the file is present; then assert err is nil and that stdout is
empty for each node in pureWorkers. Use the existing ExecOnNodeWithChroot call
and the known path "/etc/crio/crio.conf.d/01-ctrcfg-additionalArtifactStores" to
locate the code and replace the Expect(err).To(o.HaveOccurred()) check with an
assertion that stdout == "" (file absent) and no error occurred.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/extended/node/additional_layer_stores.go`:
- Around line 1044-1057: The test hardcodes a workstation-local manifest path in
the nydusSetupYAML constant which will not exist in CI; update the test to load
the nydus-store manifest from the repo or embed it and apply programmatically
instead of using "/home/bgudi/...". Replace the nydusSetupYAML constant with a
relative path within the repository (or read the YAML from a testdata resource
or an embedded string) and change the oc.AsAdmin().Run("apply").Args("-f",
nydusSetupYAML).Output() call to use that resource; ensure the test still sets
nydusNamespace/nydusStorePath/nydusTestPodImage as before and validate the file
is packaged in the repo or embedded so CI can access it.

In `@test/extended/node/additional_stores_combined.go`:
- Around line 671-680: The test currently only copies files into imageStorePath
(skopeoCmd/ExecOnNodeWithChroot) and asserts directory contents, but never
verifies CRI-O actually used additionalImageStores; add a pod-level pull/launch
using the prepopulated image (create a Pod spec that references the
prepopulated-image used in skopeoCmd) and after scheduling ensure the pod
reaches Running (or ImagePullBackOff failure absence) and then on the node use
ExecOnNodeWithChroot to run the container runtime inspection (e.g., crictl
images or crictl inspecti) to assert the image is present and was resolved
locally from imageStorePath (check the image entry/ID and any path/source
metadata), and fail the test if the pod had to fetch the image from registry or
if the runtime shows no local image; reference skopeoCmd, imageStorePath,
ExecOnNodeWithChroot, and lsOutput to find where to insert the pod creation and
runtime-inspection assertions.

---

Duplicate comments:
In `@test/extended/node/additional_artifact_stores.go`:
- Around line 567-570: The spec must early-skip when no pure worker nodes exist:
after calling getPureWorkerNodes (the pureWorkers variable) check if
len(pureWorkers) == 0 and call the test-skip helper (e.g., ginkgo.Skip / o.Skip
with a clear message) so the setup and verification loops become no-ops safely;
apply the same guard in the second occurrence around the later
getPureWorkerNodes usage (the block at lines ~656-658) to avoid falsely passing
when no worker nodes are available.
- Around line 79-85: The negative tests currently only log success and never
fail the test when an invalid ContainerRuntimeConfig is accepted; update each
case that expects rejection (e.g., the Create call via
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create and the TC2
fallback poll) so they either assert create-time failure (t.Fatalf/t.Errorf)
when Create returns nil error, or, if creation is allowed, immediately poll and
assert the MCO reached ContainerRuntimeConfigFailure=True via the existing poll
helper (the same helper used in other checks) before deferring
cleanupContainerRuntimeConfig; apply the same change pattern to the other
similar blocks you flagged (around the ranges noted) so invalid configs cause
test failures instead of only logging.
- Around line 540-545: Replace the fragile "cat" failure check in the PHASE 6
loop: instead of expecting ExecOnNodeWithChroot(oc, node.Name, "cat",
"/etc/crio/crio.conf.d/01-ctrcfg-additionalArtifactStores") to error, run
ExecOnNodeWithChroot(oc, node.Name, "sh", "-c", "<command that only emits
content if the file exists>") (or equivalent) so the command always succeeds and
prints only when the file is present; then assert err is nil and that stdout is
empty for each node in pureWorkers. Use the existing ExecOnNodeWithChroot call
and the known path "/etc/crio/crio.conf.d/01-ctrcfg-additionalArtifactStores" to
locate the code and replace the Expect(err).To(o.HaveOccurred()) check with an
assertion that stdout == "" (file absent) and no error occurred.

In `@test/extended/node/additional_image_stores.go`:
- Around line 80-86: The tests currently only log when Create succeeds, so
invalid configs can slip through; update each test block that calls
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create (and the
other similar blocks at the noted ranges) to assert behavior: if Create returns
an error then fail the test only if the error is unexpected (or explicitly
assert rejection for negative tests), and if Create succeeds then defer
cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) and immediately
poll/assert that the Machine Config Operator reports
ContainerRuntimeConfigFailure=True for the created object (use the existing MCO
status-check helper or implement a short wait loop checking the MCO status
condition for the object referenced by ctrcfg.Name and tc.path), failing the
test if the MCO does not set ContainerRuntimeConfigFailure=True within the
timeout; replace the plain framework.Logf success branches with these assertions
for all mentioned blocks.
- Around line 550-576: The test currently lists events once after
waitForPodRunning and checks for a "Pulled" event message using variables/events
from oc.AdminKubeClient().CoreV1().Events(...) and foundAlreadyPresentEvent,
which is flaky because events are recorded asynchronously; replace the single
List() call with a polling loop (e.g., wait.PollImmediate or a retry with
timeout) that repeatedly calls
oc.AdminKubeClient().CoreV1().Events(...).List(ctx,
metav1.ListOptions{FieldSelector: fmt.Sprintf("involvedObject.name=%s",
testPod.Name)}) and scans events.Items for the Pulled reason and the "already
present on machine and can be accessed by the pod" message until found or
timeout, then assert foundAlreadyPresentEvent; apply the same polling change to
the similar block referenced at lines 607-629 so both the additional-store and
fallback assertions use polling and are no longer flaky (keep references to
testPod and additionalImageStorePath in the assertion message).
- Around line 718-720: The test assumes pure worker nodes exist but doesn't
handle SNO/compact clusters; after calling getPureWorkerNodes(workerNodes) add a
guard that checks if len(pureWorkers) == 0 and skip the spec (e.g., call
framework.Skipf or ginkgo.Skip with a clear message like "no pure worker nodes
found, skipping") so the setup and verification loops do not run and report
false success; apply the same guard to the other occurrence that mirrors this
logic near the later block.
- Around line 583-592: After calling AdminKubeClient().CoreV1().Pods(...).Delete
for testPod, wait for the pod to be fully removed before running the podman rmi
on the node: poll the API (e.g., using wait.PollImmediate or a simple loop
calling AdminKubeClient().CoreV1().Pods(oc.Namespace()).Get(ctx, testPod.Name,
metav1.GetOptions())) until the Get returns a NotFound error (use
metav1.IsNotFound(err)) or the pod is clearly terminated, then proceed to
construct removeCmd and call ExecOnNodeWithChroot to run podman rmi against
additionalImageStorePath/testImageDefault. Ensure the wait uses a reasonable
timeout to avoid flakiness.
- Around line 631-644: The hard 2x assertion using
o.Expect(podStartupTime).To(o.BeNumerically("<", pod2StartupTime/2)) is brittle;
change it to a more tolerant, configurable check on the computed speedup
variable (speedup := float64(pod2StartupTime)/float64(podStartupTime)) instead
of dividing times directly. Replace the strict check with an assertion like
o.Expect(speedup).To(o.BeNumerically(">", minSpeedup)) where minSpeedup is a
constant (e.g., 1.2) or read from an env var (e.g.,
ADDITIONAL_IMAGE_STORES_MIN_SPEEDUP) so CI can relax the threshold, and keep the
framework.Logf output of speedup for diagnostics; update references to
podStartupTime, pod2StartupTime, speedup and the o.Expect call in
additional_image_stores.go.

In `@test/extended/node/additional_layer_stores.go`:
- Around line 427-429: After creating the ContainerRuntimeConfig with
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(ctx, ctrcfg,
metav1.CreateOptions{}), immediately schedule its cleanup with a defer that
deletes the resource (use
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Delete(ctx,
ctrcfg.Name, metav1.DeleteOptions{})) so the worker pool is restored if later
assertions fail; place the defer directly after the Create call and keep using
the existing ctx and ctrcfg.Name to identify the resource.
- Around line 632-635: After computing pureWorkers with
getPureWorkerNodes(workerNodes), add a guard that checks if len(pureWorkers) ==
0 and skips the spec (e.g., call ginkgo.Skip with a short message) so the test
does not proceed on SNO/compact clusters; apply the same change at the other
occurrence around lines 730-733 where getPureWorkerNodes is used. Ensure you
reference getNodesByLabel and getPureWorkerNodes to locate the blocks to modify.
- Around line 208-214: The negative test currently only logs a warning when
creating the ContainerRuntimeConfig unexpectedly succeeds; change that to fail
the test so regressions break the suite. Replace the else branch body that
currently defers cleanupContainerRuntimeConfig(ctx, mcClient, ctrcfg.Name) and
calls framework.Logf("Warning: Path with '%s' accepted at API level", tc.char)
with code that still defers cleanupContainerRuntimeConfig(ctx, mcClient,
ctrcfg.Name) but then calls framework.Failf (or the appropriate test-failing
helper used in this test suite) with a clear message referencing tc.char and
ctrcfg.Name so the test is marked failed when
mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create(...)
unexpectedly succeeds.

In `@test/extended/node/additional_stores_combined.go`:
- Around line 269-271: The spec must be skipped when there are no pure worker
nodes: after calling getNodesByLabel(...) and computing pureWorkers :=
getPureWorkerNodes(workerNodes) add a guard that if len(pureWorkers) == 0 you
call the test skip helper (e.g., Skip/Skipf from your test framework) and return
so the setup/verification loops are not treated as a passing test; apply the
same check at the other occurrences of getPureWorkerNodes (the blocks around
lines 367-369, 441-443, and 546-548) to consistently skip those specs when
pureWorkers is empty.
- Around line 67-69: The test currently only asserts that Create returned some
error; change it to assert the rejection reason explicitly by checking the error
message contains the invalid image path string used in ctrcfg. After the Create
call on mcClient.MachineconfigurationV1().ContainerRuntimeConfigs().Create, keep
the existence check but add an assertion that err.Error() (or the returned
Status message) contains the specific bad image path value from the ctrcfg
object (the invalid image path constant/field you set for the test) using a
substring matcher (e.g. Gomega ContainSubstring) so the failure is tied to the
intended validation.

In `@test/extended/node/node_utils.go`:
- Around line 950-955: The helper waitForDaemonSetReady currently swallows GET
errors by returning false, nil which masks apiserver/RBAC/transport failures;
change the oc.AdminKubeClient().AppsV1().DaemonSets(namespace).Get(...) error
handling to return the error for all non-missing cases (i.e., if err != nil and
!apierrors.IsNotFound(err) return false, err) and only treat NotFound or
specific transient states as retryable; apply the same fix pattern to the other
similar helpers referenced (the GET calls at the blocks around lines 1052-1056
and 1087-1091) so polling only suppresses expected transient states and surfaces
real errors.

In `@test/extended/node/stargz_store_setup.go`:
- Around line 49-84: The setup currently sets s.deployed = true only after all
post-create verifications, which prevents Cleanup() from running on partial
failures; move the deployment marker so cleanup runs on failures during
readiness/verification: after createDaemonSet(ctx) returns nil (successful
creation of the DaemonSet) set s.deployed = true, then proceed to
waitForDaemonSetReady(ctx, ...) and verifyStargzStoreRunning(ctx). Update the
code around createDaemonSet, waitForDaemonSetReady, and verifyStargzStoreRunning
to reflect this ordering (set s.deployed = true immediately after
createDaemonSet succeeds).
- Around line 468-484: The loop that verifies stargz-store currently ignores
failures from ExecOnNodeWithChroot and continues, allowing Deploy() to succeed
despite missing per-node validation; change the error handling in the checks
that call ExecOnNodeWithChroot (both the "systemctl is-active stargz-store" call
and the "mount" call) to return a descriptive error (e.g., fmt.Errorf with
node.Name and the original err) instead of logging a warning and continuing so
that Deploy() fails when per-node verification cannot be performed.
- Around line 90-118: The Cleanup method currently only deletes the namespace
and leaves host-level artifacts; update Cleanup(ctx context.Context) error to
also perform host cleanup on all worker nodes where stargz-store was installed:
for each node (discover via the cluster client or stored node list on the setup
struct) SSH/exec to stop and disable the systemd unit (e.g.,
"stargz-store.service"), remove the unit file and any drop-ins, remove the
installed binary and config files, unmount and remove any mountpoints created by
stargz-store, reload systemd daemon, and restart the CRI-O service to undo side
effects; ensure all operations handle NotFound errors and surface failures via
framework.Logf, and keep s.deployed set to false only after host cleanup
completes.
- Around line 275-284: The test setup currently downloads stargz-store from
GitHub using the curl/tar commands (the "Downloading stargz-snapshotter" block
and subsequent "curl -L -o /tmp/stargz.tar.gz", "tar -xzf /tmp/stargz.tar.gz -C
/tmp/", and "cp /tmp/stargz-store /host/usr/local/bin/" lines), which creates an
external dependency; replace that download sequence with a deterministic source:
pull the binary from a repo/CI-provided artifact or extract it from a
payload-owned image (for example unpack a test-assets tar in the repo or use a
container image that contains stargz-store), and change the script to copy from
that local path (e.g., ../test-assets/stargz-store or an image extraction step)
into /host/usr/local/bin/, adding a clear failure message if the artifact is
missing. Ensure the curl/tar lines are removed and that the script still sets +x
on the copied stargz-store.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 85d07e88-830c-48ae-9844-1ef372fa2007

📥 Commits

Reviewing files that changed from the base of the PR and between 990e5f3 and d98f6f9.

📒 Files selected for processing (6)
  • test/extended/node/additional_artifact_stores.go
  • test/extended/node/additional_image_stores.go
  • test/extended/node/additional_layer_stores.go
  • test/extended/node/additional_stores_combined.go
  • test/extended/node/node_utils.go
  • test/extended/node/stargz_store_setup.go

Comment on lines +1044 to +1057
const (
nydusNamespace = "nydus-store"
nydusSetupYAML = "/home/bgudi/work/src/github.com/openshift/epic/additionalArtifactsStore/testfiles/nydus-store-setup.yaml"
nydusStorePath = "/var/lib/nydus-store"
nydusTestPodImage = "quay.io/bgudi/test-small:oci"
nydusTestName = "test-nydus-layerstore"
)

g.By("Phase 1: Deploying nydus-store DaemonSet")
framework.Logf("Applying nydus-store setup from: %s", nydusSetupYAML)

// Apply the nydus-store setup DaemonSet
output, err := oc.AsAdmin().Run("apply").Args("-f", nydusSetupYAML).Output()
o.Expect(err).NotTo(o.HaveOccurred())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

TC15 depends on a workstation-local manifest path.

/home/bgudi/.../nydus-store-setup.yaml will not exist in CI or on other developers' machines, so this spec cannot run outside the author's environment. The manifest needs to live in-repo or be embedded/applied programmatically.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_layer_stores.go` around lines 1044 - 1057, The
test hardcodes a workstation-local manifest path in the nydusSetupYAML constant
which will not exist in CI; update the test to load the nydus-store manifest
from the repo or embed it and apply programmatically instead of using
"/home/bgudi/...". Replace the nydusSetupYAML constant with a relative path
within the repository (or read the YAML from a testdata resource or an embedded
string) and change the oc.AsAdmin().Run("apply").Args("-f",
nydusSetupYAML).Output() call to use that resource; ensure the test still sets
nydusNamespace/nydusStorePath/nydusTestPodImage as before and validate the file
is packaged in the repo or embedded so CI can access it.

Comment on lines +671 to +680
// Use skopeo to copy image to additional image store
skopeoCmd := fmt.Sprintf("skopeo copy docker://%s dir:%s/prepopulated-image", testImage, imageStorePath)
_, err = ExecOnNodeWithChroot(oc, testNode.Name, "bash", "-c", skopeoCmd)
o.Expect(err).NotTo(o.HaveOccurred())

// Verify image was copied
lsOutput, err := ExecOnNodeWithChroot(oc, testNode.Name, "ls", "-la", imageStorePath+"/prepopulated-image")
o.Expect(err).NotTo(o.HaveOccurred())
o.Expect(lsOutput).To(o.ContainSubstring("manifest.json"))
framework.Logf("Image pre-populated successfully: %s", lsOutput)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

TC9 never proves that additionalImageStores was actually used.

The current flow only copies files into imageStorePath and later checks the directory plus storage.conf. This still passes if CRI-O ignores additionalImageStores entirely. To match the test name, the image-store phase needs a pod-level assertion that the image was resolved from the configured store.

Also applies to: 792-804

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/extended/node/additional_stores_combined.go` around lines 671 - 680, The
test currently only copies files into imageStorePath
(skopeoCmd/ExecOnNodeWithChroot) and asserts directory contents, but never
verifies CRI-O actually used additionalImageStores; add a pod-level pull/launch
using the prepopulated image (create a Pod spec that references the
prepopulated-image used in skopeoCmd) and after scheduling ensure the pod
reaches Running (or ImagePullBackOff failure absence) and then on the node use
ExecOnNodeWithChroot to run the container runtime inspection (e.g., crictl
images or crictl inspecti) to assert the image is present and was resolved
locally from imageStorePath (check the image entry/ID and any path/source
metadata), and fail the test if the pod had to fetch the image from registry or
if the runtime shows no local image; reference skopeoCmd, imageStorePath,
ExecOnNodeWithChroot, and lsOutput to find where to insert the pod creation and
runtime-inspection assertions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants