Skip to content

e2e: PP: cover ExecCPUAffinity support in tests#1432

Open
shajmakh wants to merge 3 commits intoopenshift:mainfrom
shajmakh:exec-affinity-pp-e2e
Open

e2e: PP: cover ExecCPUAffinity support in tests#1432
shajmakh wants to merge 3 commits intoopenshift:mainfrom
shajmakh:exec-affinity-pp-e2e

Conversation

@shajmakh
Copy link
Contributor

@shajmakh shajmakh commented Nov 13, 2025

Add basic e2e tests that checks the default behavior of performance-profile with default enabled ExecCPUAffinity: first.

@openshift-ci openshift-ci bot requested review from MarSik and swatisehgal November 13, 2025 13:46
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 13, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: shajmakh
Once this PR has been reviewed and has the lgtm label, please assign ffromani for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@shajmakh
Copy link
Contributor Author

depend on #1426

@shajmakh shajmakh force-pushed the exec-affinity-pp-e2e branch 3 times, most recently from 6ef4f1a to beeea3d Compare November 18, 2025 10:34
@shajmakh shajmakh force-pushed the exec-affinity-pp-e2e branch 4 times, most recently from 565820a to 0af067c Compare January 20, 2026 11:14
@shajmakh
Copy link
Contributor Author

regarding ci/prow/e2e-gcp-pao-updating-profile the newly added test in the PR is failing because the exec process was always (for 20 retries) pinned to the first CPU of the set although the execCPUAffinity feature is disabled.
this was tested locally several times and it passed. looking deeper in the mustgather, we can see that the PP cpu config is as follow:
cpu: isolated: 1-3 reserved: "0"
while the test logs show us that the exclusive CPUs that were assigned to the running (guaranteed) container were:
first exclusive CPU: 1, all exclusive CPUs: 1,4
which means CPU 4 is likely offline which leaves only CPU 1 for the process to be pinned to.
looking at the node's allocatable cpus:
allocatable: cpu: "5"
which means that the PP didn't distribute the rest of the unreserved CPUs. The thing that caused unalignments when scheduling a workload.
Investigation is ongoing to solve this.

/test e2e-gcp-pao-workloadhints

shajmakh added a commit to shajmakh/release that referenced this pull request Jan 22, 2026
GCP cluster profile uses ipi-gcp flow which by default uses 6 vCPUs for compute machines (see `step-registry/ipi/conf/gcp/ipi-conf-ref.yaml`).The performance profile suites configures a profile with `reserved: "0"` and `isolated: "1-3"` (see openshift/cluster-node-tuning-operator#909), unless environment vars are specificed.
In general this is the good practice to include all node's cpus in the
PP cpu section, but reason why we need this now is that we have some new tests that requires most the cpus to be all distributed using PP (see openshift/cluster-node-tuning-operator#1432 (comment)).

In this commit we start updating only the affected job on which the test
would run, later we will need to add this setting to all other jobs that
consume ipi-gcp cluster configuration.
Note: this is subject to change should the CPU specifications on GCP get
modified.

Signed-off-by: Shereen Haj <shajmakh@redhat.com>
shajmakh added a commit to shajmakh/release that referenced this pull request Jan 22, 2026
GCP cluster profile uses ipi-gcp flow which by default uses 6 vCPUs for compute machines (see `step-registry/ipi/conf/gcp/ipi-conf-ref.yaml`).The performance profile suites configures a profile with `reserved: "0"` and `isolated: "1-3"` (see openshift/cluster-node-tuning-operator#909), unless environment vars are specificed.
In general this is the good practice to include all node's cpus in the
PP cpu section, but reason why we need this now is that we have some new tests that requires most the cpus to be all distributed using PP (see openshift/cluster-node-tuning-operator#1432 (comment)).

In this commit we start updating only the affected job on which the test
would run, later we will need to add this setting to all other jobs that
consume ipi-gcp cluster configuration.
Note: this is subject to change should the CPU specifications on GCP get
modified.

Signed-off-by: Shereen Haj <shajmakh@redhat.com>
@shajmakh
Copy link
Contributor Author

/retest

shajmakh added a commit to shajmakh/release that referenced this pull request Jan 22, 2026
GCP cluster profile uses ipi-gcp flow which by default uses 6 vCPUs for compute machines (see `step-registry/ipi/conf/gcp/ipi-conf-ref.yaml`).The performance profile suites configures a profile with `reserved: "0"` and `isolated: "1-3"` (see openshift/cluster-node-tuning-operator#909), unless environment vars are specificed.
In general this is the good practice to include all node's cpus in the
PP cpu section, but reason why we need this now is that we have some new tests that requires most the cpus to be all distributed using PP (see openshift/cluster-node-tuning-operator#1432 (comment)).

Note: this is subject to change should the CPU specifications on GCP get
modified.

Signed-off-by: Shereen Haj <shajmakh@redhat.com>
@shajmakh
Copy link
Contributor Author

when temporarly removed the failing test due to misaligning node topology with PP cpu section,
ci/prow/e2e-gcp-pao-updating-profile lane passed. A fix for the infra issue is proposed here: openshift/release#73835

@shajmakh
Copy link
Contributor Author

/hold
for prow change to be merged

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 23, 2026
openshift-merge-bot bot pushed a commit to openshift/release that referenced this pull request Jan 23, 2026
GCP cluster profile uses ipi-gcp flow which by default uses 6 vCPUs for compute machines (see `step-registry/ipi/conf/gcp/ipi-conf-ref.yaml`).The performance profile suites configures a profile with `reserved: "0"` and `isolated: "1-3"` (see openshift/cluster-node-tuning-operator#909), unless environment vars are specificed.
In general this is the good practice to include all node's cpus in the
PP cpu section, but reason why we need this now is that we have some new tests that requires most the cpus to be all distributed using PP (see openshift/cluster-node-tuning-operator#1432 (comment)).

Note: this is subject to change should the CPU specifications on GCP get
modified.

Signed-off-by: Shereen Haj <shajmakh@redhat.com>
@shajmakh shajmakh force-pushed the exec-affinity-pp-e2e branch 2 times, most recently from 0af067c to 41afeca Compare January 26, 2026 06:56
@shajmakh
Copy link
Contributor Author

/test e2e-aws-ovn
/test e2e-aws-operator

@shajmakh shajmakh force-pushed the exec-affinity-pp-e2e branch 2 times, most recently from acdb51a to a8b7158 Compare January 27, 2026 20:50
@shajmakh
Copy link
Contributor Author

/retest

@shajmakh
Copy link
Contributor Author

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 29, 2026
dhensel-rh pushed a commit to dhensel-rh/release that referenced this pull request Feb 19, 2026
…#73835)

GCP cluster profile uses ipi-gcp flow which by default uses 6 vCPUs for compute machines (see `step-registry/ipi/conf/gcp/ipi-conf-ref.yaml`).The performance profile suites configures a profile with `reserved: "0"` and `isolated: "1-3"` (see openshift/cluster-node-tuning-operator#909), unless environment vars are specificed.
In general this is the good practice to include all node's cpus in the
PP cpu section, but reason why we need this now is that we have some new tests that requires most the cpus to be all distributed using PP (see openshift/cluster-node-tuning-operator#1432 (comment)).

Note: this is subject to change should the CPU specifications on GCP get
modified.

Signed-off-by: Shereen Haj <shajmakh@redhat.com>
@SargunNarula
Copy link
Contributor

@shajmakh I think all the concerns are covered now and the tests look good to me. /lgtm

By("Run exec command on the pod and verify the process is pinned not only to the first exclusive CPU")

for i := 0; i < retries; i++ {
cmd := []string{"/bin/bash", "-c", "sleep 10 & SLPID=$!; ps -o psr -p $SLPID;"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we really need a more robust approach here

Copy link
Contributor Author

@shajmakh shajmakh Mar 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the code to a more robust verification. Let me know if this meets your expectations. Thanks.

@shajmakh shajmakh force-pushed the exec-affinity-pp-e2e branch 2 times, most recently from 305d0c1 to 33c0996 Compare March 3, 2026 15:33
@shajmakh
Copy link
Contributor Author

shajmakh commented Mar 4, 2026

/retest

@shajmakh shajmakh force-pushed the exec-affinity-pp-e2e branch from 33c0996 to 6e54136 Compare March 4, 2026 10:55
@shajmakh
Copy link
Contributor Author

shajmakh commented Mar 4, 2026

@ffromani @Tal-or Thank you both for your comments and insights. I addressed them and adjusted the retries logic to a more robust approach. Please have a look and let me know your thoughts.

@shajmakh shajmakh force-pushed the exec-affinity-pp-e2e branch from 6e54136 to fe1870d Compare March 9, 2026 13:33
@coderabbitai
Copy link

coderabbitai bot commented Mar 9, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds exec-cpu-affinity e2e tests, new pod/node/resource test utilities, consolidates pod deletion to a DeleteAndSync API across tests, adds sibling-CPU detection and CPU rounding helpers, and promotes golang.org/x/sync to a direct dependency in go.mod.

Changes

Cohort / File(s) Summary
Dependency
go.mod
Promoted golang.org/x/sync v0.19.0 to a direct requirement.
Exec-CPU-Affinity tests
test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go, test/e2e/performanceprofile/functests/1_performance/cpu_management.go, test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go
Added Describe/When test contexts and DescribeTable matrices to validate exec CPU pinning across QoS classes, with concurrent sampling, retries, and profile update/restore flows. Note: a duplicated test block appears in mixedcpus.go.
Pod deletion refactor (call sites)
test/e2e/performanceprofile/functests/...
Replaced local deleteTestPod usages with pods.DeleteAndSync(ctx, client, pod) and updated teardown flows to assert success across multiple test files (irqbalance, llc, hugepages, memorymanager, latency, cgroups, workloadhints, etc.).
Pod utilities (new API)
test/e2e/performanceprofile/functests/utils/pods/pods.go
Added MakePodWithResources(nodeName, qos, containersResources), DeleteAndSync(ctx, client, pod), and changed WaitForDeletion to accept a client.Client parameter.
Node utilities
test/e2e/performanceprofile/functests/utils/nodes/nodes.go
Added GetTwoSiblingsFromCPUSet(siblings map[int]map[int][]int, cpuSet cpuset.CPUSet) (cpuset.CPUSet, error) to locate two-sibling CPU pairs inside a CPUSet.
Node utilities tests
test/e2e/performanceprofile/functests/utils/nodes/nodes_test.go
New table-driven tests covering success and error cases for GetTwoSiblingsFromCPUSet.
Resource utilities
test/e2e/performanceprofile/functests/utils/resources/resources.go
Added TotalCPUsRoundedUp(containersResources []corev1.ResourceList) int and MaxCPURequestsRoundedUp(containersResources []corev1.ResourceList) int.
Resource utilities tests
test/e2e/performanceprofile/functests/utils/resources/resources_test.go
New comprehensive tests for total and max CPU rounding behavior across empty, fractional, and mixed resource lists.
Misc test cleanups
test/e2e/performanceprofile/functests/...
Removed unused k8s.io/apimachinery/pkg/api/errors imports, eliminated local delete helper functions, minor formatting and teardown adjustments across many test files.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.84% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ⚠️ Warning Pull request contains multiple unaddressed review comments regarding test structure and quality issues. Replace testclient.Client.Create() with testclient.DataPlaneClient.Create() for pod creation; register cleanup immediately after pod creation; reset needsUpdate flag in BeforeEach; fix CPU rounding functions; update malformed core handling in GetTwoSiblingsFromCPUSet.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main objective of the pull request: adding end-to-end tests for ExecCPUAffinity support in the performance-profile feature.
Stable And Deterministic Test Names ✅ Passed All test names use stable, deterministic hardcoded strings without runtime-generated identifiers.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use Trivy to scan for security misconfigurations and secrets in Infrastructure as Code files.

Add a .trivyignore file to your project to customize which findings Trivy reports.

@shajmakh shajmakh force-pushed the exec-affinity-pp-e2e branch from fe1870d to 2abfb4c Compare March 9, 2026 13:40
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (3)
test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go (1)

696-704: Formula explanation is helpful; consider extracting to a named constant or helper.

The retry formula logic is now well-documented. For clarity and reusability, consider extracting the magic number 20 to a named constant:

const baseRetries = 20 // minimum retries for cpuset of size 2 to avoid false positives
retries = int(math.Ceil(float64(baseRetries) / float64(cpuRequest)))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go` around lines
696 - 704, Extract the magic number used in the retries calculation into a named
constant to improve clarity and reusability: replace the literal 20 in the
expression that sets retries (retries = int(math.Ceil(float64(20) /
float64(cpuRequest)))) with a well-named constant (e.g., baseRetries) declared
near the top of the file or function, and use that constant in the math.Ceil
call so the code reads retries = int(math.Ceil(float64(baseRetries) /
float64(cpuRequest))).
test/e2e/performanceprofile/functests/utils/nodes/nodes.go (1)

362-376: Consider improving error message clarity.

The error message on line 367 uses %d format for core which is []int, resulting in output like core [0 4] has 2 siblings. Consider using a more descriptive message:

 		for coreID, core := range node {
 			if len(core) != 2 {
-				return cpuset.New(), fmt.Errorf("core %d has %d siblings, wanted 2", core, len(core))
+				return cpuset.New(), fmt.Errorf("core with CPUs %v has %d siblings, wanted 2", core, len(core))
 			}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/utils/nodes/nodes.go` around lines 362
- 376, The error message in GetTwoSiblingsFromCPUSet incorrectly formats the
slice variable core with %d and is unclear; update the fmt.Errorf call to print
the core CPU list (e.g., use %v or join the ints) and keep the sibling count via
len(core) so the message reads something like "core [0 4] has 2 siblings, wanted
2" or a clearer variant; locate the fmt.Errorf in GetTwoSiblingsFromCPUSet and
replace the format specifier for core accordingly.
test/e2e/performanceprofile/functests/1_performance/cpu_management.go (1)

1228-1232: Assert the pod’s actual QoS class before checking affinity.

All later expectations are keyed off qos, but after Line 1231 the spec never verifies the created pod actually landed in that class. If MakePodWithResources changes, the best-effort/burstable rows can silently exercise the wrong branch.

Suggested fix
 					Expect(testclient.Client.Create(ctx, testPod)).To(Succeed(), "Failed to create test pod")
 					testPod, err = pods.WaitForCondition(ctx, client.ObjectKeyFromObject(testPod), corev1.PodReady, corev1.ConditionTrue, 5*time.Minute)
 					Expect(err).ToNot(HaveOccurred())
+					Expect(testPod.Status.QOSClass).To(Equal(qos),
+						"unexpected QoS class for pod %s/%s", testPod.Namespace, testPod.Name)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go` around
lines 1228 - 1232, After creating and waiting for the pod, assert that the pod
actually landed in the expected QoS class: after pods.WaitForCondition returns
the updated testPod, add an Expect that compares testPod.Status.QOSClass to the
expected qos (using the same corev1.PodQOSClass type or string form) before any
affinity-related assertions; this ensures MakePodWithResources and
pods.WaitForCondition produced the intended QoS (refer to symbols
MakePodWithResources, testPod, pods.WaitForCondition, and qos).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go`:
- Around line 1230-1240: The test registers pod cleanup after waiting for
readiness, so if Expect(Create(...)) or pods.WaitForCondition(...) fails before
the defer is set the pod can be leaked; move the defer that deletes testPod to
immediately after creating testPod (right after
Expect(testclient.Client.Create(ctx, testPod)).To(Succeed(), ...)) so the
cleanup closure referencing testPod is always registered, and keep the readiness
wait using pods.WaitForCondition(...) and its error checks unchanged.

In `@test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go`:
- Line 634: Replace the call to the legacy cluster client with the data-plane
client so pod creation targets the workload cluster: change the call that uses
testclient.Client.Create(ctx, testPod) to use
testclient.DataPlaneClient.Create(ctx, testPod) (the Expect assertion around the
Create should remain unchanged) so utilities and Hypershift deployments use the
correct client.

In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`:
- Around line 1358-1368: Move the pod cleanup defer to immediately after the pod
is created so it always registers before any subsequent waits/operations;
specifically, after the successful call to testclient.Client.Create (and after
testPod is assigned) install the existing defer cleanup block that checks
testPod==nil and calls pods.Delete, then call pods.WaitForCondition as currently
done. This ensures the defer (the cleanup closure that logs and calls
pods.Delete) is registered before pods.WaitForCondition and prevents leaked
pods/CPU allocation if Create or Wait fails.
- Around line 1370-1413: The test currently only checks that execProcessCPUs are
not always the first exclusive CPU (firstExclusiveCPU) which allows values on
reserved/shared/offline CPUs; update the check to first compute the runnable set
by intersecting cpusList (from cpuset.Parse on cpusetCfg.Cpus) with the node’s
online CPU set, assert the runnable set size is >= 2 (skip the test if <2),
validate each parsed sample from outputparser.ParsePSROutput (collected into
execProcessCPUs via pods.ExecCommandOnPod) is contained in that runnable set,
and then also assert that not all samples equal the first element of the
runnable set (instead of the original firstExclusiveCPU).

In `@test/e2e/performanceprofile/functests/utils/pods/pods.go`:
- Around line 87-106: In Delete, after calling testclient.DataPlaneClient.Get in
the Delete function, currently only errors.IsNotFound(err) is checked; add an
explicit check for err != nil that returns the error (e.g., if err != nil {
return err }) so non-NotFound GET errors (network/permission issues) are
propagated instead of proceeding to Delete; modify the Delete function's
early-return logic around testclient.DataPlaneClient.Get to handle this case.

In `@test/e2e/performanceprofile/functests/utils/resources/resources.go`:
- Around line 18-27: MaxCPURequestsRounded currently truncates fractional CPUs
by using containerResources.Cpu().Value(), so replace that with MilliValue-based
rounding: for each containerResources in MaxCPURequestsRounded call
containerResources.Cpu().MilliValue(), compute rounded CPUs as int((milli + 999)
/ 1000) (ceiling division) and use that to update maxPodCpus; update the
variable names accordingly in the function to preserve behavior for values like
"500m".

---

Nitpick comments:
In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go`:
- Around line 1228-1232: After creating and waiting for the pod, assert that the
pod actually landed in the expected QoS class: after pods.WaitForCondition
returns the updated testPod, add an Expect that compares testPod.Status.QOSClass
to the expected qos (using the same corev1.PodQOSClass type or string form)
before any affinity-related assertions; this ensures MakePodWithResources and
pods.WaitForCondition produced the intended QoS (refer to symbols
MakePodWithResources, testPod, pods.WaitForCondition, and qos).

In `@test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go`:
- Around line 696-704: Extract the magic number used in the retries calculation
into a named constant to improve clarity and reusability: replace the literal 20
in the expression that sets retries (retries = int(math.Ceil(float64(20) /
float64(cpuRequest)))) with a well-named constant (e.g., baseRetries) declared
near the top of the file or function, and use that constant in the math.Ceil
call so the code reads retries = int(math.Ceil(float64(baseRetries) /
float64(cpuRequest))).

In `@test/e2e/performanceprofile/functests/utils/nodes/nodes.go`:
- Around line 362-376: The error message in GetTwoSiblingsFromCPUSet incorrectly
formats the slice variable core with %d and is unclear; update the fmt.Errorf
call to print the core CPU list (e.g., use %v or join the ints) and keep the
sibling count via len(core) so the message reads something like "core [0 4] has
2 siblings, wanted 2" or a clearer variant; locate the fmt.Errorf in
GetTwoSiblingsFromCPUSet and replace the format specifier for core accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6a597827-4f33-4d68-816b-6dabba149f48

📥 Commits

Reviewing files that changed from the base of the PR and between d6c7df0 and fe1870d.

📒 Files selected for processing (11)
  • go.mod
  • test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go
  • test/e2e/performanceprofile/functests/1_performance/cpu_management.go
  • test/e2e/performanceprofile/functests/1_performance/irqbalance.go
  • test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go
  • test/e2e/performanceprofile/functests/utils/nodes/nodes.go
  • test/e2e/performanceprofile/functests/utils/nodes/nodes_test.go
  • test/e2e/performanceprofile/functests/utils/outputparser/outputparser.go
  • test/e2e/performanceprofile/functests/utils/pods/pods.go
  • test/e2e/performanceprofile/functests/utils/resources/resources.go
  • test/e2e/performanceprofile/functests/utils/resources/resources_test.go

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (6)
test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go (2)

1358-1368: ⚠️ Potential issue | 🟠 Major

Register pod cleanup before the readiness wait.

If pods.WaitForCondition(...) fails, the defer is never installed and the guaranteed pod can leak exclusive CPUs into later specs. Register cleanup immediately after the successful Create.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`
around lines 1358 - 1368, The cleanup defer for deleting the test pod must be
registered immediately after a successful Create to avoid leaking resources if
pods.WaitForCondition fails; move the defer that references pods.Delete, testPod
and testlog.Infof to directly follow the Expect(testclient.Client.Create(ctx,
testPod)).To(Succeed()) call (before calling pods.WaitForCondition) and adjust
the deferred closure to handle a nil or non-existent testPod safely while
preserving the existing log and Expect deletion behavior.

1370-1410: ⚠️ Potential issue | 🟠 Major

Validate against runnable exclusive CPUs, not the raw cgroup set.

The failure mode described in the PR discussion is still possible here: cpusetCfg.Cpus can include offline CPUs, so the “not always the first CPU” assertion becomes impossible when only one exclusive CPU is actually runnable. Intersect with nodes.GetOnlineCPUsSet(...), skip when fewer than two CPUs remain, and assert every sampled CPU stays inside that runnable set.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`
around lines 1370 - 1410, The test currently uses cpusetCfg.Cpus (via
cpuset.Parse) which may include offline CPUs; change the logic to intersect that
parsed set with the runnable set returned by nodes.GetOnlineCPUsSet(...) to
compute a runnableExclusiveSet, skip the rest of the test when
runnableExclusiveSet.Size() < 2, update firstExclusiveCPU to come from
runnableExclusiveSet.List()[0], and, after collecting execProcessCPUs, assert
(1) every sampled CPU is contained in runnableExclusiveSet and (2) the samples
are not all equal to firstExclusiveCPU; update variable references (cpusetCfg,
cpuset.Parse, firstExclusiveCPU, runnableExclusiveSet, execProcessCPUs)
accordingly.
test/e2e/performanceprofile/functests/utils/resources/resources.go (1)

18-26: ⚠️ Potential issue | 🟠 Major

MaxCPURequestsRounded still truncates milliCPU requests.

500m becomes 0 here, so burstable sizing can undercount the CPUs the test actually needs. This is the same rounding bug already raised earlier on this helper.

Proposed fix
 func MaxCPURequestsRounded(containersResources []corev1.ResourceList) int {
 	maxPodCpus := 0
 	for _, containerResources := range containersResources {
-		current := int(containerResources.Cpu().Value())
+		milliValue := containerResources.Cpu().MilliValue()
+		current := int((milliValue + 999) / 1000)
 		if current > maxPodCpus {
 			maxPodCpus = current
 		}
 	}
 	return maxPodCpus
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/utils/resources/resources.go` around
lines 18 - 26, MaxCPURequestsRounded currently truncates milliCPU values (e.g.,
500m -> 0) because it uses Cpu().Value(); change it to use Cpu().MilliValue(),
convert millis to CPUs with ceiling division (cpu = (milli + 999) / 1000) so
partial CPUs are rounded up, and update the loop in MaxCPURequestsRounded to
compute each container's cpu this way and take the max.
test/e2e/performanceprofile/functests/utils/pods/pods.go (1)

89-95: ⚠️ Potential issue | 🟠 Major

Propagate non-NotFound lookup errors before deleting.

If the pre-delete Get fails for RBAC/network/API reasons, this still falls through to Delete, which hides the original failure mode. Return on err != nil right after the IsNotFound check.

Proposed fix
 	err := testclient.DataPlaneClient.Get(ctx, client.ObjectKeyFromObject(pod), pod)
 	if errors.IsNotFound(err) {
 		klog.InfoS("pod already deleted")
 		return nil
 	}
+	if err != nil {
+		return err
+	}
 
 	err = testclient.DataPlaneClient.Delete(ctx, pod)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/utils/pods/pods.go` around lines 89 -
95, The pre-delete lookup using testclient.DataPlaneClient.Get currently ignores
non-NotFound errors and proceeds to Delete; update the logic in the function
containing the Get/Delete sequence so that after calling
testclient.DataPlaneClient.Get(ctx, client.ObjectKeyFromObject(pod), pod) you
check errors.IsNotFound(err) (return nil) and then if err != nil return that err
(propagate it) before calling testclient.DataPlaneClient.Delete(ctx, pod),
ensuring Get failures (RBAC/network/API) are not masked by Delete.
test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go (1)

632-634: ⚠️ Potential issue | 🟠 Major

Create the test pod through DataPlaneClient.

This is the same client-selection issue raised earlier in this file: workload pods should be created on the data-plane client, not the legacy shared client.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go` around lines
632 - 634, The test creates workload pods using the legacy shared client; change
the creation to use the data-plane client instead: after constructing testPod
with pods.MakePodWithResources, call testclient.DataPlaneClient.Create(ctx,
testPod) (and handle the error with Expect as before) instead of
testclient.Client.Create(ctx, testPod) so workload pods are created on the
data-plane client.
test/e2e/performanceprofile/functests/1_performance/cpu_management.go (1)

1230-1240: ⚠️ Potential issue | 🟠 Major

Install cleanup immediately after Create.

A failed readiness wait leaves the pod behind and can contaminate the next table entry by holding CPUs. Register the cleanup right after the successful create call.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go` around
lines 1230 - 1240, Move the pod cleanup defer to immediately after the
successful create call so the created pod is always deleted even if
WaitForCondition fails: after Expect(testclient.Client.Create(ctx,
testPod)).To(Succeed(), ...) register the existing defer block (the anonymous
func that checks testPod == nil, logs and calls pods.Delete) and leave the
pods.WaitForCondition(...) call and subsequent err check as-is; keep using the
same testPod variable so the deferred closure deletes the final testPod value
returned/updated by WaitForCondition.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go`:
- Around line 1228-1230: The pod for workload testing is being created with
testclient.Client but must use the Hypershift-aware data-plane client; change
the creation call that uses testclient.Client.Create(ctx, testPod) to
testclient.DataPlaneClient.Create(ctx, testPod) (keeping the same
Expect(...).To(Succeed()) assertion and error message). Locate the code around
the MakePodWithResources call and replace the Client usage with DataPlaneClient
so the pod lifecycle runs via the data plane client.

In `@test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go`:
- Around line 634-644: The cleanup defer that deletes testPod must be registered
immediately after the successful creation call, not after WaitForCondition; move
the defer block that references testPod (currently after pods.WaitForCondition)
to directly follow Expect(testclient.Client.Create(ctx, testPod)).To(Succeed(),
...), so the pod is always scheduled for deletion even if pods.WaitForCondition
(or pods.WaitForCondition error path) fails; keep or simplify the nil check
inside the defer but ensure testPod variable used by pods.Delete is the same
object returned/created by testclient.Client.Create and that the defer runs
before any early returns or test failures.
- Around line 527-594: The BeforeEach that adjusts the performance profile must
reset the shared flag needsUpdate at the start of each spec to avoid running the
revert path unintentionally; inside the BeforeEach (function BeforeEach) add
needsUpdate = false as the first operation so each table entry starts with a
clean state before checking/updating profile, ensuring subsequent entries that
don't modify profile won't wait for a rollout that never began.

In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`:
- Around line 1358-1359: The test creates the data-plane workload pod using
testclient.Client which targets the wrong cluster; replace the create call to
use DataPlaneClient instead (i.e., call DataPlaneClient.Create(ctx, testPod)
where testclient.Client.Create is used) so the pod is created on the data plane,
and keep the subsequent pods.WaitForCondition(...) and testPod references
unchanged to wait for PodReady.

In `@test/e2e/performanceprofile/functests/utils/nodes/nodes.go`:
- Around line 364-375: The loop over siblings in nodes.go currently aborts on
the first malformed core because the check "if len(core) != 2 { return
cpuset.New(), fmt.Errorf(...)" returns immediately; change this to skip that
entry (use continue) so the search keeps scanning other entries (i.e., in the
nested loop over "core" inside "for _, node := range siblings" replace the
immediate return with a continue), and only after both loops complete return the
final error "no two siblings found..." if nothing matched; keep the existing
logic that constructs siblingCPUSet with cpuset.New(core...) and checks
siblingCPUSet.IsSubsetOf(cpuSet) to return a valid pair early.

In `@test/e2e/performanceprofile/functests/utils/resources/resources.go`:
- Around line 10-15: TotalCPUsRounded and MaxCPURequestsRounded currently use
resource.Quantity.Value() which truncates fractional CPUs; change both to
sum/compare using MilliValue() to preserve milli-CPUs and then perform ceiling
division by 1000 (e.g. totalMilli := sum of container.Cpu().MilliValue(); result
:= int((totalMilli + 999) / 1000)). Update TotalCPUsRounded to accumulate
containerResources.Cpu().MilliValue() and return ceil/1000, and update
MaxCPURequestsRounded to compute the max of Cpu().MilliValue() across containers
and return its ceiling/1000.

---

Duplicate comments:
In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go`:
- Around line 1230-1240: Move the pod cleanup defer to immediately after the
successful create call so the created pod is always deleted even if
WaitForCondition fails: after Expect(testclient.Client.Create(ctx,
testPod)).To(Succeed(), ...) register the existing defer block (the anonymous
func that checks testPod == nil, logs and calls pods.Delete) and leave the
pods.WaitForCondition(...) call and subsequent err check as-is; keep using the
same testPod variable so the deferred closure deletes the final testPod value
returned/updated by WaitForCondition.

In `@test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go`:
- Around line 632-634: The test creates workload pods using the legacy shared
client; change the creation to use the data-plane client instead: after
constructing testPod with pods.MakePodWithResources, call
testclient.DataPlaneClient.Create(ctx, testPod) (and handle the error with
Expect as before) instead of testclient.Client.Create(ctx, testPod) so workload
pods are created on the data-plane client.

In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`:
- Around line 1358-1368: The cleanup defer for deleting the test pod must be
registered immediately after a successful Create to avoid leaking resources if
pods.WaitForCondition fails; move the defer that references pods.Delete, testPod
and testlog.Infof to directly follow the Expect(testclient.Client.Create(ctx,
testPod)).To(Succeed()) call (before calling pods.WaitForCondition) and adjust
the deferred closure to handle a nil or non-existent testPod safely while
preserving the existing log and Expect deletion behavior.
- Around line 1370-1410: The test currently uses cpusetCfg.Cpus (via
cpuset.Parse) which may include offline CPUs; change the logic to intersect that
parsed set with the runnable set returned by nodes.GetOnlineCPUsSet(...) to
compute a runnableExclusiveSet, skip the rest of the test when
runnableExclusiveSet.Size() < 2, update firstExclusiveCPU to come from
runnableExclusiveSet.List()[0], and, after collecting execProcessCPUs, assert
(1) every sampled CPU is contained in runnableExclusiveSet and (2) the samples
are not all equal to firstExclusiveCPU; update variable references (cpusetCfg,
cpuset.Parse, firstExclusiveCPU, runnableExclusiveSet, execProcessCPUs)
accordingly.

In `@test/e2e/performanceprofile/functests/utils/pods/pods.go`:
- Around line 89-95: The pre-delete lookup using testclient.DataPlaneClient.Get
currently ignores non-NotFound errors and proceeds to Delete; update the logic
in the function containing the Get/Delete sequence so that after calling
testclient.DataPlaneClient.Get(ctx, client.ObjectKeyFromObject(pod), pod) you
check errors.IsNotFound(err) (return nil) and then if err != nil return that err
(propagate it) before calling testclient.DataPlaneClient.Delete(ctx, pod),
ensuring Get failures (RBAC/network/API) are not masked by Delete.

In `@test/e2e/performanceprofile/functests/utils/resources/resources.go`:
- Around line 18-26: MaxCPURequestsRounded currently truncates milliCPU values
(e.g., 500m -> 0) because it uses Cpu().Value(); change it to use
Cpu().MilliValue(), convert millis to CPUs with ceiling division (cpu = (milli +
999) / 1000) so partial CPUs are rounded up, and update the loop in
MaxCPURequestsRounded to compute each container's cpu this way and take the max.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d7e5230c-c821-48d3-a236-5ad0b9f7bfe3

📥 Commits

Reviewing files that changed from the base of the PR and between fe1870d and 2abfb4c.

📒 Files selected for processing (11)
  • go.mod
  • test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go
  • test/e2e/performanceprofile/functests/1_performance/cpu_management.go
  • test/e2e/performanceprofile/functests/1_performance/irqbalance.go
  • test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go
  • test/e2e/performanceprofile/functests/utils/nodes/nodes.go
  • test/e2e/performanceprofile/functests/utils/nodes/nodes_test.go
  • test/e2e/performanceprofile/functests/utils/outputparser/outputparser.go
  • test/e2e/performanceprofile/functests/utils/pods/pods.go
  • test/e2e/performanceprofile/functests/utils/resources/resources.go
  • test/e2e/performanceprofile/functests/utils/resources/resources_test.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • go.mod
  • test/e2e/performanceprofile/functests/1_performance/irqbalance.go
  • test/e2e/performanceprofile/functests/utils/outputparser/outputparser.go
  • test/e2e/performanceprofile/functests/utils/nodes/nodes_test.go

@ffromani
Copy link
Contributor

ffromani commented Mar 9, 2026

Coderabbit AI adds even more review burden. Now I need to review the code AND the AI review itself to sift what's worthy and what's noise. There can totally be valuable suggestions from AI - I think I spotted a couple at glance, but I want to go on record this INCREASES to the maintainer burden, not the other way around

Copy link
Contributor

@ffromani ffromani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review

@shajmakh shajmakh force-pushed the exec-affinity-pp-e2e branch from 2abfb4c to e9e8dd1 Compare March 16, 2026 08:08
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/e2e/performanceprofile/functests/13_llc/llc.go (1)

1420-1432: ⚠️ Potential issue | 🟠 Major

Deletion wait condition is inverted and can fail cleanup

Line 1424 treats len(podList.Items) == 0 as failure. For a deletion wait helper, zero pods is the success case. This can timeout after fast deletions and fail teardown.

💡 Proposed fix
 func waitForDeploymentPodsDeletion(ctx context.Context, targetNode *corev1.Node, podLabel map[string]string) {
@@
 	podList := &corev1.PodList{}
 	Eventually(func() bool {
 		if err := testclient.DataPlaneClient.List(ctx, podList, listOptions); err != nil {
 			return false
 		}
-		if len(podList.Items) == 0 {
-			return false
-		}
-		return true
+		return len(podList.Items) == 0
 	}).WithTimeout(time.Minute * 5).WithPolling(time.Second * 30).Should(BeTrue())
-	for _, pod := range podList.Items {
-		testlog.TaggedInfof("cleanup", "Deleting pod %s", pod.Name)
-		err := pods.WaitForDeletion(ctx, testclient.DataPlaneClient, &pod, pods.DefaultDeletionTimeout*time.Second)
-		Expect(err).ToNot(HaveOccurred())
-	}
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/13_llc/llc.go` around lines 1420 -
1432, The current Eventually check treats an empty podList as failure; invert
that condition so zero pods is considered success: inside the Eventually(func()
bool { ... }) return true when len(podList.Items) == 0 (and return false
otherwise), so the waiter completes immediately when no pods remain; keep using
testclient.DataPlaneClient and podList as before and then only iterate pods when
items exist before calling pods.WaitForDeletion for each pod.
♻️ Duplicate comments (1)
test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go (1)

1371-1410: ⚠️ Potential issue | 🟠 Major

Validate against runnable exclusive CPUs, not just “not always first”

Line 1376 picks the first CPU from cpusetCfg.Cpus without filtering offline CPUs, and Line 1409 only checks diversity. This can fail when only one exclusive CPU is online and can miss out-of-set CPU placements.

💡 Proposed fix
 cpusetCfg := &controller.CpuSet{}
 Expect(getter.Container(ctx, testPod, testPod.Spec.Containers[0].Name, cpusetCfg)).To(Succeed(), "Failed to get cpuset config for test pod")
 cpusList, err := cpuset.Parse(cpusetCfg.Cpus)
 Expect(err).ToNot(HaveOccurred())
-Expect(cpusList.Size()).ToNot(BeZero())
-firstExclusiveCPU := cpusList.List()[0]
-testlog.Infof("first exclusive CPU: %d, all exclusive CPUs: %s", firstExclusiveCPU, cpusList.String())
+onlineCPUs, err := nodes.GetOnlineCPUsSet(ctx, &workerRTNodes[0])
+Expect(err).ToNot(HaveOccurred())
+runnableCPUs := cpusList.Intersection(onlineCPUs)
+if runnableCPUs.Size() < 2 {
+	Skip(fmt.Sprintf("need at least two online exclusive CPUs, got %s (container=%s online=%s)", runnableCPUs.String(), cpusList.String(), onlineCPUs.String()))
+}
+firstExclusiveCPU := runnableCPUs.List()[0]
+testlog.Infof("first exclusive CPU: %d, runnable exclusive CPUs: %s", firstExclusiveCPU, runnableCPUs.String())
@@
 Expect(g.Wait()).ToNot(HaveOccurred(), "One or more exec commands failed")
+for _, cpu := range execProcessCPUs {
+	Expect(runnableCPUs.Contains(cpu)).To(BeTrue(), "exec process CPU %d is outside runnable exclusive CPUs %s", cpu, runnableCPUs.String())
+}
 Expect(execProcessCPUs).ToNot(HaveEach(firstExclusiveCPU),
 	"exec process was always pinned to the first exclusive CPU %d across all %d retries", firstExclusiveCPU, retries)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`
around lines 1371 - 1410, The test currently picks firstExclusiveCPU from
cpusetCfg.Cpus without filtering offline CPUs and only asserts diversity; update
the logic to (1) derive the set of runnable exclusive CPUs by intersecting
cpusList (from cpuset.Parse on cpusetCfg.Cpus) with the system's online CPUs,
fail the test if that intersection is empty, (2) select firstExclusiveCPU from
that runnableExclusive set (not raw cpusList), (3) when checking execProcessCPUs
ensure two things: if runnableExclusive has more than one CPU assert that not
all execProcessCPUs equal firstExclusiveCPU, and always assert that each
observed cpu in execProcessCPUs is contained in the original cpusList (to detect
out-of-set placements). Use symbols cpusetCfg.Cpus, cpusList, runnableExclusive
(new), firstExclusiveCPU, and execProcessCPUs to locate and update the code.
🧹 Nitpick comments (1)
test/e2e/performanceprofile/functests/1_performance/cpu_management.go (1)

1291-1291: Consider using the errgroup context for cancellation on first error.

The context returned by errgroup.WithContext is cancelled when any goroutine returns an error, allowing other goroutines to exit early. Currently it's discarded with _.

However, since pods.ExecCommandOnPod likely doesn't accept a context parameter, this may be intentional. If the exec function doesn't support context cancellation, discarding is acceptable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go` at
line 1291, The errgroup.WithContext call currently discards the derived context
(g, _ := errgroup.WithContext(ctx)); keep and use the returned context (e.g.,
egCtx, g := errgroup.WithContext(ctx)) so goroutines started on g can observe
cancellation when any goroutine fails; update goroutine bodies that call
pods.ExecCommandOnPod to either accept the new egCtx (if pods.ExecCommandOnPod
supports a context) or wrap the exec calls to check egCtx.Done() and return
early—ensure the errgroup variable (g) and the derived context name (egCtx) are
used consistently to cancel remaining work on first error.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go`:
- Around line 343-346: The cleanup defer for deleting the created pod is
registered after the assertion Expect(err).ToNot(HaveOccurred()), so if
createPodWithHouskeeping(...) creates a pod and then returns an error the
assertion will abort before cleanup is registered and the pod will leak; fix by
registering the defer immediately after the call to
createPodWithHouskeeping(...) (or after confirming the returned testpod is
non-nil) but before any failing assertions, using the same
pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, testpod) call so
cleanup always runs even if subsequent Expect(...) fails.

In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`:
- Around line 1273-1276: The DeferCleanup closure captures loop-scoped variables
by reference causing a panic and wrong pod deletion; fix by copying the current
loop values into new locals before registering DeferCleanup (e.g., capture
nodeName := workerRTNodes[i].Name and podCopy := testpod) and use those locals
inside the closure passed to DeferCleanup so pods.DeleteAndSync(context.TODO(),
testclient.DataPlaneClient, podCopy) references the correct pod and nodeName for
the log.

In `@test/e2e/performanceprofile/functests/utils/pods/pods.go`:
- Around line 115-120: The WaitForDeletion polling currently swallows all
non-NotFound errors from cli.Get; update the closure in WaitForDeletion to check
the error returned by cli.Get and return that error (stop polling) when err !=
nil and !errors.IsNotFound(err), only treating errors.IsNotFound(err) as success
(return true, nil); reference the WaitForDeletion function and the cli.Get(...)
call to locate where to add the explicit error return so transient
RBAC/API/network failures surface immediately instead of being retried until
timeout.
- Around line 99-101: DeleteAndSync dereferences pod immediately which can panic
when callers pass nil during cleanup; add a nil guard at the top of
DeleteAndSync (check pod == nil) and handle it as a safe no-op by logging a
debug/warn and returning nil instead of proceeding; move or protect the existing
klog.InfoS("delete pod...","namespace", pod.Namespace, "name", pod.Name) so it
only runs after the nil check, and ensure other uses of pod within DeleteAndSync
are not executed when pod is nil.

---

Outside diff comments:
In `@test/e2e/performanceprofile/functests/13_llc/llc.go`:
- Around line 1420-1432: The current Eventually check treats an empty podList as
failure; invert that condition so zero pods is considered success: inside the
Eventually(func() bool { ... }) return true when len(podList.Items) == 0 (and
return false otherwise), so the waiter completes immediately when no pods
remain; keep using testclient.DataPlaneClient and podList as before and then
only iterate pods when items exist before calling pods.WaitForDeletion for each
pod.

---

Duplicate comments:
In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`:
- Around line 1371-1410: The test currently picks firstExclusiveCPU from
cpusetCfg.Cpus without filtering offline CPUs and only asserts diversity; update
the logic to (1) derive the set of runnable exclusive CPUs by intersecting
cpusList (from cpuset.Parse on cpusetCfg.Cpus) with the system's online CPUs,
fail the test if that intersection is empty, (2) select firstExclusiveCPU from
that runnableExclusive set (not raw cpusList), (3) when checking execProcessCPUs
ensure two things: if runnableExclusive has more than one CPU assert that not
all execProcessCPUs equal firstExclusiveCPU, and always assert that each
observed cpu in execProcessCPUs is contained in the original cpusList (to detect
out-of-set placements). Use symbols cpusetCfg.Cpus, cpusList, runnableExclusive
(new), firstExclusiveCPU, and execProcessCPUs to locate and update the code.

---

Nitpick comments:
In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go`:
- Line 1291: The errgroup.WithContext call currently discards the derived
context (g, _ := errgroup.WithContext(ctx)); keep and use the returned context
(e.g., egCtx, g := errgroup.WithContext(ctx)) so goroutines started on g can
observe cancellation when any goroutine fails; update goroutine bodies that call
pods.ExecCommandOnPod to either accept the new egCtx (if pods.ExecCommandOnPod
supports a context) or wrap the exec calls to check egCtx.Done() and return
early—ensure the errgroup variable (g) and the derived context name (egCtx) are
used consistently to cancel remaining work on first error.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0e24cbda-4baf-447d-9236-98f40390a22e

📥 Commits

Reviewing files that changed from the base of the PR and between 2abfb4c and e9e8dd1.

📒 Files selected for processing (16)
  • go.mod
  • test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go
  • test/e2e/performanceprofile/functests/13_llc/llc.go
  • test/e2e/performanceprofile/functests/1_performance/cpu_management.go
  • test/e2e/performanceprofile/functests/1_performance/hugepages.go
  • test/e2e/performanceprofile/functests/1_performance/irqbalance.go
  • test/e2e/performanceprofile/functests/2_performance_update/memorymanager.go
  • test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go
  • test/e2e/performanceprofile/functests/4_latency/latency.go
  • test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go
  • test/e2e/performanceprofile/functests/8_performance_workloadhints/workloadhints.go
  • test/e2e/performanceprofile/functests/utils/nodes/nodes.go
  • test/e2e/performanceprofile/functests/utils/nodes/nodes_test.go
  • test/e2e/performanceprofile/functests/utils/pods/pods.go
  • test/e2e/performanceprofile/functests/utils/resources/resources.go
  • test/e2e/performanceprofile/functests/utils/resources/resources_test.go
✅ Files skipped from review due to trivial changes (1)
  • test/e2e/performanceprofile/functests/2_performance_update/memorymanager.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/e2e/performanceprofile/functests/utils/nodes/nodes_test.go

Comment on lines 1273 to 1276
DeferCleanup(func() {
By(fmt.Sprintf("deleting the test pod from node %s", workerRTNodes[i].Name))
Expect(testclient.DataPlaneClient.Delete(context.TODO(), testpod)).ToNot(HaveOccurred())
Expect(pods.WaitForDeletion(context.TODO(), testpod, pods.DefaultDeletionTimeout*time.Second)).ToNot(HaveOccurred())
Expect(pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, testpod)).To(Succeed())
})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go | sed -n '1250,1285p'

Repository: openshift/cluster-node-tuning-operator

Length of output: 2554


Capture loop values before registering DeferCleanup to prevent cleanup from referencing wrong pod and causing index-out-of-range panic

Lines 1273-1276 register a cleanup closure that captures two variables by reference:

  1. Loop index i: By cleanup time, the loop has completed and i is out of range, causing workerRTNodes[i].Name to panic.
  2. testpod variable: Reassigned at line 1277 after cleanup registration. Since Go for loops share scope across iterations, the closure captures the variable reference, not the value, so cleanup will attempt to delete the pod from the last loop iteration rather than the current one.

Both must be captured before DeferCleanup:

💡 Proposed fix
 for i := 0; i < len(workerRTNodes); i++ {
+    nodeName := workerRTNodes[i].Name
     By("Determining the default container runtime used in the node")
     tunedPod, err := tuned.GetPod(context.TODO(), &workerRTNodes[i])
     Expect(err).ToNot(HaveOccurred())
     expectedRuntime, err = runtime.GetContainerRuntimeTypeFor(context.TODO(), testclient.DataPlaneClient, tunedPod)
     Expect(err).ToNot(HaveOccurred())
     testlog.Infof("Container runtime used for the node: %s", expectedRuntime)
     By("verifying pod using high-performance runtime class handled by the default container runtime aswell")
     Expect(err).ToNot(HaveOccurred())
     testpod := testpodTemplate.DeepCopy()
-    testpod.Spec.NodeName = workerRTNodes[i].Name
-    testpod.Spec.NodeSelector = map[string]string{testutils.LabelHostname: workerRTNodes[i].Name}
-    By(fmt.Sprintf("creating a test pod using high-performance runtime class on node %s", workerRTNodes[i].Name))
+    testpod.Spec.NodeName = nodeName
+    testpod.Spec.NodeSelector = map[string]string{testutils.LabelHostname: nodeName}
+    By(fmt.Sprintf("creating a test pod using high-performance runtime class on node %s", nodeName))
     Expect(testclient.DataPlaneClient.Create(context.TODO(), testpod)).ToNot(HaveOccurred())
+    podToDelete := testpod
     DeferCleanup(func() {
-        By(fmt.Sprintf("deleting the test pod from node %s", workerRTNodes[i].Name))
-        Expect(pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, testpod)).To(Succeed())
+        By(fmt.Sprintf("deleting the test pod from node %s", nodeName))
+        Expect(pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, podToDelete)).To(Succeed())
     })
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`
around lines 1273 - 1276, The DeferCleanup closure captures loop-scoped
variables by reference causing a panic and wrong pod deletion; fix by copying
the current loop values into new locals before registering DeferCleanup (e.g.,
capture nodeName := workerRTNodes[i].Name and podCopy := testpod) and use those
locals inside the closure passed to DeferCleanup so
pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, podCopy)
references the correct pod and nodeName for the log.

@shajmakh shajmakh force-pushed the exec-affinity-pp-e2e branch from e9e8dd1 to a6f7bb8 Compare March 16, 2026 12:42
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/e2e/performanceprofile/functests/1_performance/cpu_management.go (1)

157-163: ⚠️ Potential issue | 🟠 Major

CPU allocatable assertion is still incorrect when shared CPUs are configured

Line 162 explicitly notes the gap, but the assertion still enforces capacity-allocatable == reserved, which can fail for valid shared-CPU profiles and make this test flaky across environments. Please gate or adjust this expectation before release.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go` around
lines 157 - 163, The assertion currently enforces capacityCPU - allocatableCPU
== int64(len(listReservedCPU)) which fails when shared CPUs exist; modify the
check in the cpu_management.go test (around workerRTNode, listReservedCPU,
differenceCPUGot, differenceCPUExpected) to account for shared-CPU profiles by
either (A) gating the strict equality: if a shared-CPU configuration is detected
(e.g., inspect the node's topology/CPU manager state or a helper flag indicating
shared CPUs) then assert differenceCPUGot <= differenceCPUExpected (or skip the
equality check), or (B) replace the Expect(...).To(Equal(...)) with
Expect(differenceCPUGot).To(BeNumerically("<=", differenceCPUExpected)) and add
a clear By() note explaining the relaxed check for shared CPUs so the test is
not flaky across environments.
♻️ Duplicate comments (6)
test/e2e/performanceprofile/functests/utils/pods/pods.go (2)

123-129: ⚠️ Potential issue | 🟠 Major

WaitForDeletion swallows non-NotFound errors.

At Line 125, only IsNotFound is handled; other cli.Get failures are retried as false, nil until timeout, which hides the real failure mode.

As per coding guidelines "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Proposed fix
 func WaitForDeletion(ctx context.Context, cli client.Client, pod *corev1.Pod, timeout time.Duration) error {
 	return wait.PollUntilContextTimeout(ctx, time.Second, timeout, true, func(ctx context.Context) (bool, error) {
-		if err := cli.Get(ctx, client.ObjectKeyFromObject(pod), pod); errors.IsNotFound(err) {
+		err := cli.Get(ctx, client.ObjectKeyFromObject(pod), pod)
+		if errors.IsNotFound(err) {
 			return true, nil
 		}
-		return false, nil
+		if err != nil {
+			return false, err
+		}
+		return false, nil
 	})
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/utils/pods/pods.go` around lines 123 -
129, The WaitForDeletion function currently swallows all non-NotFound errors
from cli.Get; update the polling closure (used by wait.PollUntilContextTimeout)
to propagate real errors by returning (false, err) when cli.Get returns an error
that is not errors.IsNotFound(err), so that these failures break the poll and
surface the underlying error instead of silently retrying; keep the existing
behavior of returning (true, nil) when errors.IsNotFound(err) to signal
successful deletion.

99-103: ⚠️ Potential issue | 🟠 Major

Nil check is currently after dereference in DeleteAndSync.

At Line 100, pod.Namespace and pod.Name are accessed before the nil guard on Line 101, so teardown can panic when cleanup is invoked with nil.

As per coding guidelines "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Proposed fix
 func DeleteAndSync(ctx context.Context, cli client.Client, pod *corev1.Pod) error {
-	klog.InfoS("delete pod and waiting for deletion", "namespace", pod.Namespace, "name", pod.Name)
 	if pod == nil {
 		klog.InfoS("pod is nil, skipping deletion")
 		return nil
 	}
 
 	if cli == nil {
 		return fmt.Errorf("client is nil")
 	}
+	klog.InfoS("delete pod and waiting for deletion", "namespace", pod.Namespace, "name", pod.Name)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/utils/pods/pods.go` around lines 99 -
103, The function DeleteAndSync dereferences pod (accessing pod.Namespace and
pod.Name) before checking for nil which can panic; move the nil guard to the top
of DeleteAndSync so you return early if pod == nil, then perform the klog.InfoS
call using pod.Namespace and pod.Name only after the nil check; update any
related logging or downstream logic in DeleteAndSync to assume pod is non-nil
once past the guard.
test/e2e/performanceprofile/functests/1_performance/irqbalance.go (1)

340-346: ⚠️ Potential issue | 🟠 Major

Register cleanup before the failing assertion.

Line 341 can fail before the defer on Line 343 is registered. If pod creation partially succeeds and then errors, cleanup is skipped and the pod can leak.

As per coding guidelines "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Proposed fix
 testpod, err := createPodWithHouskeeping(numOfContainersInPod, cpusPerContainer, profile, context.TODO(), targetNode)
-Expect(err).ToNot(HaveOccurred(), "Failed to create pod with housekeeping annotation")
-
 defer func() {
-	testlog.Infof("deleting pod %q", testpod.Name)
-	Expect(pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, testpod)).To(Succeed())
+	if testpod != nil {
+		testlog.Infof("deleting pod %q", testpod.Name)
+		Expect(pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, testpod)).To(Succeed())
+	}
 }()
+Expect(err).ToNot(HaveOccurred(), "Failed to create pod with housekeeping annotation")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go` around
lines 340 - 346, createPodWithHouskeeping may return a non-nil testpod and an
error, so register cleanup immediately after the call to avoid leaking pods;
move the defer that calls pods.DeleteAndSync/testclient.DataPlaneClient so it
runs before the Expect(err).ToNot(HaveOccurred()) assertion, but guard it by
checking testpod != nil (or capture a local copy of testpod inside the defer) so
you only attempt deletion when a pod was actually returned; reference
createPodWithHouskeeping, testpod, pods.DeleteAndSync and
testclient.DataPlaneClient when making the change.
test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go (2)

1273-1276: ⚠️ Potential issue | 🟠 Major

Fix deferred cleanup closure capturing loop index.

DeferCleanup closes over i; by cleanup time, i may be out of range or point to a different node/pod. Capture stable locals before registering cleanup.

Proposed fix
-				for i := 0; i < len(workerRTNodes); i++ {
+				for i := 0; i < len(workerRTNodes); i++ {
+					nodeName := workerRTNodes[i].Name
 					By("Determining the default container runtime used in the node")
 					tunedPod, err := tuned.GetPod(context.TODO(), &workerRTNodes[i])
@@
-					testpod := testpodTemplate.DeepCopy()
-					testpod.Spec.NodeName = workerRTNodes[i].Name
-					testpod.Spec.NodeSelector = map[string]string{testutils.LabelHostname: workerRTNodes[i].Name}
-					By(fmt.Sprintf("creating a test pod using high-performance runtime class on node %s", workerRTNodes[i].Name))
+					testpod := testpodTemplate.DeepCopy()
+					testpod.Spec.NodeName = nodeName
+					testpod.Spec.NodeSelector = map[string]string{testutils.LabelHostname: nodeName}
+					By(fmt.Sprintf("creating a test pod using high-performance runtime class on node %s", nodeName))
 					Expect(testclient.DataPlaneClient.Create(context.TODO(), testpod)).ToNot(HaveOccurred())
+					podToDelete := testpod
 					DeferCleanup(func() {
-						By(fmt.Sprintf("deleting the test pod from node %s", workerRTNodes[i].Name))
-						Expect(pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, testpod)).To(Succeed())
+						By(fmt.Sprintf("deleting the test pod from node %s", nodeName))
+						Expect(pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, podToDelete)).To(Succeed())
 					})

As per coding guidelines, Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`
around lines 1273 - 1276, DeferCleanup's closure captures the loop index i so by
the time the deferred function runs it may reference the wrong node/pod; fix by
capturing stable locals before registering the cleanup (e.g., create local
copies like node := workerRTNodes[i] and podToDelete := testpod) and have the
closure call pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient,
podToDelete) and reference node.Name for the log; this ensures the deferred
closure uses the intended node and pod rather than a mutated loop variable.

1391-1433: ⚠️ Potential issue | 🟠 Major

Validate against runnable exclusive CPUs, not raw cpuset.

This assertion can be unsatisfiable when exclusive cpuset includes offline CPUs and only one runnable exclusive CPU remains. Intersect with online CPUs, skip when <2, and validate samples against that runnable set.

Proposed fix
-				exclusiveCPUs := cpusList.Difference(reservedCPUs).Difference(sharedCPUs)
-				Expect(exclusiveCPUs.Size()).ToNot(BeZero())
-				firstExclusiveCPU := exclusiveCPUs.List()[0]
-				testlog.Infof("first exclusive CPU: %d, all exclusive CPUs: %s", firstExclusiveCPU, exclusiveCPUs.String())
+				exclusiveCPUs := cpusList.Difference(reservedCPUs).Difference(sharedCPUs)
+				Expect(exclusiveCPUs.Size()).ToNot(BeZero())
+				onlineCPUs, err := nodes.GetOnlineCPUsSet(ctx, &workerRTNodes[0])
+				Expect(err).ToNot(HaveOccurred())
+				runnableExclusiveCPUs := exclusiveCPUs.Intersection(onlineCPUs)
+				if runnableExclusiveCPUs.Size() < 2 {
+					Skip(fmt.Sprintf("need at least 2 runnable exclusive CPUs, got %s (exclusive=%s online=%s)",
+						runnableExclusiveCPUs.String(), exclusiveCPUs.String(), onlineCPUs.String()))
+				}
+				firstExclusiveCPU := runnableExclusiveCPUs.List()[0]
+				testlog.Infof("first runnable exclusive CPU: %d, runnable exclusive CPUs: %s", firstExclusiveCPU, runnableExclusiveCPUs.String())
@@
-						if !exclusiveCPUs.Contains(cpu) {
+						if !runnableExclusiveCPUs.Contains(cpu) {
 							return fmt.Errorf("entry %d: exec process was pinned to a reserved or shared CPU %d", idx, cpu)
 						}
@@
-				Expect(execProcessCPUs).ToNot(HaveEach(firstExclusiveCPU),
+				Expect(execProcessCPUs).ToNot(HaveEach(firstExclusiveCPU),
 					"exec process was always pinned to the first exclusive CPU %d across all %d retries", firstExclusiveCPU, retries)

As per coding guidelines, Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`
around lines 1391 - 1433, The test currently validates samples against
exclusiveCPUs (derived from
cpusList.Difference(reservedCPUs).Difference(sharedCPUs)) which can include
offline CPUs; change the logic to compute runnableExclusiveCPUs by intersecting
exclusiveCPUs with the set of online CPUs (e.g., get online set then
runnableExclusive := exclusiveCPUs.Intersection(onlineCPUs)), skip the remainder
of the check when runnableExclusiveCPUs.Size() < 2, and use
runnableExclusiveCPUs (not exclusiveCPUs) for membership checks in the exec
goroutine and the final Expect on execProcessCPUs vs firstExclusiveCPU; update
references to exclusiveCPUs, firstExclusiveCPU, execProcessCPUs and the final
Expect to use runnableExclusiveCPUs so the assertion is satisfiable.
test/e2e/performanceprofile/functests/1_performance/cpu_management.go (1)

1015-1016: ⚠️ Potential issue | 🟠 Major

Pod lifecycle client mismatch can leak test pods

Cleanup now uses testclient.DataPlaneClient, but pod creation in this same test still uses testclient.Client (Line 1008 and Line 1041). In split control-plane/data-plane environments, this can leave pods orphaned or fail cleanup.

Suggested fix
- err := testclient.Client.Create(ctx, guaranteedPod)
+ err := testclient.DataPlaneClient.Create(ctx, guaranteedPod)
@@
- err = testclient.Client.Create(ctx, bestEffortPod)
+ err = testclient.DataPlaneClient.Create(ctx, bestEffortPod)

Also applies to: 1046-1047

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go` around
lines 1015 - 1016, The cleanup is calling pods.DeleteAndSync(ctx,
testclient.DataPlaneClient, guaranteedPod) while pod creation in this test uses
testclient.Client, which can orphan pods in split control-plane/data-plane
setups; update the test so pod creation and deletion use the same client: either
change the DeleteAndSync calls (and other cleanup calls around guaranteedPod) to
use testclient.Client, or change the pod creation calls to use
testclient.DataPlaneClient—ensure all references to pods.DeleteAndSync,
guaranteedPod creation, and any subsequent cleanup (the other occurrence at the
same block around lines noted) consistently use the same client object.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go`:
- Around line 1316-1318: The code currently only warns when len(samples) <
samplesPerRun (using testlog.Warningf), which lets undersampled runs pass;
change that to fail the test immediately by replacing the warning with a fatal
failure — e.g. call testlog.Fatalf (or the test runner's equivalent) with the
same message and parameters ("entry %d: got %d/%d samples (process may have
exited early)", idx, len(samples), samplesPerRun) so any undersampled run causes
the test to stop and report failure.

---

Outside diff comments:
In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go`:
- Around line 157-163: The assertion currently enforces capacityCPU -
allocatableCPU == int64(len(listReservedCPU)) which fails when shared CPUs
exist; modify the check in the cpu_management.go test (around workerRTNode,
listReservedCPU, differenceCPUGot, differenceCPUExpected) to account for
shared-CPU profiles by either (A) gating the strict equality: if a shared-CPU
configuration is detected (e.g., inspect the node's topology/CPU manager state
or a helper flag indicating shared CPUs) then assert differenceCPUGot <=
differenceCPUExpected (or skip the equality check), or (B) replace the
Expect(...).To(Equal(...)) with Expect(differenceCPUGot).To(BeNumerically("<=",
differenceCPUExpected)) and add a clear By() note explaining the relaxed check
for shared CPUs so the test is not flaky across environments.

---

Duplicate comments:
In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go`:
- Around line 1015-1016: The cleanup is calling pods.DeleteAndSync(ctx,
testclient.DataPlaneClient, guaranteedPod) while pod creation in this test uses
testclient.Client, which can orphan pods in split control-plane/data-plane
setups; update the test so pod creation and deletion use the same client: either
change the DeleteAndSync calls (and other cleanup calls around guaranteedPod) to
use testclient.Client, or change the pod creation calls to use
testclient.DataPlaneClient—ensure all references to pods.DeleteAndSync,
guaranteedPod creation, and any subsequent cleanup (the other occurrence at the
same block around lines noted) consistently use the same client object.

In `@test/e2e/performanceprofile/functests/1_performance/irqbalance.go`:
- Around line 340-346: createPodWithHouskeeping may return a non-nil testpod and
an error, so register cleanup immediately after the call to avoid leaking pods;
move the defer that calls pods.DeleteAndSync/testclient.DataPlaneClient so it
runs before the Expect(err).ToNot(HaveOccurred()) assertion, but guard it by
checking testpod != nil (or capture a local copy of testpod inside the defer) so
you only attempt deletion when a pod was actually returned; reference
createPodWithHouskeeping, testpod, pods.DeleteAndSync and
testclient.DataPlaneClient when making the change.

In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`:
- Around line 1273-1276: DeferCleanup's closure captures the loop index i so by
the time the deferred function runs it may reference the wrong node/pod; fix by
capturing stable locals before registering the cleanup (e.g., create local
copies like node := workerRTNodes[i] and podToDelete := testpod) and have the
closure call pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient,
podToDelete) and reference node.Name for the log; this ensures the deferred
closure uses the intended node and pod rather than a mutated loop variable.
- Around line 1391-1433: The test currently validates samples against
exclusiveCPUs (derived from
cpusList.Difference(reservedCPUs).Difference(sharedCPUs)) which can include
offline CPUs; change the logic to compute runnableExclusiveCPUs by intersecting
exclusiveCPUs with the set of online CPUs (e.g., get online set then
runnableExclusive := exclusiveCPUs.Intersection(onlineCPUs)), skip the remainder
of the check when runnableExclusiveCPUs.Size() < 2, and use
runnableExclusiveCPUs (not exclusiveCPUs) for membership checks in the exec
goroutine and the final Expect on execProcessCPUs vs firstExclusiveCPU; update
references to exclusiveCPUs, firstExclusiveCPU, execProcessCPUs and the final
Expect to use runnableExclusiveCPUs so the assertion is satisfiable.

In `@test/e2e/performanceprofile/functests/utils/pods/pods.go`:
- Around line 123-129: The WaitForDeletion function currently swallows all
non-NotFound errors from cli.Get; update the polling closure (used by
wait.PollUntilContextTimeout) to propagate real errors by returning (false, err)
when cli.Get returns an error that is not errors.IsNotFound(err), so that these
failures break the poll and surface the underlying error instead of silently
retrying; keep the existing behavior of returning (true, nil) when
errors.IsNotFound(err) to signal successful deletion.
- Around line 99-103: The function DeleteAndSync dereferences pod (accessing
pod.Namespace and pod.Name) before checking for nil which can panic; move the
nil guard to the top of DeleteAndSync so you return early if pod == nil, then
perform the klog.InfoS call using pod.Namespace and pod.Name only after the nil
check; update any related logging or downstream logic in DeleteAndSync to assume
pod is non-nil once past the guard.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 135c6ac5-fd02-4fe8-b6a0-bede26a89279

📥 Commits

Reviewing files that changed from the base of the PR and between e9e8dd1 and a6f7bb8.

📒 Files selected for processing (16)
  • go.mod
  • test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go
  • test/e2e/performanceprofile/functests/13_llc/llc.go
  • test/e2e/performanceprofile/functests/1_performance/cpu_management.go
  • test/e2e/performanceprofile/functests/1_performance/hugepages.go
  • test/e2e/performanceprofile/functests/1_performance/irqbalance.go
  • test/e2e/performanceprofile/functests/2_performance_update/memorymanager.go
  • test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go
  • test/e2e/performanceprofile/functests/4_latency/latency.go
  • test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go
  • test/e2e/performanceprofile/functests/8_performance_workloadhints/workloadhints.go
  • test/e2e/performanceprofile/functests/utils/nodes/nodes.go
  • test/e2e/performanceprofile/functests/utils/nodes/nodes_test.go
  • test/e2e/performanceprofile/functests/utils/pods/pods.go
  • test/e2e/performanceprofile/functests/utils/resources/resources.go
  • test/e2e/performanceprofile/functests/utils/resources/resources_test.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • test/e2e/performanceprofile/functests/2_performance_update/memorymanager.go
  • test/e2e/performanceprofile/functests/utils/resources/resources_test.go
  • go.mod

@shajmakh shajmakh force-pushed the exec-affinity-pp-e2e branch from a6f7bb8 to b76ea67 Compare March 16, 2026 12:55
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (4)
test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go (2)

1273-1275: ⚠️ Potential issue | 🟠 Major

Capture loop values before DeferCleanup to avoid wrong cleanup target/panic.

The cleanup closure uses workerRTNodes[i].Name from the loop variable. By cleanup time, i may be out of range, and cleanup can panic or reference the wrong node/pod.

Proposed fix
 				for i := 0; i < len(workerRTNodes); i++ {
+					nodeName := workerRTNodes[i].Name
 					By("Determining the default container runtime used in the node")
 					tunedPod, err := tuned.GetPod(context.TODO(), &workerRTNodes[i])
@@
 					testpod := testpodTemplate.DeepCopy()
-					testpod.Spec.NodeName = workerRTNodes[i].Name
-					testpod.Spec.NodeSelector = map[string]string{testutils.LabelHostname: workerRTNodes[i].Name}
-					By(fmt.Sprintf("creating a test pod using high-performance runtime class on node %s", workerRTNodes[i].Name))
+					testpod.Spec.NodeName = nodeName
+					testpod.Spec.NodeSelector = map[string]string{testutils.LabelHostname: nodeName}
+					By(fmt.Sprintf("creating a test pod using high-performance runtime class on node %s", nodeName))
 					Expect(testclient.DataPlaneClient.Create(context.TODO(), testpod)).ToNot(HaveOccurred())
+					podToDelete := testpod
 					DeferCleanup(func() {
-						By(fmt.Sprintf("deleting the test pod from node %s", workerRTNodes[i].Name))
-						Expect(pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, testpod)).To(Succeed())
+						By(fmt.Sprintf("deleting the test pod from node %s", nodeName))
+						Expect(pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, podToDelete)).To(Succeed())
 					})

As per coding guidelines "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`
around lines 1273 - 1275, The DeferCleanup closure captures the loop variable i
which may change by cleanup time; capture the needed values (e.g., nodeName :=
workerRTNodes[i].Name and podRef := testpod) immediately before calling
DeferCleanup and use those local copies inside the closure so DeferCleanup calls
pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, podRef) and logs
nodeName instead of referencing workerRTNodes[i].Name or i.

1371-1376: ⚠️ Potential issue | 🟠 Major

Derive expected affinity from runnable exclusive CPUs (online intersection).

The test chooses firstExclusiveCPU from raw cpuset data. In CI, cpusets can include offline CPUs; this makes the assertion flaky or unsatisfiable. Intersect with node online CPUs first, then assert against that runnable set.

Proposed fix
 				cpusetCfg := &controller.CpuSet{}
 				Expect(getter.Container(ctx, testPod, testPod.Spec.Containers[0].Name, cpusetCfg)).To(Succeed(), "Failed to get cpuset config for test pod")
 				cpusList, err := cpuset.Parse(cpusetCfg.Cpus)
 				Expect(err).ToNot(HaveOccurred())
 				Expect(cpusList.Size()).ToNot(BeZero())
@@
 				exclusiveCPUs := cpusList.Difference(reservedCPUs).Difference(sharedCPUs)
-				Expect(exclusiveCPUs.Size()).ToNot(BeZero())
-				firstExclusiveCPU := exclusiveCPUs.List()[0]
-				testlog.Infof("first exclusive CPU: %d, all exclusive CPUs: %s", firstExclusiveCPU, exclusiveCPUs.String())
+				onlineCPUs, err := nodes.GetOnlineCPUsSet(ctx, &workerRTNodes[0])
+				Expect(err).ToNot(HaveOccurred())
+				runnableExclusiveCPUs := exclusiveCPUs.Intersection(onlineCPUs)
+				if runnableExclusiveCPUs.Size() < 2 {
+					Skip(fmt.Sprintf("need at least two online exclusive CPUs, got %s (exclusive=%s, online=%s)",
+						runnableExclusiveCPUs.String(), exclusiveCPUs.String(), onlineCPUs.String()))
+				}
+				firstExclusiveCPU := runnableExclusiveCPUs.List()[0]
+				testlog.Infof("first exclusive CPU: %d, runnable exclusive CPUs: %s", firstExclusiveCPU, runnableExclusiveCPUs.String())
@@
-						if !exclusiveCPUs.Contains(cpu) {
+						if !runnableExclusiveCPUs.Contains(cpu) {
 							return fmt.Errorf("entry %d: exec process was pinned to a reserved or shared CPU %d", idx, cpu)
 						}

As per coding guidelines "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Also applies to: 1391-1394, 1432-1433

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`
around lines 1371 - 1376, The test currently derives expected affinity from raw
cpuset data (cpusetCfg -> cpusList) which may include offline CPUs; change the
logic that picks firstExclusiveCPU and makes assertions to first compute the
runnable/exclusive set by intersecting cpusList with the node's online CPUs (use
the node online CPU set before any further filtering), then derive
firstExclusiveCPU and assert against that runnable intersection; update all
occurrences using cpusetCfg/cpusList/firstExclusiveCPU (including the blocks
around cpusetCfg, cpusList parsing and the assertions at 1391-1394 and
1432-1433) to use the online-intersected cpuset instead.
test/e2e/performanceprofile/functests/utils/pods/pods.go (1)

123-129: ⚠️ Potential issue | 🟠 Major

Propagate non-NotFound errors in WaitForDeletion.

cli.Get failures (RBAC/API/network) are currently swallowed and retried until timeout. This hides root cause and makes teardown flaky.

Proposed fix
 func WaitForDeletion(ctx context.Context, cli client.Client, pod *corev1.Pod, timeout time.Duration) error {
 	return wait.PollUntilContextTimeout(ctx, time.Second, timeout, true, func(ctx context.Context) (bool, error) {
-		if err := cli.Get(ctx, client.ObjectKeyFromObject(pod), pod); errors.IsNotFound(err) {
+		err := cli.Get(ctx, client.ObjectKeyFromObject(pod), pod)
+		if err == nil {
+			return false, nil
+		}
+		if errors.IsNotFound(err) {
 			return true, nil
 		}
-		return false, nil
+		return false, err
 	})
 }

As per coding guidelines "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/utils/pods/pods.go` around lines 123 -
129, The WaitForDeletion function swallows all cli.Get errors and only treats
NotFound as success; propagate non-NotFound errors immediately so
RBAC/API/network failures aren't retried and hidden. In WaitForDeletion (which
uses client.Client cli and wait.PollUntilContextTimeout), check the error
returned by cli.Get: if errors.IsNotFound(err) return true,nil; if err != nil
return false, err to surface the failure; only return false,nil when there is no
error and the object still exists. This ensures callers see real failures
instead of timeouts.
test/e2e/performanceprofile/functests/1_performance/cpu_management.go (1)

1316-1318: ⚠️ Potential issue | 🟠 Major

Fail on undersampled runs to prevent false positives

If sampling returns fewer than samplesPerRun, the test only warns and still proceeds. This can pass affinity checks with incomplete evidence.

Suggested change
-								if len(samples) < samplesPerRun {
-									testlog.Warningf("entry %d: got %d/%d samples (process may have exited early)", idx, len(samples), samplesPerRun)
-								}
+								if len(samples) < samplesPerRun {
+									return fmt.Errorf("entry %d: got %d/%d samples (process may have exited early)", idx, len(samples), samplesPerRun)
+								}

As per coding guidelines, "Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go` around
lines 1316 - 1318, The test currently only warns when a run is undersampled
(len(samples) < samplesPerRun) which can produce false positives; update the
check around samples, samplesPerRun (and idx) to fail the test instead of
continuing—use the test harness failure path (e.g., testlog.Fatalf or the
appropriate test failure function used elsewhere in this file) to abort the test
with a clear message including idx, len(samples) and samplesPerRun so
undersampled runs do not proceed to affinity checks.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go`:
- Around line 1291-1323: The loop launches all retry goroutines concurrently
which can create too many simultaneous pods.ExecCommandOnPod streams; limit
concurrency by adding a semaphore or worker pool when spawning goroutines for
the errgroup (use errgroup.WithContext as before) — for example create a
buffered channel (or semaphore) sized to a safe maxConcurrent (e.g.,
min(retries, 4)), acquire before calling g.Go and release inside the goroutine
when done; keep using idx := i, write results into execProcessCPUSamples[idx],
and ensure g.Wait() still checks for errors.

In `@test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go`:
- Around line 569-575: The selection of "first" CPUs uses raw cpuset order and
can pick offline CPUs; change the logic in the block that calls
nodes.GetCoreSiblings and nodes.GetTwoSiblingsFromCPUSet (and similar blocks
that set updatedShared/updatedIsolated) to intersect the candidate cpuset with
the current online CPU set before choosing CPUs—i.e., compute onlineCandidates
:= candidateCpuset.Intersection(onlineCPUs) and then pick the first two from
onlineCandidates (or fall back appropriately) instead of using
updatedIsolated.List() directly; apply the same intersection-based selection
wherever updatedShared/updatedIsolated are constructed (the other occurrences
noted in the comment).

---

Duplicate comments:
In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go`:
- Around line 1316-1318: The test currently only warns when a run is
undersampled (len(samples) < samplesPerRun) which can produce false positives;
update the check around samples, samplesPerRun (and idx) to fail the test
instead of continuing—use the test harness failure path (e.g., testlog.Fatalf or
the appropriate test failure function used elsewhere in this file) to abort the
test with a clear message including idx, len(samples) and samplesPerRun so
undersampled runs do not proceed to affinity checks.

In
`@test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go`:
- Around line 1273-1275: The DeferCleanup closure captures the loop variable i
which may change by cleanup time; capture the needed values (e.g., nodeName :=
workerRTNodes[i].Name and podRef := testpod) immediately before calling
DeferCleanup and use those local copies inside the closure so DeferCleanup calls
pods.DeleteAndSync(context.TODO(), testclient.DataPlaneClient, podRef) and logs
nodeName instead of referencing workerRTNodes[i].Name or i.
- Around line 1371-1376: The test currently derives expected affinity from raw
cpuset data (cpusetCfg -> cpusList) which may include offline CPUs; change the
logic that picks firstExclusiveCPU and makes assertions to first compute the
runnable/exclusive set by intersecting cpusList with the node's online CPUs (use
the node online CPU set before any further filtering), then derive
firstExclusiveCPU and assert against that runnable intersection; update all
occurrences using cpusetCfg/cpusList/firstExclusiveCPU (including the blocks
around cpusetCfg, cpusList parsing and the assertions at 1391-1394 and
1432-1433) to use the online-intersected cpuset instead.

In `@test/e2e/performanceprofile/functests/utils/pods/pods.go`:
- Around line 123-129: The WaitForDeletion function swallows all cli.Get errors
and only treats NotFound as success; propagate non-NotFound errors immediately
so RBAC/API/network failures aren't retried and hidden. In WaitForDeletion
(which uses client.Client cli and wait.PollUntilContextTimeout), check the error
returned by cli.Get: if errors.IsNotFound(err) return true,nil; if err != nil
return false, err to surface the failure; only return false,nil when there is no
error and the object still exists. This ensures callers see real failures
instead of timeouts.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9fe94267-8a2b-4f0d-905b-3817ed39bb59

📥 Commits

Reviewing files that changed from the base of the PR and between a6f7bb8 and b76ea67.

📒 Files selected for processing (16)
  • go.mod
  • test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go
  • test/e2e/performanceprofile/functests/13_llc/llc.go
  • test/e2e/performanceprofile/functests/1_performance/cpu_management.go
  • test/e2e/performanceprofile/functests/1_performance/hugepages.go
  • test/e2e/performanceprofile/functests/1_performance/irqbalance.go
  • test/e2e/performanceprofile/functests/2_performance_update/memorymanager.go
  • test/e2e/performanceprofile/functests/2_performance_update/updating_profile.go
  • test/e2e/performanceprofile/functests/4_latency/latency.go
  • test/e2e/performanceprofile/functests/7_performance_kubelet_node/cgroups.go
  • test/e2e/performanceprofile/functests/8_performance_workloadhints/workloadhints.go
  • test/e2e/performanceprofile/functests/utils/nodes/nodes.go
  • test/e2e/performanceprofile/functests/utils/nodes/nodes_test.go
  • test/e2e/performanceprofile/functests/utils/pods/pods.go
  • test/e2e/performanceprofile/functests/utils/resources/resources.go
  • test/e2e/performanceprofile/functests/utils/resources/resources_test.go
🚧 Files skipped from review as they are similar to previous changes (10)
  • test/e2e/performanceprofile/functests/4_latency/latency.go
  • test/e2e/performanceprofile/functests/utils/resources/resources_test.go
  • test/e2e/performanceprofile/functests/13_llc/llc.go
  • go.mod
  • test/e2e/performanceprofile/functests/1_performance/hugepages.go
  • test/e2e/performanceprofile/functests/utils/nodes/nodes.go
  • test/e2e/performanceprofile/functests/utils/nodes/nodes_test.go
  • test/e2e/performanceprofile/functests/8_performance_workloadhints/workloadhints.go
  • test/e2e/performanceprofile/functests/1_performance/irqbalance.go
  • test/e2e/performanceprofile/functests/utils/resources/resources.go

Comment on lines +1291 to +1323
g, _ := errgroup.WithContext(ctx)
for i := 0; i < retries; i++ {
idx := i
g.Go(func() error {
output, err := pods.ExecCommandOnPod(testclient.K8sClient, testPod, container.Name, cmd)
if err != nil {
return fmt.Errorf("exec command failed for run %d: %w", idx, err)
}
strout := strings.TrimSpace(string(output))
testlog.Infof("entry %d: exec command output (samples): \n%s", idx, strout)
var samples []int
for _, line := range strings.Split(strout, "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
cpu, err := strconv.Atoi(line)
if err != nil {
return fmt.Errorf("entry %d: failed to parse processor number from %q: %w", idx, line, err)
}
samples = append(samples, cpu)
}
if len(samples) == 0 {
return fmt.Errorf("entry %d: no valid samples", idx)
}
if len(samples) < samplesPerRun {
testlog.Warningf("entry %d: got %d/%d samples (process may have exited early)", idx, len(samples), samplesPerRun)
}
execProcessCPUSamples[idx] = samples
return nil
})
}
Expect(g.Wait()).ToNot(HaveOccurred(), "One or more exec commands failed")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, check the broader context around lines 1291-1323 to see how `retries` is defined
cat -n test/e2e/performanceprofile/functests/1_performance/cpu_management.go | sed -n '1270,1330p'

Repository: openshift/cluster-node-tuning-operator

Length of output: 3455


🏁 Script executed:

# Search for where `retries` is defined in this file
rg -n "retries\s*:=|retries\s*=" test/e2e/performanceprofile/functests/1_performance/cpu_management.go

Repository: openshift/cluster-node-tuning-operator

Length of output: 177


🏁 Script executed:

# Check if samplesPerRun is also defined nearby
rg -n "samplesPerRun" test/e2e/performanceprofile/functests/1_performance/cpu_management.go | head -20

Repository: openshift/cluster-node-tuning-operator

Length of output: 338


🌐 Web query:

Go errgroup.SetLimit() availability go.dev

💡 Result:

errgroup.(*Group).SetLimit is available on Go’s official package docs site (pkg.go.dev / go.dev) as part of the module golang.org/x/sync/errgroup (it is not in the standard library). The current docs show SetLimit(n int) and describe how it caps the number of active goroutines. [1]

It was added to x/sync/errgroup in the change “add TryGo and SetLimit to control concurrency” (review dated May 9, 2022), so you need an golang.org/x/sync version from after that change (e.g., modern tagged releases). [2]

Use: import golang.org/x/sync/errgroup and ensure your go.mod depends on a recent golang.org/x/sync (commonly via go get golang.org/x/sync@latest). [1]

Sources: [1] [2]


🏁 Script executed:

# Check errgroup import in the file
rg -n "^import|errgroup" test/e2e/performanceprofile/functests/1_performance/cpu_management.go | head -30

Repository: openshift/cluster-node-tuning-operator

Length of output: 170


🏁 Script executed:

# Check go.mod for golang.org/x/sync dependency
cat go.mod | grep -A 5 -B 5 "golang.org/x/sync"

Repository: openshift/cluster-node-tuning-operator

Length of output: 454


Cap parallel exec sampling to reduce CI flakiness

All retries are launched concurrently; with current retry math this can trigger up to 20 simultaneous exec streams per container. That increases API/server pressure and can make the test flaky under load.

Suggested change
-						g, _ := errgroup.WithContext(ctx)
+						g, _ := errgroup.WithContext(ctx)
+						const maxParallelExecSamples = 4
+						g.SetLimit(maxParallelExecSamples)
 						for i := 0; i < retries; i++ {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
g, _ := errgroup.WithContext(ctx)
for i := 0; i < retries; i++ {
idx := i
g.Go(func() error {
output, err := pods.ExecCommandOnPod(testclient.K8sClient, testPod, container.Name, cmd)
if err != nil {
return fmt.Errorf("exec command failed for run %d: %w", idx, err)
}
strout := strings.TrimSpace(string(output))
testlog.Infof("entry %d: exec command output (samples): \n%s", idx, strout)
var samples []int
for _, line := range strings.Split(strout, "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
cpu, err := strconv.Atoi(line)
if err != nil {
return fmt.Errorf("entry %d: failed to parse processor number from %q: %w", idx, line, err)
}
samples = append(samples, cpu)
}
if len(samples) == 0 {
return fmt.Errorf("entry %d: no valid samples", idx)
}
if len(samples) < samplesPerRun {
testlog.Warningf("entry %d: got %d/%d samples (process may have exited early)", idx, len(samples), samplesPerRun)
}
execProcessCPUSamples[idx] = samples
return nil
})
}
Expect(g.Wait()).ToNot(HaveOccurred(), "One or more exec commands failed")
g, _ := errgroup.WithContext(ctx)
const maxParallelExecSamples = 4
g.SetLimit(maxParallelExecSamples)
for i := 0; i < retries; i++ {
idx := i
g.Go(func() error {
output, err := pods.ExecCommandOnPod(testclient.K8sClient, testPod, container.Name, cmd)
if err != nil {
return fmt.Errorf("exec command failed for run %d: %w", idx, err)
}
strout := strings.TrimSpace(string(output))
testlog.Infof("entry %d: exec command output (samples): \n%s", idx, strout)
var samples []int
for _, line := range strings.Split(strout, "\n") {
line = strings.TrimSpace(line)
if line == "" {
continue
}
cpu, err := strconv.Atoi(line)
if err != nil {
return fmt.Errorf("entry %d: failed to parse processor number from %q: %w", idx, line, err)
}
samples = append(samples, cpu)
}
if len(samples) == 0 {
return fmt.Errorf("entry %d: no valid samples", idx)
}
if len(samples) < samplesPerRun {
testlog.Warningf("entry %d: got %d/%d samples (process may have exited early)", idx, len(samples), samplesPerRun)
}
execProcessCPUSamples[idx] = samples
return nil
})
}
Expect(g.Wait()).ToNot(HaveOccurred(), "One or more exec commands failed")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/1_performance/cpu_management.go` around
lines 1291 - 1323, The loop launches all retry goroutines concurrently which can
create too many simultaneous pods.ExecCommandOnPod streams; limit concurrency by
adding a semaphore or worker pool when spawning goroutines for the errgroup (use
errgroup.WithContext as before) — for example create a buffered channel (or
semaphore) sized to a safe maxConcurrent (e.g., min(retries, 4)), acquire before
calling g.Go and release inside the goroutine when done; keep using idx := i,
write results into execProcessCPUSamples[idx], and ensure g.Wait() still checks
for errors.

Comment on lines +569 to +575
coreSiblings, err := nodes.GetCoreSiblings(ctx, workerRTNode)
Expect(err).ToNot(HaveOccurred())
updatedShared, err = nodes.GetTwoSiblingsFromCPUSet(coreSiblings, updatedIsolated)
if err != nil {
testlog.Info("no two siblings found in the given CPU set, looks like the initial profile does not respect hyperthreading; proceed then with this state and pick first two isolated CPUs as the shared CPUs")
updatedShared = cpuset.New(updatedIsolated.List()[0], updatedIsolated.List()[1])
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Choose/validate affinity against online CPUs, not raw cpuset order.

This logic can select an offline CPU as the expected “first” CPU, which makes the test flaky in constrained CI topologies. Build expected sets from cpuset ∩ onlineCPUs before selecting firstCPU and asserting samples.

Proposed fix
 				for _, container := range testPod.Spec.Containers {
+					onlineCPUs, err := nodes.GetOnlineCPUsSet(ctx, workerRTNode)
+					Expect(err).ToNot(HaveOccurred())
@@
 					cpusIncludingShared, err := cpuset.Parse(cpusetCfg.Cpus)
 					Expect(err).ToNot(HaveOccurred(), "Failed to parse cpuset config for test pod cpus=%q", cpusetCfg.Cpus)
 					Expect(cpusIncludingShared.Size()).ToNot(BeZero(), "Failed to get cpuset config for test pod cpus=%q", cpusetCfg.Cpus)
+					runnableCPUs := cpusIncludingShared.Intersection(onlineCPUs)
+					Expect(runnableCPUs.Size()).ToNot(BeZero(), "no runnable CPUs in cpuset %s (online=%s)", cpusIncludingShared.String(), onlineCPUs.String())
 					testlog.Infof("all CPUs dedicated for the container (including shared if requested): %s", cpusIncludingShared.String())
-					firstCPU := cpusIncludingShared.List()[0]
+					firstCPU := runnableCPUs.List()[0]
@@
 						cntShared := cpusIncludingShared.Difference(updatedIsolated)
 						reserved := mustParse(string(*profile.Spec.CPU.Reserved))
-						cntShared = cntShared.Difference(*reserved)
+						cntShared = cntShared.Difference(*reserved).Intersection(onlineCPUs)
 						Expect(cntShared.Size()).ToNot(BeZero(), "no shared CPUs found for container %s", container.Name)
@@
 							if isExclusiveCPURequest {
 								Expect(cpu).To(Equal(firstCPU), "Exec process CPU is not the correct first CPU")
 							} else {
-								Expect(cpusIncludingShared.Contains(cpu)).To(BeTrue(), "Exec process CPU %d is not part of the cpuset: %s", cpu, cpusIncludingShared.String())
+								Expect(runnableCPUs.Contains(cpu)).To(BeTrue(), "Exec process CPU %d is not part of the runnable cpuset: %s", cpu, runnableCPUs.String())
 							}

As per coding guidelines "**: Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity."

Also applies to: 675-676, 760-764

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/e2e/performanceprofile/functests/11_mixedcpus/mixedcpus.go` around lines
569 - 575, The selection of "first" CPUs uses raw cpuset order and can pick
offline CPUs; change the logic in the block that calls nodes.GetCoreSiblings and
nodes.GetTwoSiblingsFromCPUSet (and similar blocks that set
updatedShared/updatedIsolated) to intersect the candidate cpuset with the
current online CPU set before choosing CPUs—i.e., compute onlineCandidates :=
candidateCpuset.Intersection(onlineCPUs) and then pick the first two from
onlineCandidates (or fall back appropriately) instead of using
updatedIsolated.List() directly; apply the same intersection-based selection
wherever updatedShared/updatedIsolated are constructed (the other occurrences
noted in the comment).

Introduce pods.DeleteAndSync helper (with client.Client parameter)
and replace all local deleteTestPod functions and inline
Get+Delete+WaitForDeletion patterns across functest suites
with calls to this shared utility.

Assisted-by: Cursor v2.3.37
AIA Attribution: AIA Primarily human, Stylistic edits, Human-initiated, Reviewed
Signed-off-by: Shereen Haj <shajmakh@redhat.com>
Add e2e tests to verify exec-cpu-affinity behavior across
cpu_management, updating_profile, and mixedcpus suites.
Tests verify that exec processes are pinned to the correct
exclusive, shared, or isolated CPU sets based on pod QoS class
and container resource requests.

Assisted-by: Cursor v2.3.37
AIA Attribution: AIA Primarily human, Stylistic edits, Human-initiated, Reviewed v1.0
Signed-off-by: Shereen Haj <shajmakh@redhat.com>
Add unit tests for functions in resources helper package for tests.

Assisted-by: Cursor v1.2.2
AI-Attribution: AIA Entirely AI, Human-initiated, Reviewed, Cursor v1.2.2 v1.0

Signed-off-by: Shereen Haj <shajmakh@redhat.com>
@shajmakh
Copy link
Contributor Author

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 17, 2026

@shajmakh: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/vet 3604b53 link true /test vet
ci/prow/okd-scos-images 3604b53 link true /test okd-scos-images
ci/prow/e2e-gcp-pao-workloadhints 3604b53 link true /test e2e-gcp-pao-workloadhints
ci/prow/lint 3604b53 link true /test lint
ci/prow/e2e-hypershift 3604b53 link true /test e2e-hypershift

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants