Skip to content

CI: Target IB-capable nodes for tests#11173

Open
Alexey-Rivkin wants to merge 9 commits intoopenucx:masterfrom
Alexey-Rivkin:gtest_ib_infra
Open

CI: Target IB-capable nodes for tests#11173
Alexey-Rivkin wants to merge 9 commits intoopenucx:masterfrom
Alexey-Rivkin:gtest_ib_infra

Conversation

@Alexey-Rivkin
Copy link
Copy Markdown
Contributor

@Alexey-Rivkin Alexey-Rivkin commented Feb 9, 2026

What?

Update UCX gtest k8s jobs to run on RDMA-capable nodes in Blossom:

  • RoCE (CX7) via sriov-cx7-p2 network and nvidia.com/sriov_cx7_p2 resource
  • IB (CX8) via cx8-ib-network and nvidia.com/cx8_vfs resource

Why?

Ensure tests exercise both IB and RoCE paths on their respective HCAs with required RDMA access.

How?

Add two runs_on_dockers entries (hca-roce, hca-ib) in test_matrix.yaml with the appropriate annotations, limits/requests, and caps_add: [IPC_LOCK, NET_RAW]. RoCE runs first.

@Alexey-Rivkin
Copy link
Copy Markdown
Contributor Author

/build

@Alexey-Rivkin Alexey-Rivkin changed the title Gtest ib infra CI: Target IB-capable nodes for tests Feb 9, 2026
@Alexey-Rivkin Alexey-Rivkin force-pushed the gtest_ib_infra branch 21 times, most recently from 09c7c1a to 6d2947d Compare February 15, 2026 09:28
@Alexey-Rivkin Alexey-Rivkin marked this pull request as ready for review February 15, 2026 13:13
Comment thread .ci/pipeline/test_matrix.yaml Outdated
@Alexey-Rivkin Alexey-Rivkin force-pushed the gtest_ib_infra branch 2 times, most recently from 32c240d to 0abba6a Compare February 23, 2026 15:48
Comment thread .ci/pipeline/test_matrix.yaml
Comment thread .ci/pipeline/test_matrix.yaml Outdated
Comment thread .ci/pipeline/test_matrix.yaml Outdated
@Alexey-Rivkin
Copy link
Copy Markdown
Contributor Author

/build

Comment thread .ci/pipeline/test_matrix.yaml Outdated
@Alexey-Rivkin
Copy link
Copy Markdown
Contributor Author

/build

Comment thread .ci/pipeline/test_matrix.yaml
@Alexey-Rivkin
Copy link
Copy Markdown
Contributor Author

/build

1 similar comment
@dpressle
Copy link
Copy Markdown
Contributor

dpressle commented Mar 1, 2026

/build

@dpressle dpressle self-requested a review March 1, 2026 14:52
dpressle
dpressle previously approved these changes Mar 1, 2026
@dpressle
Copy link
Copy Markdown
Contributor

dpressle commented Mar 2, 2026

/build

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Gtest failures when running in k8s env, as unlimites
max_threads cause resource exhaustion. Setting the CPU
affinity will limit max_threads to 2 dynamically.

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
Run RoCE first, then IB.

Signed-off-by: Alexey Rivkin <arivkin@nvidia.com>
@Alexey-Rivkin
Copy link
Copy Markdown
Contributor Author

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants