Add XPU workflow by scotts · Pull Request #1302 · pytorch/kineto

scotts · 2026-03-13T20:31:11Z

Adds an XPU workflow to CI. Mirrors the existing CUDA and ROCm workflows.

pytorch-bot · 2026-03-13T20:39:56Z

Warning: Unknown label ciflow/xpu.
Currently recognized labels are

ciflow/rocm

Please add the new label to .github/pytorch-probot.yml

chuanqi129 · 2026-03-19T09:58:37Z

+
+jobs:
+  pr-test:
+    uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main


Because all XPU runners under pytorch org are self-hosted runners maintained by Intel directly for now, it need config ECR permission. I have submitted 2 PRs pytorch/test-infra#7853 and pytorch/pytorch#177831 to enable it

scotts · 2026-03-20T16:21:32Z

@chuanqi129, we're still getting the failures. Are there any other steps we need to take?

chuanqi129 · 2026-03-21T01:45:09Z

@chuanqi129, we're still getting the failures. Are there any other steps we need to take?

Hi @scotts, I have double checked the failure log, I feel it's very strange, according to the log, seems the PR pytorch/test-infra#7853 & pytorch/test-infra#7860 don't work as expectation. Could you please try to rebase your PR instead of rerun the failed job?

scotts · 2026-03-23T13:04:13Z

@chuanqi129, done, rebased and pushed. We're getting different failures now, during "Setup XPU": https://github.com/pytorch/kineto/actions/runs/23438522997/job/68182791603?pr=1302.

chuanqi129 · 2026-03-23T13:23:20Z

@chuanqi129, done, rebased and pushed. We're getting different failures now, during "Setup XPU": https://github.com/pytorch/kineto/actions/runs/23438522997/job/68182791603?pr=1302.

Hi @scotts , I have submitted a new PR to address this cross-repo issue, please help to review it pytorch/pytorch#178143

scotts · 2026-03-24T19:44:34Z

@chuanqi129, we're getting an error when trying to pull the docker image: https://github.com/pytorch/kineto/actions/runs/23438522997/job/68423941333?pr=1302

chuanqi129 · 2026-03-25T01:19:55Z

@chuanqi129, we're getting an error when trying to pull the docker image: https://github.com/pytorch/kineto/actions/runs/23438522997/job/68423941333?pr=1302

Hi @scotts , we don't enable pytorch/almalinux-builder:xpu docker image build in pytorch, so the failure is expected, could you please try the above modify suggestion again?

chuanqi129 · 2026-03-25T13:44:28Z

Hi @scotts , I have checked the latest xpu workflow failure, and created another PR to address it pytorch/pytorch#178380, please help to review it again.

And as for the kineto / pytorch build for xpu, I think we need to do some extra steps for xpu. Will feedback to you later

Co-authored-by: Wang, Chuanqi <chuanqi.wang@intel.com>

scotts · 2026-03-27T19:21:39Z

@chuanqi129, progress! But now we're getting certificate issues on the host when trying to use conda to update packages: https://github.com/pytorch/kineto/actions/runs/23652712326/job/68936825671?pr=1302#step:16:66

chuanqi129 · 2026-03-30T06:14:40Z

@chuanqi129, progress! But now we're getting certificate issues on the host when trying to use conda to update packages: https://github.com/pytorch/kineto/actions/runs/23652712326/job/68936825671?pr=1302#step:16:66

Thanks @scotts , it should cause by the anaconda default channel can't used on intel owned machines, we can use conda-forge channel. let me check how to fix it

scotts · 2026-03-31T16:47:40Z

@chuanqi129, I was curious if it would make a difference if I used conda-forge before you made any changes, and it still fails in the same way. Let me know if there's anything I can do to help!

chuanqi129 · 2026-04-01T19:54:37Z

Hi @scotts , sorry for the late reply, I have tried it locally, it cause the default channel is from anaconda too. I can resolve this issue by below WA

# show current channels
conda config --show channels
# remove defaults channel
conda config --remove channels defaults
# add conda-forge
conda config --add channels conda-forge
conda install -y 'cmake>=3.27'

Could you please help to try it again? If it can resolve the issue, maybe we can consider add this WA into https://github.com/pytorch/pytorch/blob/main/.ci/docker/common/install_conda.sh directly

chuanqi129 · 2026-04-02T00:40:44Z

Thanks @scotts , we got new failure now https://github.com/pytorch/kineto/actions/runs/23869014674/job/69595758258?pr=1302#step:16:1692. This failure caused by the xpu env source scripts has unbound vars. So that we need use set +u before source xpu env for now.

chuanqi129 · 2026-04-03T09:41:08Z

Thanks @scotts for the progress update, per the linker error, I have invited our developer @moksiuc help to double check. As we can build kineto xpu with pytorch xpu success, there is no reason we can't build kineto xpu standalone. Let's check it.

chuanqi129 · 2026-04-06T15:54:03Z

Hi @scotts, for the linker error, I have submit a PR #1349 to fix it with copilot help and verified the fix in local env. Please help to review it. CC @moksiuc

scotts · 2026-04-09T13:18:50Z

@chuanqi129, the build succeeded and we're running tests! 🎉 It looks like we have two potential issues remaining:

A warning is flooding the logs: warning: Double arithmetic operation is not supported on this platform with FP64 conversion emulation mode (poison FP64 kernels is enabled).. I don't see that message anywhere in PyTorch or Kineto repos, so I suspect that hardware itself is emitting that message. I found FP64 Emulation Support is Broken; Cannot Run Own Scripts intel/intel-extension-for-pytorch#257. Is there a setting we need in the environment scripts?
There's an infra error uploading results files: https://github.com/pytorch/kineto/actions/runs/24159003509/job/70505315113?pr=1302#step:19:55

chuanqi129 · 2026-04-10T03:55:08Z

Hi @scotts, thanks for the update! For issue 1, the warning caused by the torch xpu default aot include old intel GPU platform which not support FP64 by default. It's not a real issue. Let's limit the default aot build in here https://github.com/pytorch/kineto/blob/main/.github/scripts/pytorch_build_test.sh#L36 by add below sentences.

elif [[ "${GPU_ARCH}" == "xpu" ]]; then
  export TORCH_XPU_ARCH_LIST=pvc

For issue 2, let me check it.

scotts · 2026-04-13T21:09:30Z

@chuanqi129 success! I'm going to clean this up a little and then we should be good to go. (I suspect the infra error was caused by the logs being too big because of all of the warnings.)

chuanqi129 · 2026-04-14T01:59:29Z

That's great! Thanks @scotts

chuanqi129

LGTM. Thanks @scotts

scotts · 2026-04-14T03:09:16Z

@chuanqi129, it looks like we're actually still getting that failure: https://github.com/pytorch/kineto/actions/runs/24377248598/job/71193215400?pr=1302

meta-codesync · 2026-04-14T03:12:25Z

@scotts has imported this pull request. If you are a Meta employee, you can view this in D100734710.

scotts · 2026-04-14T13:59:43Z

Submitted pytorch/test-infra#7966 to fix.

The `linux.idc.*` runners (e.g. XPU) don't have the ECR alpine image cached and lack ECR credentials at the chown-directory cleanup stage, causing post-test failures even when all tests pass. Use the public Docker Hub alpine image for these runners, matching the existing pattern for `linux.arm64` runners. I hit this when trying to get XPU workflows going on Kineto: pytorch/kineto#1302

meta-codesync · 2026-04-15T02:00:47Z

@scotts merged this pull request in 11cc1e0.

Includes the following commits: - Stop allocating CUPTI buffers after exceeding max buffer count (pytorch/kineto#1362) 666f62c - Add XPU workflow (pytorch/kineto#1302) 11cc1e0

Includes the following commits: - Stop allocating CUPTI buffers after exceeding max buffer count (pytorch/kineto#1362) 666f62c - Add XPU workflow (pytorch/kineto#1302) 11cc1e0 - Remove RocprofActivity.h/RoctracerActivity.h from RocmActivityProfiler.h (pytorch/kineto#1357) 896068d - Split ActivityProfilerController into Sync and Async Handlers (pytorch/kineto#1269) 6d7f045 - Add priority field to kernel metadata (pytorch/kineto#1361) f2a7423 - Add kineto-release skill (pytorch/kineto#1360) 675b6cd

Includes the following commits: - Fix stream wait events referencing future correlation IDs (pytorch/kineto#1339) 23b5bb5 - Remove kineto tb_plugin directory entirely (pytorch/kineto#1368) 9497960 - Move Stream Sync events to a new row in JSON trace export (pytorch/kineto#1356) 041e7ce - Expose isGpuCollectionStopped() through Kineto's public API (pytorch/kineto#1367) 17708f5 - Fix toggle test (pytorch/kineto#1369) ee2103c - Link to correct fmt repo (pytorch/kineto#1345) 3447834 - Fix data race on CuptiActivityApi::externalCorrelationEnabled_ (pytorch/kineto#1365) 0e86499 - Stop allocating CUPTI buffers after exceeding max buffer count (pytorch/kineto#1362) 666f62c - Add XPU workflow (pytorch/kineto#1302) 11cc1e0 - Remove RocprofActivity.h/RoctracerActivity.h from RocmActivityProfiler.h (pytorch/kineto#1357) 896068d - Split ActivityProfilerController into Sync and Async Handlers (pytorch/kineto#1269) 6d7f045 - Add priority field to kernel metadata (pytorch/kineto#1361) f2a7423 - Add kineto-release skill (pytorch/kineto#1360) 675b6cd Authored with Claude.

Includes the following commits: - Fix stream wait events referencing future correlation IDs (pytorch/kineto#1339) 23b5bb5 - Remove kineto tb_plugin directory entirely (pytorch/kineto#1368) 9497960 - Move Stream Sync events to a new row in JSON trace export (pytorch/kineto#1356) 041e7ce - Expose isGpuCollectionStopped() through Kineto's public API (pytorch/kineto#1367) 17708f5 - Fix toggle test (pytorch/kineto#1369) ee2103c - Link to correct fmt repo (pytorch/kineto#1345) 3447834 - Fix data race on CuptiActivityApi::externalCorrelationEnabled_ (pytorch/kineto#1365) 0e86499 - Stop allocating CUPTI buffers after exceeding max buffer count (pytorch/kineto#1362) 666f62c - Add XPU workflow (pytorch/kineto#1302) 11cc1e0 - Remove RocprofActivity.h/RoctracerActivity.h from RocmActivityProfiler.h (pytorch/kineto#1357) 896068d - Split ActivityProfilerController into Sync and Async Handlers (pytorch/kineto#1269) 6d7f045 - Add priority field to kernel metadata (pytorch/kineto#1361) f2a7423 - Add kineto-release skill (pytorch/kineto#1360) 675b6cd Authored with Claude. Pull Request resolved: #180606 Approved by: https://github.com/ryanzhang22, https://github.com/Skylion007

Add XPU workflow

fc025ea

meta-cla Bot added the cla signed label Mar 13, 2026

scotts added the ciflow/xpu label Mar 13, 2026

scotts added 3 commits March 13, 2026 13:42

Add xpu label to probot

47aafc0

Try with dut7360_pvc_card_0

eed469c

Go back to label

e6229ab

chuanqi129 reviewed Mar 19, 2026

View reviewed changes

Merge branch 'main' of github.com:pytorch/kineto into ci_xpu

b326405

chuanqi129 reviewed Mar 25, 2026

View reviewed changes

Comment thread .github/workflows/linux_xpu.yml

Use docker-image pytorch/manylinux2_28-builder:xpu

0f77520

chuanqi129 reviewed Mar 27, 2026

View reviewed changes

Comment thread .github/scripts/config_xpu.sh

scotts and others added 2 commits March 27, 2026 10:55

Update .github/scripts/config_xpu.sh

a1a2c73

Co-authored-by: Wang, Chuanqi <chuanqi.wang@intel.com>

Add comment

ae58ac5

Try conda-forge?

2bfb52e

Conda channels dance

b268ce0

scotts added 2 commits April 1, 2026 19:14

set u dance

0dc7277

Merge branch 'main' of github.com:pytorch/kineto into ci_xpu

9616bcd

Merge branch 'main' of github.com:pytorch/kineto into ci_xpu

bd73b08

scotts added 2 commits April 13, 2026 11:15

Avoid fp64 warning

24bda57

Merge branch 'main' of github.com:pytorch/kineto into ci_xpu

0f8e462

Clean up

458e1ac

scotts marked this pull request as ready for review April 14, 2026 02:12

chuanqi129 approved these changes Apr 14, 2026

View reviewed changes

scotts mentioned this pull request Apr 14, 2026

Use Docker Hub alpine for IDC runners in chown-directory steps pytorch/test-infra#7966

Merged

Shellcheck

335473c

jiannanWang approved these changes Apr 15, 2026

View reviewed changes

meta-codesync Bot closed this in 11cc1e0 Apr 15, 2026

facebook-github-tools Bot added the Merged label Apr 15, 2026

ryanzhang22 mentioned this pull request Apr 15, 2026

Update third_party/kineto submodule to 666f62c3b2 pytorch/pytorch#180480

Closed

scotts mentioned this pull request Apr 16, 2026

Update third_party/kineto submodule to 23b5bb57 pytorch/pytorch#180606

Closed

Conversation

scotts commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Mar 13, 2026

Uh oh!

chuanqi129 Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

scotts commented Mar 20, 2026

Uh oh!

chuanqi129 commented Mar 21, 2026

Uh oh!

scotts commented Mar 23, 2026

Uh oh!

chuanqi129 commented Mar 23, 2026

Uh oh!

scotts commented Mar 24, 2026

Uh oh!

Uh oh!

chuanqi129 commented Mar 25, 2026

Uh oh!

chuanqi129 commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

scotts commented Mar 27, 2026

Uh oh!

chuanqi129 commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scotts commented Mar 31, 2026

Uh oh!

chuanqi129 commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chuanqi129 commented Apr 2, 2026

Uh oh!

chuanqi129 commented Apr 3, 2026

Uh oh!

chuanqi129 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

scotts commented Apr 9, 2026

Uh oh!

chuanqi129 commented Apr 10, 2026

Uh oh!

scotts commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chuanqi129 commented Apr 14, 2026

Uh oh!

chuanqi129 left a comment

Choose a reason for hiding this comment

Uh oh!

scotts commented Apr 14, 2026

Uh oh!

meta-codesync Bot commented Apr 14, 2026

Uh oh!

scotts commented Apr 14, 2026

Uh oh!

meta-codesync Bot commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

scotts commented Mar 13, 2026 •

edited

Loading

chuanqi129 commented Mar 25, 2026 •

edited

Loading

chuanqi129 commented Mar 30, 2026 •

edited

Loading

chuanqi129 commented Apr 1, 2026 •

edited

Loading

chuanqi129 commented Apr 6, 2026 •

edited

Loading

scotts commented Apr 13, 2026 •

edited

Loading