Skip to content

Add XPU workflow#1302

Closed
scotts wants to merge 17 commits intopytorch:mainfrom
scotts:ci_xpu
Closed

Add XPU workflow#1302
scotts wants to merge 17 commits intopytorch:mainfrom
scotts:ci_xpu

Conversation

@scotts
Copy link
Copy Markdown
Contributor

@scotts scotts commented Mar 13, 2026

Adds an XPU workflow to CI. Mirrors the existing CUDA and ROCm workflows.

@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Mar 13, 2026

Warning: Unknown label ciflow/xpu.
Currently recognized labels are

  • ciflow/rocm

Please add the new label to .github/pytorch-probot.yml


jobs:
pr-test:
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because all XPU runners under pytorch org are self-hosted runners maintained by Intel directly for now, it need config ECR permission. I have submitted 2 PRs pytorch/test-infra#7853 and pytorch/pytorch#177831 to enable it

@scotts
Copy link
Copy Markdown
Contributor Author

scotts commented Mar 20, 2026

@chuanqi129, we're still getting the failures. Are there any other steps we need to take?

@chuanqi129
Copy link
Copy Markdown
Contributor

@chuanqi129, we're still getting the failures. Are there any other steps we need to take?

Hi @scotts, I have double checked the failure log, I feel it's very strange, according to the log, seems the PR pytorch/test-infra#7853 & pytorch/test-infra#7860 don't work as expectation. Could you please try to rebase your PR instead of rerun the failed job?

@scotts
Copy link
Copy Markdown
Contributor Author

scotts commented Mar 23, 2026

@chuanqi129, done, rebased and pushed. We're getting different failures now, during "Setup XPU": https://github.com/pytorch/kineto/actions/runs/23438522997/job/68182791603?pr=1302.

@chuanqi129
Copy link
Copy Markdown
Contributor

@chuanqi129, done, rebased and pushed. We're getting different failures now, during "Setup XPU": https://github.com/pytorch/kineto/actions/runs/23438522997/job/68182791603?pr=1302.

Hi @scotts , I have submitted a new PR to address this cross-repo issue, please help to review it pytorch/pytorch#178143

@scotts
Copy link
Copy Markdown
Contributor Author

scotts commented Mar 24, 2026

@chuanqi129, we're getting an error when trying to pull the docker image: https://github.com/pytorch/kineto/actions/runs/23438522997/job/68423941333?pr=1302

Comment thread .github/workflows/linux_xpu.yml
@chuanqi129
Copy link
Copy Markdown
Contributor

@chuanqi129, we're getting an error when trying to pull the docker image: https://github.com/pytorch/kineto/actions/runs/23438522997/job/68423941333?pr=1302

Hi @scotts , we don't enable pytorch/almalinux-builder:xpu docker image build in pytorch, so the failure is expected, could you please try the above modify suggestion again?

@chuanqi129
Copy link
Copy Markdown
Contributor

chuanqi129 commented Mar 25, 2026

Hi @scotts , I have checked the latest xpu workflow failure, and created another PR to address it pytorch/pytorch#178380, please help to review it again.

And as for the kineto / pytorch build for xpu, I think we need to do some extra steps for xpu. Will feedback to you later

Comment thread .github/scripts/config_xpu.sh
scotts and others added 2 commits March 27, 2026 10:55
Co-authored-by: Wang, Chuanqi <chuanqi.wang@intel.com>
@scotts
Copy link
Copy Markdown
Contributor Author

scotts commented Mar 27, 2026

@chuanqi129, progress! But now we're getting certificate issues on the host when trying to use conda to update packages: https://github.com/pytorch/kineto/actions/runs/23652712326/job/68936825671?pr=1302#step:16:66

@chuanqi129
Copy link
Copy Markdown
Contributor

chuanqi129 commented Mar 30, 2026

@chuanqi129, progress! But now we're getting certificate issues on the host when trying to use conda to update packages: https://github.com/pytorch/kineto/actions/runs/23652712326/job/68936825671?pr=1302#step:16:66

Thanks @scotts , it should cause by the anaconda default channel can't used on intel owned machines, we can use conda-forge channel. let me check how to fix it

@scotts
Copy link
Copy Markdown
Contributor Author

scotts commented Mar 31, 2026

@chuanqi129, I was curious if it would make a difference if I used conda-forge before you made any changes, and it still fails in the same way. Let me know if there's anything I can do to help!

@chuanqi129
Copy link
Copy Markdown
Contributor

chuanqi129 commented Apr 1, 2026

Hi @scotts , sorry for the late reply, I have tried it locally, it cause the default channel is from anaconda too. I can resolve this issue by below WA

# show current channels
conda config --show channels
# remove defaults channel
conda config --remove channels defaults
# add conda-forge
conda config --add channels conda-forge
conda install -y 'cmake>=3.27'

Could you please help to try it again? If it can resolve the issue, maybe we can consider add this WA into https://github.com/pytorch/pytorch/blob/main/.ci/docker/common/install_conda.sh directly

@chuanqi129
Copy link
Copy Markdown
Contributor

Thanks @scotts , we got new failure now https://github.com/pytorch/kineto/actions/runs/23869014674/job/69595758258?pr=1302#step:16:1692. This failure caused by the xpu env source scripts has unbound vars. So that we need use set +u before source xpu env for now.

@chuanqi129
Copy link
Copy Markdown
Contributor

Thanks @scotts for the progress update, per the linker error, I have invited our developer @moksiuc help to double check. As we can build kineto xpu with pytorch xpu success, there is no reason we can't build kineto xpu standalone. Let's check it.

@chuanqi129
Copy link
Copy Markdown
Contributor

chuanqi129 commented Apr 6, 2026

Hi @scotts, for the linker error, I have submit a PR #1349 to fix it with copilot help and verified the fix in local env. Please help to review it. CC @moksiuc

@scotts
Copy link
Copy Markdown
Contributor Author

scotts commented Apr 9, 2026

@chuanqi129, the build succeeded and we're running tests! 🎉 It looks like we have two potential issues remaining:

  1. A warning is flooding the logs: warning: Double arithmetic operation is not supported on this platform with FP64 conversion emulation mode (poison FP64 kernels is enabled).. I don't see that message anywhere in PyTorch or Kineto repos, so I suspect that hardware itself is emitting that message. I found FP64 Emulation Support is Broken; Cannot Run Own Scripts intel/intel-extension-for-pytorch#257. Is there a setting we need in the environment scripts?
  2. There's an infra error uploading results files: https://github.com/pytorch/kineto/actions/runs/24159003509/job/70505315113?pr=1302#step:19:55

@chuanqi129
Copy link
Copy Markdown
Contributor

Hi @scotts, thanks for the update! For issue 1, the warning caused by the torch xpu default aot include old intel GPU platform which not support FP64 by default. It's not a real issue. Let's limit the default aot build in here https://github.com/pytorch/kineto/blob/main/.github/scripts/pytorch_build_test.sh#L36 by add below sentences.

elif [[ "${GPU_ARCH}" == "xpu" ]]; then
  export TORCH_XPU_ARCH_LIST=pvc

For issue 2, let me check it.

@scotts
Copy link
Copy Markdown
Contributor Author

scotts commented Apr 13, 2026

@chuanqi129 success! I'm going to clean this up a little and then we should be good to go. (I suspect the infra error was caused by the logs being too big because of all of the warnings.)

@chuanqi129
Copy link
Copy Markdown
Contributor

That's great! Thanks @scotts

@scotts scotts marked this pull request as ready for review April 14, 2026 02:12
Copy link
Copy Markdown
Contributor

@chuanqi129 chuanqi129 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @scotts

@scotts
Copy link
Copy Markdown
Contributor Author

scotts commented Apr 14, 2026

@chuanqi129, it looks like we're actually still getting that failure: https://github.com/pytorch/kineto/actions/runs/24377248598/job/71193215400?pr=1302

@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 14, 2026

@scotts has imported this pull request. If you are a Meta employee, you can view this in D100734710.

@scotts
Copy link
Copy Markdown
Contributor Author

scotts commented Apr 14, 2026

Submitted pytorch/test-infra#7966 to fix.

atalman pushed a commit to pytorch/test-infra that referenced this pull request Apr 14, 2026
The `linux.idc.*` runners (e.g. XPU) don't have the ECR alpine image
cached and lack ECR credentials at the chown-directory cleanup stage,
causing post-test failures even when all tests pass. Use the public
Docker Hub alpine image for these runners, matching the existing pattern
for `linux.arm64` runners.

I hit this when trying to get XPU workflows going on Kineto:
pytorch/kineto#1302
@meta-codesync meta-codesync Bot closed this in 11cc1e0 Apr 15, 2026
@meta-codesync
Copy link
Copy Markdown

meta-codesync Bot commented Apr 15, 2026

@scotts merged this pull request in 11cc1e0.

ryanzhang22 added a commit to ryanzhang22/pytorch that referenced this pull request Apr 15, 2026
Includes the following commits:

- Stop allocating CUPTI buffers after exceeding max buffer count (pytorch/kineto#1362) 666f62c
- Add XPU workflow (pytorch/kineto#1302) 11cc1e0
ryanzhang22 added a commit to ryanzhang22/pytorch that referenced this pull request Apr 15, 2026
Includes the following commits:

- Stop allocating CUPTI buffers after exceeding max buffer count (pytorch/kineto#1362) 666f62c
- Add XPU workflow (pytorch/kineto#1302) 11cc1e0
ryanzhang22 added a commit to ryanzhang22/pytorch that referenced this pull request Apr 15, 2026
Includes the following commits:

- Stop allocating CUPTI buffers after exceeding max buffer count (pytorch/kineto#1362) 666f62c
- Add XPU workflow (pytorch/kineto#1302) 11cc1e0
- Remove RocprofActivity.h/RoctracerActivity.h from RocmActivityProfiler.h (pytorch/kineto#1357) 896068d
- Split ActivityProfilerController into Sync and Async Handlers (pytorch/kineto#1269) 6d7f045
- Add priority field to kernel metadata (pytorch/kineto#1361) f2a7423
- Add kineto-release skill (pytorch/kineto#1360) 675b6cd
scotts added a commit to scotts/pytorch that referenced this pull request Apr 16, 2026
Includes the following commits:

- Fix stream wait events referencing future correlation IDs (pytorch/kineto#1339) 23b5bb5
- Remove kineto tb_plugin directory entirely (pytorch/kineto#1368) 9497960
- Move Stream Sync events to a new row in JSON trace export (pytorch/kineto#1356) 041e7ce
- Expose isGpuCollectionStopped() through Kineto's public API (pytorch/kineto#1367) 17708f5
- Fix toggle test (pytorch/kineto#1369) ee2103c
- Link to correct fmt repo (pytorch/kineto#1345) 3447834
- Fix data race on CuptiActivityApi::externalCorrelationEnabled_ (pytorch/kineto#1365) 0e86499
- Stop allocating CUPTI buffers after exceeding max buffer count (pytorch/kineto#1362) 666f62c
- Add XPU workflow (pytorch/kineto#1302) 11cc1e0
- Remove RocprofActivity.h/RoctracerActivity.h from RocmActivityProfiler.h (pytorch/kineto#1357) 896068d
- Split ActivityProfilerController into Sync and Async Handlers (pytorch/kineto#1269) 6d7f045
- Add priority field to kernel metadata (pytorch/kineto#1361) f2a7423
- Add kineto-release skill (pytorch/kineto#1360) 675b6cd

Authored with Claude.
pytorchmergebot pushed a commit to pytorch/pytorch that referenced this pull request Apr 17, 2026
Includes the following commits:

- Fix stream wait events referencing future correlation IDs (pytorch/kineto#1339) 23b5bb5
- Remove kineto tb_plugin directory entirely (pytorch/kineto#1368) 9497960
- Move Stream Sync events to a new row in JSON trace export (pytorch/kineto#1356) 041e7ce
- Expose isGpuCollectionStopped() through Kineto's public API (pytorch/kineto#1367) 17708f5
- Fix toggle test (pytorch/kineto#1369) ee2103c
- Link to correct fmt repo (pytorch/kineto#1345) 3447834
- Fix data race on CuptiActivityApi::externalCorrelationEnabled_ (pytorch/kineto#1365) 0e86499
- Stop allocating CUPTI buffers after exceeding max buffer count (pytorch/kineto#1362) 666f62c
- Add XPU workflow (pytorch/kineto#1302) 11cc1e0
- Remove RocprofActivity.h/RoctracerActivity.h from RocmActivityProfiler.h (pytorch/kineto#1357) 896068d
- Split ActivityProfilerController into Sync and Async Handlers (pytorch/kineto#1269) 6d7f045
- Add priority field to kernel metadata (pytorch/kineto#1361) f2a7423
- Add kineto-release skill (pytorch/kineto#1360) 675b6cd

Authored with Claude.
Pull Request resolved: #180606
Approved by: https://github.com/ryanzhang22, https://github.com/Skylion007
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants