Conversation
|
Warning: Unknown label
Please add the new label to .github/pytorch-probot.yml |
|
|
||
| jobs: | ||
| pr-test: | ||
| uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main |
There was a problem hiding this comment.
Because all XPU runners under pytorch org are self-hosted runners maintained by Intel directly for now, it need config ECR permission. I have submitted 2 PRs pytorch/test-infra#7853 and pytorch/pytorch#177831 to enable it
|
@chuanqi129, we're still getting the failures. Are there any other steps we need to take? |
Hi @scotts, I have double checked the failure log, I feel it's very strange, according to the log, seems the PR pytorch/test-infra#7853 & pytorch/test-infra#7860 don't work as expectation. Could you please try to rebase your PR instead of rerun the failed job? |
|
@chuanqi129, done, rebased and pushed. We're getting different failures now, during "Setup XPU": https://github.com/pytorch/kineto/actions/runs/23438522997/job/68182791603?pr=1302. |
Hi @scotts , I have submitted a new PR to address this cross-repo issue, please help to review it pytorch/pytorch#178143 |
|
@chuanqi129, we're getting an error when trying to pull the docker image: https://github.com/pytorch/kineto/actions/runs/23438522997/job/68423941333?pr=1302 |
Hi @scotts , we don't enable |
|
Hi @scotts , I have checked the latest xpu workflow failure, and created another PR to address it pytorch/pytorch#178380, please help to review it again. And as for the kineto / pytorch build for xpu, I think we need to do some extra steps for xpu. Will feedback to you later |
Co-authored-by: Wang, Chuanqi <chuanqi.wang@intel.com>
|
@chuanqi129, progress! But now we're getting certificate issues on the host when trying to use |
Thanks @scotts , it should cause by the anaconda default channel can't used on intel owned machines, we can use conda-forge channel. let me check how to fix it |
|
@chuanqi129, I was curious if it would make a difference if I used conda-forge before you made any changes, and it still fails in the same way. Let me know if there's anything I can do to help! |
|
Hi @scotts , sorry for the late reply, I have tried it locally, it cause the default channel is from anaconda too. I can resolve this issue by below WA Could you please help to try it again? If it can resolve the issue, maybe we can consider add this WA into https://github.com/pytorch/pytorch/blob/main/.ci/docker/common/install_conda.sh directly |
|
Thanks @scotts , we got new failure now https://github.com/pytorch/kineto/actions/runs/23869014674/job/69595758258?pr=1302#step:16:1692. This failure caused by the xpu env source scripts has unbound vars. So that we need use |
|
@chuanqi129, the build succeeded and we're running tests! 🎉 It looks like we have two potential issues remaining:
|
|
Hi @scotts, thanks for the update! For issue 1, the warning caused by the torch xpu default aot include old intel GPU platform which not support FP64 by default. It's not a real issue. Let's limit the default aot build in here https://github.com/pytorch/kineto/blob/main/.github/scripts/pytorch_build_test.sh#L36 by add below sentences. For issue 2, let me check it. |
|
@chuanqi129 success! I'm going to clean this up a little and then we should be good to go. (I suspect the infra error was caused by the logs being too big because of all of the warnings.) |
|
That's great! Thanks @scotts |
|
@chuanqi129, it looks like we're actually still getting that failure: https://github.com/pytorch/kineto/actions/runs/24377248598/job/71193215400?pr=1302 |
|
@scotts has imported this pull request. If you are a Meta employee, you can view this in D100734710. |
|
Submitted pytorch/test-infra#7966 to fix. |
The `linux.idc.*` runners (e.g. XPU) don't have the ECR alpine image cached and lack ECR credentials at the chown-directory cleanup stage, causing post-test failures even when all tests pass. Use the public Docker Hub alpine image for these runners, matching the existing pattern for `linux.arm64` runners. I hit this when trying to get XPU workflows going on Kineto: pytorch/kineto#1302
Includes the following commits: - Stop allocating CUPTI buffers after exceeding max buffer count (pytorch/kineto#1362) 666f62c - Add XPU workflow (pytorch/kineto#1302) 11cc1e0
Includes the following commits: - Stop allocating CUPTI buffers after exceeding max buffer count (pytorch/kineto#1362) 666f62c - Add XPU workflow (pytorch/kineto#1302) 11cc1e0
Includes the following commits: - Stop allocating CUPTI buffers after exceeding max buffer count (pytorch/kineto#1362) 666f62c - Add XPU workflow (pytorch/kineto#1302) 11cc1e0 - Remove RocprofActivity.h/RoctracerActivity.h from RocmActivityProfiler.h (pytorch/kineto#1357) 896068d - Split ActivityProfilerController into Sync and Async Handlers (pytorch/kineto#1269) 6d7f045 - Add priority field to kernel metadata (pytorch/kineto#1361) f2a7423 - Add kineto-release skill (pytorch/kineto#1360) 675b6cd
Includes the following commits: - Fix stream wait events referencing future correlation IDs (pytorch/kineto#1339) 23b5bb5 - Remove kineto tb_plugin directory entirely (pytorch/kineto#1368) 9497960 - Move Stream Sync events to a new row in JSON trace export (pytorch/kineto#1356) 041e7ce - Expose isGpuCollectionStopped() through Kineto's public API (pytorch/kineto#1367) 17708f5 - Fix toggle test (pytorch/kineto#1369) ee2103c - Link to correct fmt repo (pytorch/kineto#1345) 3447834 - Fix data race on CuptiActivityApi::externalCorrelationEnabled_ (pytorch/kineto#1365) 0e86499 - Stop allocating CUPTI buffers after exceeding max buffer count (pytorch/kineto#1362) 666f62c - Add XPU workflow (pytorch/kineto#1302) 11cc1e0 - Remove RocprofActivity.h/RoctracerActivity.h from RocmActivityProfiler.h (pytorch/kineto#1357) 896068d - Split ActivityProfilerController into Sync and Async Handlers (pytorch/kineto#1269) 6d7f045 - Add priority field to kernel metadata (pytorch/kineto#1361) f2a7423 - Add kineto-release skill (pytorch/kineto#1360) 675b6cd Authored with Claude.
Includes the following commits: - Fix stream wait events referencing future correlation IDs (pytorch/kineto#1339) 23b5bb5 - Remove kineto tb_plugin directory entirely (pytorch/kineto#1368) 9497960 - Move Stream Sync events to a new row in JSON trace export (pytorch/kineto#1356) 041e7ce - Expose isGpuCollectionStopped() through Kineto's public API (pytorch/kineto#1367) 17708f5 - Fix toggle test (pytorch/kineto#1369) ee2103c - Link to correct fmt repo (pytorch/kineto#1345) 3447834 - Fix data race on CuptiActivityApi::externalCorrelationEnabled_ (pytorch/kineto#1365) 0e86499 - Stop allocating CUPTI buffers after exceeding max buffer count (pytorch/kineto#1362) 666f62c - Add XPU workflow (pytorch/kineto#1302) 11cc1e0 - Remove RocprofActivity.h/RoctracerActivity.h from RocmActivityProfiler.h (pytorch/kineto#1357) 896068d - Split ActivityProfilerController into Sync and Async Handlers (pytorch/kineto#1269) 6d7f045 - Add priority field to kernel metadata (pytorch/kineto#1361) f2a7423 - Add kineto-release skill (pytorch/kineto#1360) 675b6cd Authored with Claude. Pull Request resolved: #180606 Approved by: https://github.com/ryanzhang22, https://github.com/Skylion007
Adds an XPU workflow to CI. Mirrors the existing CUDA and ROCm workflows.