Skip to content

ci: allocate idle CI devices dynamically#1

Draft
zhangyue207 wants to merge 1 commit into
codex/prune-unused-ci-artifactsfrom
ci/ascend-dynamic-npu-allocation
Draft

ci: allocate idle CI devices dynamically#1
zhangyue207 wants to merge 1 commit into
codex/prune-unused-ci-artifactsfrom
ci/ascend-dynamic-npu-allocation

Conversation

@zhangyue207
Copy link
Copy Markdown
Collaborator

@zhangyue207 zhangyue207 commented May 19, 2026

Summary

  • Parse Ascend npu-smi info process table entries into process_count and process_pids.
  • Parse Iluvatar ixsmi process table entries into process_count and process_pids.
  • Exclude Ascend and Iluvatar devices with running processes from automatic allocation and fail allocation when no process-free device is available.
  • Remove the platform-level hard-coded /dev/davinci0 mount so selected Ascend devices are mounted dynamically by run.py.
  • Increase queued GitHub CI watchdog timeout from 10 minutes to 30 minutes in both CI workflows.

Motivation

CI jobs using automatic allocation could still enter containers on devices with leftover processes. The wrapper should choose a process-free device before container launch and expose process metadata in status output for debugging.

The GitHub queued-job watchdog also needs more time for self-hosted runner scheduling, so this PR extends that watchdog from 600 seconds to 1800 seconds.

Scope

This PR targets codex/prune-unused-ci-artifacts, which contains the .ci submodule commit currently pinned by InfiniOps upstream. This keeps the review diff limited to dynamic device allocation and CI watchdog timeout fixes.

Test Plan

Check Result
pytest tests/test_resource.py tests/test_run.py tests/test_workflow.py tests/test_shadow_workflow.py -v 98 passed
RUFF_CACHE_DIR=/tmp/infini-ci-ruff-cache ruff check ci_resource.py tests/test_resource.py tests/test_run.py tests/test_workflow.py tests/test_shadow_workflow.py passed
RUFF_CACHE_DIR=/tmp/infini-ci-ruff-cache ruff format --check ci_resource.py tests/test_resource.py tests/test_run.py tests/test_workflow.py tests/test_shadow_workflow.py passed
git diff --check passed
python run.py --config config.yml --job ascend_npu --dry-run --gpu-id 3 emitted --device=/dev/davinci3:/dev/davinci0; no platform-level /dev/davinci0
python run.py --config config.yml --job ascend_npu --dry-run auto selected NPU 1 on the local Ascend host
python run.py --config config.yml --job iluvatar_gpu --dry-run --gpu-id 2 emitted CUDA_VISIBLE_DEVICES=2

Notes

  • Existing Ascend container mapping keeps physical device IDs outside the container and maps the selected device to logical davinci0 inside the container.
  • Iluvatar keeps the existing /dev passthrough and CUDA_VISIBLE_DEVICES={id} container behavior; this PR changes only which device is selected automatically.
  • Local Iluvatar hardware is not available in this environment, so automatic ixsmi selection is covered by parser and allocation unit tests.
  • InfiniOps will need a follow-up submodule pointer update after this CI change is accepted.

@zhangyue207 zhangyue207 force-pushed the ci/ascend-dynamic-npu-allocation branch from dc7bb04 to fff2f5e Compare May 19, 2026 07:31
@zhangyue207 zhangyue207 force-pushed the ci/ascend-dynamic-npu-allocation branch from fff2f5e to dd9bdb9 Compare May 19, 2026 08:32
@zhangyue207 zhangyue207 changed the title ci(ascend): allocate idle NPU dynamically ci: allocate idle CI devices dynamically May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant