Skip to content

ci: update CI submodule for dynamic device allocation#615

Draft
zhangyue207 wants to merge 1 commit into
InfiniTensor:masterfrom
zhangyue207:ci/ascend-dynamic-npu-allocation
Draft

ci: update CI submodule for dynamic device allocation#615
zhangyue207 wants to merge 1 commit into
InfiniTensor:masterfrom
zhangyue207:ci/ascend-dynamic-npu-allocation

Conversation

@zhangyue207
Copy link
Copy Markdown
Collaborator

@zhangyue207 zhangyue207 commented May 19, 2026

Summary

  • Update the .ci submodule to include dynamic CI device allocation from InfiniTensor/ci#1.
  • The CI wrapper now skips Ascend and Iluvatar devices with running processes.
  • The Ascend platform-level hard-coded /dev/davinci0 mount is removed in favor of dynamic selected-device mapping.
  • The queued GitHub CI watchdog timeout is increased from 10 minutes to 30 minutes.

Motivation

CI jobs using gpu_ids: auto should not blindly enter containers on devices with leftover processes. The CI submodule change makes the wrapper choose a process-free device before container launch and expose process metadata for debugging.

The queued-job watchdog also needs more time for self-hosted runner scheduling before failing queued jobs.

Closes N/A

Type of Change

  • feat — new feature / new operator / new platform
  • fix — bug fix
  • perf — performance improvement (no behavioral change)
  • refactor — code restructuring without behavior change
  • test — adding or fixing tests only
  • docs — documentation only
  • build / ci — build system or CI configuration
  • chore — tooling, formatting, or other non-code changes
  • Breaking change (requires a ! in the Conventional Commits prefix or a BREAKING CHANGE: footer)

Platforms Affected

  • CPU (WITH_CPU)
  • NVIDIA (WITH_NVIDIA)
  • Iluvatar (WITH_ILUVATAR)
  • MetaX (WITH_METAX)
  • Cambricon (WITH_CAMBRICON)
  • Moore (WITH_MOORE)
  • Ascend (WITH_ASCEND)
  • PyTorch C++ bindings (WITH_TORCH)
  • Build system / CMake / CI
  • Python bindings / user-facing API

Test Results on Supported Platforms

Platform Built pytest Result Notes / Hardware
NVIDIA N/A Not run Not affected by this device allocation change; workflow timeout behavior is covered by CI tool tests.
Iluvatar N/A pytest tests/test_resource.py tests/test_run.py tests/test_workflow.py tests/test_shadow_workflow.py -v: 98 passed Parser/allocation covered by mocked ixsmi tests; no local Iluvatar hardware available.
MetaX N/A Not run Not affected by this device allocation change; workflow timeout behavior is covered by CI tool tests.
Cambricon N/A Not run Not affected by this device allocation change; workflow timeout behavior is covered by CI tool tests.
Moore N/A Not run Not affected by this device allocation change; workflow timeout behavior is covered by CI tool tests.
Ascend N/A pytest tests/test_resource.py tests/test_run.py tests/test_workflow.py tests/test_shadow_workflow.py -v: 98 passed Local Ascend host; dry-run verified explicit --gpu-id 3 and auto selection.
Full `pytest` output (optional)
98 passed in 0.38s

Benchmark / Performance Impact

N/A

Notes for Reviewers

  • The code change is in InfiniTensor/ci#1; this PR updates the InfiniOps .ci submodule pointer from c6bf369 to dd9bdb9.
  • Existing Ascend container behavior maps a selected physical /dev/davinci{id} to logical /dev/davinci0 inside the container.
  • Iluvatar keeps the existing /dev passthrough and CUDA_VISIBLE_DEVICES={id} container behavior.
  • gpu_ids: auto treats any Ascend npu-smi info or Iluvatar ixsmi process-table entry as busy for allocation.
  • The queued-job watchdog now waits 1800 seconds before failing jobs that remain queued.

Checklist

Title, Branch, and Commits

  • PR title follows Conventional Commits (e.g. feat(nvidia): …, fix(cuda/gemm): …).
  • Branch name follows <type>/xxx-yyyy-zzzz where <type> matches the PR title's Conventional Commits type and words are joined with hyphens (see CONTRIBUTING.md §Branches).
  • Each commit message follows Conventional Commits.
  • Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable (see CONTRIBUTING.md §Pull Requests).
  • No stray merge commits from master — the branch is rebased cleanly on top of the current master.
  • No fixup! / squash! / wip commits remain.

Scope and Design

  • Changes are minimal — nothing unrelated to the stated motivation was added (CONTRIBUTING.md §Code/General).
  • No dead code, commented-out blocks, debug prints, printf/std::cout/print(...) left behind, or TODO without an owner and issue link.
  • No unrelated formatting churn that would obscure the diff.
  • N/A Public API changes: this PR updates only the .ci submodule pointer.

General Code Hygiene (applies to all languages)

  • The code is self-explanatory; comments were added only where the why is non-obvious (CONTRIBUTING.md §Code/General).
  • Every modified or added file ends with a single trailing newline (CONTRIBUTING.md §Code/General).
  • No trailing whitespace, tab/space mixing, or stray BOMs.
  • Identifiers in comments and error messages are wrapped in backticks (e.g. the `seqlens_k` tensor) (CONTRIBUTING.md §Code/General).
  • All comments and error messages are in English (CONTRIBUTING.md §Code/General).
  • Comments and error messages are complete sentences — capitalized first letter, terminal punctuation — unless the language/framework convention says otherwise (CONTRIBUTING.md §Code/General; §Python).

C++ Specific (if C++ files changed)

N/A No C++ files changed.

Python Specific (if Python files changed)

  • Code is PEP 8 compliant; ruff check passes cleanly on CI (see .github/workflows/ruff.yml).
  • ruff format --check passes cleanly — if not, run ruff format and commit the result.
  • Comments are complete English sentences, starting with a capital letter and ending with punctuation; Markdown backticks are used for code references (CONTRIBUTING.md §Python).
  • Framework-specific conventions (e.g. lowercase pytest.skip messages without terminal period) are honored where applicable (CONTRIBUTING.md §Python).
  • No blank line between the function signature and the body when there is no docstring or comment (CONTRIBUTING.md §Python).
  • A blank line is present before and after if, for, and similar control-flow statements (CONTRIBUTING.md §Python).
  • A blank line appears before each return, except when it directly follows a control-flow statement like if or for.
  • Docstrings (if any) follow PEP 257 (CONTRIBUTING.md §Python).
  • Type hints are added / kept consistent with the surrounding code.

Testing

  • pytest was run locally on every supported platform that this PR can affect, and the results are recorded in the "Test Results" table above (CONTRIBUTING.md §Pull Requests).
  • For any platform that could not be tested, an explicit reason is given in the table and a reviewer with access has been tagged.
  • New functionality has matching tests under tests/ following tests/test_add.py / tests/test_gemm.py patterns (CONTRIBUTING.md §Adding an Operator).
  • N/A Operator-specific parametrization rules do not apply to this CI wrapper change.
  • N/A pytest.mark.auto_act_and_assert does not apply to this CI wrapper change.
  • N/A Default operator dtype / device parameterization does not apply to this CI wrapper change.
  • Any new test that is flaky under parallelism is marked so, or documented to require pytest -n 1.
  • For bug fixes: a regression test has been added that fails on master and passes with this PR.

Build, CI, and Tooling

  • N/A pip install .[dev] was not run because this PR only updates the CI submodule pointer; the relevant CI wrapper tests were run in the submodule.
  • N/A compile_commands.json is not affected.
  • N/A No new backend or device CMake auto-detection was added.
  • N/A CUDA-like GPU backend selection is not affected.
  • Both CI workflows (clang-format.yml, ruff.yml) are green locally (or expected to be green on CI).
  • No new runtime dependency was added without updating pyproject.toml's [project.optional-dependencies] (or justified in the PR description).

Documentation

  • N/A No user-facing documentation update is required for this CI allocation fix.
  • N/A No new operators, dispatch helpers, or public utilities were added.
  • N/A No user-visible breaking change.

Security and Safety

  • No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
  • Third-party code is license-compatible and attributed.
  • No unsafe pointer arithmetic, uninitialized reads, or missing bounds checks were introduced.

@zhangyue207 zhangyue207 force-pushed the ci/ascend-dynamic-npu-allocation branch from 5026670 to 2cc45ac Compare May 19, 2026 07:32
@zhangyue207 zhangyue207 force-pushed the ci/ascend-dynamic-npu-allocation branch from 2cc45ac to 2cfb15a Compare May 19, 2026 08:33
@zhangyue207 zhangyue207 changed the title ci(ascend): update CI submodule for dynamic NPU allocation ci: update CI submodule for dynamic device allocation May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant