Skip to content

Fix minion state queue=True and process_count_max queues#68694

Merged
dwoz merged 16 commits intosaltstack:3006.xfrom
dwoz:queue
Feb 11, 2026
Merged

Fix minion state queue=True and process_count_max queues#68694
dwoz merged 16 commits intosaltstack:3006.xfrom
dwoz:queue

Conversation

@dwoz
Copy link
Copy Markdown
Contributor

@dwoz dwoz commented Feb 7, 2026

Summary

Fixes critical race conditions, resource exhaustion ("Too many open files"), and infinite loops in state.apply queueing. Also resolves CI/CD build failures across platforms.

Key Changes

  • Resource Protection: Added _has_fd_headroom to queue jobs when file descriptors are low, preventing crashes.

  • Process Queue: Fixed process_count_max accounting and added stale lock cleanup on startup.

  • State Queue:

    • Fixed infinite loop in state.apply(queue=True).
    • Refactored queue processing to be fully async/non-blocking.
    • Improved FIFO ordering stability.
  • CI/Tests:

    • Fixed cheroot build failures (pinned >=10.0.1).
    • Fixed pkg_resources and import errors in test

Fixes: #68703

@dwoz dwoz requested a review from a team as a code owner February 7, 2026 00:06
@dwoz dwoz changed the base branch from master to 3006.x February 7, 2026 00:09
@dwoz dwoz added the test:full Run the full test suite label Feb 7, 2026
dwoz added 16 commits February 11, 2026 00:42
Use filelock to serialize queue checks and validate PIDs in state_queue
files to prevent race conditions and handle stale locks during
state.apply(queue=True).
…ate handling.

- Update `salt/minion.py` to use `subprocess_list` for active process counting, replacing the racy filesystem-based check.
- Implement `__ignore_process_count_max` handling to allow queued jobs to bypass limits and deduplication upon re-submission.
- Fix `jid_queue` logic to allow re-processing of bypassed JIDs.
- Update `tests/pytests/unit/test_minion.py` to correctly mock process creation and verification.
- Add `tests/pytests/integration/minion/test_process_queue.py` with comprehensive tests covering basic queueing, `state.apply` interaction, and high-contingency handoff scenarios.
- Remove `filelock` library dependency in favor of `salt.utils.files.wait_lock`.
- Update `salt.utils.state.acquire_queue_lock` to use `wait_lock` with a long timeout.
- Implement `await_lock` in `salt.utils.files` for async locking support.
- Refactor `salt.minion.process_state_queue` to use `await_lock` to avoid blocking the main loop.
- Fix circular import in `salt.utils.files` by moving `salt.modules.selinux` import.
- Fix unit tests in `tests/pytests/unit/modules/state/test_state.py` to properly mock `cachedir`, `saltutil.is_running`, and locking mechanisms.
- Update `Minion._handle_payload` to correctly schedule `_handle_decoded_payload` using `spawn_callback`, as `_handle_decoded_payload` is now an async coroutine.
- Update `Minion.process_process_queue` to use `spawn_callback` for the same reason.
- Ensure `salt/minion.py` properly handles async execution of payload processing to prevent silent failures where jobs are received but not executed.
- Verified fix with `tests/pytests/integration/modules/state/test_state_queue.py` passing.
- Fix a regression where `state.apply(queue=True)` caused an infinite loop of re-queueing itself because it detected its own placeholder process as a conflict.
- Update `salt.modules.state.running` and `_prior_running_states` to explicitly ignore the current job's JID when checking for running states.
- Refactor `salt.minion.process_state_queue` and `process_process_queue` to use proper async/await patterns with `spawn_callback` to avoid blocking the main loop.
- Convert `salt.utils.files.await_lock` to a native `asynccontextmanager`.
- Update documentation and tests to reflect the new queue behavior and fix `test_transactional_update.py` expectations.
- Update `salt.minion.process_state_queue` to sort queued files by JID instead of filename/timestamp. This ensures that the processing order matches the dependency check order in `_prior_running_states` (which uses `jid < current_jid`). This prevents a race condition where out-of-order timestamps could cause a job to block itself or a newer job, leading to an infinite re-queueing loop.
- Add regression test `tests/pytests/integration/modules/state/test_state_queue_loop.py` to verify that `state.apply(queue=True)` does not loop.
- (Verified `salt/modules/state.py` syntax is correct).
…minion.py

- Remove top-level `import resource` which caused `ModuleNotFoundError` on Windows.
- Update `_has_fd_headroom` to check `HAS_RESOURCE` flag before accessing `resource` module.
…tmanager'

- Added backport of asynccontextmanager for Python < 3.7.
- Updated await_lock to use the local asynccontextmanager.
- Resolves integration test failures for salt-ssh on older Python versions.
- Fix test_transactional_update.py expectations for queued jobs
- Updated expected output to include '__no_return__: True' for queued jobs in 'single' function tests.
- Mock salt.utils.files.await_lock instead of wait_lock, as the code uses the async context manager.
- Import contextlib to support the async mock.
- Create `var/tb/2/bootstrap.py` dynamically in the test's temporary directory instead of relying on a static file in `tests/integration/files`.
- This avoids keeping a file with invalid Python code ("foo") in the repository, which causes linting errors.
Account for the fact that we do not send 'queued' notifications back to
the master. This preserves existing behavior.
- Enhance salt/minion.py with process_count_max enforcement and disk-based job queuing.
- Implement FIFO ordering for queued state jobs by checking prior JIDs in salt/utils/state.py.
- Introduce placeholder proc files to eliminate job tracking race conditions (Invisible Gap).
- Add robust cross-process/thread state queue locking utilities.
- Update scenario tests to verify concurrency control and lock recovery logic.
- Reduce job count and sleep duration in test_queue_load.py to prevent timeouts while maintaining queue logic verification.
- Improve event listener robustness in scenario tests using after_time.
- Fix TypeError in test_minion.py by using string JIDs for os.path.join compatibility.
- Set salt_minion fixture scope to function in queue scenarios to support startup recovery tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:full Run the full test suite

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant