Fix minion state queue=True and process_count_max queues#68694
Merged
dwoz merged 16 commits intosaltstack:3006.xfrom Feb 11, 2026
Merged
Fix minion state queue=True and process_count_max queues#68694dwoz merged 16 commits intosaltstack:3006.xfrom
dwoz merged 16 commits intosaltstack:3006.xfrom
Conversation
Use filelock to serialize queue checks and validate PIDs in state_queue files to prevent race conditions and handle stale locks during state.apply(queue=True).
…ate handling. - Update `salt/minion.py` to use `subprocess_list` for active process counting, replacing the racy filesystem-based check. - Implement `__ignore_process_count_max` handling to allow queued jobs to bypass limits and deduplication upon re-submission. - Fix `jid_queue` logic to allow re-processing of bypassed JIDs. - Update `tests/pytests/unit/test_minion.py` to correctly mock process creation and verification. - Add `tests/pytests/integration/minion/test_process_queue.py` with comprehensive tests covering basic queueing, `state.apply` interaction, and high-contingency handoff scenarios.
- Remove `filelock` library dependency in favor of `salt.utils.files.wait_lock`. - Update `salt.utils.state.acquire_queue_lock` to use `wait_lock` with a long timeout. - Implement `await_lock` in `salt.utils.files` for async locking support. - Refactor `salt.minion.process_state_queue` to use `await_lock` to avoid blocking the main loop. - Fix circular import in `salt.utils.files` by moving `salt.modules.selinux` import. - Fix unit tests in `tests/pytests/unit/modules/state/test_state.py` to properly mock `cachedir`, `saltutil.is_running`, and locking mechanisms.
- Update `Minion._handle_payload` to correctly schedule `_handle_decoded_payload` using `spawn_callback`, as `_handle_decoded_payload` is now an async coroutine. - Update `Minion.process_process_queue` to use `spawn_callback` for the same reason. - Ensure `salt/minion.py` properly handles async execution of payload processing to prevent silent failures where jobs are received but not executed. - Verified fix with `tests/pytests/integration/modules/state/test_state_queue.py` passing.
- Fix a regression where `state.apply(queue=True)` caused an infinite loop of re-queueing itself because it detected its own placeholder process as a conflict. - Update `salt.modules.state.running` and `_prior_running_states` to explicitly ignore the current job's JID when checking for running states. - Refactor `salt.minion.process_state_queue` and `process_process_queue` to use proper async/await patterns with `spawn_callback` to avoid blocking the main loop. - Convert `salt.utils.files.await_lock` to a native `asynccontextmanager`. - Update documentation and tests to reflect the new queue behavior and fix `test_transactional_update.py` expectations.
- Update `salt.minion.process_state_queue` to sort queued files by JID instead of filename/timestamp. This ensures that the processing order matches the dependency check order in `_prior_running_states` (which uses `jid < current_jid`). This prevents a race condition where out-of-order timestamps could cause a job to block itself or a newer job, leading to an infinite re-queueing loop. - Add regression test `tests/pytests/integration/modules/state/test_state_queue_loop.py` to verify that `state.apply(queue=True)` does not loop. - (Verified `salt/modules/state.py` syntax is correct).
…minion.py - Remove top-level `import resource` which caused `ModuleNotFoundError` on Windows. - Update `_has_fd_headroom` to check `HAS_RESOURCE` flag before accessing `resource` module.
…tmanager'
- Added backport of asynccontextmanager for Python < 3.7.
- Updated await_lock to use the local asynccontextmanager.
- Resolves integration test failures for salt-ssh on older Python versions.
- Fix test_transactional_update.py expectations for queued jobs
- Updated expected output to include '__no_return__: True' for queued jobs in 'single' function tests.
- Mock salt.utils.files.await_lock instead of wait_lock, as the code uses the async context manager.
- Import contextlib to support the async mock.
- Create `var/tb/2/bootstrap.py` dynamically in the test's temporary directory instead of relying on a static file in `tests/integration/files`.
- This avoids keeping a file with invalid Python code ("foo") in the repository, which causes linting errors.
Account for the fact that we do not send 'queued' notifications back to the master. This preserves existing behavior.
- Enhance salt/minion.py with process_count_max enforcement and disk-based job queuing. - Implement FIFO ordering for queued state jobs by checking prior JIDs in salt/utils/state.py. - Introduce placeholder proc files to eliminate job tracking race conditions (Invisible Gap). - Add robust cross-process/thread state queue locking utilities. - Update scenario tests to verify concurrency control and lock recovery logic.
- Reduce job count and sleep duration in test_queue_load.py to prevent timeouts while maintaining queue logic verification. - Improve event listener robustness in scenario tests using after_time. - Fix TypeError in test_minion.py by using string JIDs for os.path.join compatibility. - Set salt_minion fixture scope to function in queue scenarios to support startup recovery tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes critical race conditions, resource exhaustion ("Too many open files"), and infinite loops in
state.applyqueueing. Also resolves CI/CD build failures across platforms.Key Changes
Resource Protection: Added
_has_fd_headroomto queue jobs when file descriptors are low, preventing crashes.Process Queue: Fixed
process_count_maxaccounting and added stale lock cleanup on startup.State Queue:
state.apply(queue=True).CI/Tests:
cherootbuild failures (pinned>=10.0.1).pkg_resourcesand import errors in testFixes: #68703