fix(workflow): make FileLock and git snapshot robust to the CWD#2254
Open
zhaow-de wants to merge 3 commits into
Open
fix(workflow): make FileLock and git snapshot robust to the CWD#2254zhaow-de wants to merge 3 commits into
zhaow-de wants to merge 3 commits into
Conversation
`_get_or_create_exp` built the lock path with
`os.path.join(pr.netloc, pr.path.lstrip("/"), "filelock")`. For an absolute
`file://` URI (including the default `QSettings` URI) `urlparse` yields an
empty netloc and an absolute path, so `lstrip("/")` turned it into a
CWD-relative path. The FileLock and its parent dirs were then created under
whatever directory the process happened to run in, leaving stray trees and an
ineffective lock that no longer serialized concurrent experiment creation.
Use `url2pathname(pr.path)` so the lock resolves to the location the URI
actually names (also correct for Windows drive-letter URIs), and ensure the
directory exists before locking.
Refs microsoft#2252
`_log_uncommitted_code` ran `git diff/status/diff --cached` via `subprocess.check_output(cmd, shell=True)` without capturing stderr. When the CWD was not a git repository (containers, CI sandboxes, a tempdir) git wrote a multi-line "usage: git diff --no-index ..." banner straight to the parent's stderr, bypassing qlib's logger, and each failure emitted a noisy INFO record. Probe `git rev-parse --is-inside-work-tree` first and skip silently (DEBUG) when there is no work tree or git is unavailable. Drop `shell=True` in favor of argument lists, capture/suppress stderr, and demote the per-command failure log to DEBUG. This hook is opportunistic reproducibility metadata, not a precondition for running an experiment. Refs microsoft#2252
Author
|
@microsoft-github-policy-service agree |
Regression tests for the two fixes in this branch: - MLflowExpManager places its FileLock at the absolute file:// path the URI names and leaves an unrelated working directory clean. - MLflowRecorder._log_uncommitted_code stays silent outside a git work tree (no git banner leaked to stderr, no artifacts logged) and still logs code_diff/status/cached inside one. Both CWD-failure tests fail on the pre-fix code and pass with the fixes. Refs microsoft#2252
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Two related defects in
qlib.workflow, both rooted in assumptions about the caller's current working directory.MLflowExpManager._get_or_create_exp(qlib/workflow/expm.py) built itsFileLockpath withos.path.join(pr.netloc, pr.path.lstrip("/"), "filelock"). For an absolutefile://URI (including the defaultQSettingsURI)urlparseyields an emptynetlocand an absolutepath, solstrip("/")produced a CWD-relative path. The lock and its parent dirs were then created under wherever the process happened to be running, leaving stray directory trees and an ineffective lock that no longer serialized concurrent experiment creation. Fixed by resolving the path withurl2pathname(pr.path)(also correct for Windows drive-letter URIs) and ensuring the directory exists before locking.MLflowRecorder._log_uncommitted_code(qlib/workflow/recorder.py) rangit diff/git status/git diff --cachedviasubprocess.check_output(cmd, shell=True)without capturing stderr. Outside a git work tree (containers, CI sandboxes, a tempdir) git's multi-lineusage: git diff --no-index ...banner leaked straight to the parent's stderr — bypassing qlib's logger — and each failure also emitted a noisyINFOrecord. Fixed by probinggit rev-parse --is-inside-work-treefirst and skipping silently atDEBUG; also droppedshell=Truein favor of argument lists, captured/suppressed stderr, and demoted the per-command failure log toDEBUG. This hook is opportunistic reproducibility metadata, not a precondition for running an experiment.tests/test_workflow_cwd.pywith regression tests for both behaviors (they fail on the pre-fix code and pass with the fixes).Motivation and Context
Closes #2252.
Both defects surface whenever qlib is driven from a working directory other than the one implied by the experiment URI — e.g. a wrapper CLI that runs qlib inside a
tempfile.TemporaryDirectory(), a container, or a CI sandbox. The FileLock bug silently scattersmlruns/lock directories across the filesystem and weakens the lock that guards concurrent experiment creation; the git-snapshot bug corrupts any structured-stderr contract (JSON logs, CI log parsers) with git's usage banner that the user cannot suppress through qlib's log-level controls.How Has This Been Tested?
pytest qlib/tests/test_all_pipeline.pyunder upper directory ofqlib.Verified locally on macOS / Python 3.12 against an editable install (Cython extensions built;
numpy<2.0.0to match the CIrlpin):make black pylint flake8 mypy— all green (black: 337 files unchanged;flake8: clean;pylint: rated 10.00/10 for bothqlibandscripts;mypy:Success: no issues found in 55 source files).cd tests && python -m pytest test_workflow.py test_workflow_cwd.py— 4 passed.Screenshots of Test Results (if appropriate):
(The two CWD-failure tests in
test_workflow_cwd.pywere also confirmed to fail on the pre-fix commit, so they genuinely guard the regressions.)Types of changes