Skip to content

fix(workflow): make FileLock and git snapshot robust to the CWD#2254

Open
zhaow-de wants to merge 3 commits into
microsoft:mainfrom
zhaow-de:fix/mlflow-cwd-handling
Open

fix(workflow): make FileLock and git snapshot robust to the CWD#2254
zhaow-de wants to merge 3 commits into
microsoft:mainfrom
zhaow-de:fix/mlflow-cwd-handling

Conversation

@zhaow-de

@zhaow-de zhaow-de commented Jun 9, 2026

Copy link
Copy Markdown

Description

Two related defects in qlib.workflow, both rooted in assumptions about the caller's current working directory.

  • MLflowExpManager._get_or_create_exp (qlib/workflow/expm.py) built its FileLock path with os.path.join(pr.netloc, pr.path.lstrip("/"), "filelock"). For an absolute file:// URI (including the default QSettings URI) urlparse yields an empty netloc and an absolute path, so lstrip("/") produced a CWD-relative path. The lock and its parent dirs were then created under wherever the process happened to be running, leaving stray directory trees and an ineffective lock that no longer serialized concurrent experiment creation. Fixed by resolving the path with url2pathname(pr.path) (also correct for Windows drive-letter URIs) and ensuring the directory exists before locking.
  • MLflowRecorder._log_uncommitted_code (qlib/workflow/recorder.py) ran git diff / git status / git diff --cached via subprocess.check_output(cmd, shell=True) without capturing stderr. Outside a git work tree (containers, CI sandboxes, a tempdir) git's multi-line usage: git diff --no-index ... banner leaked straight to the parent's stderr — bypassing qlib's logger — and each failure also emitted a noisy INFO record. Fixed by probing git rev-parse --is-inside-work-tree first and skipping silently at DEBUG; also dropped shell=True in favor of argument lists, captured/suppressed stderr, and demoted the per-command failure log to DEBUG. This hook is opportunistic reproducibility metadata, not a precondition for running an experiment.
  • Added tests/test_workflow_cwd.py with regression tests for both behaviors (they fail on the pre-fix code and pass with the fixes).

Motivation and Context

Closes #2252.

Both defects surface whenever qlib is driven from a working directory other than the one implied by the experiment URI — e.g. a wrapper CLI that runs qlib inside a tempfile.TemporaryDirectory(), a container, or a CI sandbox. The FileLock bug silently scatters mlruns/lock directories across the filesystem and weakens the lock that guards concurrent experiment creation; the git-snapshot bug corrupts any structured-stderr contract (JSON logs, CI log parsers) with git's usage banner that the user cannot suppress through qlib's log-level controls.

How Has This Been Tested?

  • Pass the test by running: pytest qlib/tests/test_all_pipeline.py under upper directory of qlib.
  • If you are adding a new feature, test on your own test scripts.

Verified locally on macOS / Python 3.12 against an editable install (Cython extensions built; numpy<2.0.0 to match the CI rl pin):

  • make black pylint flake8 mypy — all green (black: 337 files unchanged; flake8: clean; pylint: rated 10.00/10 for both qlib and scripts; mypy: Success: no issues found in 55 source files).
  • cd tests && python -m pytest test_workflow.py test_workflow_cwd.py — 4 passed.

Screenshots of Test Results (if appropriate):

  1. Pipeline test:
$ cd tests && python -m pytest test_all_pipeline.py -v
test_all_pipeline.py::TestAllFlow::test_0_train PASSED                   [ 33%]
test_all_pipeline.py::TestAllFlow::test_1_backtest PASSED                [ 66%]
test_all_pipeline.py::TestAllFlow::test_2_expmanager PASSED              [100%]
================== 3 passed, 3 warnings in 473.27s (0:07:53) ===================
  1. Your own tests:
$ cd tests && python -m pytest test_workflow.py test_workflow_cwd.py -v
test_workflow.py::WorkflowTest::test_get_local_dir PASSED                                  [ 25%]
test_workflow_cwd.py::MLflowExpManagerCWDTest::test_filelock_resolves_to_absolute_uri_not_cwd PASSED [ 50%]
test_workflow_cwd.py::MLflowRecorderGitSnapshotTest::test_git_work_tree_logs_code_snapshot PASSED   [ 75%]
test_workflow_cwd.py::MLflowRecorderGitSnapshotTest::test_non_git_cwd_is_silent_and_logs_no_artifacts PASSED [100%]
============================== 4 passed in 7.55s ===============================

(The two CWD-failure tests in test_workflow_cwd.py were also confirmed to fail on the pre-fix commit, so they genuinely guard the regressions.)

Types of changes

  • Fix bugs

zhaow-de added 2 commits June 9, 2026 13:12
`_get_or_create_exp` built the lock path with
`os.path.join(pr.netloc, pr.path.lstrip("/"), "filelock")`. For an absolute
`file://` URI (including the default `QSettings` URI) `urlparse` yields an
empty netloc and an absolute path, so `lstrip("/")` turned it into a
CWD-relative path. The FileLock and its parent dirs were then created under
whatever directory the process happened to run in, leaving stray trees and an
ineffective lock that no longer serialized concurrent experiment creation.

Use `url2pathname(pr.path)` so the lock resolves to the location the URI
actually names (also correct for Windows drive-letter URIs), and ensure the
directory exists before locking.

Refs microsoft#2252
`_log_uncommitted_code` ran `git diff/status/diff --cached` via
`subprocess.check_output(cmd, shell=True)` without capturing stderr. When the
CWD was not a git repository (containers, CI sandboxes, a tempdir) git wrote a
multi-line "usage: git diff --no-index ..." banner straight to the parent's
stderr, bypassing qlib's logger, and each failure emitted a noisy INFO record.

Probe `git rev-parse --is-inside-work-tree` first and skip silently (DEBUG)
when there is no work tree or git is unavailable. Drop `shell=True` in favor of
argument lists, capture/suppress stderr, and demote the per-command failure log
to DEBUG. This hook is opportunistic reproducibility metadata, not a
precondition for running an experiment.

Refs microsoft#2252
@zhaow-de

zhaow-de commented Jun 9, 2026

Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Regression tests for the two fixes in this branch:
- MLflowExpManager places its FileLock at the absolute file:// path the URI
  names and leaves an unrelated working directory clean.
- MLflowRecorder._log_uncommitted_code stays silent outside a git work tree
  (no git banner leaked to stderr, no artifacts logged) and still logs
  code_diff/status/cached inside one.

Both CWD-failure tests fail on the pre-fix code and pass with the fixes.

Refs microsoft#2252
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

qlib.workflow: FileLock and _log_uncommitted_code both unsafe vs CWD

1 participant