fix(workflow): make FileLock and git snapshot robust to the CWD by zhaow-de · Pull Request #2254 · microsoft/qlib

zhaow-de · 2026-06-09T11:18:34Z

Description

Two related defects in qlib.workflow, both rooted in assumptions about the caller's current working directory.

MLflowExpManager._get_or_create_exp (qlib/workflow/expm.py) built its FileLock path with os.path.join(pr.netloc, pr.path.lstrip("/"), "filelock"). For an absolute file:// URI (including the default QSettings URI) urlparse yields an empty netloc and an absolute path, so lstrip("/") produced a CWD-relative path. The lock and its parent dirs were then created under wherever the process happened to be running, leaving stray directory trees and an ineffective lock that no longer serialized concurrent experiment creation. Fixed by resolving the path with url2pathname(pr.path) (also correct for Windows drive-letter URIs) and ensuring the directory exists before locking.
MLflowRecorder._log_uncommitted_code (qlib/workflow/recorder.py) ran git diff / git status / git diff --cached via subprocess.check_output(cmd, shell=True) without capturing stderr. Outside a git work tree (containers, CI sandboxes, a tempdir) git's multi-line usage: git diff --no-index ... banner leaked straight to the parent's stderr — bypassing qlib's logger — and each failure also emitted a noisy INFO record. Fixed by probing git rev-parse --is-inside-work-tree first and skipping silently at DEBUG; also dropped shell=True in favor of argument lists, captured/suppressed stderr, and demoted the per-command failure log to DEBUG. This hook is opportunistic reproducibility metadata, not a precondition for running an experiment.
Added tests/test_workflow_cwd.py with regression tests for both behaviors (they fail on the pre-fix code and pass with the fixes).

Motivation and Context

Closes #2252.

Both defects surface whenever qlib is driven from a working directory other than the one implied by the experiment URI — e.g. a wrapper CLI that runs qlib inside a tempfile.TemporaryDirectory(), a container, or a CI sandbox. The FileLock bug silently scatters mlruns/lock directories across the filesystem and weakens the lock that guards concurrent experiment creation; the git-snapshot bug corrupts any structured-stderr contract (JSON logs, CI log parsers) with git's usage banner that the user cannot suppress through qlib's log-level controls.

How Has This Been Tested?

Pass the test by running: pytest qlib/tests/test_all_pipeline.py under upper directory of qlib.
If you are adding a new feature, test on your own test scripts.

Verified locally on macOS / Python 3.12 against an editable install (Cython extensions built; numpy<2.0.0 to match the CI rl pin):

make black pylint flake8 mypy — all green (black: 337 files unchanged; flake8: clean; pylint: rated 10.00/10 for both qlib and scripts; mypy: Success: no issues found in 55 source files).
cd tests && python -m pytest test_workflow.py test_workflow_cwd.py — 4 passed.

Screenshots of Test Results (if appropriate):

Pipeline test:

$ cd tests && python -m pytest test_all_pipeline.py -v
test_all_pipeline.py::TestAllFlow::test_0_train PASSED                   [ 33%]
test_all_pipeline.py::TestAllFlow::test_1_backtest PASSED                [ 66%]
test_all_pipeline.py::TestAllFlow::test_2_expmanager PASSED              [100%]
================== 3 passed, 3 warnings in 473.27s (0:07:53) ===================

Your own tests:

$ cd tests && python -m pytest test_workflow.py test_workflow_cwd.py -v
test_workflow.py::WorkflowTest::test_get_local_dir PASSED                                  [ 25%]
test_workflow_cwd.py::MLflowExpManagerCWDTest::test_filelock_resolves_to_absolute_uri_not_cwd PASSED [ 50%]
test_workflow_cwd.py::MLflowRecorderGitSnapshotTest::test_git_work_tree_logs_code_snapshot PASSED   [ 75%]
test_workflow_cwd.py::MLflowRecorderGitSnapshotTest::test_non_git_cwd_is_silent_and_logs_no_artifacts PASSED [100%]
============================== 4 passed in 7.55s ===============================

(The two CWD-failure tests in test_workflow_cwd.py were also confirmed to fail on the pre-fix commit, so they genuinely guard the regressions.)

Types of changes

Fix bugs

`_get_or_create_exp` built the lock path with `os.path.join(pr.netloc, pr.path.lstrip("/"), "filelock")`. For an absolute `file://` URI (including the default `QSettings` URI) `urlparse` yields an empty netloc and an absolute path, so `lstrip("/")` turned it into a CWD-relative path. The FileLock and its parent dirs were then created under whatever directory the process happened to run in, leaving stray trees and an ineffective lock that no longer serialized concurrent experiment creation. Use `url2pathname(pr.path)` so the lock resolves to the location the URI actually names (also correct for Windows drive-letter URIs), and ensure the directory exists before locking. Refs microsoft#2252

`_log_uncommitted_code` ran `git diff/status/diff --cached` via `subprocess.check_output(cmd, shell=True)` without capturing stderr. When the CWD was not a git repository (containers, CI sandboxes, a tempdir) git wrote a multi-line "usage: git diff --no-index ..." banner straight to the parent's stderr, bypassing qlib's logger, and each failure emitted a noisy INFO record. Probe `git rev-parse --is-inside-work-tree` first and skip silently (DEBUG) when there is no work tree or git is unavailable. Drop `shell=True` in favor of argument lists, capture/suppress stderr, and demote the per-command failure log to DEBUG. This hook is opportunistic reproducibility metadata, not a precondition for running an experiment. Refs microsoft#2252

zhaow-de · 2026-06-09T11:21:06Z

@microsoft-github-policy-service agree

Regression tests for the two fixes in this branch: - MLflowExpManager places its FileLock at the absolute file:// path the URI names and leaves an unrelated working directory clean. - MLflowRecorder._log_uncommitted_code stays silent outside a git work tree (no git banner leaked to stderr, no artifacts logged) and still logs code_diff/status/cached inside one. Both CWD-failure tests fail on the pre-fix code and pass with the fixes. Refs microsoft#2252

zhaow-de added 2 commits June 9, 2026 13:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(workflow): make FileLock and git snapshot robust to the CWD#2254

fix(workflow): make FileLock and git snapshot robust to the CWD#2254
zhaow-de wants to merge 3 commits into
microsoft:mainfrom
zhaow-de:fix/mlflow-cwd-handling

zhaow-de commented Jun 9, 2026 •

edited

Loading

Uh oh!

zhaow-de commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zhaow-de commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Context

How Has This Been Tested?

Screenshots of Test Results (if appropriate):

Types of changes

Uh oh!

zhaow-de commented Jun 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhaow-de commented Jun 9, 2026 •

edited

Loading