Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ dependencies = [
"click>=8.0",
"dffs>=0.1.1",
"dvc>=3.50",
"dvc-data>=3.18.3",
"dvc-data @ git+https://github.com/runsascoded/dvc-data.git@9d3ef02c2c1e6752f897d2885f467d12fe934646",
]

[project.optional-dependencies]
Expand Down Expand Up @@ -67,6 +67,9 @@ dvx = "dvx.cli:main"
Homepage = "https://github.com/runsascoded/dvx"
Repository = "https://github.com/runsascoded/dvx"

[tool.hatch.metadata]
allow-direct-references = true

[tool.hatch.version]
source = "vcs"

Expand Down
130 changes: 130 additions & 0 deletions specs/done/fix-run-external-deps.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Fix `dvx run` circular dependency error for external deps + git dep support

## Problem

`dvx run` reports "Circular dependency detected" when `.dvc` files have deps on files that aren't DVX-tracked (e.g. git-tracked notebooks, PDFs).

## Root Cause

In `src/dvx/run/executor.py:run()` (lines 511-516), when building the artifact graph from `.dvc` files, dependencies are only added to the graph if they have a corresponding `.dvc` file:

```python
if artifact.computation:
for dep in artifact.computation.deps:
dep_path = dep.path if isinstance(dep, Artifact) else str(dep)
dvc_file = Path(str(dep_path) + ".dvc")
if dvc_file.exists() and dep_path not in artifacts:
pending.append(dvc_file)
```

External dependencies (git-tracked files like `.pdf`, `.ipynb`) are silently dropped from the graph. Then in `_group_into_levels()`, when checking if all deps are "done", these missing dep paths are never in the `done` set, so the artifact is permanently "not ready". Since nothing is ever ready, it raises "Circular dependency detected".

## Immediate Fix (unblock `dvx run`)

In `run()`, when a dep has no `.dvc` file, add it as a leaf `Artifact` (no computation):

```python
if artifact.computation:
for dep in artifact.computation.deps:
dep_path = dep.path if isinstance(dep, Artifact) else str(dep)
if dep_path not in artifacts:
dvc_file = Path(str(dep_path) + ".dvc")
if dvc_file.exists():
pending.append(dvc_file)
else:
# External dependency (no .dvc file) - add as leaf
artifacts[dep_path] = Artifact(path=dep_path)
```

This way `_group_into_levels` sees leaf artifacts (computation=None) for external deps and immediately adds them to `done` at line 78.

## Design: `git_deps` for git-tracked dependencies

Currently deps use content MD5 hashes:
```yaml
deps:
data/2025-PATH-Monthly-Ridership-Report.pdf: 22c8cb757209390928b525c6fc5b37f5
monthly.ipynb: 48c56466c13cd724fe771ab7631fff57
```

But for git-tracked files, using git blob SHAs would be better:
- Freshness checks via `git ls-tree HEAD` (already cached in `_get_blob_cache()` in `dvc_files.py`)
- No need to read file content / compute MD5 for freshness
- Clear semantic distinction: DVX deps (have `.dvc` files, MD5) vs git deps (git-tracked, SHA-1)

### Proposed format

Add a separate `git_deps` key in the computation block:

```yaml
meta:
computation:
cmd: papermill monthly.ipynb out/monthly-2025.ipynb -p year 2025
deps:
data/2025.pqt: 278da34972b360e3a54aade6cdaf0384
git_deps:
data/2025-PATH-Monthly-Ridership-Report.pdf: <git-blob-sha>
monthly.ipynb: <git-blob-sha>
```

- `deps`: DVX-tracked artifacts (have `.dvc` files). Hash is content MD5.
- `git_deps`: Git-tracked files. Hash is git blob SHA-1.

### Implementation

**`DVCFileInfo`** (`dvc_files.py`): add `git_deps: dict[str, str]` field, populate from `computation.get("git_deps") or {}`.

**`Computation`** (`artifact.py`): add `git_deps: list[Artifact | str | Path]` field.

**`Artifact.from_dvc`** (`artifact.py`): populate `git_deps` from `info.git_deps`.

**`is_output_fresh`** (`dvc_files.py`): for `git_deps`, compare recorded blob SHA against `get_git_blob_sha(dep_path, "HEAD")` (uses the already-cached `_get_blob_cache`).

**`run()`** (`executor.py`): when building graph, git_deps are always leaf nodes (no `.dvc` file). Add them as leaf Artifacts.

**`_group_into_levels`**: no change needed — git deps become leaf artifacts, immediately "done".

**`write_dvc_file`** (`dvc_files.py`): accept `git_deps` param, write into `meta.computation.git_deps`.

**`Computation.get_dep_hashes`**: for git_deps, use `get_git_blob_sha()` instead of `compute_md5()`.

### Script to populate git_deps

The `add-dvc-computations.py` script (or a new one) should:
1. For each `.dvc` file, determine which deps are git-tracked vs DVX-tracked
2. For git-tracked deps, compute `get_git_blob_sha(path, "HEAD")` and put in `git_deps`
3. For DVX-tracked deps, keep in `deps` with content MD5

## Reproduction

In the `hccs/path` repo:
```bash
dvx run -n -v data/2025.pqt.dvc
# Error: Circular dependency detected
```

The `data/2025.pqt.dvc` file has:
```yaml
meta:
computation:
cmd: papermill monthly.ipynb out/monthly-2025.ipynb -p year 2025
deps:
data/2025-PATH-Monthly-Ridership-Report.pdf: 22c8cb757209390928b525c6fc5b37f5
monthly.ipynb: 48c56466c13cd724fe771ab7631fff57
```

Neither dep has a `.dvc` file (both are git-tracked).

## Testing

After the fix:
```bash
dvx run -n -v data/2025.pqt.dvc
# Should show: "would run" or "skip (up-to-date)" instead of circular dep error
```

Also test with multi-target:
```bash
dvx run -n -v data/*.dvc www/public/*.dvc
# Should show execution plan with levels
```
2 changes: 2 additions & 0 deletions src/dvx/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,8 @@
repo.push()
"""

import dvx._compat # noqa: F401 # patch DVC schemas before any validation

from dvx.repo import Repo

__all__ = ["Repo"]
39 changes: 39 additions & 0 deletions src/dvx/_compat.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
"""Patch DVC to support HTTP `Last-Modified` as `mtime` in `.dvc` files.

dvc-data's `Meta.from_info()` captures HTTP `Last-Modified` into `mtime`, but:
1. DVC's voluptuous schema doesn't allow `mtime` in deps/outs
2. `Meta.to_dict()` doesn't serialize `mtime` (doing so breaks local files)

This module patches both: adds `mtime` to the schema, and extends `to_dict()`
to include `mtime` when `checksum` is set (HTTP sources use `checksum` for
ETag/Content-MD5, while local/S3/GCS use `etag`/`md5`).
"""


def _patch():
from dvc.dependency import SCHEMA as DEP_SCHEMA
from dvc.output import ARTIFACT_SCHEMA, META_SCHEMA
from dvc_data.hashfile.meta import Meta

# 1. Allow `mtime` in DVC's .dvc file schema validation
for schema in (META_SCHEMA, ARTIFACT_SCHEMA, DEP_SCHEMA):
if "mtime" not in schema:
schema["mtime"] = float

# 2. Extend Meta.to_dict() to serialize mtime for HTTP sources.
# Local filesystems always have inode set; HTTP sources never do.
# Use this to distinguish HTTP mtime (from Last-Modified) from
# local filesystem mtime (which DVC uses internally but shouldn't
# persist to .dvc files).
_orig_to_dict = Meta.to_dict

def _to_dict_with_mtime(self):
ret = _orig_to_dict(self)
if self.mtime is not None and self.inode is None:
ret["mtime"] = self.mtime
return ret

Meta.to_dict = _to_dict_with_mtime


_patch()