fix(fsspec): handle zero-byte files in __len__#3353
Conversation
rambleraptor
left a comment
There was a problem hiding this comment.
This looks fine to me code-wise. Love the test.
Personal curiosity, but any reason why this would be useful? Zero-byte files are an interesting edge case
|
The main point is that FileIO.len should faithfully reflect the filesystem metadata. A zero-byte object is valid object-store metadata, but the previous truthiness check treated 0 the same as a missing key and raised instead of returning 0. Even if PyIceberg does not intentionally write zero-byte data files, the fsspec FileIO can be used against general object-store paths, including marker files or user-created empty objects. This keeps the boundary behavior consistent with the underlying filesystem. |
Fokko
left a comment
There was a problem hiding this comment.
Oof, great catch @fallintoplace Thanks for fixing this. @rambleraptor thanks for the review 👍
…tion (#3412) Inspired by the walrus issue in #3353 ## Summary Several `if x := dict.get(k):` checks treated legitimate falsy values as missing: - `lower_bounds.get(field_id)` / `upper_bounds.get(field_id)` return `bytes`. `b""` is a valid serialization of an empty string and was being skipped, causing metrics-based row filtering to return `ROWS_MIGHT_MATCH` when it should have been `ROWS_CANNOT_MATCH` (and vice versa for strict eval). - `accessors[...].get(file.partition)` can return `0` or `""` for an `IdentityTransform` partition. The walrus dropped it, so projected partition columns were filled with `null` instead of the actual value. - `inspect.py` `lower_bound` / `upper_bound` rendering had the same `b""` issue, showing `None` instead of `""` in `readable_metrics`. All conditions are switched to explicit `is not None` checks. A small `_readable_bound` helper deduplicates the inspect rendering. ## Changes - `pyiceberg/expressions/visitors.py` — `_InclusiveMetricsEvaluator` and `_StrictMetricsEvaluator` bound lookups (21 sites). - `pyiceberg/io/pyarrow.py` — `_get_column_projection_values` and `ArrowProjectionVisitor` missing-field handling. - `pyiceberg/table/inspect.py` — extract `_readable_bound`, use it in both the `entries` and `_get_files_from_manifest` rendering paths. ## Tests - `tests/expressions/test_evaluator.py` — inclusive and strict evaluators with empty-string bounds, covering lower-only, upper-only, and both-bounds branches. - `tests/io/test_pyarrow.py` — parametrized identity-transform projection with falsy partition values (`0`, `""`, `None`). - `tests/table/test_inspect.py` — `_readable_bound` helper plus integration tests via `tbl.inspect.entries()` and `tbl.inspect.files()` for empty-string and null bounds.
Inspired by the walrus issue in #3353 ## Problem `PartitionField` and `SortField` accept the v3 `source-ids` list (added in [#1554](#1554)) and map it to the legacy singular `source-id`. Both validators try to reject an empty list: ```python if "source-id" not in data and (source_ids := data["source-ids"]): if isinstance(source_ids, list): if len(source_ids) == 0: raise ValueError("Empty source-ids is not allowed") ... data["source-id"] = source_ids[0] ``` The walrus uses truthiness, and `[]` is falsy — so the `len(source_ids) == 0` branch is unreachable. Passing `{"source-ids": []}` silently skips the mapping, and Pydantic then reports a generic "field required" error instead of the intended message. A missing `source-ids` key also raises `KeyError` instead of being handled cleanly. ## Fix Replace the walrus with an explicit key check in both validators: ```python if "source-id" not in data and "source-ids" in data: source_ids = data["source-ids"] ... ``` This makes the empty-list validation reachable and avoids the `KeyError`. ## Tests Added regression tests that deserialize `{"source-ids": []}` and assert `ValueError("Empty source-ids is not allowed")` is raised, for both `PartitionField` and `SortField`.
Summary
Fix
FsspecInputFile.__len__andFsspecOutputFile.__len__so zero-byte files return0instead of being treated as missing metadata.Both methods previously used truthiness checks on
object_info.get(...), which caused valid sizes like0to fall through to the runtime error path.Changes
Sizekey presence explicitlysizekey presence explicitlyVerification
python -m pytest tests/io/test_fsspec.py -k zero_length_of_file -q