fix(fsspec): handle zero-byte files in __len__ by fallintoplace · Pull Request #3353 · apache/iceberg-python

fallintoplace · 2026-05-12T22:36:55Z

Summary

Fix FsspecInputFile.__len__ and FsspecOutputFile.__len__ so zero-byte files return 0 instead of being treated as missing metadata.

Both methods previously used truthiness checks on object_info.get(...), which caused valid sizes like 0 to fall through to the runtime error path.

Changes

check for Size key presence explicitly
check for size key presence explicitly
add a regression test covering zero-byte lengths for both metadata key variants

Verification

python -m pytest tests/io/test_fsspec.py -k zero_length_of_file -q

rambleraptor

This looks fine to me code-wise. Love the test.

Personal curiosity, but any reason why this would be useful? Zero-byte files are an interesting edge case

fallintoplace · 2026-05-12T23:03:02Z

The main point is that FileIO.len should faithfully reflect the filesystem metadata. A zero-byte object is valid object-store metadata, but the previous truthiness check treated 0 the same as a missing key and raised instead of returning 0.

Even if PyIceberg does not intentionally write zero-byte data files, the fsspec FileIO can be used against general object-store paths, including marker files or user-created empty objects. This keeps the boundary behavior consistent with the underlying filesystem.

Fokko

Oof, great catch @fallintoplace Thanks for fixing this. @rambleraptor thanks for the review 👍

…tion (#3412) Inspired by the walrus issue in #3353 ## Summary Several `if x := dict.get(k):` checks treated legitimate falsy values as missing: - `lower_bounds.get(field_id)` / `upper_bounds.get(field_id)` return `bytes`. `b""` is a valid serialization of an empty string and was being skipped, causing metrics-based row filtering to return `ROWS_MIGHT_MATCH` when it should have been `ROWS_CANNOT_MATCH` (and vice versa for strict eval). - `accessors[...].get(file.partition)` can return `0` or `""` for an `IdentityTransform` partition. The walrus dropped it, so projected partition columns were filled with `null` instead of the actual value. - `inspect.py` `lower_bound` / `upper_bound` rendering had the same `b""` issue, showing `None` instead of `""` in `readable_metrics`. All conditions are switched to explicit `is not None` checks. A small `_readable_bound` helper deduplicates the inspect rendering. ## Changes - `pyiceberg/expressions/visitors.py` — `_InclusiveMetricsEvaluator` and `_StrictMetricsEvaluator` bound lookups (21 sites). - `pyiceberg/io/pyarrow.py` — `_get_column_projection_values` and `ArrowProjectionVisitor` missing-field handling. - `pyiceberg/table/inspect.py` — extract `_readable_bound`, use it in both the `entries` and `_get_files_from_manifest` rendering paths. ## Tests - `tests/expressions/test_evaluator.py` — inclusive and strict evaluators with empty-string bounds, covering lower-only, upper-only, and both-bounds branches. - `tests/io/test_pyarrow.py` — parametrized identity-transform projection with falsy partition values (`0`, `""`, `None`). - `tests/table/test_inspect.py` — `_readable_bound` helper plus integration tests via `tbl.inspect.entries()` and `tbl.inspect.files()` for empty-string and null bounds.

Inspired by the walrus issue in #3353 ## Problem `PartitionField` and `SortField` accept the v3 `source-ids` list (added in [#1554](#1554)) and map it to the legacy singular `source-id`. Both validators try to reject an empty list: ```python if "source-id" not in data and (source_ids := data["source-ids"]): if isinstance(source_ids, list): if len(source_ids) == 0: raise ValueError("Empty source-ids is not allowed") ... data["source-id"] = source_ids[0] ``` The walrus uses truthiness, and `[]` is falsy — so the `len(source_ids) == 0` branch is unreachable. Passing `{"source-ids": []}` silently skips the mapping, and Pydantic then reports a generic "field required" error instead of the intended message. A missing `source-ids` key also raises `KeyError` instead of being handled cleanly. ## Fix Replace the walrus with an explicit key check in both validators: ```python if "source-id" not in data and "source-ids" in data: source_ids = data["source-ids"] ... ``` This makes the empty-list validation reachable and avoids the `KeyError`. ## Tests Added regression tests that deserialize `{"source-ids": []}` and assert `ValueError("Empty source-ids is not allowed")` is raised, for both `PartitionField` and `SortField`.

fix(fsspec): handle zero-byte files in __len__

f58d686

rambleraptor approved these changes May 12, 2026

View reviewed changes

Fokko approved these changes May 15, 2026

View reviewed changes

Fokko merged commit a99320b into apache:main May 15, 2026
16 checks passed

This was referenced May 24, 2026

Reject empty source-ids in PartitionField / SortField #3411

Merged

Fix walrus truthiness on metrics bounds and identity-partition projection #3412

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(fsspec): handle zero-byte files in len#3353

fix(fsspec): handle zero-byte files in len#3353
Fokko merged 1 commit into
apache:mainfrom
fallintoplace:fix-fsspec-zero-byte-len

fallintoplace commented May 12, 2026

Uh oh!

rambleraptor left a comment

Uh oh!

fallintoplace commented May 12, 2026

Uh oh!

Fokko left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

fallintoplace commented May 12, 2026

Summary

Changes

Verification

Uh oh!

rambleraptor left a comment

Choose a reason for hiding this comment

Uh oh!

fallintoplace commented May 12, 2026

Uh oh!

Fokko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants