Skip to content

fix(fsspec): handle zero-byte files in __len__#3353

Merged
Fokko merged 1 commit into
apache:mainfrom
fallintoplace:fix-fsspec-zero-byte-len
May 15, 2026
Merged

fix(fsspec): handle zero-byte files in __len__#3353
Fokko merged 1 commit into
apache:mainfrom
fallintoplace:fix-fsspec-zero-byte-len

Conversation

@fallintoplace
Copy link
Copy Markdown
Contributor

Summary

Fix FsspecInputFile.__len__ and FsspecOutputFile.__len__ so zero-byte files return 0 instead of being treated as missing metadata.

Both methods previously used truthiness checks on object_info.get(...), which caused valid sizes like 0 to fall through to the runtime error path.

Changes

  • check for Size key presence explicitly
  • check for size key presence explicitly
  • add a regression test covering zero-byte lengths for both metadata key variants

Verification

  • python -m pytest tests/io/test_fsspec.py -k zero_length_of_file -q

Copy link
Copy Markdown
Contributor

@rambleraptor rambleraptor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine to me code-wise. Love the test.

Personal curiosity, but any reason why this would be useful? Zero-byte files are an interesting edge case

@fallintoplace
Copy link
Copy Markdown
Contributor Author

The main point is that FileIO.len should faithfully reflect the filesystem metadata. A zero-byte object is valid object-store metadata, but the previous truthiness check treated 0 the same as a missing key and raised instead of returning 0.

Even if PyIceberg does not intentionally write zero-byte data files, the fsspec FileIO can be used against general object-store paths, including marker files or user-created empty objects. This keeps the boundary behavior consistent with the underlying filesystem.

Copy link
Copy Markdown
Contributor

@Fokko Fokko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, great catch @fallintoplace Thanks for fixing this. @rambleraptor thanks for the review 👍

@Fokko Fokko merged commit a99320b into apache:main May 15, 2026
16 checks passed
Fokko pushed a commit that referenced this pull request May 25, 2026
…tion (#3412)

Inspired by the walrus issue in #3353

## Summary

Several `if x := dict.get(k):` checks treated legitimate falsy values as
missing:

- `lower_bounds.get(field_id)` / `upper_bounds.get(field_id)` return
`bytes`. `b""` is a valid serialization of an empty string and was being
skipped, causing metrics-based row filtering to return
`ROWS_MIGHT_MATCH` when it should have been `ROWS_CANNOT_MATCH` (and
vice versa for strict eval).
- `accessors[...].get(file.partition)` can return `0` or `""` for an
`IdentityTransform` partition. The walrus dropped it, so projected
partition columns were filled with `null` instead of the actual value.
- `inspect.py` `lower_bound` / `upper_bound` rendering had the same
`b""` issue, showing `None` instead of `""` in `readable_metrics`.

All conditions are switched to explicit `is not None` checks. A small
`_readable_bound` helper deduplicates the inspect rendering.

## Changes

- `pyiceberg/expressions/visitors.py` — `_InclusiveMetricsEvaluator` and
`_StrictMetricsEvaluator` bound lookups (21 sites).
- `pyiceberg/io/pyarrow.py` — `_get_column_projection_values` and
`ArrowProjectionVisitor` missing-field handling.
- `pyiceberg/table/inspect.py` — extract `_readable_bound`, use it in
both the `entries` and `_get_files_from_manifest` rendering paths.

## Tests

- `tests/expressions/test_evaluator.py` — inclusive and strict
evaluators with empty-string bounds, covering lower-only, upper-only,
and both-bounds branches.
- `tests/io/test_pyarrow.py` — parametrized identity-transform
projection with falsy partition values (`0`, `""`, `None`).
- `tests/table/test_inspect.py` — `_readable_bound` helper plus
integration tests via `tbl.inspect.entries()` and `tbl.inspect.files()`
for empty-string and null bounds.
Fokko pushed a commit that referenced this pull request May 25, 2026
Inspired by the walrus issue in #3353

## Problem

`PartitionField` and `SortField` accept the v3 `source-ids` list (added
in [#1554](#1554)) and map
it to the legacy singular `source-id`. Both validators try to reject an
empty list:

```python
if "source-id" not in data and (source_ids := data["source-ids"]):
    if isinstance(source_ids, list):
        if len(source_ids) == 0:
            raise ValueError("Empty source-ids is not allowed")
        ...
        data["source-id"] = source_ids[0]
```

The walrus uses truthiness, and `[]` is falsy — so the `len(source_ids)
== 0` branch is unreachable. Passing `{"source-ids": []}` silently skips
the mapping, and Pydantic then reports a generic "field required" error
instead of the intended message. A missing `source-ids` key also raises
`KeyError` instead of being handled cleanly.

## Fix

Replace the walrus with an explicit key check in both validators:

```python
if "source-id" not in data and "source-ids" in data:
    source_ids = data["source-ids"]
    ...
```

This makes the empty-list validation reachable and avoids the
`KeyError`.

## Tests

Added regression tests that deserialize `{"source-ids": []}` and assert
`ValueError("Empty source-ids is not allowed")` is raised, for both
`PartitionField` and `SortField`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants