feat: support bf16 from pytorch dataset by eddyxu · Pull Request #6342 · lance-format/lance

eddyxu · 2026-03-30T15:54:48Z

Summary

Support round-trip to use bf16 from PyTorch

Co-authored-by: Claude Opus 4.6 noreply@anthropic.com

github-actions · 2026-03-30T15:56:34Z

PR Review

P0 Bug: `uint64_as_int64` conversion broken for numeric types

The insertion of the elif _is_bfloat16_type(arr.type): branch has accidentally moved the uint64_as_int64 guard into the wrong branch.

Before (base):

        elif (
            pa.types.is_integer(arr.type)
            or pa.types.is_floating(arr.type)
            or pa.types.is_boolean(arr.type)
        ):
            tensor = torch.from_numpy(arr.to_numpy(zero_copy_only=False))

            if uint64_as_int64 and tensor.dtype == torch.uint64:  # ← inside numeric branch ✓
                tensor = tensor.to(torch.int64)

After (this PR):

        elif (
            pa.types.is_integer(arr.type)
            ...
        ):
            tensor = torch.from_numpy(arr.to_numpy(zero_copy_only=False))
        elif _is_bfloat16_type(arr.type):
            tensor = _bf16_to_tensor(arr)

            if uint64_as_int64 and tensor.dtype == torch.uint64:  # ← now inside bf16 branch ✗
                tensor = tensor.to(torch.int64)

The uint64_as_int64 conversion no longer runs for numeric uint64 columns (regression), and is dead code inside the bf16 branch (bf16 tensors will never have torch.uint64 dtype).

Fix: Move the if uint64_as_int64 block back into (or after) the numeric elif, e.g. by dedenting it to run after all branches, or by keeping it inside the numeric branch:

        ):
            tensor = torch.from_numpy(arr.to_numpy(zero_copy_only=False))

            if uint64_as_int64 and tensor.dtype == torch.uint64:
                tensor = tensor.to(torch.int64)
        elif _is_bfloat16_type(arr.type):
            tensor = _bf16_to_tensor(arr)
        elif hf_converter is not None:

Minor: No null handling in `_bf16_to_tensor`

If a bf16 array contains nulls, _bf16_to_tensor will silently produce garbage values for null slots. This is consistent with the existing numeric path (to_numpy(zero_copy_only=False) also fills nulls with 0), so not blocking — but worth a brief doc comment noting the assumption.

Overall the approach (reinterpreting uint16 storage bytes as bfloat16) is sound and the test coverage for the happy path is good.

westonpace

Just some nits

westonpace · 2026-03-31T16:20:34Z

python/python/lance/torch/data.py

+    Null values are replaced with NaN.
+    """
+    storage = arr.storage if isinstance(arr.type, pa.ExtensionType) else arr
+    buf = storage.buffers()[1]


Should we do a sanity check that the data type of storage is a 16-bit type at this point?

westonpace · 2026-03-31T16:22:43Z

python/python/lance/torch/data.py

+    buf = storage.buffers()[1]
+    offset = storage.offset * 2  # 2 bytes per bf16 value
+    np_uint16 = np.frombuffer(buf, dtype=np.uint16, count=len(storage), offset=offset)
+    tensor = torch.from_numpy(np_uint16.copy()).view(torch.bfloat16)


Is the copy here so that the resulting buffer can be mutable?

westonpace · 2026-03-31T16:23:12Z

python/python/lance/torch/data.py

+    np_uint16 = np.frombuffer(buf, dtype=np.uint16, count=len(storage), offset=offset)
+    tensor = torch.from_numpy(np_uint16.copy()).view(torch.bfloat16)
+    if arr.null_count > 0:
+        null_mask = torch.from_numpy(arr.is_null().to_numpy(zero_copy_only=False))


Seems like there should be a way to do this without a copy but maybe not.

westonpace · 2026-03-31T16:23:46Z

python/python/lance/torch/data.py

-            if uint64_as_int64 and tensor.dtype == torch.uint64:
+            if (
+                uint64_as_int64 and tensor.dtype == torch.uint64
+            ):  # ← inside numeric branch ✓


Kind of a strange comment. I'm not really sure what it means.

## Summary - Converts `MAX_MINIBLOCK_VALUES` from a compile-time constant to a `LazyLock<u64>` that reads from the `LANCE_MINIBLOCK_MAX_VALUES` environment variable (default `4096`) - Updates all 6 usage sites across the encoding crate to dereference the `LazyLock` - Adds documentation in `docs/src/format/file/encoding.md` explaining the tuning knob and when it's useful Closes #6140 ## Test plan - [x] All existing miniblock tests pass (10 tests) - [x] All existing RLE tests pass (22 tests) - [x] `cargo clippy -p lance-encoding --tests -- -D warnings` clean - [ ] ~~Verify with a custom `LANCE_MINIBLOCK_MAX_VALUES` value that smaller mini-blocks are produced~~ 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…#6316) ## Summary - Cloud providers (AWS, Azure, GCP) constructed `StorageOptions` using the raw tuple struct constructor, bypassing `StorageOptions::new()` which reads `OBJECT_STORE_CLIENT_MAX_RETRIES` and `OBJECT_STORE_CLIENT_RETRY_TIMEOUT` from environment variables - This meant those env vars were silently ignored — the default (10) was always used regardless of what was set in the environment ## Test plan - [ ] Set `OBJECT_STORE_CLIENT_MAX_RETRIES=1` and verify the object store client respects it - [ ] Verify `OBJECT_STORE_CLIENT_RETRY_TIMEOUT` is also picked up 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

…dices (#6351) Problem: take() with sorted-but-duplicate indices (e.g., ML sliding window sampling) degrades from O(n) to O(n²). Benchmarked: 2.4M indices with 90% duplicates takes 57 seconds instead of 120ms. Root cause (two bugs): 1. check_row_addrs() uses strict `>` to detect sorted order. Duplicate adjacent values (addr == last_offset) are misclassified as "unsorted", routing to the slow fallback path. 2. The unsorted fallback remaps via `.position()` linear scan — O(N*M) where N=original indices, M=deduplicated rows. For 2.4M × 249K = 308 billion comparisons ≈ 57 seconds. Fix: 1. Change `>` to `>=` in check_row_addrs so sorted-with-duplicates correctly takes the fast "sorted" branch. 2. Replace `.position()` linear scan with HashMap lookup (O(1) per element) as defense-in-depth for truly unsorted input. Benchmark (1024 sliding windows, 2410 rows each, stride 241): Before: 57,000ms After: 120ms (475× speedup) The sorted fast path already handles duplicates correctly via fragment-level take_as_batch() which has its own dedup logic. Co-authored-by: YSBF <noreply@users.noreply.github.com>

This removes the `ROW_ID` sort from distributed vector segment finalization. The merge path now preserves incoming batch order instead of canonicalizing partition contents by `ROW_ID`, which avoids the `FixedSizeList` `take` overflow that was hitting very large IVF_PQ partitions. A focused Rust regression test now covers large `FixedSizeList` partition batches and verifies we preserve input order without triggering the Arrow overflow.

This fixes an OOM in vector index training when `fragment_ids` are provided for a nullable FixedSizeList vector column. Instead of materializing all selected training rows and then sampling, the nullable fragment-filtered path now samples in chunks and filters nulls batch by batch, keeping memory bounded by the requested sample size.

This fixes a bug in distributed vector index builds where any worker that fell back to builder-local training sampled training data from the entire dataset instead of the worker's selected fragments. The change threads `fragment_filter` through IVF and quantizer training, and adds an `IvfSq` regression test that verifies worker-local SQ bounds are derived only from the shard fragments.

Expose the stable row ID manifest flag to Java via `Dataset.usesStableRowIds()`. Downstream consumers (e.g. lance-spark) need to check whether a dataset was created with stable row IDs enabled. The Rust manifest tracks this flag internally, but it was not accessible from the Java API. The only workaround was to mirror the flag into `Dataset.getConfig()` via an explicit `UpdateConfig` commit after table creation. This had two consequences: 1. **Unnecessary version bump.** Every table creation with stable row IDs produced two commits instead of one: the initial `CREATE`, followed by a separate `UpdateConfig` commit just to persist the flag in the config map. This means the table starts at version 2 instead of version 1. 2. The above mentioned action will also cause **CDF (Change Data Feed) breakage.** CDF tracks row-level changes across dataset versions, and tests assert exact version numbers for created/updated rows (e.g. "row inserted at version 1, updated at version 3"). The extra version bump shifts every subsequent version by +1, breaking all CDF assertions across the test suite. With `Dataset.hasStableRowIds()`, consumers can read the manifest flag directly: no config map mirror needed, no extra commit, no version inflation. A companion PR on lance-spark (lance-format/lance-spark#351) adds the rest of the needed changes. **NOTE:** he changes are related, but can be merged independently. No synchronization is needed, they can exist without each other without problems --------- Co-authored-by: Will Jones <willjones127@gmail.com>

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Will Jones <willjones127@gmail.com>

This PR wires IVF_RQ into the distributed vector index segment build path so workers can build and commit shard-local RQ segments through the existing distributed indexing workflow. It also adds the missing RQ handling in the distributed auxiliary metadata plumbing and regression coverage for distributed RQ segment build plus auxiliary merge. Current limitation: IVF_RQ still does not have a pre-trained/shared RQ metadata contract. Different RQ segments can therefore carry different rotations and centroids, so merging multiple IVF_RQ segments into a single physical segment is not a supported workflow yet. I plan to follow up by adding shared/pre-trained RQ metadata so segment merge can be supported safely.

support bf16 from torch

11494a4

github-actions bot added enhancement New feature or request python labels Mar 30, 2026

eddyxu and others added 2 commits March 30, 2026 09:11

Merge branch 'main' into lei/torch_bf16

22a5454

m;

ef653f1

eddyxu requested a review from westonpace March 30, 2026 16:22

westonpace approved these changes Mar 31, 2026

View reviewed changes

Lance Release Bot and others added 10 commits March 31, 2026 13:12

chore: release beta version 5.0.0-beta.2

34024a8

feat: change default file format version to 2.1 (#6115)

2d9c94b

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Will Jones <willjones127@gmail.com>

github-actions bot added the java label Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support bf16 from pytorch dataset#6342

feat: support bf16 from pytorch dataset#6342
eddyxu wants to merge 13 commits intomainfrom
lei/torch_bf16

eddyxu commented Mar 30, 2026

Uh oh!

github-actions bot commented Mar 30, 2026

Uh oh!

westonpace left a comment

Uh oh!

westonpace Mar 31, 2026

Uh oh!

westonpace Mar 31, 2026

Uh oh!

westonpace Mar 31, 2026

Uh oh!

westonpace Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

eddyxu commented Mar 30, 2026

Summary

Uh oh!

github-actions bot commented Mar 30, 2026

PR Review

P0 Bug: uint64_as_int64 conversion broken for numeric types

Minor: No null handling in _bf16_to_tensor

Uh oh!

westonpace left a comment

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

westonpace Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

P0 Bug: `uint64_as_int64` conversion broken for numeric types

Minor: No null handling in `_bf16_to_tensor`