Skip to content

feat: support bf16 from pytorch dataset#6342

Open
eddyxu wants to merge 13 commits intomainfrom
lei/torch_bf16
Open

feat: support bf16 from pytorch dataset#6342
eddyxu wants to merge 13 commits intomainfrom
lei/torch_bf16

Conversation

@eddyxu
Copy link
Copy Markdown
Member

@eddyxu eddyxu commented Mar 30, 2026

Summary

Support round-trip to use bf16 from PyTorch

Co-authored-by: Claude Opus 4.6 noreply@anthropic.com

@github-actions github-actions bot added enhancement New feature or request python labels Mar 30, 2026
@github-actions
Copy link
Copy Markdown
Contributor

PR Review

P0 Bug: uint64_as_int64 conversion broken for numeric types

The insertion of the elif _is_bfloat16_type(arr.type): branch has accidentally moved the uint64_as_int64 guard into the wrong branch.

Before (base):

        elif (
            pa.types.is_integer(arr.type)
            or pa.types.is_floating(arr.type)
            or pa.types.is_boolean(arr.type)
        ):
            tensor = torch.from_numpy(arr.to_numpy(zero_copy_only=False))

            if uint64_as_int64 and tensor.dtype == torch.uint64:  # ← inside numeric branch ✓
                tensor = tensor.to(torch.int64)

After (this PR):

        elif (
            pa.types.is_integer(arr.type)
            ...
        ):
            tensor = torch.from_numpy(arr.to_numpy(zero_copy_only=False))
        elif _is_bfloat16_type(arr.type):
            tensor = _bf16_to_tensor(arr)

            if uint64_as_int64 and tensor.dtype == torch.uint64:  # ← now inside bf16 branch ✗
                tensor = tensor.to(torch.int64)

The uint64_as_int64 conversion no longer runs for numeric uint64 columns (regression), and is dead code inside the bf16 branch (bf16 tensors will never have torch.uint64 dtype).

Fix: Move the if uint64_as_int64 block back into (or after) the numeric elif, e.g. by dedenting it to run after all branches, or by keeping it inside the numeric branch:

        ):
            tensor = torch.from_numpy(arr.to_numpy(zero_copy_only=False))

            if uint64_as_int64 and tensor.dtype == torch.uint64:
                tensor = tensor.to(torch.int64)
        elif _is_bfloat16_type(arr.type):
            tensor = _bf16_to_tensor(arr)
        elif hf_converter is not None:

Minor: No null handling in _bf16_to_tensor

If a bf16 array contains nulls, _bf16_to_tensor will silently produce garbage values for null slots. This is consistent with the existing numeric path (to_numpy(zero_copy_only=False) also fills nulls with 0), so not blocking — but worth a brief doc comment noting the assumption.


Overall the approach (reinterpreting uint16 storage bytes as bfloat16) is sound and the test coverage for the happy path is good.

@eddyxu eddyxu requested a review from westonpace March 30, 2026 16:22
Copy link
Copy Markdown
Member

@westonpace westonpace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits

Null values are replaced with NaN.
"""
storage = arr.storage if isinstance(arr.type, pa.ExtensionType) else arr
buf = storage.buffers()[1]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do a sanity check that the data type of storage is a 16-bit type at this point?

buf = storage.buffers()[1]
offset = storage.offset * 2 # 2 bytes per bf16 value
np_uint16 = np.frombuffer(buf, dtype=np.uint16, count=len(storage), offset=offset)
tensor = torch.from_numpy(np_uint16.copy()).view(torch.bfloat16)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the copy here so that the resulting buffer can be mutable?

np_uint16 = np.frombuffer(buf, dtype=np.uint16, count=len(storage), offset=offset)
tensor = torch.from_numpy(np_uint16.copy()).view(torch.bfloat16)
if arr.null_count > 0:
null_mask = torch.from_numpy(arr.is_null().to_numpy(zero_copy_only=False))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like there should be a way to do this without a copy but maybe not.

if uint64_as_int64 and tensor.dtype == torch.uint64:
if (
uint64_as_int64 and tensor.dtype == torch.uint64
): # ← inside numeric branch ✓
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of a strange comment. I'm not really sure what it means.

Lance Release Bot and others added 10 commits March 31, 2026 13:12
## Summary

- Converts `MAX_MINIBLOCK_VALUES` from a compile-time constant to a
`LazyLock<u64>` that reads from the `LANCE_MINIBLOCK_MAX_VALUES`
environment variable (default `4096`)
- Updates all 6 usage sites across the encoding crate to dereference the
`LazyLock`
- Adds documentation in `docs/src/format/file/encoding.md` explaining
the tuning knob and when it's useful

Closes #6140

## Test plan

- [x] All existing miniblock tests pass (10 tests)
- [x] All existing RLE tests pass (22 tests)
- [x] `cargo clippy -p lance-encoding --tests -- -D warnings` clean
- [ ] ~~Verify with a custom `LANCE_MINIBLOCK_MAX_VALUES` value that
smaller mini-blocks are produced~~

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…#6316)

## Summary
- Cloud providers (AWS, Azure, GCP) constructed `StorageOptions` using
the raw tuple struct constructor, bypassing `StorageOptions::new()`
which reads `OBJECT_STORE_CLIENT_MAX_RETRIES` and
`OBJECT_STORE_CLIENT_RETRY_TIMEOUT` from environment variables
- This meant those env vars were silently ignored — the default (10) was
always used regardless of what was set in the environment

## Test plan
- [ ] Set `OBJECT_STORE_CLIENT_MAX_RETRIES=1` and verify the object
store client respects it
- [ ] Verify `OBJECT_STORE_CLIENT_RETRY_TIMEOUT` is also picked up

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…dices (#6351)

Problem:
  take() with sorted-but-duplicate indices (e.g., ML sliding window
  sampling) degrades from O(n) to O(n²). Benchmarked: 2.4M indices
  with 90% duplicates takes 57 seconds instead of 120ms.

Root cause (two bugs):
1. check_row_addrs() uses strict `>` to detect sorted order. Duplicate
adjacent values (addr == last_offset) are misclassified as "unsorted",
routing to the slow fallback path.
2. The unsorted fallback remaps via `.position()` linear scan — O(N*M)
where N=original indices, M=deduplicated rows. For 2.4M × 249K = 308
billion comparisons ≈ 57 seconds.

Fix:
1. Change `>` to `>=` in check_row_addrs so sorted-with-duplicates
correctly takes the fast "sorted" branch.
2. Replace `.position()` linear scan with HashMap lookup (O(1) per
element) as defense-in-depth for truly unsorted input.

Benchmark (1024 sliding windows, 2410 rows each, stride 241):
  Before: 57,000ms
  After:    120ms  (475× speedup)

The sorted fast path already handles duplicates correctly via
fragment-level take_as_batch() which has its own dedup logic.

Co-authored-by: YSBF <noreply@users.noreply.github.com>
This removes the `ROW_ID` sort from distributed vector segment
finalization. The merge path now preserves incoming batch order instead
of canonicalizing partition contents by `ROW_ID`, which avoids the
`FixedSizeList` `take` overflow that was hitting very large IVF_PQ
partitions.

A focused Rust regression test now covers large `FixedSizeList`
partition batches and verifies we preserve input order without
triggering the Arrow overflow.
This fixes an OOM in vector index training when `fragment_ids` are
provided for a nullable FixedSizeList vector column. Instead of
materializing all selected training rows and then sampling, the nullable
fragment-filtered path now samples in chunks and filters nulls batch by
batch, keeping memory bounded by the requested sample size.
This fixes a bug in distributed vector index builds where any worker
that fell back to builder-local training sampled training data from the
entire dataset instead of the worker's selected fragments. The change
threads `fragment_filter` through IVF and quantizer training, and adds
an `IvfSq` regression test that verifies worker-local SQ bounds are
derived only from the shard fragments.
Expose the stable row ID manifest flag to Java via
`Dataset.usesStableRowIds()`.

Downstream consumers (e.g. lance-spark) need to check whether a dataset
was created with stable row IDs enabled. The Rust manifest tracks this
flag internally, but it was not accessible from the Java API.

The only workaround was to mirror the flag into `Dataset.getConfig()`
via an explicit `UpdateConfig` commit after table creation. This had two
consequences:

1. **Unnecessary version bump.** Every table creation with stable row
IDs produced two commits instead of one: the initial `CREATE`, followed
by a separate `UpdateConfig` commit just to persist the flag in the
config map. This means the table starts at version 2 instead of version
1.

2. The above mentioned action will also cause **CDF (Change Data Feed)
breakage.** CDF tracks row-level changes across dataset versions, and
tests assert exact version numbers for created/updated rows (e.g. "row
inserted at version 1, updated at version 3"). The extra version bump
shifts every subsequent version by +1, breaking all CDF assertions
across the test suite.

With `Dataset.hasStableRowIds()`, consumers can read the manifest flag
directly: no config map mirror needed, no extra commit, no version
inflation.

A companion PR on lance-spark
(lance-format/lance-spark#351) adds the rest of
the needed changes.

**NOTE:** he changes are related, but can be merged independently. No
synchronization is needed, they can exist without each other without
problems

---------

Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Will Jones <willjones127@gmail.com>
This PR wires IVF_RQ into the distributed vector index segment build
path so workers can build and commit shard-local RQ segments through the
existing distributed indexing workflow. It also adds the missing RQ
handling in the distributed auxiliary metadata plumbing and regression
coverage for distributed RQ segment build plus auxiliary merge.

Current limitation: IVF_RQ still does not have a pre-trained/shared RQ
metadata contract. Different RQ segments can therefore carry different
rotations and centroids, so merging multiple IVF_RQ segments into a
single physical segment is not a supported workflow yet. I plan to
follow up by adding shared/pre-trained RQ metadata so segment merge can
be supported safely.
@github-actions github-actions bot added the java label Mar 31, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants