Conversation
PR ReviewP0 Bug:
|
| Null values are replaced with NaN. | ||
| """ | ||
| storage = arr.storage if isinstance(arr.type, pa.ExtensionType) else arr | ||
| buf = storage.buffers()[1] |
There was a problem hiding this comment.
Should we do a sanity check that the data type of storage is a 16-bit type at this point?
| buf = storage.buffers()[1] | ||
| offset = storage.offset * 2 # 2 bytes per bf16 value | ||
| np_uint16 = np.frombuffer(buf, dtype=np.uint16, count=len(storage), offset=offset) | ||
| tensor = torch.from_numpy(np_uint16.copy()).view(torch.bfloat16) |
There was a problem hiding this comment.
Is the copy here so that the resulting buffer can be mutable?
| np_uint16 = np.frombuffer(buf, dtype=np.uint16, count=len(storage), offset=offset) | ||
| tensor = torch.from_numpy(np_uint16.copy()).view(torch.bfloat16) | ||
| if arr.null_count > 0: | ||
| null_mask = torch.from_numpy(arr.is_null().to_numpy(zero_copy_only=False)) |
There was a problem hiding this comment.
Seems like there should be a way to do this without a copy but maybe not.
| if uint64_as_int64 and tensor.dtype == torch.uint64: | ||
| if ( | ||
| uint64_as_int64 and tensor.dtype == torch.uint64 | ||
| ): # ← inside numeric branch ✓ |
There was a problem hiding this comment.
Kind of a strange comment. I'm not really sure what it means.
## Summary - Converts `MAX_MINIBLOCK_VALUES` from a compile-time constant to a `LazyLock<u64>` that reads from the `LANCE_MINIBLOCK_MAX_VALUES` environment variable (default `4096`) - Updates all 6 usage sites across the encoding crate to dereference the `LazyLock` - Adds documentation in `docs/src/format/file/encoding.md` explaining the tuning knob and when it's useful Closes #6140 ## Test plan - [x] All existing miniblock tests pass (10 tests) - [x] All existing RLE tests pass (22 tests) - [x] `cargo clippy -p lance-encoding --tests -- -D warnings` clean - [ ] ~~Verify with a custom `LANCE_MINIBLOCK_MAX_VALUES` value that smaller mini-blocks are produced~~ 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…#6316) ## Summary - Cloud providers (AWS, Azure, GCP) constructed `StorageOptions` using the raw tuple struct constructor, bypassing `StorageOptions::new()` which reads `OBJECT_STORE_CLIENT_MAX_RETRIES` and `OBJECT_STORE_CLIENT_RETRY_TIMEOUT` from environment variables - This meant those env vars were silently ignored — the default (10) was always used regardless of what was set in the environment ## Test plan - [ ] Set `OBJECT_STORE_CLIENT_MAX_RETRIES=1` and verify the object store client respects it - [ ] Verify `OBJECT_STORE_CLIENT_RETRY_TIMEOUT` is also picked up 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…dices (#6351) Problem: take() with sorted-but-duplicate indices (e.g., ML sliding window sampling) degrades from O(n) to O(n²). Benchmarked: 2.4M indices with 90% duplicates takes 57 seconds instead of 120ms. Root cause (two bugs): 1. check_row_addrs() uses strict `>` to detect sorted order. Duplicate adjacent values (addr == last_offset) are misclassified as "unsorted", routing to the slow fallback path. 2. The unsorted fallback remaps via `.position()` linear scan — O(N*M) where N=original indices, M=deduplicated rows. For 2.4M × 249K = 308 billion comparisons ≈ 57 seconds. Fix: 1. Change `>` to `>=` in check_row_addrs so sorted-with-duplicates correctly takes the fast "sorted" branch. 2. Replace `.position()` linear scan with HashMap lookup (O(1) per element) as defense-in-depth for truly unsorted input. Benchmark (1024 sliding windows, 2410 rows each, stride 241): Before: 57,000ms After: 120ms (475× speedup) The sorted fast path already handles duplicates correctly via fragment-level take_as_batch() which has its own dedup logic. Co-authored-by: YSBF <noreply@users.noreply.github.com>
This removes the `ROW_ID` sort from distributed vector segment finalization. The merge path now preserves incoming batch order instead of canonicalizing partition contents by `ROW_ID`, which avoids the `FixedSizeList` `take` overflow that was hitting very large IVF_PQ partitions. A focused Rust regression test now covers large `FixedSizeList` partition batches and verifies we preserve input order without triggering the Arrow overflow.
This fixes an OOM in vector index training when `fragment_ids` are provided for a nullable FixedSizeList vector column. Instead of materializing all selected training rows and then sampling, the nullable fragment-filtered path now samples in chunks and filters nulls batch by batch, keeping memory bounded by the requested sample size.
This fixes a bug in distributed vector index builds where any worker that fell back to builder-local training sampled training data from the entire dataset instead of the worker's selected fragments. The change threads `fragment_filter` through IVF and quantizer training, and adds an `IvfSq` regression test that verifies worker-local SQ bounds are derived only from the shard fragments.
Expose the stable row ID manifest flag to Java via `Dataset.usesStableRowIds()`. Downstream consumers (e.g. lance-spark) need to check whether a dataset was created with stable row IDs enabled. The Rust manifest tracks this flag internally, but it was not accessible from the Java API. The only workaround was to mirror the flag into `Dataset.getConfig()` via an explicit `UpdateConfig` commit after table creation. This had two consequences: 1. **Unnecessary version bump.** Every table creation with stable row IDs produced two commits instead of one: the initial `CREATE`, followed by a separate `UpdateConfig` commit just to persist the flag in the config map. This means the table starts at version 2 instead of version 1. 2. The above mentioned action will also cause **CDF (Change Data Feed) breakage.** CDF tracks row-level changes across dataset versions, and tests assert exact version numbers for created/updated rows (e.g. "row inserted at version 1, updated at version 3"). The extra version bump shifts every subsequent version by +1, breaking all CDF assertions across the test suite. With `Dataset.hasStableRowIds()`, consumers can read the manifest flag directly: no config map mirror needed, no extra commit, no version inflation. A companion PR on lance-spark (lance-format/lance-spark#351) adds the rest of the needed changes. **NOTE:** he changes are related, but can be merged independently. No synchronization is needed, they can exist without each other without problems --------- Co-authored-by: Will Jones <willjones127@gmail.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Will Jones <willjones127@gmail.com>
This PR wires IVF_RQ into the distributed vector index segment build path so workers can build and commit shard-local RQ segments through the existing distributed indexing workflow. It also adds the missing RQ handling in the distributed auxiliary metadata plumbing and regression coverage for distributed RQ segment build plus auxiliary merge. Current limitation: IVF_RQ still does not have a pre-trained/shared RQ metadata contract. Different RQ segments can therefore carry different rotations and centroids, so merging multiple IVF_RQ segments into a single physical segment is not a supported workflow yet. I plan to follow up by adding shared/pre-trained RQ metadata so segment merge can be supported safely.
Summary
Support round-trip to use bf16 from PyTorch
Co-authored-by: Claude Opus 4.6 noreply@anthropic.com