perf(encoding): inline mini-block chunk metadata into page layout proto#6851
perf(encoding): inline mini-block chunk metadata into page layout proto#6851westonpace wants to merge 1 commit into
Conversation
The per-chunk "Words" metadata buffer (a few KB per page) is now also embedded in the MiniBlockLayout protobuf via a new optional field `inline_chunk_meta`. ColumnMetadata is read in a single coalesced I/O at file open, so new readers can populate the chunk metadata from the proto and skip the per-column metadata fetch in MiniBlockScheduler::initialize. When the page has no dictionary and no repetition index, initialize issues zero I/O — addressing the per-column read on the cold search-cache path called out in #4888. The writer continues to emit the metadata buffer at buffer_offsets_and_sizes[0], so older readers (which ignore the unknown proto field) keep decoding correctly. No on-disk format version bump required.
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
|
@claude review once |
| let serialized = | ||
| Self::serialize_miniblocks(compressed_data, rep_data, def_data, support_large_chunk)?; | ||
|
|
||
| // Copy the per-chunk "Words" metadata so we can also inline it into the page layout | ||
| // proto. New readers can then skip the per-column metadata I/O at initialize time; | ||
| // old readers still consume the on-disk copy at buffer_offsets_and_sizes[0]. | ||
| let inline_chunk_meta = serialized.metadata.as_ref().to_vec(); | ||
|
|
||
| // Metadata, Data, Dictionary, (maybe) Repetition Index | ||
| let mut data = Vec::with_capacity(4); | ||
| data.push(serialized.metadata); |
There was a problem hiding this comment.
🟡 Line 4630 does serialized.metadata.as_ref().to_vec(), which allocates a fresh Vec<u8> and memcpys the entire metadata buffer on every mini-block page write — even though serialized.metadata is a refcount-cloneable LanceBuffer that is then moved into data unchanged on line 4634. Mini-block metadata is only a few KB per page, so the overhead is small, but it can be avoided by restructuring serialize_miniblocks to build the metadata as bytes::Bytes first and then cloning it cheaply into both data[0] (via LanceBuffer::from_bytes) and the proto field.
Extended reasoning...
The avoidable copy. At primitive.rs:4630 the writer does:
let inline_chunk_meta = serialized.metadata.as_ref().to_vec();serialized.metadata is a LanceBuffer wrapping arrow_buffer::Buffer (buffer.rs:22), which is itself Arc-based and cheap to clone. as_ref() returns &[u8], and .to_vec() then performs an unconditional O(metadata_size) heap allocation + memcpy. On the next line (4634) the original serialized.metadata is pushed into data and consumed — so the program ends up holding two distinct copies of the same bytes: one as a LanceBuffer in data[0], and one as a Vec<u8> (later converted to bytes::Bytes zero-copy via Bytes::from(vec) in format.rs:568).
Step-by-step proof. For a single mini-block page write with N bytes of chunk metadata:
serialize_miniblocksreturnsserialized.metadata: LanceBuffer— backed by an Arc'd allocation A of N bytes.serialized.metadata.as_ref()borrows A →&[u8]of length N. No copy yet..to_vec()allocates a fresh heap buffer B of N bytes andmemcpys A → B. First copy.data.push(serialized.metadata)moves the original LanceBuffer (still pointing to A) into the page data.- Later,
inline_chunk_meta.map(bytes::Bytes::from)(format.rs:568) converts B into abytes::Byteszero-copy — no second memcpy, but B was already allocated in step 3.
So every page write now pays one extra N-byte allocation and memcpy that did not exist before this PR.
Why the existing code can't avoid it directly. The on-the-wire proto field is bytes::Bytes (prost-generated, schema-fixed), while LanceBuffer wraps arrow_buffer::Buffer with its own Arc scheme. There is currently no LanceBuffer → bytes::Bytes zero-copy path in the codebase (only the reverse, LanceBuffer::from_bytes at buffer.rs:91).
Addressing the 'not actionable' objection. One refutation argues that any conversion between arrow_buffer::Buffer and bytes::Bytes requires a memcpy, so this is unfixable. That overstates the case: the fix is to invert the construction order inside serialize_miniblocks — build the Words buffer into a Vec<u8> (or directly into a bytes::Bytes via BytesMut), then:
- Convert to
bytes::BytesviaBytes::from(vec)(zero-copy). - Use
LanceBuffer::from_bytes(bytes.clone(), alignment)fordata[0](refcount bump). - Clone the same
Bytesinto the proto field (refcount bump).
Both downstream consumers then share the same allocation; only one copy of the metadata bytes is alive. This is the same pattern already used elsewhere in the codebase. bytes::Bytes::from_owner (bytes ≥ 1.7) is an alternative if the metadata must be built directly into an arrow_buffer::Buffer.
Impact. Metadata is a few KB per page (one u16/u32 word per chunk), so the absolute regression per page is small and dominated by I/O and compression. The PR is otherwise a clear win on the read path. Flagging as a nit — worth a follow-up but not blocking.
The per-chunk "Words" metadata buffer (a few KB per page) is now also
embedded in the MiniBlockLayout protobuf via a new optional field
inline_chunk_meta. ColumnMetadata is read in a single coalesced I/O atfile open, so new readers can populate the chunk metadata from the proto
and skip the per-column metadata fetch in MiniBlockScheduler::initialize.
When the page has no dictionary and no repetition index, initialize
issues zero I/O — addressing the per-column read on the cold
search-cache path called out in #4888.
The writer continues to emit the metadata buffer at
buffer_offsets_and_sizes[0], so older readers (which ignore the unknown
proto field) keep decoding correctly. No on-disk format version bump
required.