Skip to content

perf(encoding): inline mini-block chunk metadata into page layout proto#6851

Open
westonpace wants to merge 1 commit into
mainfrom
claude/refine-local-plan-KIhA8
Open

perf(encoding): inline mini-block chunk metadata into page layout proto#6851
westonpace wants to merge 1 commit into
mainfrom
claude/refine-local-plan-KIhA8

Conversation

@westonpace
Copy link
Copy Markdown
Member

The per-chunk "Words" metadata buffer (a few KB per page) is now also
embedded in the MiniBlockLayout protobuf via a new optional field
inline_chunk_meta. ColumnMetadata is read in a single coalesced I/O at
file open, so new readers can populate the chunk metadata from the proto
and skip the per-column metadata fetch in MiniBlockScheduler::initialize.
When the page has no dictionary and no repetition index, initialize
issues zero I/O — addressing the per-column read on the cold
search-cache path called out in #4888.

The writer continues to emit the metadata buffer at
buffer_offsets_and_sizes[0], so older readers (which ignore the unknown
proto field) keep decoding correctly. No on-disk format version bump
required.

The per-chunk "Words" metadata buffer (a few KB per page) is now also
embedded in the MiniBlockLayout protobuf via a new optional field
`inline_chunk_meta`. ColumnMetadata is read in a single coalesced I/O at
file open, so new readers can populate the chunk metadata from the proto
and skip the per-column metadata fetch in MiniBlockScheduler::initialize.
When the page has no dictionary and no repetition index, initialize
issues zero I/O — addressing the per-column read on the cold
search-cache path called out in #4888.

The writer continues to emit the metadata buffer at
buffer_offsets_and_sizes[0], so older readers (which ignore the unknown
proto field) keep decoding correctly. No on-disk format version bump
required.
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

@westonpace
Copy link
Copy Markdown
Member Author

@claude review once

Comment on lines 4624 to 4634
let serialized =
Self::serialize_miniblocks(compressed_data, rep_data, def_data, support_large_chunk)?;

// Copy the per-chunk "Words" metadata so we can also inline it into the page layout
// proto. New readers can then skip the per-column metadata I/O at initialize time;
// old readers still consume the on-disk copy at buffer_offsets_and_sizes[0].
let inline_chunk_meta = serialized.metadata.as_ref().to_vec();

// Metadata, Data, Dictionary, (maybe) Repetition Index
let mut data = Vec::with_capacity(4);
data.push(serialized.metadata);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Line 4630 does serialized.metadata.as_ref().to_vec(), which allocates a fresh Vec<u8> and memcpys the entire metadata buffer on every mini-block page write — even though serialized.metadata is a refcount-cloneable LanceBuffer that is then moved into data unchanged on line 4634. Mini-block metadata is only a few KB per page, so the overhead is small, but it can be avoided by restructuring serialize_miniblocks to build the metadata as bytes::Bytes first and then cloning it cheaply into both data[0] (via LanceBuffer::from_bytes) and the proto field.

Extended reasoning...

The avoidable copy. At primitive.rs:4630 the writer does:

let inline_chunk_meta = serialized.metadata.as_ref().to_vec();

serialized.metadata is a LanceBuffer wrapping arrow_buffer::Buffer (buffer.rs:22), which is itself Arc-based and cheap to clone. as_ref() returns &[u8], and .to_vec() then performs an unconditional O(metadata_size) heap allocation + memcpy. On the next line (4634) the original serialized.metadata is pushed into data and consumed — so the program ends up holding two distinct copies of the same bytes: one as a LanceBuffer in data[0], and one as a Vec<u8> (later converted to bytes::Bytes zero-copy via Bytes::from(vec) in format.rs:568).

Step-by-step proof. For a single mini-block page write with N bytes of chunk metadata:

  1. serialize_miniblocks returns serialized.metadata: LanceBuffer — backed by an Arc'd allocation A of N bytes.
  2. serialized.metadata.as_ref() borrows A → &[u8] of length N. No copy yet.
  3. .to_vec() allocates a fresh heap buffer B of N bytes and memcpys A → B. First copy.
  4. data.push(serialized.metadata) moves the original LanceBuffer (still pointing to A) into the page data.
  5. Later, inline_chunk_meta.map(bytes::Bytes::from) (format.rs:568) converts B into a bytes::Bytes zero-copy — no second memcpy, but B was already allocated in step 3.

So every page write now pays one extra N-byte allocation and memcpy that did not exist before this PR.

Why the existing code can't avoid it directly. The on-the-wire proto field is bytes::Bytes (prost-generated, schema-fixed), while LanceBuffer wraps arrow_buffer::Buffer with its own Arc scheme. There is currently no LanceBuffer → bytes::Bytes zero-copy path in the codebase (only the reverse, LanceBuffer::from_bytes at buffer.rs:91).

Addressing the 'not actionable' objection. One refutation argues that any conversion between arrow_buffer::Buffer and bytes::Bytes requires a memcpy, so this is unfixable. That overstates the case: the fix is to invert the construction order inside serialize_miniblocks — build the Words buffer into a Vec<u8> (or directly into a bytes::Bytes via BytesMut), then:

  • Convert to bytes::Bytes via Bytes::from(vec) (zero-copy).
  • Use LanceBuffer::from_bytes(bytes.clone(), alignment) for data[0] (refcount bump).
  • Clone the same Bytes into the proto field (refcount bump).

Both downstream consumers then share the same allocation; only one copy of the metadata bytes is alive. This is the same pattern already used elsewhere in the codebase. bytes::Bytes::from_owner (bytes ≥ 1.7) is an alternative if the metadata must be built directly into an arrow_buffer::Buffer.

Impact. Metadata is a few KB per page (one u16/u32 word per chunk), so the absolute regression per page is small and dominated by I/O and compression. The PR is otherwise a clear win on the read path. Flagging as a nit — worth a follow-up but not blocking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants