Store non-full pages in a BTreeSet, not a Vec#5071
Open
gefjon wants to merge 2 commits into
Open
Conversation
Reviving a previous patch I wrote during our (internal) TPCC experimentation. This has become important because, in addition to its performance implications, it makes row insertion locations deterministic regardless of datastore restarts, which previously they were not. Previously, restarting the datastore would re-order the `non_full_pages` list (i.e. sort it by increasing `PageIndex`, where normally it was not sorted), meaning that which page a new row would be inserted into depended on when the datastore was last restarted. With this patch, that is not the case: the `non_full_pages` are always kept in a deterministic order, so which page a new row goes into is also deterministic. Original commit message follows: And sort them by number of available var-len granules. This prevents an accidentally quadratic behavior where, for a table where the average row contains many var-len granules, after inserting a large number of rows, there would be a large number of pages in `non_full_pages` each of which had enough space for at least one fixed-len row part, but insufficient space for an actual row in practice due to insufficient var-len granules. Each insertion would then do a linear scan over `non_full_pages` before either inserting into the last page or allocating a new page which went to the end. Now, non-full pages are stored in a `BTreeSet` sorted by the number of free var-len granules, and the search for a useable page is done with a `BTreeSet::range` iterator for only the pages with enough granules. I think there may still be an off-by-one-ish bug here, where a page may have enough bytes in the gap that it could either store the fixed-len part or the var-len granules, but not both, but this fix hopefully will suffice for now.
Contributor
joshua-spacetime
left a comment
There was a problem hiding this comment.
I tested this on the keynote-2 benchmark, and I didn't observe a decrease in overall throughput fwiw.
During review, Joshua noticed that I had completely forgotten to update `non_full_pages` in the delete path. D'oh! Fixing this, it made the most sense to do a minor refactor to how `non_full_pages` was updated: rather than conditionalizing inclusion there on `page.is_full(fixed_row_size)`, we now check `page.available_var_len_granules() != 0`. This means we have to pass around `fixed_row_size` less, and reduces dedundancy. It did, however, mean rewriting and/or replacing a few methods on `Pages` more than the original patch had included.
Comment on lines
+343
to
+344
| if available_granules != 0 { | ||
| self.non_full_pages.insert((available_granules, page_index)); |
Contributor
There was a problem hiding this comment.
What about deletes that don't include any var_len_granules? If I'm reading this correctly, it won't record the page as non_full, right?
kim
reviewed
May 20, 2026
|
|
||
| impl<T: MemoryUsage> MemoryUsage for std::collections::BTreeSet<T> { | ||
| fn heap_usage(&self) -> usize { | ||
| // NB: this is best-effort, since we don't have a `capacity()` method on `BTreeMap`. |
Contributor
There was a problem hiding this comment.
Isn't this how B-trees work? Like, there isn't any excess, preallocated capacity 🤔
| // but we'd have to do some amount of reasoning to demonstrate it was correct | ||
| // based on the definition of `Page::clear`, | ||
| // and why bother? | ||
| .map(|idx| (self.pages[idx].available_var_len_granules(), PageIndex(idx as u64))) |
Contributor
There was a problem hiding this comment.
For my own education: why is it that the number of granules isn't known statically at this point (we just cleared all pages)? Does the size of a granule depend on the table schema or something?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of Changes
Reviving a previous patch I wrote during our (internal) TPCC experimentation. This has become important because, in addition to its performance implications, it makes row insertion locations deterministic regardless of datastore restarts, which previously they were not.
Previously, restarting the datastore would re-order the
non_full_pageslist (i.e. sort it by increasingPageIndex, where normally it was not sorted), meaning that which page a new row would be inserted into depended on when the datastore was last restarted.With this patch, that is not the case: the
non_full_pagesare always kept in a deterministic order, so which page a new row goes into is also deterministic.Original commit message follows:
And sort them by number of available var-len granules. This prevents an accidentally quadratic behavior where, for a table where the average row contains many var-len granules, after inserting a large number of rows, there would be a large number of pages in
non_full_pageseach of which had enough space for at least one fixed-len row part, but insufficient space for an actual row in practice due to insufficient var-len granules. Each insertion would then do a linear scan overnon_full_pagesbefore either inserting into the last page or allocating a new page which went to the end.Now, non-full pages are stored in a
BTreeSetsorted by the number of free var-len granules, and the search for a useable page is done with aBTreeSet::rangeiterator for only the pages with enough granules. I think there may still be an off-by-one-ish bug here, where a page may have enough bytes in the gap that it could either store the fixed-len part or the var-len granules, but not both, but this fix hopefully will suffice for now.API and ABI breaking changes
N/a
Expected complexity level and risk
2? Table code is a bit fiddly, and this path is performance-sensitive when inserting rows.
Testing