perf(fts): lazy-load PostingListReader metadata by wkalt · Pull Request #6873 · lance-format/lance

wkalt · 2026-05-20T14:36:25Z

To compute BM25 IDF, FtsStatsExec asks each segment "how many docs contain term X?" That question goes through InvertedPartition::load, which eagerly read two columns (length, max score) for every unique token in the partition before answering. A one-term query paid 8 * num_unique_tokens bytes per partition to get back a single LENGTH_COL row.

Modern v2 max_scores and lengths now live in tokio::sync::OnceCells. The stats path reads one LENGTH_COL row per (term, partition) via the new posting_len_for_token. The scoring path calls ensure_metadata_loaded once on first use, so warm-cache behavior is unchanged. Legacy v1 is untouched.

bm25_stats_for_terms and build_global_bm25_scorer are now async; four call sites await them. New rstest test_bm25_stats_for_terms_is_lazy (N=10/100/1000) asserts one posting-file row read per stats call.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

BubbleCal · 2026-05-20T15:18:10Z

+            .iter()
+            .map(|part| {
+                let part = part.clone();
+                async move { part.inverted_list.ensure_metadata_loaded().await }


Say for a cold query, we will finally still need to call ensure_metadata_loaded here which would load the all lengths and max_scores, so we don't actually reduce IO, is this right?

codecov · 2026-05-20T16:09:07Z

Codecov Report

❌ Patch coverage is 70.05348% with 112 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
rust/lance-index/src/scalar/inverted/index.rs	69.94%	92 Missing and 15 partials ⚠️
rust/lance/src/io/exec/fts.rs	80.00%	0 Missing and 3 partials ⚠️
rust/lance-index/src/scalar/inverted.rs	33.33%	0 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

BubbleCal · 2026-05-20T16:40:48Z

            base_scorer
        } else {
-            local_scorer = IndexBM25Scorer::new(self.partitions.iter().map(|part| part.as_ref()));
+            local_scorer = self.bm25_base_scorer(tokens.as_ref()).await?;


It seems this doesn't handle fuzzy query case, we will get 0 score for expanded tokens

BubbleCal · 2026-05-20T16:45:57Z

+            } => offsets.deep_size_of_children(context) + max_scores.deep_size_of_children(context),
+            PostingMetadata::V2 { metadata } => metadata
+                .get()
+                .map(|loaded| {


It looks like we would add the posting list into cache before filling the metadata, then max_scores and lengths won't be counted in the cache's stats

To compute BM25 IDF, FtsStatsExec asks each segment "how many docs contain term X?" That question goes through InvertedPartition::load, which eagerly read two columns (length, max score) for every unique token in the partition before answering. A one-term query paid 8 * num_unique_tokens bytes per partition to get back a single LENGTH_COL row. Modern v2 max_scores and lengths now live in tokio::sync::OnceCells. The stats path reads one LENGTH_COL row per (term, partition) via the new posting_len_for_token. The scoring path calls ensure_metadata_loaded once on first use, so warm-cache behavior is unchanged. Legacy v1 is untouched. bm25_stats_for_terms and build_global_bm25_scorer are now async; four call sites await them. New rstest test_bm25_stats_for_terms_is_lazy (N=10/100/1000) asserts one posting-file row read per stats call.

The first commit deferred the v2 bulk metadata load out of try_new but then forced it back via ensure_metadata_loaded at the top of bm25_search and inside posting_list. So the scoring path still paid the 8 * num_unique_tokens bulk read on cold first query. Per-token now goes all the way through. posting_list issues posting_batch and a new single-row posting_metadata_for_token in parallel, so a cold v2 partition resolves a K-token query with ~K rows of metadata instead of N. The outer bm25_search local-scorer fallback uses an async bm25_base_scorer built from df_for_term per term, removing its need for bulk metadata as well. ensure_metadata_loaded is still used by prewarm and read_all/into_builder where every posting list is touched anyway. test_posting_list_metadata_reads_scale_with_query_size asserts K posting_list calls produce K metadata rows on a 32-token partition.

claude Bot reviewed May 20, 2026

View reviewed changes

github-actions Bot added the performance label May 20, 2026

wkalt force-pushed the ticket/ent-1324/fts-stats-lazy-posting-metadata branch from 34b91de to 2c5af00 Compare May 20, 2026 15:14

BubbleCal reviewed May 20, 2026

View reviewed changes

wkalt force-pushed the ticket/ent-1324/fts-stats-lazy-posting-metadata branch from b680130 to 0331f4f Compare May 21, 2026 19:17

wkalt added 2 commits May 21, 2026 14:23

wkalt force-pushed the ticket/ent-1324/fts-stats-lazy-posting-metadata branch from 0331f4f to e1ca4cf Compare May 21, 2026 21:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(fts): lazy-load PostingListReader metadata#6873

perf(fts): lazy-load PostingListReader metadata#6873
wkalt wants to merge 2 commits into
lance-format:mainfrom
wkalt:ticket/ent-1324/fts-stats-lazy-posting-metadata

wkalt commented May 20, 2026

Uh oh!

claude Bot left a comment

Uh oh!

BubbleCal May 20, 2026

Uh oh!

codecov Bot commented May 20, 2026 •

edited

Loading

Uh oh!

BubbleCal May 20, 2026

Uh oh!

BubbleCal May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wkalt commented May 20, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

BubbleCal May 20, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

BubbleCal May 20, 2026

Choose a reason for hiding this comment

Uh oh!

BubbleCal May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov Bot commented May 20, 2026 •

edited

Loading