Skip to content

perf(fts): lazy-load PostingListReader metadata#6873

Open
wkalt wants to merge 2 commits into
lance-format:mainfrom
wkalt:ticket/ent-1324/fts-stats-lazy-posting-metadata
Open

perf(fts): lazy-load PostingListReader metadata#6873
wkalt wants to merge 2 commits into
lance-format:mainfrom
wkalt:ticket/ent-1324/fts-stats-lazy-posting-metadata

Conversation

@wkalt
Copy link
Copy Markdown
Contributor

@wkalt wkalt commented May 20, 2026

To compute BM25 IDF, FtsStatsExec asks each segment "how many docs contain term X?" That question goes through InvertedPartition::load, which eagerly read two columns (length, max score) for every unique token in the partition before answering. A one-term query paid 8 * num_unique_tokens bytes per partition to get back a single LENGTH_COL row.

Modern v2 max_scores and lengths now live in tokio::sync::OnceCells. The stats path reads one LENGTH_COL row per (term, partition) via the new posting_len_for_token. The scoring path calls ensure_metadata_loaded once on first use, so warm-cache behavior is unchanged. Legacy v1 is untouched.

bm25_stats_for_terms and build_global_bm25_scorer are now async; four call sites await them. New rstest test_bm25_stats_for_terms_is_lazy (N=10/100/1000) asserts one posting-file row read per stats call.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@wkalt wkalt force-pushed the ticket/ent-1324/fts-stats-lazy-posting-metadata branch from 34b91de to 2c5af00 Compare May 20, 2026 15:14
.iter()
.map(|part| {
let part = part.clone();
async move { part.inverted_list.ensure_metadata_loaded().await }
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say for a cold query, we will finally still need to call ensure_metadata_loaded here which would load the all lengths and max_scores, so we don't actually reduce IO, is this right?

@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 70.05348% with 112 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/inverted/index.rs 69.94% 92 Missing and 15 partials ⚠️
rust/lance/src/io/exec/fts.rs 80.00% 0 Missing and 3 partials ⚠️
rust/lance-index/src/scalar/inverted.rs 33.33% 0 Missing and 2 partials ⚠️

📢 Thoughts on this report? Let us know!

base_scorer
} else {
local_scorer = IndexBM25Scorer::new(self.partitions.iter().map(|part| part.as_ref()));
local_scorer = self.bm25_base_scorer(tokens.as_ref()).await?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems this doesn't handle fuzzy query case, we will get 0 score for expanded tokens

} => offsets.deep_size_of_children(context) + max_scores.deep_size_of_children(context),
PostingMetadata::V2 { metadata } => metadata
.get()
.map(|loaded| {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we would add the posting list into cache before filling the metadata, then max_scores and lengths won't be counted in the cache's stats

@wkalt wkalt force-pushed the ticket/ent-1324/fts-stats-lazy-posting-metadata branch from b680130 to 0331f4f Compare May 21, 2026 19:17
wkalt added 2 commits May 21, 2026 14:23
To compute BM25 IDF, FtsStatsExec asks each segment "how many docs
contain term X?" That question goes through InvertedPartition::load,
which eagerly read two columns (length, max score) for every unique
token in the partition before answering. A one-term query paid
8 * num_unique_tokens bytes per partition to get back a single
LENGTH_COL row.

Modern v2 max_scores and lengths now live in tokio::sync::OnceCells.
The stats path reads one LENGTH_COL row per (term, partition) via the
new posting_len_for_token. The scoring path calls ensure_metadata_loaded
once on first use, so warm-cache behavior is unchanged. Legacy v1 is
untouched.

bm25_stats_for_terms and build_global_bm25_scorer are now async; four
call sites await them. New rstest test_bm25_stats_for_terms_is_lazy
(N=10/100/1000) asserts one posting-file row read per stats call.
The first commit deferred the v2 bulk metadata load out of try_new but
then forced it back via ensure_metadata_loaded at the top of bm25_search
and inside posting_list. So the scoring path still paid the
8 * num_unique_tokens bulk read on cold first query.

Per-token now goes all the way through. posting_list issues posting_batch
and a new single-row posting_metadata_for_token in parallel, so a cold
v2 partition resolves a K-token query with ~K rows of metadata instead
of N. The outer bm25_search local-scorer fallback uses an async
bm25_base_scorer built from df_for_term per term, removing its need
for bulk metadata as well.

ensure_metadata_loaded is still used by prewarm and read_all/into_builder
where every posting list is touched anyway.

test_posting_list_metadata_reads_scale_with_query_size asserts K
posting_list calls produce K metadata rows on a 32-token partition.
@wkalt wkalt force-pushed the ticket/ent-1324/fts-stats-lazy-posting-metadata branch from 0331f4f to e1ca4cf Compare May 21, 2026 21:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants