Skip to content

refactor!: align distributed index build around segments#6313

Open
Xuanwo wants to merge 6 commits intomainfrom
xuanwo/distributed-index-uuid-flow
Open

refactor!: align distributed index build around segments#6313
Xuanwo wants to merge 6 commits intomainfrom
xuanwo/distributed-index-uuid-flow

Conversation

@Xuanwo
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo commented Mar 27, 2026

This refactor removes the old IndexSegment / segment-builder workflow from distributed vector indexing and aligns the public flow around IndexMetadata returned from fragment-level builds. Worker builds now produce uncommitted index metadata, callers optionally merge caller-defined metadata groups, and the final metadata set is committed as one logical index.

# Build uncommitted index metadata
segment_0 = dataset.create_index_uncommitted(...)
segment_1 = dataset.create_index_uncommitted(...)

# Merge segments if needed
merged_segment = dataset.merge_existing_index_segments([
    segment_0,
    segment_1,
])

# Commit segments
dataset.commit_existing_index_segments("vector_idx", "vector", [merged_segment])

@Xuanwo Xuanwo marked this pull request as ready for review March 27, 2026 12:21
@Xuanwo Xuanwo changed the title refactor: align distributed index build around segment uuids refactor!: align distributed index build around segment uuids Mar 27, 2026
@Xuanwo Xuanwo force-pushed the xuanwo/distributed-index-uuid-flow branch from b6410e0 to bfc8dad Compare March 27, 2026 12:27
@github-actions github-actions bot added the java label Mar 27, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 27, 2026

Codecov Report

❌ Patch coverage is 94.09283% with 14 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/index.rs 93.57% 6 Missing and 1 partial ⚠️
rust/lance/src/index/vector/ivf.rs 80.64% 1 Missing and 5 partials ⚠️
rust/lance/src/dataset.rs 0.00% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@wjones127
Copy link
Copy Markdown
Contributor

# Build index segments
segment_uuid_0 = dataset.create_index_uncommitted(...)
segment_uuid_1 = dataset.create_index_uncommitted(...)

# Merge index segments if needed
merged_uuid = dataset.merge_existing_index_segments([
    segment_uuid_0,
    segment_uuid_1,
])

# Commit index segments
dataset.commit_existing_index_segments("vector_idx", "vector", [merged_uuid])

issue: this creates a new commit workflow (though commit_existing_index_segments) which I worry clutters the API. Have you considered just reusing the existing dataset.commit() API?

For example, here's an alternative API for writing two index segments that uses the existing commit workflow:

segment_0: Transaction = dataset.create_index("vector", "IVF_PQ", fragments=[0, 1], committed=False)
segment_1: Transaction = dataset.create_index("vector", "IVF_PQ", fragments=[2, 3], committed=False)

operation = lance.LanceOperation.CreateIndex(...)
transaction = Transaction(segment_0.read_version, operation)

dataset.commit(transaction)

@Xuanwo
Copy link
Copy Markdown
Collaborator Author

Xuanwo commented Mar 27, 2026

operation = lance.LanceOperation.CreateIndex(...)
transaction = Transaction(segment_0.read_version, operation)

dataset.commit(transaction)

Thank you @wjones127! I think this API makes sense. Do you also suggest that we remove create_index_uncommitted at the same time?

Comment on lines +459 to +469
let sidecar_path = index_dir.child("_index_metadata.pb");
ds.object_store
.put(
&sidecar_path,
&table_pb::IndexMetadata::from(&sidecar).encode_to_vec(),
)
.await
.infer_error()?;
ds.commit_existing_index_segments(index_name, column, vec![index_id])
.await
.infer_error()?;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: is this written out as a method of inter-process communication? If so, this is inconsistent with how we handle that elsewhere. In distributed fragment writes, we return the metadata directly and leave it to the caller to choose how to send the data. They caller can write to object store like this, or they can send via another IPC mechanism. Is there a reason that couldn't be done here too?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, we used to return IndexMetadata directly but our internal users are not happy with that. They don't want to handle the details and perfer to have APIs more like our old ones where they just call create_index_uncommitted() -> commit_exising_index().

I'm open to provider both ways: we can return IndexMetdata and write the sidecar files at the sametime and users can choose how to handle them.

Do you feel that looks better to you?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to provider both ways: we can return IndexMetdata and write the sidecar files at the sametime

I'd rather not make them pay the IO cost to create that file and then have to clean it up later if the caller isn't going to use it. So I wouldn't want both.

My preference is to keep the API low-level and agnostic to the distributed framework.

We should try to make it easy to use these APIs. If there's something missing for making it easy to write IndexMetadata out, then perhaps we can add that. Maybe it would be good if we could make it just two lines:

metadata = create_index_uncommitted()
metadata.write_to("s3://bucket/path")

@wjones127
Copy link
Copy Markdown
Contributor

Do you also suggest that we remove create_index_uncommitted at the same time?

I'm not as sure about that. We definitely aren't consistent with how we handle uncommitted. For example, we use LanceFragment.create() for append:

fragment = lance.fragment.LanceFragment.create(base_dir, table)

But we use .execute_uncommitted for merge insert:

transaction, stats = (
dataset.merge_insert(on="id")
.when_matched_update_all()
.execute_uncommitted(updates)
)

I think most important would be to re-use Dataset.commit() as the main entry point to commit transactions.

@Xuanwo
Copy link
Copy Markdown
Collaborator Author

Xuanwo commented Mar 27, 2026

I think most important would be to re-use Dataset.commit() as the main entry point to commit transactions.

I agree with that.

We are luck of some utils that building Transaction from index segments. Do you think it's a good idea for this refactor to move on with commit_existing_index_segments (this API replaced the old commit_existing_index and not an entire new API).

I'm happy to work as a follow up #6317 to refactor this part and unify them all in the Dataset.commit().

@wjones127
Copy link
Copy Markdown
Contributor

We are luck of some utils that building Transaction from index segments. Do you think it's a good idea for this refactor to move on with commit_existing_index_segments (this API replaced the old commit_existing_index and not an entire new API).

Yeah, I agree we can do that refactor in a follow up.

@Xuanwo Xuanwo force-pushed the xuanwo/distributed-index-uuid-flow branch from 4bb7286 to cdb91ad Compare March 28, 2026 08:10
@Xuanwo Xuanwo requested a review from wjones127 March 31, 2026 05:20
@Xuanwo Xuanwo changed the title refactor!: align distributed index build around segment uuids refactor!: align distributed index build around segments Mar 31, 2026
),
IndexMetadata {
uuid: first_segment_uuid,
fragment_bitmap: Some(std::iter::once(target_fragments[0].id() as u32).collect()),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can this work?

Suggested change
fragment_bitmap: Some(std::iter::once(target_fragments[0].id() as u32).collect()),
fragment_bitmap: Some(vec![target_fragments[0].id() as u32]),

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants