refactor!: align distributed index build around segments#6313
refactor!: align distributed index build around segments#6313
Conversation
b6410e0 to
bfc8dad
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
issue: this creates a new commit workflow (though For example, here's an alternative API for writing two index segments that uses the existing commit workflow: segment_0: Transaction = dataset.create_index("vector", "IVF_PQ", fragments=[0, 1], committed=False)
segment_1: Transaction = dataset.create_index("vector", "IVF_PQ", fragments=[2, 3], committed=False)
operation = lance.LanceOperation.CreateIndex(...)
transaction = Transaction(segment_0.read_version, operation)
dataset.commit(transaction) |
Thank you @wjones127! I think this API makes sense. Do you also suggest that we remove |
python/src/indices.rs
Outdated
| let sidecar_path = index_dir.child("_index_metadata.pb"); | ||
| ds.object_store | ||
| .put( | ||
| &sidecar_path, | ||
| &table_pb::IndexMetadata::from(&sidecar).encode_to_vec(), | ||
| ) | ||
| .await | ||
| .infer_error()?; | ||
| ds.commit_existing_index_segments(index_name, column, vec![index_id]) | ||
| .await | ||
| .infer_error()?; |
There was a problem hiding this comment.
question: is this written out as a method of inter-process communication? If so, this is inconsistent with how we handle that elsewhere. In distributed fragment writes, we return the metadata directly and leave it to the caller to choose how to send the data. They caller can write to object store like this, or they can send via another IPC mechanism. Is there a reason that couldn't be done here too?
There was a problem hiding this comment.
Yep, we used to return IndexMetadata directly but our internal users are not happy with that. They don't want to handle the details and perfer to have APIs more like our old ones where they just call create_index_uncommitted() -> commit_exising_index().
I'm open to provider both ways: we can return IndexMetdata and write the sidecar files at the sametime and users can choose how to handle them.
Do you feel that looks better to you?
There was a problem hiding this comment.
I'm open to provider both ways: we can return
IndexMetdataand write the sidecar files at the sametime
I'd rather not make them pay the IO cost to create that file and then have to clean it up later if the caller isn't going to use it. So I wouldn't want both.
My preference is to keep the API low-level and agnostic to the distributed framework.
We should try to make it easy to use these APIs. If there's something missing for making it easy to write IndexMetadata out, then perhaps we can add that. Maybe it would be good if we could make it just two lines:
metadata = create_index_uncommitted()
metadata.write_to("s3://bucket/path")
I'm not as sure about that. We definitely aren't consistent with how we handle uncommitted. For example, we use lance/python/python/tests/test_dataset.py Line 1602 in 36e8b2d But we use lance/python/python/tests/test_dataset.py Lines 1697 to 1701 in 36e8b2d I think most important would be to re-use |
I agree with that. We are luck of some utils that building I'm happy to work as a follow up #6317 to refactor this part and unify them all in the |
Yeah, I agree we can do that refactor in a follow up. |
4bb7286 to
cdb91ad
Compare
| ), | ||
| IndexMetadata { | ||
| uuid: first_segment_uuid, | ||
| fragment_bitmap: Some(std::iter::once(target_fragments[0].id() as u32).collect()), |
There was a problem hiding this comment.
can this work?
| fragment_bitmap: Some(std::iter::once(target_fragments[0].id() as u32).collect()), | |
| fragment_bitmap: Some(vec![target_fragments[0].id() as u32]), |
This refactor removes the old
IndexSegment/ segment-builder workflow from distributed vector indexing and aligns the public flow aroundIndexMetadatareturned from fragment-level builds. Worker builds now produce uncommitted index metadata, callers optionally merge caller-defined metadata groups, and the final metadata set is committed as one logical index.