feat: add MemWAL sharding evaluator#6854
Conversation
Lift bucket sharding initialization to persist the configured shard field independently from primary-key metadata.
Remove deprecated Region compatibility aliases from the Python MemWAL API and align raw bindings with Shard naming.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Xuanwo
left a comment
There was a problem hiding this comment.
I think the Python sharding spec round-trip needs one fix before this lands.
| field_id: get_py_value(field, "field_id")?.extract::<String>()?, | ||
| source_ids: get_py_value(field, "source_ids")?.extract::<Vec<i32>>()?, | ||
| transform: optional_string(get_py_value(field, "transform")?)?, | ||
| expression: optional_string(get_py_value(field, "expression")?)?, |
There was a problem hiding this comment.
This makes dict specs returned by Dataset.mem_wal_index_details() unusable with the new evaluator.
mem_wal_index_details() currently serializes each sharding field with field_id, source_ids, transform, result_type, and parameters, but it does not include expression. Since this parser now requires expression to be present, the natural flow below fails with Missing sharding spec field 'expression':
spec = ds.mem_wal_index_details()["sharding_specs"][0]
evaluate_sharding_spec(batch, spec, LanceSchema.from_pyarrow(batch.schema))Could we either include expression in the dict returned by mem_wal_index_details() or treat a missing expression key as None here? I think adding it to mem_wal_index_details() is cleaner because that keeps the exported spec shape complete and round-trippable.
Submitted as changes requested by mistake; intended as a non-blocking review comment.
Adds an Arrow-native MemWAL sharding evaluator and exposes it through the Java API/JNI.
This is needed by lance-spark to route writes using Lance's sharding semantics instead of duplicating Spark-side bucket logic.