Skip to content

feat: add MemWAL sharding evaluator#6854

Open
jackye1995 wants to merge 4 commits into
lance-format:mainfrom
jackye1995:jack/arrow-native-memwal-sharding
Open

feat: add MemWAL sharding evaluator#6854
jackye1995 wants to merge 4 commits into
lance-format:mainfrom
jackye1995:jack/arrow-native-memwal-sharding

Conversation

@jackye1995
Copy link
Copy Markdown
Contributor

Adds an Arrow-native MemWAL sharding evaluator and exposes it through the Java API/JNI.

  • Evaluates MemWAL sharding specs against Arrow RecordBatch values for bucket, identity, and unsharded fields.
  • Resolves sharding source IDs through a Java-provided source-id-to-column map.
  • Adds Java-facing ShardingEvaluator returning an Arrow reader for the evaluated sharding key batch.

This is needed by lance-spark to route writes using Lance's sharding semantics instead of duplicating Spark-side bucket logic.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions Bot added enhancement New feature or request java labels May 20, 2026
Lift bucket sharding initialization to persist the configured shard field independently from primary-key metadata.
Remove deprecated Region compatibility aliases from the Python MemWAL API and align raw bindings with Shard naming.
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

❌ Patch coverage is 78.84131% with 84 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance/src/dataset/mem_wal/sharding.rs 79.16% 68 Missing and 12 partials ⚠️
rust/lance/src/dataset/mem_wal/api.rs 50.00% 3 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Xuanwo
Xuanwo previously requested changes May 20, 2026
Copy link
Copy Markdown
Collaborator

@Xuanwo Xuanwo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the Python sharding spec round-trip needs one fix before this lands.

Comment thread python/src/mem_wal.rs
field_id: get_py_value(field, "field_id")?.extract::<String>()?,
source_ids: get_py_value(field, "source_ids")?.extract::<Vec<i32>>()?,
transform: optional_string(get_py_value(field, "transform")?)?,
expression: optional_string(get_py_value(field, "expression")?)?,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes dict specs returned by Dataset.mem_wal_index_details() unusable with the new evaluator.

mem_wal_index_details() currently serializes each sharding field with field_id, source_ids, transform, result_type, and parameters, but it does not include expression. Since this parser now requires expression to be present, the natural flow below fails with Missing sharding spec field 'expression':

spec = ds.mem_wal_index_details()["sharding_specs"][0]
evaluate_sharding_spec(batch, spec, LanceSchema.from_pyarrow(batch.schema))

Could we either include expression in the dict returned by mem_wal_index_details() or treat a missing expression key as None here? I think adding it to mem_wal_index_details() is cleaner because that keeps the exported spec shape complete and round-trippable.

@Xuanwo Xuanwo dismissed their stale review May 20, 2026 07:19

Submitted as changes requested by mistake; intended as a non-blocking review comment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants