feat: support Lance metadata stats#18901
Conversation
|
Pushed Root cause: the functional test still asserted that Lance tables must not initialize column/partition stats metadata, but this branch now adds Lance column stats support and the common metadata test expects those partitions to be enabled where applicable. Change:
Local verification:
|
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for the contribution! This PR adds Lance support to readColumnStatsFromMetadata and removes the gating on the COLUMN_STATS / PARTITION_STATS metadata partitions for Lance tables. The implementation correctly handles null/empty column lists, missing columns, and nested fields, and it follows the same record-iteration pattern used for log files. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A few naming and simplification suggestions below.
cc @yihua
|
@rahil-c Can you please take a look at this? |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #18901 +/- ##
=============================================
- Coverage 68.81% 44.84% -23.97%
+ Complexity 29160 8572 -20588
=============================================
Files 2520 1203 -1317
Lines 140056 63019 -77037
Branches 17209 6862 -10347
=============================================
- Hits 96373 28263 -68110
+ Misses 35909 31627 -4282
+ Partials 7774 3129 -4645
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
e734a41 to
b84e6ab
Compare
|
@hudi-bot run azure |
hudi-agent
left a comment
There was a problem hiding this comment.
🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.
Thanks for working on this! The PR enables Lance metadata column stats and partition stats by implementing LanceUtils.readColumnStatsFromMetadata via a record-iteration approach (similar to the log file path), routing .lance files in readColumnRangeMetadataFrom, and removing the gates in MetadataPartitionType. One architectural question worth surfacing in the inline comments about the performance trade-off of reading all records. Please take a look at the inline comment, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of small readability suggestions below.
| return Collections.emptyList(); | ||
| } | ||
|
|
||
| List<String> projectedColumns = fieldsToIndex.stream() |
There was a problem hiding this comment.
🤖 This path iterates every record in the file to compute stats. Since the Lance writer doesn't populate writeStat.columnStats, this full-file re-read will be triggered for every newly-written Lance base file on every commit (and for partition-stats aggregation on top of that). For Parquet, the equivalent path is cheap because it reads block-level statistics from the footer. @yihua have you considered (a) populating column stats on the writer side in HoodieBaseLanceWriter like Parquet does, or (b) exposing Lance's native per-fragment column statistics via the Java binding? Either would avoid the per-commit full-scan. Is the intent to land this read-based path now and optimize in a follow-up?
- AI-generated; verify before applying. React 👍/👎 to flag quality.
There was a problem hiding this comment.
This patch keeps the read-based path so Lance column and partition stats can be generated now. The writer-side/native Lance stats path looks like larger follow-up work: the current Lance writer path only writes file data plus bloom/vector metadata, and I do not see an existing HoodieWriteStat column-stats hook to reuse there.
|
@hudi-bot run azure |
Describe the issue this Pull Request addresses
Closes #18758.
Summary and Changelog
Adds Lance support for metadata column statistics so Lance base files can contribute column ranges to the metadata table.
Changes:
LanceUtils.readColumnStatsFromMetadataby reading projected Lance columns and collectingHoodieColumnRangeMetadatawith the existing metadata utility..lancebase files through the column-range metadata reader.No code was copied.
Impact
Lance base files can now populate metadata column stats. Partition stats can also be enabled for partitioned Lance tables because they aggregate per-file column stats. No public API, storage format, or config key changes are introduced.
Risk Level
medium
This changes metadata-table behavior for Lance tables by enabling existing stats indexes for that file format. Verification covers the Lance stats reader path and the metadata partition enablement gate.
Documentation Update
none
No new config is added; this enables existing column/partition stats behavior for Lance.
Contributor's checklist
Local verification:
mvn test -q -Punit-tests -pl hudi-spark-datasource/hudi-spark -am -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false -Dtest=TestHoodieSparkLanceReader#testReadColumnStatsFromMetadata -DwildcardSuites=abc -Dspark3.5 -Dlance.skip.tests=false -Dmaven.repo.local=/private/tmp/hudi-18758-m2mvn test -q -Punit-tests -pl hudi-common -am -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false -Dtest=TestMetadataPartitionType#testColumnAndPartitionStatsEnabledForLanceTables -DwildcardSuites=abc -Dmaven.repo.local=/private/tmp/hudi-18758-m2git diff --check