Skip to content

feat: support Lance metadata stats#18901

Open
puneetdixit200 wants to merge 4 commits into
apache:masterfrom
puneetdixit200:fix/18758-lance-stats
Open

feat: support Lance metadata stats#18901
puneetdixit200 wants to merge 4 commits into
apache:masterfrom
puneetdixit200:fix/18758-lance-stats

Conversation

@puneetdixit200
Copy link
Copy Markdown

Describe the issue this Pull Request addresses

Closes #18758.

Summary and Changelog

Adds Lance support for metadata column statistics so Lance base files can contribute column ranges to the metadata table.

Changes:

  • Implement LanceUtils.readColumnStatsFromMetadata by reading projected Lance columns and collecting HoodieColumnRangeMetadata with the existing metadata utility.
  • Route .lance base files through the column-range metadata reader.
  • Enable column stats and partition stats metadata partitions for Lance tables now that Lance column ranges can be produced.
  • Add focused regression coverage for Lance column range metadata and Lance metadata partition enablement.

No code was copied.

Impact

Lance base files can now populate metadata column stats. Partition stats can also be enabled for partitioned Lance tables because they aggregate per-file column stats. No public API, storage format, or config key changes are introduced.

Risk Level

medium

This changes metadata-table behavior for Lance tables by enabling existing stats indexes for that file format. Verification covers the Lance stats reader path and the metadata partition enablement gate.

Documentation Update

none

No new config is added; this enables existing column/partition stats behavior for Lance.

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

Local verification:

  • mvn test -q -Punit-tests -pl hudi-spark-datasource/hudi-spark -am -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false -Dtest=TestHoodieSparkLanceReader#testReadColumnStatsFromMetadata -DwildcardSuites=abc -Dspark3.5 -Dlance.skip.tests=false -Dmaven.repo.local=/private/tmp/hudi-18758-m2
  • mvn test -q -Punit-tests -pl hudi-common -am -DfailIfNoTests=false -Dsurefire.failIfNoSpecifiedTests=false -Dtest=TestMetadataPartitionType#testColumnAndPartitionStatsEnabledForLanceTables -DwildcardSuites=abc -Dmaven.repo.local=/private/tmp/hudi-18758-m2
  • git diff --check

@github-actions github-actions Bot added the size:M PR with lines of changes in (100, 300] label Jun 2, 2026
@puneetdixit200
Copy link
Copy Markdown
Author

Pushed e734a41b1 for the Azure failure on TestLanceDataSource.testSqlCommands.

Root cause: the functional test still asserted that Lance tables must not initialize column/partition stats metadata, but this branch now adds Lance column stats support and the common metadata test expects those partitions to be enabled where applicable.

Change:

  • COLUMN_STATS is now asserted available for Lance tables.
  • PARTITION_STATS availability is asserted to match whether the table is partitioned.

Local verification:

  • git diff --check passed.
  • I attempted focused Maven test-compile, but this Mac does not have a usable Hudi CI JDK: Maven on Java 25 fails in upstream hudi-io Lombok-generated members, and the installed Java 8 runtime is a JRE without tools.jar. The pushed CI run should give the real signal for the Spark functional test.

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for the contribution! This PR adds Lance support to readColumnStatsFromMetadata and removes the gating on the COLUMN_STATS / PARTITION_STATS metadata partitions for Lance tables. The implementation correctly handles null/empty column lists, missing columns, and nested fields, and it follows the same record-iteration pattern used for log files. No correctness issues found. A few style/readability suggestions in the inline comments. Please take a look, and this should be ready for a Hudi committer or PMC member to take it from here. A few naming and simplification suggestions below.

cc @yihua

@voonhous
Copy link
Copy Markdown
Member

voonhous commented Jun 3, 2026

@rahil-c Can you please take a look at this?

@voonhous voonhous requested a review from rahil-c June 3, 2026 11:26
@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 6.06061% with 31 lines in your changes missing coverage. Please review.
✅ Project coverage is 44.84%. Comparing base (b7adecc) to head (e734a41).
⚠️ Report is 5 commits behind head on master.

Files with missing lines Patch % Lines
...n/java/org/apache/hudi/common/util/LanceUtils.java 0.00% 27 Missing ⚠️
.../apache/hudi/metadata/HoodieTableMetadataUtil.java 0.00% 3 Missing and 1 partial ⚠️

❗ There is a different number of reports uploaded between BASE (b7adecc) and HEAD (e734a41). Click for more details.

HEAD has 30 uploads less than BASE
Flag BASE (b7adecc) HEAD (e734a41)
spark-scala-tests 12 0
spark-client-hadoop-common 1 0
spark-java-tests 15 0
utilities 1 0
common-and-other-modules 1 0
Additional details and impacted files
@@              Coverage Diff              @@
##             master   #18901       +/-   ##
=============================================
- Coverage     68.81%   44.84%   -23.97%     
+ Complexity    29160     8572    -20588     
=============================================
  Files          2520     1203     -1317     
  Lines        140056    63019    -77037     
  Branches      17209     6862    -10347     
=============================================
- Hits          96373    28263    -68110     
+ Misses        35909    31627     -4282     
+ Partials       7774     3129     -4645     
Flag Coverage Δ
common-and-other-modules ?
hadoop-mr-java-client 44.84% <6.06%> (-0.04%) ⬇️
spark-client-hadoop-common ?
spark-java-tests ?
spark-scala-tests ?
utilities ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...rg/apache/hudi/metadata/MetadataPartitionType.java 51.70% <100.00%> (-31.45%) ⬇️
.../apache/hudi/metadata/HoodieTableMetadataUtil.java 43.95% <0.00%> (-38.48%) ⬇️
...n/java/org/apache/hudi/common/util/LanceUtils.java 0.00% <0.00%> (-41.18%) ⬇️

... and 2116 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@puneetdixit200 puneetdixit200 force-pushed the fix/18758-lance-stats branch from e734a41 to b84e6ab Compare June 3, 2026 13:25
@puneetdixit200
Copy link
Copy Markdown
Author

@hudi-bot run azure

Copy link
Copy Markdown
Contributor

@hudi-agent hudi-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This review was generated by an AI agent and may contain mistakes. Please verify any suggestions before applying.

Thanks for working on this! The PR enables Lance metadata column stats and partition stats by implementing LanceUtils.readColumnStatsFromMetadata via a record-iteration approach (similar to the log file path), routing .lance files in readColumnRangeMetadataFrom, and removing the gates in MetadataPartitionType. One architectural question worth surfacing in the inline comments about the performance trade-off of reading all records. Please take a look at the inline comment, and this should be ready for a Hudi committer or PMC member to take it from here. A couple of small readability suggestions below.

return Collections.emptyList();
}

List<String> projectedColumns = fieldsToIndex.stream()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 This path iterates every record in the file to compute stats. Since the Lance writer doesn't populate writeStat.columnStats, this full-file re-read will be triggered for every newly-written Lance base file on every commit (and for partition-stats aggregation on top of that). For Parquet, the equivalent path is cheap because it reads block-level statistics from the footer. @yihua have you considered (a) populating column stats on the writer side in HoodieBaseLanceWriter like Parquet does, or (b) exposing Lance's native per-fragment column statistics via the Java binding? Either would avoid the per-commit full-scan. Is the intent to land this read-based path now and optimize in a follow-up?

- AI-generated; verify before applying. React 👍/👎 to flag quality.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch keeps the read-based path so Lance column and partition stats can be generated now. The writer-side/native Lance stats path looks like larger follow-up work: the current Lance writer path only writes file data plus bloom/vector metadata, and I do not see an existing HoodieWriteStat column-stats hook to reuse there.

Comment thread hudi-common/src/test/java/org/apache/hudi/metadata/TestMetadataPartitionType.java Outdated
@puneetdixit200
Copy link
Copy Markdown
Author

@hudi-bot run azure

@hudi-bot
Copy link
Copy Markdown
Collaborator

hudi-bot commented Jun 3, 2026

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:M PR with lines of changes in (100, 300]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support column and partition stats with Lance file format

5 participants