Skip to content

feat(spark): refresh parquet tools clustering strategy for current master#18409

Open
suryaprasanna wants to merge 2 commits intoapache:masterfrom
suryaprasanna:parqute-tools-refresh
Open

feat(spark): refresh parquet tools clustering strategy for current master#18409
suryaprasanna wants to merge 2 commits intoapache:masterfrom
suryaprasanna:parqute-tools-refresh

Conversation

@suryaprasanna
Copy link
Copy Markdown
Contributor

Describe the issue this Pull Request addresses

This PR refreshes the parquet-tools based clustering strategy from the older parquet-tools branch so it can be proposed against current apache/master.

The original implementation had drifted from current Hudi internals and test APIs. This refresh keeps the existing simple rewrite hook shape while aligning the implementation with current clustering and storage behavior.

Summary and Changelog

Refresh the parquet-tools clustering strategy and its supporting tests for current master.

  • keep the ParquetToolsExecutionStrategy API simple with the existing file-to-file rewrite hook
  • generate a new output file id for clustering rewrites instead of reusing the source file id
  • migrate helper code to current StoragePath / HoodieStorage based APIs
  • replace brittle previous-commit extraction with FSUtils.getCommitTime(...)
  • update write-status generation to use current parquet/storage utilities
  • refresh the related tests to match current writer, meta client, and clustering strategy APIs

Impact

No public API change intended.

This keeps the existing parquet-tools rewrite extension point, but makes it compatible with current Hudi master and current clustering output semantics.

Risk Level

low
The change is localized to the parquet-tools rewrite path and related test scaffolding.

Documentation Update

none

Contributor's checklist

  • Read through contributor's guide
  • Enough context is provided in the sections above
  • Adequate tests were added if applicable

@github-actions github-actions bot added the size:L PR with lines of changes in (300, 1000] label Mar 28, 2026
@hudi-bot
Copy link
Copy Markdown
Collaborator

CI report:

Bot commands @hudi-bot supports the following commands:
  • @hudi-bot run azure re-run the last Azure build

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 44.92754% with 38 lines in your changes missing coverage. Please review.
✅ Project coverage is 44.32%. Comparing base (1eb97b3) to head (0df6e53).

Files with missing lines Patch % Lines
...java/org/apache/hudi/io/HoodieFileWriteHandle.java 0.00% 22 Missing ⚠️
...ng/run/strategy/ParquetToolsExecutionStrategy.java 0.00% 15 Missing ⚠️
...ecution/ParquetFileMetaToWriteStatusConvertor.java 96.87% 0 Missing and 1 partial ⚠️

❗ There is a different number of reports uploaded between BASE (1eb97b3) and HEAD (0df6e53). Click for more details.

HEAD has 28 uploads less than BASE
Flag BASE (1eb97b3) HEAD (0df6e53)
hadoop-mr-java-client 1 0
spark-scala-tests 10 0
spark-client-hadoop-common 1 0
spark-java-tests 15 0
utilities 1 0
Additional details and impacted files
@@              Coverage Diff              @@
##             master   #18409       +/-   ##
=============================================
- Coverage     68.21%   44.32%   -23.89%     
+ Complexity    27709    16852    -10857     
=============================================
  Files          2440     2380       -60     
  Lines        134249   127070     -7179     
  Branches      16179    14562     -1617     
=============================================
- Hits          91578    56328    -35250     
- Misses        35565    65910    +30345     
+ Partials       7106     4832     -2274     
Flag Coverage Δ
common-and-other-modules 44.32% <44.92%> (+<0.01%) ⬆️
hadoop-mr-java-client ?
spark-client-hadoop-common ?
spark-java-tests ?
spark-scala-tests ?
utilities ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ecution/ParquetFileMetaToWriteStatusConvertor.java 96.87% <96.87%> (ø)
...ng/run/strategy/ParquetToolsExecutionStrategy.java 0.00% <0.00%> (ø)
...java/org/apache/hudi/io/HoodieFileWriteHandle.java 0.00% <0.00%> (ø)

... and 1198 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@suryaprasanna
Copy link
Copy Markdown
Contributor Author

@nsivabalan Executing any parquet tools operations special jar to be included in the runtime, so I am not adding the column nullifying parquet tools execution strategy. Let us just keep the test class itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size:L PR with lines of changes in (300, 1000]

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants