[IO] Implement DeltaIO reader and add Delta Lake perf tests#38750
[IO] Implement DeltaIO reader and add Delta Lake perf tests#38750durgaprasadml wants to merge 2 commits into
Conversation
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a new Delta Lake source reader for Apache Beam, leveraging the Delta Kernel API to provide a robust and scalable way to read Delta tables. The implementation includes support for parallelized Parquet file reads, automatic mapping of Delta types to Beam schemas, and full integration testing to ensure performance and correctness across various table configurations. Highlights
New Features🧠 You can now enable Memory (public preview) to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize the Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counterproductive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request implements the read path for Delta Lake tables in DeltaIO using the Delta Kernel API, including schema mapping, row conversion, and integration tests. While the implementation is a great step forward, there are several critical architectural and performance issues that need to be addressed. Specifically, planning the table scan during pipeline construction freezes the file list and requires client-side access, while re-scanning the entire file list in ReadFileFn for every file descriptor results in
|
Assigning reviewers: R: @Abacn for label java. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
|
CC: @chamikaramj |
|
Thanks for the PR but please note that this is ongoing work and a more complete source with SDF support is being implemented here: #38706 Could you please comment there instead of re-implementing so that we don't repeat work ? Feel free to take any unassigned sub-tasks here: #21100 Also for more context. Design: https://s.apache.org/beam-delta-lake-source Dev-list thread: https://lists.apache.org/thread/8wqox64s68o2mbqmpr1mlcg30pq3r91k |
|
Thanks for the clarification and for pointing me to #38706. I wasn’t aware that a more complete SDF-based implementation was already in progress. I’ll avoid duplicating the work here and will continue the discussion/contributions on #38706 instead. I appreciate the references to the design doc and dev-list thread as well — I’ll review those for better alignment with the ongoing implementation effort. Thanks again for the guidance. |
|
Great! Thanks. As mentioned, pls feel free to take any unassigned sub-tasks from here to contribute : #21100 (for example perf testing) You can add them once the initial source is in. Also any reviews/comments on the existing source PR or the design doc is welcome since you already have the context :) |
Description:
What does this PR do?
This PR implements the Delta Lake source reader using the Delta Kernel API and adds performance/integration tests for Delta Lake reads.
The implementation introduces a parallelized read path for Delta tables by planning scans on the coordinator and distributing Parquet file reads across Beam workers.
Changes Included
DeltaIO Reader Implementation
Performance / Integration Tests
Added:
Test scenarios:
The tests:
Build Updates
Updated sdks/java/io/delta/build.gradle with required integration test dependencies and Hadoop runtime dependencies required by Delta Kernel.
Verification
Executed:
bash ./gradlew :sdks:java:io:delta:compileJava ./gradlew :sdks:java:io:delta:compileTestJava ./gradlew :sdks:java:io:delta:test --tests org.apache.beam.sdk.io.delta.DeltaIOIT
Fixes #38559