Replies: 1 comment
-
|
|
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey together,
I am working with the Apache Arrow C++ Dataset API to scan multiple Parquet files. My goal is to process RecordBatches in parallel using a callback function without materializing the entire table at once.
Following the Dataset Tutorial, I am using the Scan method. According to the documentation:
However, in my implementation, the visitor function is called strictly in order—one call only begins after the previous one finishes—even though
use_threadsis set totrue. I have also triedScanBatchesUnordered, but I am seeing similar serial behavior.Minimal Working Example:
Observed Behavior
When running this against 6 Parquet files (approx. 5 GiB total), the timestamps in the output show a perfect 5-second gap between batches.
Apache arrow 22
Am I missing a configuration step in ScanOptions or ScannerBuilder to actually trigger parallel execution of the visitor? Is there a preferred way to handle parallel callbacks in the Dataset API?
Thanks for your help
Beta Was this translation helpful? Give feedback.
All reactions