Arrow-Datasets C++: Scanner::Scan visitor is executing serially despite use_threads=true #49568

moba15 · 2026-03-20T09:36:00Z

moba15
Mar 20, 2026

Hey together,
I am working with the Apache Arrow C++ Dataset API to scan multiple Parquet files. My goal is to process RecordBatches in parallel using a callback function without materializing the entire table at once.

Following the Dataset Tutorial, I am using the Scan method. According to the documentation:

If multiple threads are used (via use_threads), the visitor will be invoked from those threads and is responsible for any synchronization.

However, in my implementation, the visitor function is called strictly in order—one call only begins after the previous one finishes—even though use_threads is set to true. I have also tried ScanBatchesUnordered, but I am seeing similar serial behavior.

Minimal Working Example:

#include <iostream>
#include <memory>
#include <arrow/api.h>
#include <arrow/compute/api.h>
#include <arrow/dataset/api.h>

#include <thread>

arrow::Status ProcessBatch(const arrow::dataset::TaggedRecordBatch &tagged_batch) {
    std::cerr << "ThreadId " << std::this_thread::get_id() << " got batch with "
            << tagged_batch.record_batch->num_rows() << " rows at "
            << std::chrono::system_clock::now() << "\n";
    //Wait: simulate processing time
    std::this_thread::sleep_for(std::chrono::seconds(5));
    return arrow::Status::OK();
}

arrow::Status ScanWholeDataset(
    const std::shared_ptr<arrow::fs::FileSystem> &filesystem,
    const std::shared_ptr<arrow::dataset::FileFormat> &format, const std::string &base_dir) {
    // Create custom scan options
    auto customOption = std::make_shared<arrow::dataset::ScanOptions>();
    customOption->use_threads = true;
    customOption->fragment_readahead = 20;

    arrow::fs::FileSelector selector;
    selector.base_dir = base_dir;
    selector.recursive = true;

    ARROW_ASSIGN_OR_RAISE(
        auto factory,
        arrow::dataset::FileSystemDatasetFactory::Make(filesystem, selector, format, arrow::dataset::
            FileSystemFactoryOptions()));

    ARROW_ASSIGN_OR_RAISE(auto dataset, factory->Finish());
    arrow::dataset::ScannerBuilder scan_builder(dataset, customOption);

    ARROW_ASSIGN_OR_RAISE(auto scanner, scan_builder.Finish());
    //Call Scan method with callback function
    scanner->Scan(ProcessBatch);
    return arrow::Status::OK();
}

arrow::Status Test() {
    ARROW_RETURN_NOT_OK(arrow::compute::Initialize());
    
    std::string base_path = "xxx";
    std::string root_path;
    std::string uri = "file://xxx";
    ARROW_ASSIGN_OR_RAISE(auto fs, arrow::fs::FileSystemFromUri(uri, &root_path));
    auto format = std::make_shared<arrow::dataset::ParquetFileFormat>();

    ARROW_RETURN_NOT_OK(ScanWholeDataset(fs, format, base_path));

    return arrow::Status::OK();
}

int main() {
    auto status = Test();
    if (!status.ok()) {
        std::cerr << "Error: " << status.message() << std::endl;
        return 1;
    }
    return 0;
}

Observed Behavior

When running this against 6 Parquet files (approx. 5 GiB total), the timestamps in the output show a perfect 5-second gap between batches.

ThreadId 133083219080896 got batch with 122880 rows at 2026-03-20 09:25:04.164492317 
ThreadId 133083219080896 got batch with 122880 rows at 2026-03-20 09:25:09.165123649 
ThreadId 133083219080896 got batch with 122880 rows at 2026-03-20 09:25:14.167036139

Apache arrow 22

Am I missing a configuration step in ScanOptions or ScannerBuilder to actually trigger parallel execution of the visitor? Is there a preferred way to handle parallel callbacks in the Dataset API?

Thanks for your help

KOKOSde · 2026-03-24T05:29:58Z

KOKOSde
Mar 24, 2026

use_threads=true does not mean your visitor body will run in parallel the way a custom thread pool would. Arrow can parallelize file reading and decode, but if the visitor does blocking work, the scan pipeline will still look serial because each callback has to finish before more work can flow through. The practical fix is to keep the visitor tiny, push each RecordBatch into your own queue, and do the heavy processing in a separate pool. If you want direct control, ScanBatchesAsync is a better fit than doing real work inside the visitor.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow-Datasets C++: Scanner::Scan visitor is executing serially despite use_threads=true #49568

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Arrow-Datasets C++: Scanner::Scan visitor is executing serially despite use_threads=true #49568

Uh oh!

Uh oh!

moba15 Mar 20, 2026

Replies: 1 comment

Uh oh!

KOKOSde Mar 24, 2026

moba15
Mar 20, 2026

KOKOSde
Mar 24, 2026