Skip to content

docs: align Spark batch_size docs with 8192 default#323

Open
harsh-ande wants to merge 3 commits intolance-format:mainfrom
harsh-ande:docs-batch-size-default-8192
Open

docs: align Spark batch_size docs with 8192 default#323
harsh-ande wants to merge 3 commits intolance-format:mainfrom
harsh-ande:docs-batch-size-default-8192

Conversation

@harsh-ande
Copy link
Copy Markdown

@harsh-ande harsh-ande commented Mar 21, 2026

This PR addresses #153.

Spark currently hardcodes read batch_size in lance-spark, which keeps the connector docs tied to a Spark-owned default even though lance-core already provides the underlying default behavior.

This update changes the read path so Spark only forwards batch_size when the user explicitly sets it. When batch_size is omitted, the connector now inherits the lance-core default.

Changes:

  • make read batch_size optional in LanceSparkReadOptions
  • only call ScanOptions.batchSize(...) when batch_size is explicitly provided
  • update Spark read docs to describe inherited-default behavior instead of pinning Spark to 8192
  • keep the tuning examples for explicit larger batch sizes
  • replace the default read batch_size regression with unset-vs-explicit-override coverage
  • remove the unrelated write-side doc/test changes from this PR

Verification:

  • ran JAVA_HOME=/opt/homebrew/opt/openjdk/libexec/openjdk.jdk/Contents/Home PATH=/opt/homebrew/opt/openjdk/bin:$PATH mvn -pl lance-spark-base_2.12 -Dspotless.skip=true -Dtest=LanceSparkReadOptionsJsonTest,LanceDatasetReadTest test
  • result: BUILD SUCCESS

Notes:

  • the issue title mentions block_size, but the scan-throughput setting here is batch_size
  • this PR intentionally stays read-path only

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Mar 21, 2026
Copy link
Copy Markdown
Collaborator

@hamersaw hamersaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than setting this to 8192 to coincide with the lance-core default, would it make sense to us a Optional value for it? If unset, then we let it fallback to the lance-core default. This way if the lance-core default ever changes, we do not need to make updates through this repo.

@harsh-ande
Copy link
Copy Markdown
Author

Thanks for your response @hamersaw, that makes a lot of sense. I’ll update the Spark read path so batch_size is only forwarded when explicitly set, and otherwise it will inherit the lance-core default. I’ll also adjust the docs/tests to reflect that instead of pinning Spark to 8192.

@harsh-ande
Copy link
Copy Markdown
Author

@hamersaw I updated the PR so Spark no longer hardcodes the read batch_size default. If batch_size is unset, we now leave it unset in lance-spark and let lance-core apply its default. I also updated the read docs to describe the inherited-default behavior and changed the regression coverage to test unset vs explicit override.

Requesting you to please take a look when you get time. Thank you.

@harsh-ande harsh-ande requested a review from hamersaw March 24, 2026 22:10
@hamersaw
Copy link
Copy Markdown
Collaborator

@harsh-ande looks like some of the CI checks are failing. Lets get those fixed up and then I'll make a final pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants