[SPARK-55754][PYTHON][TEST] Add ASV microbenchmarks for scalar Arrow UDF eval types#54555
Draft
Yicong-Huang wants to merge 16 commits intoapache:masterfrom
Draft
[SPARK-55754][PYTHON][TEST] Add ASV microbenchmarks for scalar Arrow UDF eval types#54555Yicong-Huang wants to merge 16 commits intoapache:masterfrom
Yicong-Huang wants to merge 16 commits intoapache:masterfrom
Conversation
Contributor
Author
|
depends on #54533 |
- Interleave each `time_*` with its `peakmem_*` counterpart so related benchmarks are co-located in the file. - Add sys.path injection in `__init__.py` so `asv run --python=same` works directly from the PySpark source tree without a prior `pip install`. Benchmark results (asv run --python=same --quick, Apple M4 Max): ``` GroupedMapPandasUDFBench (eval type 201): time_small_groups_few_cols (1k rows/group, 5 cols, 1500 groups): 723 ms peakmem_small_groups_few_cols : 1.96 G time_small_groups_many_cols (1k rows/group, 50 cols, 200 groups): 735 ms peakmem_small_groups_many_cols : 1.99 G time_large_groups_few_cols (100k rows/group, 5 cols, 350 groups): 574 ms peakmem_large_groups_few_cols : 3.19 G time_large_groups_many_cols (100k rows/group,50 cols, 40 groups): 625 ms peakmem_large_groups_many_cols : 3.72 G time_mixed_types (mixed cols, 1-arg UDF, 1300 groups): 714 ms peakmem_mixed_types : 1.90 G time_mixed_types_two_args (mixed cols, 2-arg UDF, 1600 groups): 658 ms peakmem_mixed_types_two_args : 1.90 G ```
…ecordsPerBatch Pass `spark.sql.execution.arrow.maxRecordsPerBatch` (default 10000) through RunnerConf and split each large group's Arrow data into sub-batches of that size inside the IPC stream, mirroring what the JVM does before sending data to the Python worker. Small groups (1k rows/group) are unaffected. Large groups (100k rows/group) are now split into 10 batches of 10k rows each, which exercises the pandas.concat path in the worker and reflects real-world Spark behaviour. Benchmark results (asv run --python=same --quick, Apple M4 Max): ``` time_small_groups_few_cols (1k rows/group, 5 cols, 1500 groups): 871 ms peakmem_small_groups_few_cols : 1.96 G time_small_groups_many_cols (1k rows/group, 50 cols, 200 groups): 732 ms peakmem_small_groups_many_cols : 1.98 G time_large_groups_few_cols (100k rows/group, 5 cols, 350 groups): 672 ms peakmem_large_groups_few_cols : 3.19 G time_large_groups_many_cols (100k rows/group,50 cols, 40 groups): 701 ms peakmem_large_groups_many_cols : 3.72 G time_mixed_types (mixed cols, 1-arg UDF, 1300 groups): 868 ms peakmem_mixed_types : 1.91 G time_mixed_types_two_args (mixed cols, 2-arg UDF, 1600 groups): 825 ms peakmem_mixed_types_two_args : 1.90 G ```
ad370f2 to
66e621f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Add ASV microbenchmarks for two scalar Arrow UDF eval types in
python/benchmarks/bench_eval_type.py:ScalarArrowUDFBench(SQL_SCALAR_ARROW_UDF)ScalarArrowIterUDFBench(SQL_SCALAR_ARROW_ITER_UDF)Why are the changes needed?
This is part of SPARK-55724.
Does this PR introduce any user-facing change?
No. This only adds benchmark files under
python/benchmarks/.How was this patch tested?
ASV benchmark runs (
asv run --python=same):ScalarArrowUDFBench (
SQL_SCALAR_ARROW_UDF):ScalarArrowIterUDFBench (
SQL_SCALAR_ARROW_ITER_UDF):Was this patch authored or co-authored using generative AI tooling?
No