Skip to content

[SPARK-55754][PYTHON][TEST] Add ASV microbenchmarks for scalar Arrow UDF eval types#54555

Draft
Yicong-Huang wants to merge 16 commits intoapache:masterfrom
Yicong-Huang:SPARK-55754/benchmark/scalar-arrow-udf
Draft

[SPARK-55754][PYTHON][TEST] Add ASV microbenchmarks for scalar Arrow UDF eval types#54555
Yicong-Huang wants to merge 16 commits intoapache:masterfrom
Yicong-Huang:SPARK-55754/benchmark/scalar-arrow-udf

Conversation

@Yicong-Huang
Copy link
Contributor

@Yicong-Huang Yicong-Huang commented Feb 28, 2026

What changes were proposed in this pull request?

Add ASV microbenchmarks for two scalar Arrow UDF eval types in python/benchmarks/bench_eval_type.py:

  • ScalarArrowUDFBench (SQL_SCALAR_ARROW_UDF)
  • ScalarArrowIterUDFBench (SQL_SCALAR_ARROW_ITER_UDF)

Why are the changes needed?

This is part of SPARK-55724.

Does this PR introduce any user-facing change?

No. This only adds benchmark files under python/benchmarks/.

How was this patch tested?

ASV benchmark runs (asv run --python=same):

ScalarArrowUDFBench (SQL_SCALAR_ARROW_UDF):

   ScalarArrowUDFBench.time_compute                          71.3±0.2ms
   ScalarArrowUDFBench.time_large_batches_few_cols              367±3ms
   ScalarArrowUDFBench.time_large_batches_many_cols           164±0.9ms
   ScalarArrowUDFBench.time_mixed_types                      27.7±0.1ms
   ScalarArrowUDFBench.time_small_batches_few_cols            46.3±0.5ms
   ScalarArrowUDFBench.time_small_batches_many_cols           18.3±0.2ms
   ScalarArrowUDFBench.peakmem_compute                            2.89G
   ScalarArrowUDFBench.peakmem_large_batches_few_cols             2.99G
   ScalarArrowUDFBench.peakmem_large_batches_many_cols            2.87G
   ScalarArrowUDFBench.peakmem_mixed_types                        2.85G
   ScalarArrowUDFBench.peakmem_small_batches_few_cols             2.85G
   ScalarArrowUDFBench.peakmem_small_batches_many_cols            2.85G

ScalarArrowIterUDFBench (SQL_SCALAR_ARROW_ITER_UDF):

   ScalarArrowIterUDFBench.time_compute                        69.7±2ms
   ScalarArrowIterUDFBench.time_large_batches_few_cols          357±3ms
   ScalarArrowIterUDFBench.time_large_batches_many_cols         170±2ms
   ScalarArrowIterUDFBench.time_mixed_types                  25.7±0.2ms
   ScalarArrowIterUDFBench.time_small_batches_few_cols        43.2±0.3ms
   ScalarArrowIterUDFBench.time_small_batches_many_cols       18.3±0.3ms
   ScalarArrowIterUDFBench.peakmem_compute                        2.89G
   ScalarArrowIterUDFBench.peakmem_large_batches_few_cols         2.99G
   ScalarArrowIterUDFBench.peakmem_large_batches_many_cols        2.87G
   ScalarArrowIterUDFBench.peakmem_mixed_types                    2.85G
   ScalarArrowIterUDFBench.peakmem_small_batches_few_cols         2.85G
   ScalarArrowIterUDFBench.peakmem_small_batches_many_cols        2.85G

Was this patch authored or co-authored using generative AI tooling?

No

@Yicong-Huang Yicong-Huang changed the title [SPARK-55754][PYTHON][TESTS] Add ASV microbenchmarks for scalar Arrow UDF eval types [SPARK-55754][PYTHON][TEST] Add ASV microbenchmarks for scalar Arrow UDF eval types Feb 28, 2026
@Yicong-Huang
Copy link
Contributor Author

depends on #54533

@Yicong-Huang Yicong-Huang marked this pull request as draft February 28, 2026 06:27
- Interleave each `time_*` with its `peakmem_*` counterpart so related
  benchmarks are co-located in the file.
- Add sys.path injection in `__init__.py` so `asv run --python=same`
  works directly from the PySpark source tree without a prior
  `pip install`.

Benchmark results (asv run --python=same --quick, Apple M4 Max):

```
GroupedMapPandasUDFBench (eval type 201):
  time_small_groups_few_cols     (1k rows/group,   5 cols, 1500 groups): 723 ms
  peakmem_small_groups_few_cols                                        : 1.96 G
  time_small_groups_many_cols    (1k rows/group,  50 cols,  200 groups): 735 ms
  peakmem_small_groups_many_cols                                       : 1.99 G
  time_large_groups_few_cols     (100k rows/group, 5 cols,  350 groups): 574 ms
  peakmem_large_groups_few_cols                                        : 3.19 G
  time_large_groups_many_cols    (100k rows/group,50 cols,   40 groups): 625 ms
  peakmem_large_groups_many_cols                                       : 3.72 G
  time_mixed_types               (mixed cols, 1-arg UDF,   1300 groups): 714 ms
  peakmem_mixed_types                                                   : 1.90 G
  time_mixed_types_two_args      (mixed cols, 2-arg UDF,   1600 groups): 658 ms
  peakmem_mixed_types_two_args                                         : 1.90 G
```
…ecordsPerBatch

Pass `spark.sql.execution.arrow.maxRecordsPerBatch` (default 10000) through
RunnerConf and split each large group's Arrow data into sub-batches of that
size inside the IPC stream, mirroring what the JVM does before sending data
to the Python worker.

Small groups (1k rows/group) are unaffected. Large groups (100k rows/group)
are now split into 10 batches of 10k rows each, which exercises the
pandas.concat path in the worker and reflects real-world Spark behaviour.

Benchmark results (asv run --python=same --quick, Apple M4 Max):

```
  time_small_groups_few_cols     (1k rows/group,   5 cols, 1500 groups):  871 ms
  peakmem_small_groups_few_cols                                         : 1.96 G
  time_small_groups_many_cols    (1k rows/group,  50 cols,  200 groups):  732 ms
  peakmem_small_groups_many_cols                                        : 1.98 G
  time_large_groups_few_cols     (100k rows/group, 5 cols,  350 groups):  672 ms
  peakmem_large_groups_few_cols                                         : 3.19 G
  time_large_groups_many_cols    (100k rows/group,50 cols,   40 groups):  701 ms
  peakmem_large_groups_many_cols                                        : 3.72 G
  time_mixed_types               (mixed cols, 1-arg UDF,   1300 groups):  868 ms
  peakmem_mixed_types                                                   : 1.91 G
  time_mixed_types_two_args      (mixed cols, 2-arg UDF,   1600 groups):  825 ms
  peakmem_mixed_types_two_args                                          : 1.90 G
```
@Yicong-Huang Yicong-Huang force-pushed the SPARK-55754/benchmark/scalar-arrow-udf branch from ad370f2 to 66e621f Compare February 28, 2026 07:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant