Skip to content

Commit f362c7e

Browse files
Yicong-HuangHyukjinKwon
authored andcommitted
[SPARK-55059][PYTHON] Remove empty table workaround in toPandas
### What changes were proposed in this pull request? Remove the SPARK-51112 workaround in `_convert_arrow_table_to_pandas()` that bypassed PyArrow's `to_pandas()` for empty tables. ### Why are the changes needed? The workaround was added because arrow-java's `ListVector.getBufferSizeFor(0)` returned 0, causing the offset buffer to be omitted for empty nested arrays in IPC serialization, which led to a segmentation fault in PyArrow. This has been fixed upstream in arrow-java 19.0.0 ([apache/arrow-java#343](apache/arrow-java#343)), which Spark adopted in SPARK-56000 (PR #54820). The workaround is no longer necessary. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test `test_to_pandas_for_empty_df_with_nested_array_columns` passes. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53824 from Yicong-Huang/SPARK-55059/refactor/remove-empty-table-workaround. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
1 parent 6debd47 commit f362c7e

1 file changed

Lines changed: 2 additions & 10 deletions

File tree

python/pyspark/sql/pandas/conversion.py

Lines changed: 2 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -254,16 +254,8 @@ def _convert_arrow_table_to_pandas(
254254
error_on_duplicated_field_names = True
255255
struct_handling_mode = "dict"
256256

257-
# SPARK-51112: If the table is empty, we avoid using pyarrow to_pandas to create the
258-
# DataFrame, as it may fail with a segmentation fault.
259-
if arrow_table.num_rows == 0:
260-
# For empty tables, create empty Series to preserve dtypes
261-
column_data = (
262-
pd.Series([], name=temp_col_names[i], dtype="object") for i in range(len(schema.fields))
263-
)
264-
else:
265-
# For non-empty tables, convert arrow columns directly
266-
column_data = (arrow_col.to_pandas(**pandas_options) for arrow_col in arrow_table.columns)
257+
# Convert arrow columns to pandas Series
258+
column_data = (arrow_col.to_pandas(**pandas_options) for arrow_col in arrow_table.columns)
267259

268260
# Apply Spark-specific type converters to each column
269261
pdf = pd.concat(

0 commit comments

Comments
 (0)