[SPARK-55059][PYTHON] Remove empty table workaround in toPandas

Yicong-Huang · HyukjinKwon · commit f362c7eb44d8 · 2026-03-19T15:26:51.000+09:00
### What changes were proposed in this pull request? Remove the SPARK-51112 workaround in `_convert_arrow_table_to_pandas()` that bypassed PyArrow's `to_pandas()` for empty tables. ### Why are the changes needed? The workaround was added because arrow-java's `ListVector.getBufferSizeFor(0)` returned 0, causing the offset buffer to be omitted for empty nested arrays in IPC serialization, which led to a segmentation fault in PyArrow. This has been fixed upstream in arrow-java 19.0.0 ([apache/arrow-java#343](apache/arrow-java#343)), which Spark adopted in SPARK-56000 (PR #54820). The workaround is no longer necessary. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test `test_to_pandas_for_empty_df_with_nested_array_columns` passes. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53824 from Yicong-Huang/SPARK-55059/refactor/remove-empty-table-workaround. Authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
diff --git a/python/pyspark/sql/pandas/conversion.py b/python/pyspark/sql/pandas/conversion.py
@@ -254,16 +254,8 @@ def _convert_arrow_table_to_pandas(
         error_on_duplicated_field_names = True
         struct_handling_mode = "dict"
 
-    # SPARK-51112: If the table is empty, we avoid using pyarrow to_pandas to create the
-    # DataFrame, as it may fail with a segmentation fault.
-    if arrow_table.num_rows == 0:
-        # For empty tables, create empty Series to preserve dtypes
-        column_data = (
-            pd.Series([], name=temp_col_names[i], dtype="object") for i in range(len(schema.fields))
-        )
-    else:
-        # For non-empty tables, convert arrow columns directly
-        column_data = (arrow_col.to_pandas(**pandas_options) for arrow_col in arrow_table.columns)
+    # Convert arrow columns to pandas Series
+    column_data = (arrow_col.to_pandas(**pandas_options) for arrow_col in arrow_table.columns)
 
     # Apply Spark-specific type converters to each column
     pdf = pd.concat(