[SPARK-55294][PYTHON] Add support for PyArrow-backed dtypes in pandas API by adith-os · Pull Request #55783 · apache/spark

adith-os · 2026-05-09T16:03:34Z

What changes were proposed in this pull request?

This PR adds support for PyArrow-backed dtypes (e.g., bool[pyarrow]) in PySpark's pandas API. The changes include:

Type detection and conversion: Added is_pyarrow_backed_dtype() function to detect PyArrow-backed dtypes and enhanced as_spark_type() to convert them to appropriate Spark types.
Dtype preservation: Enhanced spark_type_to_pandas_dtype() with a use_arrow_dtypes parameter to preserve PyArrow dtypes when converting from Spark types back to pandas dtypes.
Propagation through operations: Updated comparison operators in data_type_ops/base.py to preserve PyArrow dtypes in results, and propagated the use_arrow_dtypes parameter through InternalField.from_struct_field() and related code paths.

Why are the changes needed?

Starting with pandas 3.0, when PyArrow is installed, pandas automatically uses PyArrow-backed dtypes for string columns. Without this PR, PySpark's pandas API loses dtype information when working with PyArrow-backed dtypes

Does this PR introduce any user-facing change?

Yes. Users working with PyArrow-backed dtypes will now see their dtypes preserved through PySpark operations
Before the PR:

>>> 26/05/09 19:25:26 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(scavenge), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
26/05/09 19:25:26 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(global, scavenge), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
import warnings
>>> warnings.filterwarnings('ignore')
>>> 
>>> import pandas as pd
>>> print(f"pandas: {pd.__version__}")
pandas: 3.0.2
>>> 
>>> import pyspark.pandas as ps
>>> s1 = ps.Series(['a', 'b', 'c'], dtype='string')
>>> s2 = ps.Series(['a', 'x', 'c'], dtype='string')
>>> result = s1 == s2
>>> print(f"result dtype: {result.dtype}")
result dtype: boolean
>>> print(result)
0     True                                                                      
1    False
2     True
dtype: boolean
>>>

After the PR:

>>> 26/05/09 19:19:05 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(scavenge), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors
26/05/09 19:19:05 WARN GarbageCollectionMetrics: To enable non-built-in garbage collector(s) List(global, scavenge), users should configure it(them) to spark.eventLog.gcMetrics.youngGenerationGarbageCollectors or spark.eventLog.gcMetrics.oldGenerationGarbageCollectors

>>> import warnings
>>> warnings.filterwarnings('ignore')
>>> 
>>> import pandas as pd
>>> print(f"pandas: {pd.__version__}")
pandas: 3.0.2
>>> 
>>> import pyspark.pandas as ps
>>> s1 = ps.Series(['a', 'b', 'c'], dtype='string')
s2 = ps.Series(['a', 'x', 'c'], dtype='string')
result = s1 == s2
print(f"result dtype: {result.dtype}")
print(result)

>>> s2 = ps.Series(['a', 'x', 'c'], dtype='string')
>>> result = s1 == s2
>>> print(f"result dtype: {result.dtype}")
result dtype: bool[pyarrow]
>>> print(result)
0     True
1    False
2     True
dtype: bool[pyarrow]
>>>

How was this patch tested?

New tests in python/pyspark/pandas/tests/test_typedef.py:

test_as_spark_type_pyarrow_dtypes() - Tests conversion to Spark types
test_spark_type_to_pandas_dtype_with_arrow_flag() - Tests conversion back to pandas dtypes
test_is_str_dtype_with_pyarrow() - Tests string dtype detection
test_is_pyarrow_backed_dtype() - Tests PyArrow dtype detection

Integration test in python/pyspark/pandas/tests/data_type_ops/test_string_ops.py:

test_pyarrow_backed_string_comparisons() - Tests dtype preservation in comparison operations

Was this patch authored or co-authored using generative AI tooling?

Yes. Generated-by: GPT 5.5

… API

adith-os · 2026-05-09T18:40:39Z

Hi @gaogaotiantian, Could you please review the PR when you get a chance and let me know whether this approach matches your expectations, or if you had a different direction in mind? Thanks!

HyukjinKwon · 2026-05-10T21:55:07Z

cc @devin-petersohn too

devin-petersohn

This feels both incomplete and too verbose. ne and eq were the only APIs updated. There's some copy/pasted code as well. I left some high level comments.

I also haven't fully checked, but I think this will be a problem for pandas 3 and the string dtype.

devin-petersohn · 2026-05-12T13:41:18Z

                return types.DoubleType()
+        if extension_arrow_dtypes_available and isinstance(tpe, ArrowDtype):
+            pyarrow_type = tpe.pyarrow_dtype
+            if pa.types.is_string(pyarrow_type) or pa.types.is_large_string(pyarrow_type):


Can we just reuse existing functionality?

spark/python/pyspark/sql/pandas/types.py

Lines 347 to 350 in 1360e5c

def from_arrow_type(

at: "pa.DataType",

prefer_timestamp_ntz: bool = True,

) -> DataType:

devin-petersohn · 2026-05-12T13:49:46Z

+        storage = getattr(tpe, "storage", None)
+        return isinstance(storage, str) and storage.startswith("pyarrow") and tpe.na_value is pd.NA


StringDtype.storage is part of the API. This feels overly defensive.

devin-petersohn · 2026-05-12T13:53:46Z

+            use_extension_dtypes = handle_dtype_as_extension_dtype(left.dtype) or (
+                isinstance(right, IndexOpsMixin) and handle_dtype_as_extension_dtype(right.dtype)
+            )
+            use_arrow_dtypes = is_pyarrow_backed_dtype(left.dtype) or (
+                isinstance(right, IndexOpsMixin) and is_pyarrow_backed_dtype(right.dtype)
+            )
+            field = left._internal.data_fields[0].copy(
+                dtype=spark_type_to_pandas_dtype(
+                    BooleanType(),
+                    use_extension_dtypes=use_extension_dtypes,
+                    use_arrow_dtypes=use_arrow_dtypes,
+                ),
+                spark_type=BooleanType(),
+                nullable=False,
+            )
+            left_scol = left._with_new_scol(F.lit(True), field=field)


This looks identical to the code block above. Should be a helper if it's this many lines IMO

gaogaotiantian · 2026-05-12T19:25:53Z

Honestly, I'm a bit conservative when a new contributor making complex changes with LLM to support pandas 3. Our CI does not check pandas 3 (the nightly CI simply fails and can not be used against a certain PR). I've had a few cases where people just feed my description to LLM and generate some patch that of course passes the CI because we do not test pandas 3, while the code may not even work properly.

At this point, it's very difficult and time consuming to validate the patch and the approach. We have very limited resources on many different matters. I would say that we deprioritize supporting pandas 3, especially from new contributors with LLM, so that we can keep everything else going. In the future, when we have enough resources to deal with it, we can put more effort into it.

[SPARK-55294][PYTHON] Add support for PyArrow-backed dtypes in pandas…

c03fc49

… API

devin-petersohn reviewed May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-55294][PYTHON] Add support for PyArrow-backed dtypes in pandas API#55783

[SPARK-55294][PYTHON] Add support for PyArrow-backed dtypes in pandas API#55783
adith-os wants to merge 1 commit into
apache:masterfrom
adith-os:SPARK-55294-pyarrow-dtypes

adith-os commented May 9, 2026

Uh oh!

adith-os commented May 9, 2026

Uh oh!

HyukjinKwon commented May 10, 2026

Uh oh!

devin-petersohn left a comment

Uh oh!

devin-petersohn May 12, 2026

Uh oh!

devin-petersohn May 12, 2026

Uh oh!

devin-petersohn May 12, 2026

Uh oh!

gaogaotiantian commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	def from_arrow_type(
	at: "pa.DataType",
	prefer_timestamp_ntz: bool = True,
	) -> DataType:

		storage = getattr(tpe, "storage", None)
		return isinstance(storage, str) and storage.startswith("pyarrow") and tpe.na_value is pd.NA

Conversation

adith-os commented May 9, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

adith-os commented May 9, 2026

Uh oh!

HyukjinKwon commented May 10, 2026

Uh oh!

devin-petersohn left a comment

Choose a reason for hiding this comment

Uh oh!

devin-petersohn May 12, 2026

Choose a reason for hiding this comment

Uh oh!

devin-petersohn May 12, 2026

Choose a reason for hiding this comment

Uh oh!

devin-petersohn May 12, 2026

Choose a reason for hiding this comment

Uh oh!

gaogaotiantian commented May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants