GH-48978: [Python] test failures on pandas 3.0 for fastparquet and for zoneinfo w/o pytz#48979
GH-48978: [Python] test failures on pandas 3.0 for fastparquet and for zoneinfo w/o pytz#48979tadeja wants to merge 12 commits intoapache:mainfrom
Conversation
|
|
|
Failures replicated here on CI job AMD64 Conda Python 3.13 Pandas latest |
|
Proposed fixes complete previously failing tests with success, you can see it is with pandas 3.0, with |
|
I backported this to the conda-forge feedstock in conda-forge/pyarrow-feedstock#169, and can confirm that it works! Thanks! |
There was a problem hiding this comment.
Would it make sense to use integers for categories until fastparquet fully supports pandas 3.0? That way we can also check the dtype roundtrip here.
There was a problem hiding this comment.
Ok, @AlenkaF here's the new approach with
"f": pd.Categorical([5, 6, 7]),
This fails the test at
tm.assert_frame_equal(table_fp.to_pandas(), df_for_fp, check_dtype=False)
pyarrow/tests/parquet/test_basic.py:769:
...
...
E AssertionError: Categorical Expected type <class 'pandas.Categorical'>, found <class 'numpy.ndarray'> instead
so we have to add another check_categorical=False on line 769
as already done before at line 756 for the same reason.
|
@AlenkaF, @rok this now includes comments and # TODO: once fastparquet supports pandas 3.0 dtypes revert string/categorical test - also according to #48978 (comment) |
Co-authored-by: Alenka Frim <AlenkaF@users.noreply.github.com>
Co-authored-by: Rok Mihevc <rok@mihevc.org>
bdf182c to
f6c9301
Compare
rok
left a comment
There was a problem hiding this comment.
Timezone part looks good. Suggesting a more readable fastparquet workaround. Proposed changes are not tested.
| # TODO: once fastparquet supports pandas 3 dtypes revert string and categorical | ||
| # tests by removing `check_dtype=False` and `check_categorical=False` so type | ||
| # equality is asserted again | ||
| tm.assert_frame_equal(df, df_fp, check_dtype=False, check_categorical=False) |
There was a problem hiding this comment.
Would this work instead? Since underlying type of categorical is not int it feels like it should?
| # TODO: once fastparquet supports pandas 3 dtypes revert string and categorical | |
| # tests by removing `check_dtype=False` and `check_categorical=False` so type | |
| # equality is asserted again | |
| tm.assert_frame_equal(df, df_fp, check_dtype=False, check_categorical=False) | |
| # TODO: once fastparquet supports pandas 3's pd.StringDtype() remove casting | |
| expected_types = {"a": pd.StringDtype()} | |
| tm.assert_frame_equal(df, df_fp.astype(expected_types)) |
| # fastparquet can't write pandas 3.0 StringDtype | ||
| df_for_fp = df.copy() | ||
| df_for_fp['a'] = df_for_fp['a'].astype(object) | ||
| fp.write(file_fastparquet, df_for_fp) |
There was a problem hiding this comment.
Perhaps we can avoid creating a new dataframe and only non-destructively cast at write-time?
| # fastparquet can't write pandas 3.0 StringDtype | |
| df_for_fp = df.copy() | |
| df_for_fp['a'] = df_for_fp['a'].astype(object) | |
| fp.write(file_fastparquet, df_for_fp) | |
| # fastparquet can't write pandas 3.0 StringDtype | |
| fp_compatible_types = {"a": object} | |
| fp.write(file_fastparquet, df.astype(fp_compatible_types)) |
| tm.assert_frame_equal(table_fp.to_pandas(), df_for_fp, check_dtype=False, | ||
| check_categorical=False) |
There was a problem hiding this comment.
Maybe:
| tm.assert_frame_equal(table_fp.to_pandas(), df_for_fp, check_dtype=False, | |
| check_categorical=False) | |
| expected_types = {"a": pd.StringDtype(), "f": object} | |
| tm.assert_frame_equal(table_fp.to_pandas(), df.astype(expected_types)) |
Co-authored-by: Rok Mihevc <rok@mihevc.org>
Rationale for this change
Closes #48978
What changes are included in this PR?
Update to
parquet/test_basic.py test_fastparquet_cross_compatibilityfor fastparquet string and categorical dtype differences causing failureAttribute "dtype" are differentUpdate to
test_pandas.py test_timestamp_as_object_non_nanosecondfor failureValueError: fromutc: dt.tzinfo is not self.Are these changes tested?
Yes. Initially tested locally with pandas upgraded to 3.0 as CI was still running with pandas 2.3.3 cached.
Are there any user-facing changes?
No.