[SYNPY-1749]Allow quote, apostrophe and ellipsis in store_row_async#1316

Open

danlu1 wants to merge 12 commits intodevelopfrom

SYNPY-1749-allow-quote-apostrophe-in-store-rows

Contributor

danlu1 commented Feb 9, 2026 •

edited

Loading

Problem:

A JSON serialization issue occurs when a DataFrame passed to store_row_async contains a list or dictionary with strings that include both double quotes and apostrophes.

Solution:

Restrict to_csv_kwargs in get_partial_dataframe_chunk, which is the final step before uploading the data to Synapse to obtain the file handle.

Testing:

Unit test and integration test have been added.


          reformat script

3feba11

danlu1 requested a review from a team as a code owner

February 9, 2026 20:01

Contributor

andrewelamb commented Feb 9, 2026

@danlu1 Is this still WIP, or are you looking for reviews?

danlu1 added 3 commits

February 9, 2026 16:23


          reorganize code to ensure row columns remain int

1c68dac


          add unit test for convert_dtypes_to_json_serializable

4a29a16


          correct unit for datetime64

3ecb6ec

Contributor Author

danlu1 commented Feb 10, 2026

@andrewelamb sorry I should have marked this a draft.

danlu1 marked this pull request as draft

February 10, 2026 00:48

danlu1 added 8 commits

February 9, 2026 18:08


          remove the unwanted code

af989c0


          revert changes in test_csv_to_pandas_df_with_date_columns

4d06d3a


          update doctrings

e1b20dc


          add integration test for store_rows

7ef7110


          add to_csv kwargs to ensure double quote and apostophe formated corre…

a4913a6

…ctly when upload data from a dataframe


          remove json string dumps function to let synapse decode data directly

98689d3


          update unit test since the convert_dtypes_to_json_serializable no lon…

a0af1b6

…ger output json string


          update integration test as no json string need to be generated

5002bd6

danlu1 marked this pull request as ready for review

February 18, 2026 18:36

Contributor Author

danlu1 commented Feb 18, 2026

The integration test failures are in the recordset and submission modules and do not appear to be related to my changes.

linglp requested changes

View reviewed changes

Contributor

linglp left a comment

I think overall it looks good. The tests can be consolidated a bit to test all the edge cases in fewer integration tests to improve performance, and the docstring can be updated to reflect the new state of the code since json.dumps() was removed. There's also some logic that can be simplified in the redundant checks where sample_values is created but never actually used. The function name could also be more descriptive of what it actually does now (sanitizing special values rather than just converting dtypes)

synapseclient/models/mixins/table_components.py

+                                          return obj
+                                      cleaned_x = _reformat_special_values(x)
+                                      # return json.dumps(cleaned_x, ensure_ascii=False)

Contributor

linglp Feb 20, 2026

Nit: this is old code that needs to be removed?

synapseclient/models/mixins/table_components.py

		@@ -141,6 +141,8 @@ def row_labels_from_rows(rows: List[Row]) -> List[Row]:
		def convert_dtypes_to_json_serializable(df):

Contributor

linglp Feb 20, 2026

Nit: I think the name of the function is a bit misleading. In the past, it seems like the output of this function is a JSON string, but now, it seems that it no longer converts to JSON strings. Should we change the function name to something like: normalize_dataframe_for_json?

Contributor Author

danlu1 Feb 20, 2026 •

edited

Loading

Originally, the function was created to replace NA with None and convert dtypes for some columns. I tested the code to generate the json string to fix the bug. But now, it's no longer needed. New changes align with what the function is about so I don't want to rename it.

synapseclient/models/mixins/table_components.py

               def convert_dtypes_to_json_serializable(df):
                   """
                   Convert the dtypes of the int64 and float64 columns to object columns which are JSON serializable types.
+                  Convert the list and dict columns to JSON strings which are JSON serializable types.

Contributor

linglp Feb 20, 2026

I think the docstring here is also outdated. Based on my understading, maybe this is what the function is currently doing?

Prepare a DataFrame for JSON serialization by cleaning special values and normalizing dtypes.
Replaces Ellipsis (...) with the string "..." and pandas NA with None within nested structures.
Converts ROW_ID, ROW_VERSION, and ROW_ID.1 columns to int type.
Converts other columns to object dtype after cleaning.

Contributor Author

danlu1 Feb 20, 2026

Yes, I haven't updated it after the final change. Thanks for catching this.

synapseclient/models/mixins/table_components.py

-                      df[col] = (
-                          df[col].replace({pd.NA: None}).astype(object)
-                      )  # this will convert the int64 and float64 columns to object columns
+                      if df[col].notna().any():

Contributor

linglp Feb 20, 2026

I am curious what's the reason of adding this check here since _serialize_json_value seems to be able to handle Null?

synapseclient/models/mixins/table_components.py

-                      )  # this will convert the int64 and float64 columns to object columns
+                      if df[col].notna().any():
+                          sample_values = df[col].dropna()
+                          if len(sample_values):

Contributor

linglp Feb 20, 2026

I think even we have if df[col].notna().any():, this line here might not be needed because sample_values = df[col].dropna() already drops all the null value, so sample_values is definitely going to be greater than 0? This check is not necessary IMO

Contributor Author

danlu1 Feb 20, 2026

Yes, this can be simplified as df[col].notna().any() and len(sample_values) do the same thing.

tests/integration/synapseclient/models/async/test_table_async.py

                       # AND The spy should have been called in multiple batches
                       assert spy_send_job.call_count == 1
+                  async def test_store_rows_with_json_dict_columns_and_quotes(

Contributor

linglp Feb 20, 2026

I think currently each test has to create a new table, store rows, query back, and be torn down.. and all these are slow API calls. Should we consolidate all the integration tests to 1-2 tests that verify end-to-end behavior with ALL edge cases?

Contributor Author

danlu1 Feb 20, 2026

The isolated test cases align with the existing settings and were not updated in this related PR. I’m not sure we want to make an overhaul here. If so, I'm happy to tweak all tests in this script. @BryanFauble what's your thoughts?

Contributor

linglp Feb 20, 2026

Based on the contribution guide, I think you should use scope="function" for test isolation. However, creating multiple tables with scope="function" is expensive, so consolidating these integration tests might be a better approach

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet