CASSANDRA-21381: CSV COPY TO corrupts control characters (newline, null byte, etc.) in text values#4819
Conversation
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
pylib/cqlshlib/formatting.py:595
- In
format_value_utype,escape_control_charsis propagated to field values but not to field names (format_field_namecallsformat_value_textwithout forwarding the flag). This makesescape_control_chars=Falseonly partially effective and can lead to inconsistent output for quoted identifiers. Consider passingescape_control_chars=escape_control_charswhen formatting the field name as well (or document why field names are intentionally always escaped).
def format_field_name(name):
return format_value_text(name, encoding=encoding, colormap=colormap, quote=False)
| formatted = formatter(val, cqltype=cqltype, | ||
| encoding=self.encoding, colormap=NO_COLOR_MAP, date_time_format=self.date_time_format, | ||
| float_precision=cqltype.precision, nullval=self.nullval, quote=False, | ||
| decimal_sep=self.decimal_sep, thousands_sep=self.thousands_sep, | ||
| boolean_styles=self.boolean_styles) | ||
| boolean_styles=self.boolean_styles, | ||
| escape_control_chars=False) |
…e unit tests for control characters
|
Hi @bschoening , I have address the Copilot review , Please review it whenever you have a time. |
|
@arvindKandpal-ksolves could you see my comment on PR 4813 and see if the approach of simply skipping format_value for text works here as well. |
Code reviewFound 1 issue:
In the PR's cassandra/pylib/cqlshlib/formatting.py Lines 481 to 487 in fb55b2e 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
|
@Jens-G I was suggesting a different approach, bypassing any formatting for text in format_value() Also, when making a proposed change, you need to run the existing python unit tests -- plus any new ones and confirm the results when submitting the PR on the Jira page. |
…ext and update regression tests
|
Hi @bschoening and @Jens-G, Thanks for the suggestions! I investigated the alternative approach of bypassing Technical Analysis:
To maintain architecture consistency, explicit kwarg propagation mirrors exactly how context-of-rendering parameters ( Latest Updates in this PR:
(Note: This architectural deep-dive and regression analysis was conducted with the assistance of Claude Code / Anthropic). Please review the updated changes whenever you have time. I look forward to your thoughts! |
What this PR does / why we need it:
This PR fixes CASSANDRA-21381, where exporting data via
COPY TOcorrupts control characters (like newlines, null bytes) by replacing them with their Pythonrepr()notation (e.g.,\n).Root Cause:
format_value_textis shared between terminal display (SELECT) and CSV export (COPY TO). TheUNICODE_CONTROLCHARS_REsubstitution was running unconditionally, corrupting multi-line data in CSVs.Fix:
Introduced an
escape_control_charsparameter (defaulting toTruefor terminal display) toformat_value_textand all related collection formatters (format_value_list,format_value_set,format_value_map,format_value_tuple,format_value_utype). During CSV export incopyutil.py(ExportProcess.format_value), we now passescape_control_chars=Falseto preserve actual control characters. This approach is analogous to theescape_backslashfix introduced in CASSANDRA-21131.Steps to reproduce the issue / verify the fix:
You can verify the fix using
cqlshwith the following schema containing normal text and collections: