Skip to content

CASSANDRA-21381: CSV COPY TO corrupts control characters (newline, null byte, etc.) in text values#4819

Open
arvindKandpal-ksolves wants to merge 3 commits into
apache:trunkfrom
arvindKandpal-ksolves:CASSANDRA-21381
Open

CASSANDRA-21381: CSV COPY TO corrupts control characters (newline, null byte, etc.) in text values#4819
arvindKandpal-ksolves wants to merge 3 commits into
apache:trunkfrom
arvindKandpal-ksolves:CASSANDRA-21381

Conversation

@arvindKandpal-ksolves
Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

This PR fixes CASSANDRA-21381, where exporting data via COPY TO corrupts control characters (like newlines, null bytes) by replacing them with their Python repr() notation (e.g., \n).

Root Cause: format_value_text is shared between terminal display (SELECT) and CSV export (COPY TO). The UNICODE_CONTROLCHARS_RE substitution was running unconditionally, corrupting multi-line data in CSVs.

Fix:
Introduced an escape_control_chars parameter (defaulting to True for terminal display) to format_value_text and all related collection formatters (format_value_list, format_value_set, format_value_map, format_value_tuple, format_value_utype). During CSV export in copyutil.py (ExportProcess.format_value), we now pass escape_control_chars=False to preserve actual control characters. This approach is analogous to the escape_backslash fix introduced in CASSANDRA-21131.

Steps to reproduce the issue / verify the fix:

You can verify the fix using cqlsh with the following schema containing normal text and collections:

CREATE KEYSPACE IF NOT EXISTS test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
USE test;

CREATE TYPE IF NOT EXISTS my_type (a int, b text);

CREATE TABLE IF NOT EXISTS test_collections (
    id int PRIMARY KEY,
    my_list list<text>,
    my_set set<text>,
    my_map map<text, text>,
    my_tuple tuple<int, text>,
    my_udt frozen<my_type>
);

INSERT INTO test_collections (id, my_list, my_set, my_map, my_tuple, my_udt) 
VALUES (
    1, 
    ['list
item'], 
    {'set
item'}, 
    {'map_key': 'map
value'}, 
    (10, 'tuple
item'), 
    {a: 100, b: 'udt
item'}
);

COPY test_collections TO 'test_collections.csv';

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

pylib/cqlshlib/formatting.py:595

  • In format_value_utype, escape_control_chars is propagated to field values but not to field names (format_field_name calls format_value_text without forwarding the flag). This makes escape_control_chars=False only partially effective and can lead to inconsistent output for quoted identifiers. Consider passing escape_control_chars=escape_control_chars when formatting the field name as well (or document why field names are intentionally always escaped).

    def format_field_name(name):
        return format_value_text(name, encoding=encoding, colormap=colormap, quote=False)

Comment on lines 1752 to +1757
formatted = formatter(val, cqltype=cqltype,
encoding=self.encoding, colormap=NO_COLOR_MAP, date_time_format=self.date_time_format,
float_precision=cqltype.precision, nullval=self.nullval, quote=False,
decimal_sep=self.decimal_sep, thousands_sep=self.thousands_sep,
boolean_styles=self.boolean_styles)
boolean_styles=self.boolean_styles,
escape_control_chars=False)
@arvindKandpal-ksolves
Copy link
Copy Markdown
Contributor Author

Hi @bschoening , I have address the Copilot review , Please review it whenever you have a time.

@bschoening
Copy link
Copy Markdown
Contributor

@arvindKandpal-ksolves could you see my comment on PR 4813 and see if the approach of simply skipping format_value for text works here as well.

@Jens-G
Copy link
Copy Markdown
Member

Jens-G commented May 19, 2026

Code review

Found 1 issue:

  1. Backslash doubling in format_value_text is unconditional, reintroducing the CASSANDRA-21131 bug for CSV export

In the PR's format_value_text, the line escapedval = val.replace('\\', '\\\\') runs unconditionally before the if escape_control_chars: guard (line 485). When copyutil.py calls with escape_control_chars=False during CSV export, literal backslashes in string data are still doubled — e.g. C:\Users\alice becomes C:\\Users\\alice in the CSV. The new tests in test_formatting.py don't cover this because no test includes a literal \ when testing with escape_control_chars=False.

def format_value_text(val, encoding, colormap, quote=False, escape_control_chars=True, **_):
escapedval = val.replace('\\', '\\\\')
if quote:
escapedval = escapedval.replace("'", "''")
if escape_control_chars:
escapedval = UNICODE_CONTROLCHARS_RE.sub(_show_control_chars, escapedval)
bval = escapedval

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@bschoening
Copy link
Copy Markdown
Contributor

bschoening commented May 20, 2026

@Jens-G I was suggesting a different approach, bypassing any formatting for text in format_value()

Also, when making a proposed change, you need to run the existing python unit tests -- plus any new ones and confirm the results when submitting the PR on the Jira page.

@arvindKandpal-ksolves
Copy link
Copy Markdown
Contributor Author

Hi @bschoening and @Jens-G,

Thanks for the suggestions! I investigated the alternative approach of bypassing format_value for text/varchar/ascii columns directly within copyutil.py. While it cleanly handles top-level text columns, it unfortunately breaks for text embedded inside collections (List, Map, Set, Tuple, UDT).

Technical Analysis:

  1. The Collection Path: For types like list<text>, ExportProcess.format_value only sees the outer type (list) and falls through to format_value_list -> format_simple_collection. The recursive descent happens entirely inside formatting.py and routes the elements back through the dispatch table straight to format_value_text. The bypass in copyutil.py never gets a second chance, causing nested strings to be re-corrupted (\n\\n and \\\\\).
  2. The UDT Path: format_value_utype explicitly invokes format_value_text(name, ...) by name for field identifiers rather than type dispatch. A bypass or generic formatter-swap misses this path entirely.

To maintain architecture consistency, explicit kwarg propagation mirrors exactly how context-of-rendering parameters (colormap, nullval, encoding, decimal_sep, etc.) are already plumbed through these layers.

Latest Updates in this PR:

  • Jens-G's Feedback Resolved: I have fixed the unconditional backslash substitution. The replacement val.replace('\\', '\\\\') is now correctly wrapped inside a standard if escape_control_chars: block so that CSV export mode leaves backslashes completely untouched.
  • Expanded Test Coverage: The unit tests in test_formatting.py have been updated with a mixed string ("C:\\Users\\alice\nHello\x00") and strict regression assertions to prevent backslash doubling from ever creeping back. All 10 tests are passing green.

(Note: This architectural deep-dive and regression analysis was conducted with the assistance of Claude Code / Anthropic).

Please review the updated changes whenever you have time. I look forward to your thoughts!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants