CASSANDRA-21381: CSV COPY TO corrupts control characters (newline, null byte, etc.) in text values by arvindKandpal-ksolves · Pull Request #4819 · apache/cassandra

arvindKandpal-ksolves · 2026-05-18T09:27:44Z

What this PR does / why we need it:

This PR fixes CASSANDRA-21381, where exporting data via COPY TO corrupts control characters (like newlines, null bytes) by replacing them with their Python repr() notation (e.g., \n).

Root Cause: format_value_text is shared between terminal display (SELECT) and CSV export (COPY TO). The UNICODE_CONTROLCHARS_RE substitution was running unconditionally, corrupting multi-line data in CSVs.

Fix:
Introduced an escape_control_chars parameter (defaulting to True for terminal display) to format_value_text and all related collection formatters (format_value_list, format_value_set, format_value_map, format_value_tuple, format_value_utype). During CSV export in copyutil.py (ExportProcess.format_value), we now pass escape_control_chars=False to preserve actual control characters. This approach is analogous to the escape_backslash fix introduced in CASSANDRA-21131.

Steps to reproduce the issue / verify the fix:

You can verify the fix using cqlsh with the following schema containing normal text and collections:

CREATE KEYSPACE IF NOT EXISTS test WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1};
USE test;

CREATE TYPE IF NOT EXISTS my_type (a int, b text);

CREATE TABLE IF NOT EXISTS test_collections (
    id int PRIMARY KEY,
    my_list list<text>,
    my_set set<text>,
    my_map map<text, text>,
    my_tuple tuple<int, text>,
    my_udt frozen<my_type>
);

INSERT INTO test_collections (id, my_list, my_set, my_map, my_tuple, my_udt) 
VALUES (
    1, 
    ['list
item'], 
    {'set
item'}, 
    {'map_key': 'map
value'}, 
    (10, 'tuple
item'), 
    {a: 100, b: 'udt
item'}
);

COPY test_collections TO 'test_collections.csv';

…n cqlsh

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

pylib/cqlshlib/formatting.py:595

In format_value_utype, escape_control_chars is propagated to field values but not to field names (format_field_name calls format_value_text without forwarding the flag). This makes escape_control_chars=False only partially effective and can lead to inconsistent output for quoted identifiers. Consider passing escape_control_chars=escape_control_chars when formatting the field name as well (or document why field names are intentionally always escaped).


    def format_field_name(name):
        return format_value_text(name, encoding=encoding, colormap=colormap, quote=False)

        formatted = formatter(val, cqltype=cqltype,
                              encoding=self.encoding, colormap=NO_COLOR_MAP, date_time_format=self.date_time_format,
                              float_precision=cqltype.precision, nullval=self.nullval, quote=False,
                              decimal_sep=self.decimal_sep, thousands_sep=self.thousands_sep,
-                              boolean_styles=self.boolean_styles)
+                              boolean_styles=self.boolean_styles,
+                              escape_control_chars=False)


…e unit tests for control characters

arvindKandpal-ksolves · 2026-05-19T05:41:12Z

Hi @bschoening , I have address the Copilot review , Please review it whenever you have a time.

bschoening · 2026-05-19T13:53:04Z

@arvindKandpal-ksolves could you see my comment on PR 4813 and see if the approach of simply skipping format_value for text works here as well.

Jens-G · 2026-05-19T23:27:39Z

Code review

Found 1 issue:

Backslash doubling in format_value_text is unconditional, reintroducing the CASSANDRA-21131 bug for CSV export

In the PR's format_value_text, the line escapedval = val.replace('\\', '\\\\') runs unconditionally before the if escape_control_chars: guard (line 485). When copyutil.py calls with escape_control_chars=False during CSV export, literal backslashes in string data are still doubled — e.g. C:\Users\alice becomes C:\\Users\\alice in the CSV. The new tests in test_formatting.py don't cover this because no test includes a literal \ when testing with escape_control_chars=False.

cassandra/pylib/cqlshlib/formatting.py

Lines 481 to 487 in fb55b2e

    
           def format_value_text(val, encoding, colormap, quote=False, escape_control_chars=True, **_): 
        
               escapedval = val.replace('\\', '\\\\') 
        
               if quote: 
        
                   escapedval = escapedval.replace("'", "''") 
        
               if escape_control_chars: 
        
                   escapedval = UNICODE_CONTROLCHARS_RE.sub(_show_control_chars, escapedval) 
        
               bval = escapedval

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

bschoening · 2026-05-20T00:58:57Z

@Jens-G I was suggesting a different approach, bypassing any formatting for text in format_value()

Also, when making a proposed change, you need to run the existing python unit tests -- plus any new ones and confirm the results when submitting the PR on the Jira page.

…ext and update regression tests

arvindKandpal-ksolves · 2026-05-20T07:15:01Z

Hi @bschoening and @Jens-G,

Thanks for the suggestions! I investigated the alternative approach of bypassing format_value for text/varchar/ascii columns directly within copyutil.py. While it cleanly handles top-level text columns, it unfortunately breaks for text embedded inside collections (List, Map, Set, Tuple, UDT).

Technical Analysis:

The Collection Path: For types like list<text>, ExportProcess.format_value only sees the outer type (list) and falls through to format_value_list -> format_simple_collection. The recursive descent happens entirely inside formatting.py and routes the elements back through the dispatch table straight to format_value_text. The bypass in copyutil.py never gets a second chance, causing nested strings to be re-corrupted (\n → \\n and \ → \\\\).
The UDT Path: format_value_utype explicitly invokes format_value_text(name, ...) by name for field identifiers rather than type dispatch. A bypass or generic formatter-swap misses this path entirely.

To maintain architecture consistency, explicit kwarg propagation mirrors exactly how context-of-rendering parameters (colormap, nullval, encoding, decimal_sep, etc.) are already plumbed through these layers.

Latest Updates in this PR:

Jens-G's Feedback Resolved: I have fixed the unconditional backslash substitution. The replacement val.replace('\\', '\\\\') is now correctly wrapped inside a standard if escape_control_chars: block so that CSV export mode leaves backslashes completely untouched.
Expanded Test Coverage: The unit tests in test_formatting.py have been updated with a mixed string ("C:\\Users\\alice\nHello\x00") and strict regression assertions to prevent backslash doubling from ever creeping back. All 10 tests are passing green.

(Note: This architectural deep-dive and regression analysis was conducted with the assistance of Claude Code / Anthropic).

Please review the updated changes whenever you have time. I look forward to your thoughts!

CASSANDRA-21381: Fix control character corruption during CSV export i…

36aa880

…n cqlsh

bschoening requested a review from Copilot May 18, 2026 13:28

Copilot started reviewing on behalf of bschoening May 18, 2026 13:28 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

bschoening requested a review from Copilot May 18, 2026 14:51

Copilot started reviewing on behalf of bschoening May 18, 2026 14:51 View session

Copilot AI reviewed May 18, 2026

View reviewed changes

bschoening requested a review from Copilot May 19, 2026 02:51

Copilot started reviewing on behalf of bschoening May 19, 2026 02:51 View session

Copilot AI reviewed May 19, 2026

View reviewed changes

Address PR review: fix UDT field name formatting and add comprehensiv…

fb55b2e

…e unit tests for control characters

CASSANDRA-21381: Fix conditional backslash doubling in format_value_t…

7599bb0

…ext and update regression tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CASSANDRA-21381: CSV COPY TO corrupts control characters (newline, null byte, etc.) in text values#4819

CASSANDRA-21381: CSV COPY TO corrupts control characters (newline, null byte, etc.) in text values#4819
arvindKandpal-ksolves wants to merge 3 commits into
apache:trunkfrom
arvindKandpal-ksolves:CASSANDRA-21381

arvindKandpal-ksolves commented May 18, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

arvindKandpal-ksolves commented May 19, 2026

Uh oh!

bschoening commented May 19, 2026

Uh oh!

Jens-G commented May 19, 2026

Uh oh!

bschoening commented May 20, 2026 •

edited

Loading

Uh oh!

arvindKandpal-ksolves commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

arvindKandpal-ksolves commented May 18, 2026

What this PR does / why we need it:

Steps to reproduce the issue / verify the fix:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

arvindKandpal-ksolves commented May 19, 2026

Uh oh!

bschoening commented May 19, 2026

Uh oh!

Jens-G commented May 19, 2026

Code review

Uh oh!

bschoening commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arvindKandpal-ksolves commented May 20, 2026

Technical Analysis:

Latest Updates in this PR:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bschoening commented May 20, 2026 •

edited

Loading