Skip to content

feat: support TSV/other delimiters and fix Markdown table collisions#2061

Open
trippinganymess wants to merge 5 commits into
microsoft:mainfrom
trippinganymess:TSV-and-other-delimiter-support
Open

feat: support TSV/other delimiters and fix Markdown table collisions#2061
trippinganymess wants to merge 5 commits into
microsoft:mainfrom
trippinganymess:TSV-and-other-delimiter-support

Conversation

@trippinganymess

@trippinganymess trippinganymess commented Jun 3, 2026

Copy link
Copy Markdown

Problem Statement

The current CSV converter lacks support for alternative delimiter formats (like .tsv, .psv, .ssv) and fails to safely escape structural characters (like pipes and newlines) within cell data, leading to corrupted Markdown table rendering.

Proposed Solution

  1. Dynamic Delimiter Resolution -
  • Gives users the overriding power to explicitly define a delimiter via kwargs.
  • If no delimiter is provided, it utilizes csv.Sniffer() to dynamically determine the internal delimiter from the file content (handling edge cases where internal data doesn't match the file extension).
  • If the sniffer fails (csv.Error), it safely falls back to a generic delimiter based on the file's extension or MIME type.
  • added the text/tsv even though it is not in the official IANA list because many legacy system still use it
  1. Iterative Cell Sanitization -
  • Routes all cell data through a new sanitize_cell() helper during string construction to preserve Markdown table integrity.
  • Escapes rogue pipes (| becomes |) to prevent column layout collisions.
  • Flattens newlines (\n becomes a space) and removes carriage returns (\r becomes "") to ensure rows remain strictly horizontal.
  • Used iterative approach to evaluate the safe rows as the loading the whole content (all rows at once) could cause heap memory spike.
  1. Minor Chores
    Fixed a small text duplication in the documentation of _base_converter.py.

resolves #2019 and #2022

…ssv, .psv)

Refactored the  to extend it's capabilities to resolve .tsv, .psv, .ssv files into markdown. The implementation now give user the overriding power to specify the delimiter used in the file, if the dilimiter is not specified then a sniffer function is used to determine the delimiter, In case that fails and result into csv.Error then we fallback to check the extension and MIMETYPE to determine the delimiter
…e, carriage return character

This is applied iteratively during row construction to prevent Markdown layout collisions without spiking heap memory.
@trippinganymess

Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

@trippinganymess

Copy link
Copy Markdown
Author

I know the sniffer() function is not robust and can lead to false positive or false negatives but it seemed better than the default option, to solve the robustness problem maybe we can use a dialect voting would could provide greater reliability.

here is what I am thinking :

  1. get a dialect prediction from beginning sample.
  2. get a dialect prediction from the ending sample.
  3. if both the dialect match, we pass it prediction as the delimiter.
  4. In case they don't match then picking a third sample from the file and then performing a majority vote between all three sample.

If that is something which seems like a better approach please let me know and I will implement it.

@noezhiya-dot

Copy link
Copy Markdown

This is a well-structured enhancement that solves real problems. The three-layer approach (explicit kwarg -> sniffer -> extension fallback) is the right priority order.

A few observations:

  1. The delimiter resolution chain is solid. csv.Sniffer with explicit delimiters (",\t|;") covers the common cases, and the extension-based fallback handles edge cases where the sniffer can't determine the dialect.

  2. The sanitize_cell method is important — pipes and newlines in cell values have been a source of broken tables since the CSV converter was first written. The approach of escaping pipes and flattening newlines is standard for Markdown table generation.

  3. One concern: the sniffer can be slow on very large files since it reads the entire sample chunk. The 8192-byte cap is reasonable, but consider that csv.Sniffer().sniff() internally reads the full sample you pass it, so the cap is doing the right thing.

  4. The test vectors are good — they cover pipe escaping in both SSV and TSV formats, which are the two most likely collision scenarios.

  5. Minor: the docstring for CsvConverter has a formatting issue ("Param : delimiter" should be ":param delimiter:" for proper Sphinx/rst docs). Consider fixing in a follow-up.

Overall, this is a meaningful improvement to the CSV converter. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: CsvConverter produces broken Markdown tables when cell values contain pipe characters (|)

2 participants