feat: support TSV/other delimiters and fix Markdown table collisions by trippinganymess · Pull Request #2061 · microsoft/markitdown

trippinganymess · 2026-06-03T06:21:42Z

Problem Statement

The current CSV converter lacks support for alternative delimiter formats (like .tsv, .psv, .ssv) and fails to safely escape structural characters (like pipes and newlines) within cell data, leading to corrupted Markdown table rendering.

Proposed Solution

Dynamic Delimiter Resolution -

Gives users the overriding power to explicitly define a delimiter via kwargs.
If no delimiter is provided, it utilizes csv.Sniffer() to dynamically determine the internal delimiter from the file content (handling edge cases where internal data doesn't match the file extension).
If the sniffer fails (csv.Error), it safely falls back to a generic delimiter based on the file's extension or MIME type.
added the text/tsv even though it is not in the official IANA list because many legacy system still use it

Iterative Cell Sanitization -

Routes all cell data through a new sanitize_cell() helper during string construction to preserve Markdown table integrity.
Escapes rogue pipes (| becomes |) to prevent column layout collisions.
Flattens newlines (\n becomes a space) and removes carriage returns (\r becomes "") to ensure rows remain strictly horizontal.
Used iterative approach to evaluate the safe rows as the loading the whole content (all rows at once) could cause heap memory spike.

Minor Chores
Fixed a small text duplication in the documentation of _base_converter.py.

resolves #2019 and #2022

…ssv, .psv) Refactored the to extend it's capabilities to resolve .tsv, .psv, .ssv files into markdown. The implementation now give user the overriding power to specify the delimiter used in the file, if the dilimiter is not specified then a sniffer function is used to determine the delimiter, In case that fails and result into csv.Error then we fallback to check the extension and MIMETYPE to determine the delimiter

…e, carriage return character This is applied iteratively during row construction to prevent Markdown layout collisions without spiking heap memory.

trippinganymess · 2026-06-03T06:22:22Z

@microsoft-github-policy-service agree

trippinganymess · 2026-06-04T12:50:25Z

I know the sniffer() function is not robust and can lead to false positive or false negatives but it seemed better than the default option, to solve the robustness problem maybe we can use a dialect voting would could provide greater reliability.

here is what I am thinking :

get a dialect prediction from beginning sample.
get a dialect prediction from the ending sample.
if both the dialect match, we pass it prediction as the delimiter.
In case they don't match then picking a third sample from the file and then performing a majority vote between all three sample.

If that is something which seems like a better approach please let me know and I will implement it.

noezhiya-dot · 2026-06-08T23:19:38Z

This is a well-structured enhancement that solves real problems. The three-layer approach (explicit kwarg -> sniffer -> extension fallback) is the right priority order.

A few observations:

The delimiter resolution chain is solid. csv.Sniffer with explicit delimiters (",\t|;") covers the common cases, and the extension-based fallback handles edge cases where the sniffer can't determine the dialect.
The sanitize_cell method is important — pipes and newlines in cell values have been a source of broken tables since the CSV converter was first written. The approach of escaping pipes and flattening newlines is standard for Markdown table generation.
One concern: the sniffer can be slow on very large files since it reads the entire sample chunk. The 8192-byte cap is reasonable, but consider that csv.Sniffer().sniff() internally reads the full sample you pass it, so the cap is doing the right thing.
The test vectors are good — they cover pipe escaping in both SSV and TSV formats, which are the two most likely collision scenarios.
Minor: the docstring for CsvConverter has a formatting issue ("Param : delimiter" should be ":param delimiter:" for proper Sphinx/rst docs). Consider fixing in a follow-up.

Overall, this is a meaningful improvement to the CSV converter. LGTM.

trippinganymess added 4 commits June 3, 2026 09:03

DOCS : corrected duplication mistake in the DocumentConverter

59e7bdb

BUG : implemented a sanitize_cell helper to escape rogue pipe, newlin…

3d9fef4

…e, carriage return character This is applied iteratively during row construction to prevent Markdown layout collisions without spiking heap memory.

BUG : added the pre-commit changes

7785b90

DOCS : updated the _csv_converter.py docs to reflect the changes

0e19780

trippinganymess mentioned this pull request Jun 4, 2026

fix: escape pipe characters in CSV table cells #2066

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support TSV/other delimiters and fix Markdown table collisions#2061

feat: support TSV/other delimiters and fix Markdown table collisions#2061
trippinganymess wants to merge 5 commits into
microsoft:mainfrom
trippinganymess:TSV-and-other-delimiter-support

trippinganymess commented Jun 3, 2026 •

edited

Loading

Uh oh!

trippinganymess commented Jun 3, 2026

Uh oh!

trippinganymess commented Jun 4, 2026

Uh oh!

noezhiya-dot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

trippinganymess commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem Statement

Proposed Solution

Uh oh!

trippinganymess commented Jun 3, 2026

Uh oh!

trippinganymess commented Jun 4, 2026

Uh oh!

noezhiya-dot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

trippinganymess commented Jun 3, 2026 •

edited

Loading