feat: support TSV/other delimiters and fix Markdown table collisions#2061
feat: support TSV/other delimiters and fix Markdown table collisions#2061trippinganymess wants to merge 5 commits into
Conversation
…ssv, .psv) Refactored the to extend it's capabilities to resolve .tsv, .psv, .ssv files into markdown. The implementation now give user the overriding power to specify the delimiter used in the file, if the dilimiter is not specified then a sniffer function is used to determine the delimiter, In case that fails and result into csv.Error then we fallback to check the extension and MIMETYPE to determine the delimiter
…e, carriage return character This is applied iteratively during row construction to prevent Markdown layout collisions without spiking heap memory.
|
@microsoft-github-policy-service agree |
|
I know the sniffer() function is not robust and can lead to false positive or false negatives but it seemed better than the default option, to solve the robustness problem maybe we can use a dialect voting would could provide greater reliability. here is what I am thinking :
If that is something which seems like a better approach please let me know and I will implement it. |
|
This is a well-structured enhancement that solves real problems. The three-layer approach (explicit kwarg -> sniffer -> extension fallback) is the right priority order. A few observations:
Overall, this is a meaningful improvement to the CSV converter. LGTM. |
Problem Statement
The current CSV converter lacks support for alternative delimiter formats (like
.tsv,.psv,.ssv) and fails to safely escape structural characters (like pipes and newlines) within cell data, leading to corrupted Markdown table rendering.Proposed Solution
kwargs.csv.Sniffer()to dynamically determine the internal delimiter from the file content (handling edge cases where internal data doesn't match the file extension).csv.Error), it safely falls back to a generic delimiter based on the file's extension or MIME type.text/tsveven though it is not in the official IANA list because many legacy system still use itFixed a small text duplication in the documentation of
_base_converter.py.resolves #2019 and #2022