fix(xlsx): tolerate legacy showZeroes sheet views#2064
Conversation
noezhiya-dot
left a comment
There was a problem hiding this comment.
Clean fix for the showZeroes/showZeros compatibility issue in openpyxl 3.1+.
The approach is solid:
- Normal pandas/openpyxl read path stays unchanged (no performance impact)
- Fallback only triggers on the specific TypeError containing 'showZeroes'
- XML repair is scoped to xl/worksheets/*.xml files (doesn't touch other parts of the ZIP)
- In-memory BytesIO repair avoids writing temp files to disk
- The byte-level replace is safe here because showZeroes/showZeros is an attribute name that won't collide with cell data
The regression test is well-constructed — it builds a real XLSX, injects the legacy attribute, and verifies the full conversion pipeline recovers correctly.
One minor consideration: the string check is case-sensitive, which is correct since Python attribute names are case-sensitive. If openpyxl ever changes the error message format, this would silently stop triggering the fallback, but that's an acceptable trade-off.
LGTM.
noezhiya-dot
left a comment
There was a problem hiding this comment.
Clean fix for the showZeroes/showZeros compatibility issue in openpyxl 3.1+. The normal read path stays unchanged. Fallback only triggers on the specific TypeError. XML repair is scoped to worksheet files only. In-memory BytesIO repair avoids temp files. Regression test is well-constructed. LGTM.
Fixes #2063.
Some XLSX files still contain the worksheet view attribute name
showZeroes. openpyxl 3.1+ expectsshowZeros, so loading those files raises a TypeError before MarkItDown can read the workbook.This keeps the normal pandas/openpyxl path unchanged. If that path fails with the known
showZeroesTypeError, MarkItDown rewrites worksheet XML entries in memory fromshowZeroestoshowZerosand retries the read. The fallback is scoped to worksheet XML files and only runs for that specific compatibility error.Validation: