Skip to content

fix(xlsx): tolerate legacy showZeroes sheet views#2064

Open
he-yufeng wants to merge 1 commit into
microsoft:mainfrom
he-yufeng:fix/xlsx-sheetview-showzeroes
Open

fix(xlsx): tolerate legacy showZeroes sheet views#2064
he-yufeng wants to merge 1 commit into
microsoft:mainfrom
he-yufeng:fix/xlsx-sheetview-showzeroes

Conversation

@he-yufeng

Copy link
Copy Markdown

Fixes #2063.

Some XLSX files still contain the worksheet view attribute name showZeroes. openpyxl 3.1+ expects showZeros, so loading those files raises a TypeError before MarkItDown can read the workbook.

This keeps the normal pandas/openpyxl path unchanged. If that path fails with the known showZeroes TypeError, MarkItDown rewrites worksheet XML entries in memory from showZeroes to showZeros and retries the read. The fallback is scoped to worksheet XML files and only runs for that specific compatibility error.

Validation:

  • .venv\Scripts\python.exe -m pytest packages\markitdown\tests\test_module_misc.py -q -k "xlsx_legacy_show_zeroes"
  • .venv\Scripts\python.exe -m pytest packages\markitdown\tests\test_module_vectors.py::test_convert_local -q
  • .venv\Scripts\python.exe -m py_compile packages\markitdown\src\markitdown\converters_xlsx_converter.py packages\markitdown\tests\test_module_misc.py
  • .venv\Scripts\python.exe -m ruff check packages\markitdown\src\markitdown\converters_xlsx_converter.py packages\markitdown\tests\test_module_misc.py
  • git diff --check

@noezhiya-dot noezhiya-dot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean fix for the showZeroes/showZeros compatibility issue in openpyxl 3.1+.

The approach is solid:

  • Normal pandas/openpyxl read path stays unchanged (no performance impact)
  • Fallback only triggers on the specific TypeError containing 'showZeroes'
  • XML repair is scoped to xl/worksheets/*.xml files (doesn't touch other parts of the ZIP)
  • In-memory BytesIO repair avoids writing temp files to disk
  • The byte-level replace is safe here because showZeroes/showZeros is an attribute name that won't collide with cell data

The regression test is well-constructed — it builds a real XLSX, injects the legacy attribute, and verifies the full conversion pipeline recovers correctly.

One minor consideration: the string check is case-sensitive, which is correct since Python attribute names are case-sensitive. If openpyxl ever changes the error message format, this would silently stop triggering the fallback, but that's an acceptable trade-off.

LGTM.

@noezhiya-dot noezhiya-dot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean fix for the showZeroes/showZeros compatibility issue in openpyxl 3.1+. The normal read path stays unchanged. Fallback only triggers on the specific TypeError. XML repair is scoped to worksheet files only. In-memory BytesIO repair avoids temp files. Regression test is well-constructed. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

XLSX Conversion Fails: SheetView parameter 'showZeroes' incompatible with openpyxl 3.1.0+

2 participants