Don't abort PDF conversion on a single malformed page#2078
Open
assinscreedFC wants to merge 1 commit into
Open
Don't abort PDF conversion on a single malformed page#2078assinscreedFC wants to merge 1 commit into
assinscreedFC wants to merge 1 commit into
Conversation
PdfConverter ran the whole per-page pdfplumber loop in one try and, on any page error, fell back to an unguarded whole-document pdfminer.high_level.extract_text call. A single malformed page (e.g. a form XObject with an invalid /BBox that makes the PDF backend raise "not enough values to unpack (expected 4, got 2)") therefore aborted the entire conversion with FileConversionException. Per-page extraction is now wrapped so a bad page is skipped (logged at debug level) while the text from the other pages is kept, and the pdfminer fallback calls are guarded, degrading to the already-collected pdfplumber text instead of raising. Closes microsoft#2071 Co-authored-by: Claude <noreply@anthropic.com>
Author
|
@microsoft-github-policy-service agree |
|
This is exactly the kind of robustness fix that a document converter needs. PDFs in the wild are messy, and one malformed page shouldn't tank the whole conversion. The changes are well-structured:
One suggestion: consider logging the page-level skip at warning level instead of debug. In production use, a user might want to know that pages were skipped without having to enable debug logging. But that's a minor preference. The test is thorough — building a malicious PDF in-memory with an invalid /BBox is clever and self-contained. LGTM. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #2071.
PdfConverter.convertran the entire per-pagepdfplumberloop inside a singletry, and on any page error it abandoned the whole loop and fell back to an unguarded whole-documentpdfminer.high_level.extract_textcall. A single malformed page — for example a form XObject whose/BBoxis not a four-number rectangle, which makes the PDF backend raiseValueError: not enough values to unpack (expected 4, got 2)— therefore aborted the entire conversion withFileConversionException, even when every other page was fine.Change
try/exceptso one malformed page is skipped (logged atdebuglevel via a module logger) while the text recovered from the other pages is kept.page.close()moved to afinallyso cached page data is always freed.form_page_count == 0whole-documentpdfminercall is now guarded and falls back to the per-pagepdfplumbertext already collected.extract_textfallbacks are guarded so they degrade to""instead of raising.The deeper root cause is a robustness gap in
pdfminer.six(it unpacks/BBoxwithout validating its length); that is being fixed upstream separately. This PR makesmarkitdowndegrade gracefully regardless of the backend version.Tests
Added
packages/markitdown/tests/test_pdf_invalid_bbox.py(self-contained, building the PDF in-memory): a two-page document whose second page has a malformed form/BBoxconverts without raising and still recovers the first page's text.pytest packages/markitdown/tests/test_pdf_invalid_bbox.py packages/markitdown/tests/test_pdf_masterformat.py→ passes;black --checkclean.AI assistance (Claude) was used; I have reviewed every line and run the tests.