Skip to content

Don't abort PDF conversion on a single malformed page#2078

Open
assinscreedFC wants to merge 1 commit into
microsoft:mainfrom
assinscreedFC:fix/pdf-resilient-malformed-page
Open

Don't abort PDF conversion on a single malformed page#2078
assinscreedFC wants to merge 1 commit into
microsoft:mainfrom
assinscreedFC:fix/pdf-resilient-malformed-page

Conversation

@assinscreedFC

Copy link
Copy Markdown

Summary

Closes #2071.

PdfConverter.convert ran the entire per-page pdfplumber loop inside a single try, and on any page error it abandoned the whole loop and fell back to an unguarded whole-document pdfminer.high_level.extract_text call. A single malformed page — for example a form XObject whose /BBox is not a four-number rectangle, which makes the PDF backend raise ValueError: not enough values to unpack (expected 4, got 2) — therefore aborted the entire conversion with FileConversionException, even when every other page was fine.

Change

  • Each page's extraction is wrapped in try/except so one malformed page is skipped (logged at debug level via a module logger) while the text recovered from the other pages is kept. page.close() moved to a finally so cached page data is always freed.
  • The form_page_count == 0 whole-document pdfminer call is now guarded and falls back to the per-page pdfplumber text already collected.
  • The two remaining extract_text fallbacks are guarded so they degrade to "" instead of raising.

The deeper root cause is a robustness gap in pdfminer.six (it unpacks /BBox without validating its length); that is being fixed upstream separately. This PR makes markitdown degrade gracefully regardless of the backend version.

Tests

Added packages/markitdown/tests/test_pdf_invalid_bbox.py (self-contained, building the PDF in-memory): a two-page document whose second page has a malformed form /BBox converts without raising and still recovers the first page's text.

pytest packages/markitdown/tests/test_pdf_invalid_bbox.py packages/markitdown/tests/test_pdf_masterformat.py → passes; black --check clean.


AI assistance (Claude) was used; I have reviewed every line and run the tests.

PdfConverter ran the whole per-page pdfplumber loop in one try and, on any
page error, fell back to an unguarded whole-document
pdfminer.high_level.extract_text call. A single malformed page (e.g. a form
XObject with an invalid /BBox that makes the PDF backend raise
"not enough values to unpack (expected 4, got 2)") therefore aborted the
entire conversion with FileConversionException.

Per-page extraction is now wrapped so a bad page is skipped (logged at debug
level) while the text from the other pages is kept, and the pdfminer fallback
calls are guarded, degrading to the already-collected pdfplumber text instead
of raising.

Closes microsoft#2071

Co-authored-by: Claude <noreply@anthropic.com>
@assinscreedFC

Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

@noezhiya-dot

Copy link
Copy Markdown

This is exactly the kind of robustness fix that a document converter needs. PDFs in the wild are messy, and one malformed page shouldn't tank the whole conversion.

The changes are well-structured:

  • Per-page try/except with debug logging is the right granularity — you get visibility into failures without losing good data.
  • Moving page.close() to finally is a good cleanup, prevents cached page data from leaking.
  • Guarding the pdfminer fallback calls is defensive but necessary — if pdfplumber couldn't handle it, pdfminer might not either.
  • The fallback chain (pdfplumber per-page -> pdfminer whole doc -> already collected chunks -> empty string) degrades gracefully at every level.

One suggestion: consider logging the page-level skip at warning level instead of debug. In production use, a user might want to know that pages were skipped without having to enable debug logging. But that's a minor preference.

The test is thorough — building a malicious PDF in-memory with an invalid /BBox is clever and self-contained. LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PdfConverter fails on valid PDF: ValueError not enough values to unpack (expected 4, got 2)

2 participants