Don't abort PDF conversion on a single malformed page by assinscreedFC · Pull Request #2078 · microsoft/markitdown

assinscreedFC · 2026-06-05T16:04:44Z

Summary

Closes #2071.

PdfConverter.convert ran the entire per-page pdfplumber loop inside a single try, and on any page error it abandoned the whole loop and fell back to an unguarded whole-document pdfminer.high_level.extract_text call. A single malformed page — for example a form XObject whose /BBox is not a four-number rectangle, which makes the PDF backend raise ValueError: not enough values to unpack (expected 4, got 2) — therefore aborted the entire conversion with FileConversionException, even when every other page was fine.

Change

Each page's extraction is wrapped in try/except so one malformed page is skipped (logged at debug level via a module logger) while the text recovered from the other pages is kept. page.close() moved to a finally so cached page data is always freed.
The form_page_count == 0 whole-document pdfminer call is now guarded and falls back to the per-page pdfplumber text already collected.
The two remaining extract_text fallbacks are guarded so they degrade to "" instead of raising.

The deeper root cause is a robustness gap in pdfminer.six (it unpacks /BBox without validating its length); that is being fixed upstream separately. This PR makes markitdown degrade gracefully regardless of the backend version.

Tests

Added packages/markitdown/tests/test_pdf_invalid_bbox.py (self-contained, building the PDF in-memory): a two-page document whose second page has a malformed form /BBox converts without raising and still recovers the first page's text.

pytest packages/markitdown/tests/test_pdf_invalid_bbox.py packages/markitdown/tests/test_pdf_masterformat.py → passes; black --check clean.

AI assistance (Claude) was used; I have reviewed every line and run the tests.

PdfConverter ran the whole per-page pdfplumber loop in one try and, on any page error, fell back to an unguarded whole-document pdfminer.high_level.extract_text call. A single malformed page (e.g. a form XObject with an invalid /BBox that makes the PDF backend raise "not enough values to unpack (expected 4, got 2)") therefore aborted the entire conversion with FileConversionException. Per-page extraction is now wrapped so a bad page is skipped (logged at debug level) while the text from the other pages is kept, and the pdfminer fallback calls are guarded, degrading to the already-collected pdfplumber text instead of raising. Closes microsoft#2071 Co-authored-by: Claude <noreply@anthropic.com>

assinscreedFC · 2026-06-05T16:14:34Z

@microsoft-github-policy-service agree

noezhiya-dot · 2026-06-08T23:10:55Z

This is exactly the kind of robustness fix that a document converter needs. PDFs in the wild are messy, and one malformed page shouldn't tank the whole conversion.

The changes are well-structured:

Per-page try/except with debug logging is the right granularity — you get visibility into failures without losing good data.
Moving page.close() to finally is a good cleanup, prevents cached page data from leaking.
Guarding the pdfminer fallback calls is defensive but necessary — if pdfplumber couldn't handle it, pdfminer might not either.
The fallback chain (pdfplumber per-page -> pdfminer whole doc -> already collected chunks -> empty string) degrades gracefully at every level.

One suggestion: consider logging the page-level skip at warning level instead of debug. In production use, a user might want to know that pages were skipped without having to enable debug logging. But that's a minor preference.

The test is thorough — building a malicious PDF in-memory with an invalid /BBox is clever and self-contained. LGTM.

assinscreedFC mentioned this pull request Jun 5, 2026

PdfConverter fails on valid PDF: ValueError not enough values to unpack (expected 4, got 2) #2071

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Don't abort PDF conversion on a single malformed page#2078

Don't abort PDF conversion on a single malformed page#2078
assinscreedFC wants to merge 1 commit into
microsoft:mainfrom
assinscreedFC:fix/pdf-resilient-malformed-page

assinscreedFC commented Jun 5, 2026

Uh oh!

assinscreedFC commented Jun 5, 2026

Uh oh!

noezhiya-dot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

assinscreedFC commented Jun 5, 2026

Summary

Change

Tests

Uh oh!

assinscreedFC commented Jun 5, 2026

Uh oh!

noezhiya-dot commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants