Recover PDF text truncated by inline images using PyMuPDF fallback#2092
Recover PDF text truncated by inline images using PyMuPDF fallback#2092Muhtasim-Munif-Fahim wants to merge 1 commit into
Conversation
|
@Muhtasim-Munif-Fahim please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
Contributor License AgreementContribution License AgreementThis Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
|
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds an optional PyMuPDF-based text extraction fallback to improve PDF conversion when primary extractors (pdfplumber/pdfminer) appear to return truncated text, particularly around inline images.
Changes:
- Introduces an optional PyMuPDF extraction path and heuristics to prefer it when primary output looks truncated.
- Adds a regression test to validate choosing PyMuPDF when primary extraction misses post-image text.
- Documents and packages PyMuPDF as an optional extra (and includes it in the
allextra).
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| packages/markitdown/tests/test_module_misc.py | Adds a test that simulates truncation and asserts the PyMuPDF path is preferred. |
| packages/markitdown/src/markitdown/converters/_pdf_converter.py | Adds PyMuPDF extraction helper and selection heuristics based on images + output length. |
| packages/markitdown/pyproject.toml | Adds pymupdf optional dependency (extra + included in all). |
| packages/markitdown/README.md | Documents how to install the optional PyMuPDF extra. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| monkeypatch.setattr(pdf_converter_module.pdfplumber, "open", lambda _: _FakePdf()) | ||
| monkeypatch.setattr( | ||
| pdf_converter_module.pdfminer.high_level, | ||
| "extract_text", | ||
| lambda _: "BEFORE_IMAGE: this text should be extracted", | ||
| ) | ||
| monkeypatch.setattr( | ||
| pdf_converter_module.fitz, | ||
| "open", | ||
| lambda *args, **kwargs: _FakePyMuPdfDoc(), | ||
| ) |
| if fitz is not None and has_images and markdown and len(markdown) < 2048: | ||
| try: | ||
| pymupdf_markdown = _extract_with_pymupdf(pdf_bytes) | ||
| except Exception: | ||
| pymupdf_markdown = None | ||
| else: | ||
| if pymupdf_markdown is not None: | ||
| primary_length = len(markdown.strip()) | ||
| pymupdf_length = len(pymupdf_markdown.strip()) | ||
| if pymupdf_length > primary_length and ( | ||
| primary_length == 0 | ||
| or pymupdf_length >= primary_length * 1.5 | ||
| or pymupdf_length - primary_length >= 200 | ||
| ): |
| pdf_bytes.seek(0) | ||
| chunks: list[str] = [] | ||
| with fitz.open(stream=pdf_bytes.read(), filetype="pdf") as doc: |
|
This is a well-structured approach to the PDF truncation problem. A few thoughts:
Overall solid contribution. The optional dependency approach via extras is the right pattern for this project. |
|
Nice approach to the inline-image truncation problem. The fallback logic is well-scoped — it only kicks in when:
A few thoughts:
Overall, this is a well-engineered fix for a real problem. The code is clean and the tests are thorough. |
noezhiya-dot
left a comment
There was a problem hiding this comment.
Code Review: Recover PDF text truncated by inline images using PyMuPDF fallback
Solid approach to a real problem. The fallback heuristic is well-designed.
Strengths
- Smart heuristic: requires (a) images present, (b) primary output < 2048 chars, AND (c) PyMuPDF output is 1.5x+ longer or 200+ chars longer. Prevents unnecessary switching on already-good output.
- Clean separation with _extract_with_pymupdf() as a standalone function.
- Good test coverage with monkeypatched fakes.
- README and pyproject.toml extras properly updated.
Suggestions
-
Threshold magic numbers: The 2048-char threshold and 1.5x/200-char ratios are reasonable but hardcoded. Consider making them module-level constants so they are documented and tunable.
-
pdf_bytes position clarity: The fallback section relies on _extract_with_pymupdf seeking internally. An explicit pdf_bytes.seek(0) before the fallback block would improve readability.
-
Form pages edge case: When form_page_count > 0, the PyMuPDF fallback does not trigger since it is inside the pdfminer path. Form-style PDFs with image truncation on non-form pages won't get the fallback. Worth documenting this behavior in a comment.
-
Test coverage: Consider adding a test for the case where form pages exist but the PyMuPDF fallback should NOT trigger, to document that boundary.
Approved with minor suggestions. The core logic is sound and the fallback conditions are conservative enough to avoid regressions.
Fixes #1870.
Some PDFs with inline images cause pdfplumber/pdfminer to silently drop text that appears after the image. When the primary extraction path returns a much shorter body on image-bearing pages, this change optionally retries with PyMuPDF and prefers that output if it is substantially longer.
Also documents the optional pymupdf extra and adds a regression test that simulates the truncation path.