Skip to content

pdf_rewrite_images() segfaults with shared image xrefs across many pages (buffer overflow) #4918

@alam00000

Description

@alam00000

Description of the bug

Calling doc.rewrite_images() on a PDF where the same image xref is referenced from many pages causes a segmentation fault due to a buffer overflow in MuPDF's underlying C function pdf_rewrite_images.

The PDF attached has ~99 total image references across 39 pages, with a single image xref being reused (shared) on multiple pages. This appears to overflow an internal MuPDF buffer. The crash is deterministic and reproducible.

How to reproduce the bug

import pymupdf
# Open any PDF where the same image xref is shared across many pages
# (e.g., a logo or watermark repeated on every page)
# The test PDF has ~99 image references across 39 pages.
doc = pymupdf.open("shared_image_xref.pdf")
# This segfaults:
doc.rewrite_images(dpi_threshold=150, dpi_target=100, quality=50)

Expected behavior: Images are rewritten/compressed without crashing.
Actual behavior: Segmentation fault (SIGSEGV) / memory corruption.

Workaround
Currently I bypass doc.rewrite_images() entirely and implement image rewriting per-xref using lower-level PyMuPDF APIs. But this is probably not ideal

import sys
import math
import pymupdf

def safe_rewrite_images(doc, dpi_target=None, dpi_threshold=None, quality=None, set_to_gray=False):
    """Workaround for segfault in doc.rewrite_images() with shared image xrefs."""
    if not (dpi_target or quality is not None or set_to_gray):
        return

    # Collect unique image xrefs and their smask info
    xref_info = {}
    for page in doc:
        for img in page.get_images(full=True):
            xref, smask = img[0], img[1]
            if xref > 0:
                xref_info.setdefault(xref, {"smask": smask, "min_dpi": float("inf")})

    # Calculate effective DPI for each xref across all page usages
    for page in doc:
        for info in page.get_image_info(hashes=False, xrefs=True):
            xref = info.get("xref", 0)
            if xref not in xref_info:
                continue
            bbox = info.get("bbox")
            w, h = info.get("width", 0), info.get("height", 0)
            if bbox and w > 0 and h > 0:
                disp_w = abs(bbox[2] - bbox[0])
                disp_h = abs(bbox[3] - bbox[1])
                if disp_w > 0 and disp_h > 0:
                    dpi = min(w / disp_w * 72, h / disp_h * 72)
                    if dpi < xref_info[xref]["min_dpi"]:
                        xref_info[xref]["min_dpi"] = dpi

    effective_threshold = max(dpi_threshold or 0, (dpi_target or 0) + 10) if dpi_target else None

    # Rewrite each image xref individually
    for xref, meta in xref_info.items():
        min_dpi = meta["min_dpi"]
        smask_xref = meta["smask"]

        needs_downscale = bool(
            dpi_target and effective_threshold
            and min_dpi != float("inf")
            and min_dpi > effective_threshold
        )
        if not needs_downscale and quality is None and not set_to_gray:
            continue

        try:
            pix = pymupdf.Pixmap(doc, xref)

            if set_to_gray and pix.colorspace and pix.colorspace.n > 1:
                pix = pymupdf.Pixmap(pymupdf.csGRAY, pix)
            elif pix.alpha:
                pix = pymupdf.Pixmap(pix.colorspace or pymupdf.csRGB, pix)

            if needs_downscale:
                ratio = min_dpi / dpi_target
                shrink_n = max(0, min(7, int(math.log2(ratio))))
                if shrink_n > 0:
                    pix.shrink(shrink_n)

            q = quality if quality is not None else 85
            jpeg_bytes = pix.tobytes("jpeg", jpg_quality=q)

            cs_name = "/DeviceGray" if pix.colorspace and pix.colorspace.n == 1 else "/DeviceRGB"
            smask_entry = f"/SMask {smask_xref} 0 R " if smask_xref else ""
            new_obj = (
                f"<</Type /XObject /Subtype /Image /BitsPerComponent 8"
                f" /ColorSpace {cs_name} /Filter /DCTDecode"
                f" /Height {pix.height} /Width {pix.width}"
                f" {smask_entry}>>"
            )
            doc.update_object(xref, new_obj)
            doc.update_stream(xref, jpeg_bytes, compress=0)
            pix = None

        except Exception as e:
            sys.stderr.write(f"[pymupdf] safe_rewrite_images xref {xref}: {e}\n")

PDF used:

PyMuPDF version

1.27.1

Operating system

MacOS

Python version

3.14

Metadata

Metadata

Assignees

No one assigned

    Labels

    upstream bugbug outside this package

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions