diff --git a/docs/about.rst b/docs/about.rst index 7af0df22e..f5e8b5b24 100644 --- a/docs/about.rst +++ b/docs/about.rst @@ -73,8 +73,8 @@ PyMuPDF Product Suite **Additional products** in the |PyMuPDF| product suite are: - |PyMuPDF Pro| adds support for Office document formats. -- |PyMuPDF4LLM| is optimized for large language model (LLM) applications, providing enhanced text extraction and processing capabilities. -- |PyMuPDF Layout| focuses on layout analysis and semantic understanding, ideal for document conversion and formatting tasks with enhanced results. +- |PyMuPDF4LLM| is optimized for large language model (LLM) applications, providing enhanced text extraction and processing capabilities. + It focuses on layout analysis and semantic understanding, ideal for document conversion and formatting tasks with enhanced results. .. note:: All of the products above depend on the same core product - |PyMuPDF| and therefore have full access to all of its features. @@ -89,64 +89,51 @@ PyMuPDF Products Comparison The following table illustrates what features the products offer: .. list-table:: PyMuPDF Products Comparison - :widths: 8 23 23 23 23 + :widths: 10 30 30 30 :header-rows: 1 * - - PyMuPDF - PyMuPDF Pro - PyMuPDF4LLM - - PyMuPDF Layout * - **Input Documents** - `PDF`, `XPS`, `EPUB`, `CBZ`, `MOBI`, `FB2`, `SVG`, `TXT`, Images (*standard document types*) - *as PyMuPDF* and: `DOC`/`DOCX`, `XLS`/`XLSX`, `PPT`/`PPTX`, `HWP`/`HWPX` - *as PyMuPDF* - - *as PyMuPDF* * - **Output Documents** - Can convert any input document to `PDF`, `SVG` or Image - *as PyMuPDF* - - *as PyMuPDF* and: - Markdown (`MD`) - - *as PyMuPDF4LLM* and: - `JSON` or `TXT` + - *as PyMuPDF* and: + Markdown (`MD`), `JSON` or `TXT` * - **Page Analysis** - Basic page analysis to return document structure - *as PyMuPDF* - - *as PyMuPDF* - Advanced Page Analysis with trained data for enhanced results * - **Data extraction** - Basic data extraction with structured layout information and bounding box data - *as PyMuPDF* - - Advanced data extraction with structure tags such as headings, lists, tables - - Advanced layout analysis and semantic understanding + - Advanced data extraction including layout analysis with semantic understanding and enhanced bounding box data * - **Table extraction** - Basic table extraction as part of text extraction - *as PyMuPDF* - - Advanced table extraction with cell structure and data types - - Superior table detection + - Advanced table extraction with cell structure, including support for merged cells and complex layouts * - **Image extraction** - Basic image extraction - *as PyMuPDF* - Advanced detection and rendering of image areas on page saving them to disk or embedding in MD output - - Superior detection of "picture" areas * - **Vector extraction** - Vector extraction and clustering - *as PyMuPDF* - - *as PyMuPDF* - - Superior detection of "picture" areas + - Superior detection of "picture" areas * - **Popular RAG Integrations** - - Langchane, LlamaIndex + - Langchain, LlamaIndex - *as PyMuPDF* - - *as PyMuPDF* and with some addiotnal help methods for RAG workflows - - *as PyMuPDF4LLM* + - *as PyMuPDF* and with some additional help methods for RAG workflows * - **OCR** - - On-demand invocation of built-in Tesseract for text detection on pages or images. + - On-demand invocation of built-in Tesseract for text detection on pages or images - *as PyMuPDF* - - *as PyMuPDF* - - Automatic OCR based on page content analysis. - - + - Automatic OCR based on page content analysis. OCR adapators for popular OCR engines available ---- diff --git a/docs/index.rst b/docs/index.rst index cf9adcf7b..2420f84da 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -41,7 +41,6 @@ This documentation covers all versions up to |version|. :maxdepth: 1 about.rst - pymupdf-layout/index.rst pymupdf4llm/index.rst pymupdf-pro/index.rst @@ -73,6 +72,7 @@ This documentation covers all versions up to |version|. module.rst classes.rst + pymupdf4llm/api.rst algebra.rst lowlevel.rst glossary.rst diff --git a/docs/page.rst b/docs/page.rst index c842c3aaa..b5e63d8f6 100644 --- a/docs/page.rst +++ b/docs/page.rst @@ -332,15 +332,15 @@ In a nutshell, this is what you can do with PyMuPDF: |history_end| - .. method:: apply_redactions(images=PDF_REDACT_IMAGE_PIXELS|2, graphics=PDF_REDACT_LINE_ART_REMOVE_IF_TOUCHED|2, text=PDF_REDACT_TEXT_REMOVE|0) + .. method:: apply_redactions(images=PDF_REDACT_IMAGE_PIXELS|2, graphics=PDF_REDACT_LINE_ART_REMOVE_IF_COVERED|1, text=PDF_REDACT_TEXT_REMOVE|0) **PDF only**: Remove all **content** contained in any redaction rectangle on the page. **This method applies and then deletes all redactions from the page.** - :arg int images: How to redact overlapping images. The default (2) blanks out overlapping pixels. `PDF_REDACT_IMAGE_NONE | 0` ignores, and `PDF_REDACT_IMAGE_REMOVE | 1` completely removes images overlapping any redaction annotation. Option `PDF_REDACT_IMAGE_REMOVE_UNLESS_INVISIBLE | 3` only removes images that are actually visible. + :arg int images: How to redact overlapping images. The default `PDF_REDACT_IMAGE_PIXELS | 2` blanks out overlapping pixels. `PDF_REDACT_IMAGE_NONE | 0` ignores, and `PDF_REDACT_IMAGE_REMOVE | 1` completely removes images overlapping any redaction annotation. Option `PDF_REDACT_IMAGE_REMOVE_UNLESS_INVISIBLE | 3` only removes images that are actually visible. - :arg int graphics: How to redact overlapping vector graphics (also called "line-art" or "drawings"). The default (2) removes any overlapping vector graphics. `PDF_REDACT_LINE_ART_NONE | 0` ignores, and `PDF_REDACT_LINE_ART_REMOVE_IF_COVERED | 1` removes graphics fully contained in a redaction annotation. When removing line-art, please be aware that **stroked** vector graphics (i.e. type "s" or "sf") have a **larger wrapping rectangle** than one might expect: first of all, at least 50% of the path's line width have to be added in each direction to truly include all of the drawing. If a so-called "miter limit" is provided (see page 121 of the PDF specification), the enlarging value is `miter * width / 2`. So, when letting everything default (width = 1, miter = 10), the redaction rectangle should be at least 5 points larger in every direction. + :arg int graphics: How to redact overlapping vector graphics (also called "line-art" or "drawings"). The default `PDF_REDACT_LINE_ART_REMOVE_IF_COVERED | 1` removes any overlapping vector graphics. `PDF_REDACT_LINE_ART_NONE | 0` ignores, and `PDF_REDACT_LINE_ART_REMOVE_IF_TOUCHED | 2` removes graphics fully contained in a redaction annotation. When removing line-art, please be aware that **stroked** vector graphics (i.e. type "s" or "sf") have a **larger wrapping rectangle** than one might expect: first of all, at least 50% of the path's line width have to be added in each direction to truly include all of the drawing. If a so-called "miter limit" is provided (see page 121 of the PDF specification), the enlarging value is `miter * width / 2`. So, when letting everything default (width = 1, miter = 10), the redaction rectangle should be at least 5 points larger in every direction. :arg int text: Whether to redact overlapping text. The default `PDF_REDACT_TEXT_REMOVE | 0` removes all characters whose boundary box overlaps any redaction rectangle. This complies with the original legal / data protection intentions of redaction annotations. Other use cases however may require to **keep text** while redacting vector graphics or images. This can be achieved by setting `text=True|PDF_REDACT_TEXT_NONE | 1`. This does **not comply** with the data protection intentions of redaction annotations. **Do so at your own risk.** diff --git a/docs/pymupdf-layout/index.rst b/docs/pymupdf-layout/index.rst deleted file mode 100644 index 39218d437..000000000 --- a/docs/pymupdf-layout/index.rst +++ /dev/null @@ -1,229 +0,0 @@ - -.. include:: ../header.rst - -.. _pymupdf-layout: - -.. raw:: html - - - - - -PyMuPDF Layout -=========================================================================== - - -|PyMuPDF Layout| is a lightweight layout analysis extension for |PyMuPDF| that turns PDFs into clean, structured data with minimal setup. It’s fast, accurate, and efficient without any GPU requirement. - -It is an optional, but recommended, addition to the |PyMuPDF| library especially if you are required to more accurately extract structured data with better semantic information. - - -.. raw:: html - - -

- - -Installing ----------------------------------- - -Install from |PyPI| with:: - - - pip install pymupdf-layout - - -.. _pymupdf_layout_using: - -Using ----------------------------------- - - -In nutshell, |PyMuPDF Layout| detects the layout to extract, but we need |PyMuPDF4LLM| for the API interface. This provides us with options to extract document content as |Markdown|, |JSON| or |TXT|. - -Let's set up the Python coding environment to get started and open a PDF then we'll move on to the semantic data extraction. - -Register packages and open a PDF -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -First up let's import the libraries and open a sample document:: - - import pymupdf.layout - import pymupdf4llm - doc = pymupdf.open("sample.pdf") - -Note, in the above code, that |PyMuPDF Layout| must be imported as shown and before importing |PyMuPDF4LLM| to activate |PyMuPDF|'s layout feature and make it available to |PyMuPDF4LLM|. - -Omitting the first line would cause execution of standard |PyMuPDF4LLM| - without the layout feature! - -Extract the structured data -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -We've activated the |PyMuPDF Layout| library and we've loaded a document, next let's extract the structured data. This is now like a super-charged version of standard |PyMuPDF4LLM| with ``Layout`` working behind the scenes combining heuristics with machine learning - for better extraction results. - -Extract as Markdown -"""""""""""""""""""""""" - -.. code-block:: python - - md = pymupdf4llm.to_markdown(doc) - - -Extract as JSON -""""""""""""""""" - -.. code-block:: python - - json = pymupdf4llm.to_json(doc) - - -Extract as TXT -""""""""""""""""" - -.. code-block:: python - - txt = pymupdf4llm.to_text(doc) - -.. note:: - - Please refer top the full :ref:`PyMuPDF4LLM API ` for more. - -Finally we can save the output to an external file as follows:: - - from pathlib import Path - suffix = ".md" # or ".json" or ".txt" - Path(doc.name).with_suffix(suffix).write_bytes(md.encode()) - - -Headers & Footers -~~~~~~~~~~~~~~~~~~~~~~~ - - -Many documents will have header and footer information on each page of a PDF which you may or may not want to include. This information can be repetitive and simply not needed ( e.g. the same logo and document title or page number information is not always really important when it comes to extracting the document content ). - -|PyMuPDF Layout| is trained in detecting these typical document elements and able to omit them. - -So in this case we can adjust our API calls to ignore these elements as follows:: - - md = pymupdf4llm.to_markdown(doc, header=False, footer=False) - txt = pymupdf4llm.to_text(doc, header=False, footer=False) - - -.. note:: - - Please note that page ``header`` / ``footer`` exclusion is not applicable to JSON output as it aims to always represent all data for the included pages. Please refer to the full :ref:`PyMuPDF4LLM API ` for more. - -Extending Capability ----------------------------------- - -Using with Pro -~~~~~~~~~~~~~~~~~ - -We are able to extend |PyMuPDF Layout| to work with |PyMuPDF Pro| and thus increase our capability by allowing Office documents to be provided as input files. In this case all we have to do is to add the import for |PyMuPDF Pro| and unlock it:: - - import pymupdf.layout - import pymupdf4llm - import pymupdf.pro - pymupdf.pro.unlock() - -Now we can happily load Office files and convert them as follows:: - - md = pymupdf4llm.to_markdown("sample.docx") - - -.. _pymupdf_layout_ocr_support: - -OCR support -~~~~~~~~~~~~~~~~~ - -The new layout-sensitive |PyMuPDF4LLM| version also evaluates whether a page would benefit from applying OCR to it. If its heuristics come to this conclusion, the built-in Tesseract-OCR module is automatically invoked. Its results are then handled like normal page content. - -If a page contains (roughly) no text at all, but is covered with images or many character-sized vectors, a check is made using `OpenCV `_ whether text is *probably* detectable on the page at all. This is done to tell apart image-based text from ordinary pictures (like photographs). - -If the page does contain text but too many characters are unreadable (like "�����"), OCR is also executed, but **for the affected text areas only** -- not the full page. This way, we avoid losing already existing text and other content like images and vectors. - -For these heuristics to work we need both, an existing :ref:`Tesseract installation ` and the availability of `OpenCV `_ in the Python environment. If either is missing, no OCR is attempted at all. - -The decision tree for whether OCR is actually used or not depends on the following: - -1. :ref:`PyMuPDF Layout is imported ` - -2. In the :ref:`PyMuPDF4LLM API ` you have `use_ocr` enabled (this is set to `True` by default) - -3. :ref:`Tesseract is correctly installed ` - -4. `OpenCV `_ is available in your Python environment - - -.. image:: ../images/layout-ocr-flow.png - - -.. _pymupdf_layout_ocr_engines: - -OCR engines -~~~~~~~~~~~~~~~~~~~~~~~ - -Tesseract -"""""""""""""""""""""""""""""""""" - -Tesseract is the default OCR engine used by |PyMuPDF4LLM| when the above criteria are met. It is a widely used open-source OCR engine that supports multiple languages and is known for its accuracy. - - -.. _pymupdf_layout_rapid_ocr: - -RapidOCR -"""""""""""""""""""""""""""""""""" - -If you want to use an OCR engine other than Tesseract, you can do so by providing your own OCR function via the `ocr_function` parameter of the :ref:`PyMuPDF4LLM API `. - -If `RapidOCR `_ and the RapidOCR ONNX Runtime are available, you can use a pre-made callable OCR function for it, which is provided in the `pymupdf4llm.ocr` module as `rapidocr_api.exec_ocr`. - - -Example -'''''''''''''''''''''''''''' -:: - - from pymupdf4llm.ocr import rapidocr_api - - md = pymupdf4llm.to_markdown( - doc, - ocr_function=rapidocr_api.exec_ocr, - force_ocr=True - ) - -In this way RapidOCR can be used as an alternative OCR engine to Tesseract for all pages (if `force_ocr=True`) or just for those pages which meet the default criteria for applying OCR (if `force_ocr=False` or omitted). - - -RapidOCR & Tesseract side-by-side -"""""""""""""""""""""""""""""""""" - -If you want to use both OCR engines side-by-side, you can do so by implementing a custom OCR function which calls both OCR engines - one for bbox recognition (RapidOCR) and the other for text recognition (Tesseract) - and then combines their results. - -This pre-made callable OCR function can be found in the `pymupdf4llm.ocr` module as `rapidtess_api.exec_ocr`. - -Example -'''''''''''''''''''''''''''' -:: - - from pymupdf4llm.ocr import rapidtess_api - - md = pymupdf4llm.to_markdown( - doc, - ocr_function=rapidtess_api.exec_ocr, - force_ocr=True - ) - - - ----- - -.. _pymupdf_layout_and_pymupdf4llm_api: - -|PyMuPDF Layout| and |PyMuPDF4LLM| parameter caveats ------------------------------------------------------ - -If you have imported ``pymupdf.layout``, |PyMuPDF4LLM| changes its behavior in various areas quite significantly. New methods become available and also some features are no longer supported. Please visit `this site `_ for a detailed description of the changes. That web site is being kept up to date while we continue to work on improvements. - -.. include:: ../footer.rst diff --git a/docs/pymupdf4llm/api.rst b/docs/pymupdf4llm/api.rst index 11e2015c7..db46ea408 100644 --- a/docs/pymupdf4llm/api.rst +++ b/docs/pymupdf4llm/api.rst @@ -1,58 +1,45 @@ .. include:: ../header.rst -.. raw:: html - - - .. |PyMuPDFLayoutMode_Ignored| raw:: html - Ignored by PyMuPDF Layout + use_layout() must be False .. |PyMuPDFLayoutMode_Valid| raw:: html - Only valid with PyMuPDF Layout + .. |PyMuPDFLayoutMode_EmptyList| raw:: html - Empty list with PyMuPDF Layout - + Only if use_layout() is False .. |PyMuPDFLayoutMode_Unavailable| raw:: html - Unavailable in PyMuPDF Layout + Only if use_layout() is False .. _pymupdf4llm-api: -API +The PyMuPDF4LLM API =========================================================================== -The |PyMuPDF4LLM| API --------------------------- - .. property:: version Prints the version of the library. + .. method:: to_markdown(doc: pymupdf.Document | str, *, \ detect_bg_color: bool = True, \ dpi: int = 150, \ - use_ocr: bool = True, \ - ocr_language: str = "eng", \ - ocr_dpi: int = 300, \ - ocr_function: callable = None, \ - force_ocr: bool = False, \ embed_images: bool = False, \ extract_words: bool = False, \ filename: str | None = None, \ fontsize_limit: float = 3, \ footer: bool = True, \ + force_ocr: bool = False, \ force_text: bool = True, \ graphics_limit: int = None, \ hdr_info: Any = None, \ @@ -64,7 +51,10 @@ The |PyMuPDF4LLM| API image_format: str = "png", \ image_path: str = "", \ image_size_limit: float = 0.05, \ - margins: int = 0, \ + margins: float | list = 0, \ + ocr_dpi: int = 300, \ + ocr_function: callable = None, \ + ocr_language: str = "eng", \ page_chunks: bool = False, \ page_height: float = None, \ page_separators: bool = False, \ @@ -73,24 +63,26 @@ The |PyMuPDF4LLM| API show_progress: bool = False, \ table_strategy: str = "lines_strict", \ use_glyphs: bool = False, \ + use_ocr: bool = True, \ write_images: bool = False) -> str | list[dict] Reads the pages of the file and outputs the text of its pages in |Markdown| format. How this should happen in detail can be influenced by a number of parameters. Please note that **support for building page chunks** from the |Markdown| text is supported. - :arg Document,str doc: the file, to be specified either as a file path string, or as a |PyMuPDF| :class:`Document` (created via `pymupdf.open`). In order to use `pathlib.Path` specifications, Python file-like objects, documents in memory etc. you **must** use a |PyMuPDF| :class:`Document`. :arg bool detect_bg_color: |PyMuPDFLayoutMode_Ignored| does a simple check for the general background color of the pages (default is ``True``). If any text or vector has this color it will be ignored. May increase detection accuracy. :arg int dpi: specify the desired image resolution in dots per inch. Relevant only if `write_images=True` or `embed_images=True`. Default value is 150. - :arg bool use_ocr: |PyMuPDFLayoutMode_Valid| use :ref:`OCR capability ` to help analyse the page. This will OCR pages as determined by the default criteria. + :arg bool embed_images: like `write_images`, but images will be included in the markdown text as base64-encoded strings. Mutually exclusive with `write_images` and ignores `image_path`. This may drastically increase the size of your markdown text. - :arg str ocr_language: |PyMuPDFLayoutMode_Valid| specify the language to be used by the Tesseract OCR engine. Default is "eng" (English). Make sure that the respective language data files are installed. Remember to use correct Tesseract language codes. Multiple languages can be specified by concatenating the respective codes with a plus sign "+", for example "eng+deu" for English and German. + :arg bool extract_words: |PyMuPDFLayoutMode_Ignored| a value of `True` enforces `page_chunks=True` and adds key "words" to each page dictionary. Its value is a list of words as delivered by PyMuPDF's `Page` method `get_text("words")`. The sequence of the words in this list is the same as the extracted text. - :arg int ocr_dpi: |PyMuPDFLayoutMode_Valid| specify the desired image resolution in dots per inch for applying OCR to the intermediate image of the page. Default value is 300. Only relevant if the page has been determined to profit from OCR (no or few text, most of the page covered by images or character-like vectors, etc.). Large values may increase the OCR precision but increase memory requirements and processing time. There also is a risk of over-sharpening the image which may decrease OCR precision. So the default value should probably be sufficiently high. + :arg str filename: Overwrites or sets the desired image file name of written images. Useful when the document is provided as a memory object (which has no inherent file name). - :arg callable ocr_function: |PyMuPDFLayoutMode_Valid| if you want to provide your own :ref:`OCR function `, specify it here. If omitted (`None`), the built-in Tesseract OCR engine will be used. + :arg float fontsize_limit: |PyMuPDFLayoutMode_Ignored| limit the font size to consider for text extraction. If the font size is lower than what is set then the text won't be considered for extraction. Default is `3`, meaning only text with a font size `>= 3` will be considered for extraction. + + :arg bool footer: |PyMuPDFLayoutMode_Valid| boolean to switch on/off page footer content. This parameter controls whether to include or omit footer text from all the document pages. Useful if the document has repetitive footer content which doesn't add any value to the overall extraction data. Default is `True` meaning that footer content will be considered. :arg bool force_ocr: |PyMuPDFLayoutMode_Valid| if `True`, OCR will be applied to all pages regardless of their content. @@ -101,16 +93,6 @@ The |PyMuPDF4LLM| API .. warning:: Requires `ocr_function` to be specified otherwise an exception will be raised. - :arg bool embed_images: like `write_images`, but images will be included in the markdown text as base64-encoded strings. Mutually exclusive with `write_images` and ignores `image_path`. This may drastically increase the size of your markdown text. - - :arg bool extract_words: |PyMuPDFLayoutMode_Ignored| a value of `True` enforces `page_chunks=True` and adds key "words" to each page dictionary. Its value is a list of words as delivered by PyMuPDF's `Page` method `get_text("words")`. The sequence of the words in this list is the same as the extracted text. - - :arg str filename: Overwrites or sets the desired image file name of written images. Useful when the document is provided as a memory object (which has no inherent file name). - - :arg float fontsize_limit: |PyMuPDFLayoutMode_Ignored| limit the font size to consider for text extraction. If the font size is lower than what is set then the text won't be considered for extraction. Default is `3`, meaning only text with a font size `>= 3` will be considered for extraction. - - :arg bool footer: |PyMuPDFLayoutMode_Valid| boolean to switch on/off page footer content. This parameter controls whether to include or omit footer text from all the document pages. Useful if the document has repetitive footer content which doesn't add any value to the overall extraction data. Default is `True` meaning that footer content will be considered. - :arg bool force_text: generate text output even when overlapping images / graphics. This text then appears after the respective image. :arg int graphics_limit: |PyMuPDFLayoutMode_Ignored| use this to limit dealing with excess amounts of vector graphics elements. Scientific documents, or pages simulating text via graphics commands may contain tens of thousands of these objects. As vector graphics are analyzed for multiple purposes, runtime may quickly become intolerable. With this parameter, all vector graphics will be ignored if their count exceeds the threshold. @@ -139,6 +121,12 @@ The |PyMuPDF4LLM| API * `(top, bottom)` yields `(0, top, 0, bottom)`. * To always read full pages **(default)**, use `margins=0`. + :arg int ocr_dpi: |PyMuPDFLayoutMode_Valid| specify the desired image resolution in dots per inch for applying OCR to the intermediate image of the page. Default value is 300. Only relevant if the page has been determined to profit from OCR (no or few text, most of the page covered by images or character-like vectors, etc.). Large values may increase the OCR precision but increase memory requirements and processing time. There also is a risk of over-sharpening the image which may decrease OCR precision. So the default value should probably be sufficiently high. + + :arg callable ocr_function: |PyMuPDFLayoutMode_Valid| if you want to provide your own :ref:`OCR function `, specify it here. If omitted (`None`), the built-in Tesseract OCR engine will be used. + + :arg str ocr_language: |PyMuPDFLayoutMode_Valid| specify the language to be used by the Tesseract OCR engine. Default is "eng" (English). Make sure that the respective language data files are installed. Remember to use correct Tesseract language codes. Multiple languages can be specified by concatenating the respective codes with a plus sign "+", for example "eng+deu" for English and German. + :arg bool page_chunks: if `True` the output will be a list of `Document.page_count` dictionaries (one per page). Each dictionary has the following structure: - **"metadata"** - a dictionary consisting of the document's metadata :attr:`Document.metadata`, enriched with additional keys **"file_path"** (the file name), **"page_count"** (number of pages in document), and **"page_number"** (1-based page number). @@ -180,6 +168,8 @@ The |PyMuPDF4LLM| API :arg bool use_glyphs: |PyMuPDFLayoutMode_Ignored| (New in v.0.0.19) Default is `False`. A value of `True` will use the glyph number of the characters instead of the character itself if the font does not store the Unicode value. + :arg bool use_ocr: |PyMuPDFLayoutMode_Valid| use :ref:`OCR capability ` to help analyse the page. This will OCR pages as determined by the default criteria. + :arg bool write_images: when encountering images or vector graphics, images will be created from the respective page area and stored in the specified folder. |Markdown| references will be generated pointing to these images. Any text contained in these areas will not be included in the text output (but appear as part of the images). Therefore, if for instance your document has text written on full page images, make sure to set this parameter to `False`. If using :ref:`PyMuPDF Layout `, boundary boxes that are classified as "picture" by the layout module will be treated as images - independent from the mixture of text, images or vector graphics they may be covering. If `force_text=True` is used, text will still be extracted from these areas and included in the output after the respective image reference. @@ -189,10 +179,7 @@ The |PyMuPDF4LLM| API .. method:: to_text(doc: pymupdf.Document | str, *, **kwargs) -> str - - Reads the pages of the file and outputs the text of its pages in |TXT| format. - - .. important:: |PyMuPDFLayoutMode_Valid|. This method is only available with PyMuPDF Layout. + Reads the pages of the file and outputs the text of its pages in plain text (|TXT|) format. :arg Document,str doc: the file, to be specified either as a file path string, or as a |PyMuPDF| :class:`Document` (created via `pymupdf.open`). In order to use `pathlib.Path` specifications, Python file-like objects, documents in memory etc. you **must** use a |PyMuPDF| :class:`Document`. @@ -240,8 +227,6 @@ The |PyMuPDF4LLM| API .. method:: to_json(doc: pymupdf.Document | str, *, **kwargs) -> str Parses the document and the specified pages and converts the result into a |JSON|-formatted string. - - .. important:: |PyMuPDFLayoutMode_Valid|. This method is only available with PyMuPDF Layout. :arg Document,str doc: the file, to be specified either as a file path string, or as a |PyMuPDF| :class:`Document` (created via `pymupdf.open`). In order to use `pathlib.Path` specifications, Python file-like objects, documents in memory etc. you **must** use a |PyMuPDF| :class:`Document`. @@ -268,13 +253,23 @@ The |PyMuPDF4LLM| API :arg list pages: optional, the pages to consider for output (caution: specify 0-based page numbers). If omitted (`None`) all pages are processed. Specify any valid Python sequence containing integers between `0` and `page_count - 1`. +.. _pymupdf4llm-api-layout: + + +.. method:: use_layout(yes: bool = True) + + Switch on/off the use of the :ref:`PyMuPDF Layout module `. + + If `yes=True` (default), the layout module will be used for page analysis for optimal results. If `yes=False`, the layout module will not be used. + + .. method:: get_key_values(doc: pymupdf.Document | str) -> list[dict] Parse the document if it is a **Form PDF** and extract key-value pairs from all form fields (widgets). Please note that this method is only relevant for PDF documents that contain widgets. Otherwise, an empty list will be returned. - The function is always available -- independently of whether you are using |PyMuPDF Layout | or not. + The function is always available -- independently of whether you are using the PyMuPDF Layout module or not. Each dictionary item has the following structure:: @@ -489,7 +484,7 @@ This is a version of previous **example 2** that uses :class:`TocHeaders` for he ----- -For a list of changes, please see file `CHANGES.md `_. +For a list of changes, please see file `CHANGES.md `_. .. rubric:: Footnotes @@ -505,4 +500,35 @@ For a list of changes, please see file `CHANGES.md + document.getElementById("headerSearchWidget").action = '../search.html'; + const params = document.querySelectorAll('.sig-param') + + params.forEach((param, index) => { + const next = param.nextSibling; + if (next && next.nodeType === 3) { // 3 = text node + const span = document.createElement('span'); + if (index === params.length - 1) { + span.className = 'sig-comma-last'; + } else { + param.classList.add('has-comma'); + span.className = 'sig-comma'; + } + span.textContent = next.textContent; + next.replaceWith(span); + } + }); + + diff --git a/docs/pymupdf4llm/index.rst b/docs/pymupdf4llm/index.rst index f27919021..62ae2a007 100644 --- a/docs/pymupdf4llm/index.rst +++ b/docs/pymupdf4llm/index.rst @@ -14,37 +14,30 @@ PyMuPDF4LLM =========================================================================== -|PyMuPDF4LLM| is aimed to make it easier to extract |PDF| content in the format you need for **LLM** & **RAG** environments. It supports :ref:`Markdown extraction ` as well as :ref:`LlamaIndex document output `. +|PyMuPDF4LLM| is a lightweight extension for |PyMuPDF| that turns PDFs into clean, structured data with minimal setup. It includes layout analysis *without* any GPU requirement. + + +|PyMuPDF4LLM| is aimed to make it easier to extract document content in the format you need for **LLM** & **RAG** environments. It supports :ref:`Markdown `, :ref:`JSON ` and :ref:`TXT ` extraction, as well as :ref:`LlamaIndex ` and :ref:`LangChain ` integration. -When using |PyMuPDF4LLM| with PyMuPDF Layout, page layout detection will be greatly improved. This is true for table detection, but also for the detection of page headers and footers, footnotes, list items and text paragraphs. In addition two new methods become available, `to_json()` and `to_text()`. .. important:: - You can extend the supported file types to also include **Office** document formats (DOC/DOCX, XLS/XLSX, PPT/PPTX, HWP/HWPX) by :ref:`using PyMuPDF Pro with PyMuPDF4LLM `. + You can also extend the supported file types to also include **Office** document formats (DOC/DOCX, XLS/XLSX, PPT/PPTX, HWP/HWPX) by :ref:`using PyMuPDF Pro with PyMuPDF4LLM `. Features ------------------------------- - - Support for multi-column pages - - Support for image and vector graphics extraction (and inclusion of references in the MD text) + - Support for Markdown, JSON and plain text output formats. + - Support for multi-column pages. + - Support for image and vector graphics extraction. + - Layout analysis for better semantic understanding of document structure. - Support for page chunking output. - - Direct support for output as :ref:`LlamaIndex Documents `. - - When used with :ref:`PyMuPDF Layout ` : Support for plain text output similar to Markdown - - When used with :ref:`PyMuPDF Layout ` : Support for JSON output - - -Functionality --------------------- - -- This package converts the pages of a file to plain text or in **Markdown** format using |PyMuPDF|. - -- Standard text and tables are detected, brought in the right reading sequence and then together converted to **GitHub**-compatible **Markdown** text. Tables in plain text output mode are rendered using the `tabulate `_ package. + - Integration with :ref:`LlamaIndex ` & :ref:`LangChain `. -- Header lines are identified via the font size and appropriately prefixed with one or more `#` tags. When using the package together with :ref:`PyMuPDF Layout `, titles, section headers and page headers and footers are detected. - -- Bold, italic, mono-spaced text and code blocks are detected and formatted accordingly. Similar applies to ordered and unordered lists. +API +------- -- By default, all document pages are processed. If desired, a subset of pages can be specified by providing a list of `0`-based page numbers. +See: :doc:`api`. Installation @@ -59,62 +52,115 @@ Install the package via **pip** with: pip install pymupdf4llm +Extracting +------------------------------- + + .. _extracting_as_md: -Extracting a file as **Markdown** --------------------------------------------------------------- +As **Markdown** +~~~~~~~~~~~~~~~~~~~~~~~ + +To retrieve your document content in **Markdown** use the :meth:`to_markdown` method as follows: + +.. code-block:: python + + import pymupdf4llm + md = pymupdf4llm.to_markdown("input.pdf") + + + +.. _extracting_as_json: + +As **JSON** +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +To retrieve your document content in **JSON** use the :meth:`to_json` method as follows: + +.. code-block:: python + + import pymupdf4llm + json = pymupdf4llm.to_json("input.pdf") -To retrieve your document content in **Markdown** simply install the package and then use a couple of lines of **Python** code to get results. +The JSON export will give you bounding box information and layout data for each element on the page. This can be used to create your own custom output formats or to simply have more detailed information about the document structure for RAG workflows & LLM integrations. +.. _extracting_as_txt: -Then in your **Python** script do: +As **TXT** +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +To retrieve your document content in **TXT** use the :meth:`to_text` method as follows: .. code-block:: python import pymupdf4llm - md_text = pymupdf4llm.to_markdown("input.pdf") + txt = pymupdf4llm.to_text("input.pdf") + +---- + .. note:: + Instead of using filename strings as above, one can also provide a :ref:`PyMuPDF Document `. - Instead of the filename string as above, one can also provide a :ref:`PyMuPDF Document `. A second parameter may be a list of `0`-based page numbers, e.g. `[0, 1]` would just select the first and second pages of the document. + Finally we can save the output to an external file as follows:: + from pathlib import Path + suffix = ".md" # or ".json" or ".txt" + Path(doc.name).with_suffix(suffix).write_bytes(md.encode()) -If you want to store your **Markdown** file, e.g. store as a UTF8-encoded file, then do: +Headers & Footers +~~~~~~~~~~~~~~~~~~~~~~~ -.. code-block:: python +Many documents will have header and footer information on each page of a PDF which you may or may not want to include. This information can be repetitive and simply not needed ( e.g. the same logo and document title or page number information is not always required when it comes to extracting the document content ). - import pathlib - pathlib.Path("output.md").write_bytes(md_text.encode()) +|PyMuPDF4LLM| is trained in detecting these typical document elements and able to omit them. +So in this case we can adjust our API calls to ignore these elements as follows:: + md = pymupdf4llm.to_markdown(doc, header=False, footer=False) -.. _extracting_as_llamaindex: -Extracting a file as a **LlamaIndex** document --------------------------------------------------------------- +.. note:: + + Please note that page ``header`` / ``footer`` exclusion is not applicable to JSON output as it aims to always represent all data for the included pages. Please refer to :doc:`api` for more. + + +Integrations +------------------------------- + +.. _integration_with_llamaindex: -|PyMuPDF4LLM| supports direct conversion to a **LLamaIndex** document. A document is first converted into **Markdown** format and then a **LlamaIndex** document is returned as follows: +With **LlamaIndex** +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +|PyMuPDF4LLM| supports direct conversion to a **LlamaIndex** document. A document is first converted into **Markdown** format and then a **LlamaIndex** document is returned as follows: .. code-block:: python import pymupdf4llm llama_reader = pymupdf4llm.LlamaMarkdownReader() - llama_docs = llama_reader.load_data("input.pdf") + llama_docs = llama_reader.load_data("input.pdf") -.. _using_pymupdf4llm_withpymupdfpro: +.. _integration_with_langchain: + +With **LangChain** +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +|PyMuPDF4LLM| also supports **LangChain** integration, see the `PyMuPDF4LLM Document Loader`_ for more details. + + +.. _using_pymupdf4llm_with_pymupdfpro: Using with |PyMuPDF Pro| --------------------------- -For **Office** document support, |PyMuPDF4LLM| works seamlessly with |PyMuPDF Pro|. Assuming you have :doc:`../pymupdf-pro` installed you will be able to work with **Office** documents as expected: +For **Office** document support, |PyMuPDF4LLM| works seamlessly with |PyMuPDF Pro|. Assuming you have :doc:`../pymupdf-pro/index` installed you will be able to work with **Office** documents as expected: .. code-block:: python @@ -122,17 +168,16 @@ For **Office** document support, |PyMuPDF4LLM| works seamlessly with |PyMuPDF Pr import pymupdf4llm import pymupdf.pro pymupdf.pro.unlock() - md_text = pymupdf4llm.to_markdown("sample.doc") + md = pymupdf4llm.to_markdown("sample.doc") -As you can see |PyMuPDF Pro| functionality will be available within the |PyMuPDF4LLM| context! +.. _pymupdf4llm_and_layout: +PyMuPDF4LLM & PyMuPDF Layout +----------------------------------- +By default |PyMuPDF4LLM| includes a `layout analysis module`_ to enhance output results. To disable this module you can do so by calling the :meth:`use_layout` method. -API -------- - -See :ref:`the PyMuPDF4LLM API `. Further Resources ------------------- @@ -142,7 +187,7 @@ Sample code ~~~~~~~~~~~~~~~ - `Command line RAG Chatbot with PyMuPDF `_ -- `Example of a Browser Application using Langchain and PyMuPDF `_ +- `Example of a Browser Application using LangChain and PyMuPDF `_ Blogs @@ -154,3 +199,10 @@ Blogs - `RAG/LLM and PDF: Conversion to Markdown Text with PyMuPDF `_ .. include:: ../footer.rst + + +PyMuPDF4LLM Document Loader + +.. _PyMuPDF4LLM Document Loader: https://docs.langchain.com/oss/python/integrations/providers/pymupdf4llm/ + +.. _layout analysis module: https://pypi.org/project/pymupdf-layout/