Skip to content
37 changes: 12 additions & 25 deletions docs/about.rst
Original file line number Diff line number Diff line change
Expand Up @@ -73,8 +73,8 @@ PyMuPDF Product Suite
**Additional products** in the |PyMuPDF| product suite are:

- |PyMuPDF Pro| adds support for Office document formats.
- |PyMuPDF4LLM| is optimized for large language model (LLM) applications, providing enhanced text extraction and processing capabilities.
- |PyMuPDF Layout| focuses on layout analysis and semantic understanding, ideal for document conversion and formatting tasks with enhanced results.
- |PyMuPDF4LLM| is optimized for large language model (LLM) applications, providing enhanced text extraction and processing capabilities.
It focuses on layout analysis and semantic understanding, ideal for document conversion and formatting tasks with enhanced results.

.. note::
All of the products above depend on the same core product - |PyMuPDF| and therefore have full access to all of its features.
Expand All @@ -89,64 +89,51 @@ PyMuPDF Products Comparison
The following table illustrates what features the products offer:

.. list-table:: PyMuPDF Products Comparison
:widths: 8 23 23 23 23
:widths: 10 30 30 30
:header-rows: 1

* -
- PyMuPDF
- PyMuPDF Pro
- PyMuPDF4LLM
- PyMuPDF Layout
* - **Input Documents**
- `PDF`, `XPS`, `EPUB`, `CBZ`, `MOBI`, `FB2`, `SVG`, `TXT`, Images (*standard document types*)
- *as PyMuPDF* and:
`DOC`/`DOCX`, `XLS`/`XLSX`, `PPT`/`PPTX`, `HWP`/`HWPX`
- *as PyMuPDF*
- *as PyMuPDF*
* - **Output Documents**
- Can convert any input document to `PDF`, `SVG` or Image
- *as PyMuPDF*
- *as PyMuPDF* and:
Markdown (`MD`)
- *as PyMuPDF4LLM* and:
`JSON` or `TXT`
- *as PyMuPDF* and:
Markdown (`MD`), `JSON` or `TXT`
* - **Page Analysis**
- Basic page analysis to return document structure
- *as PyMuPDF*
- *as PyMuPDF*
- Advanced Page Analysis with trained data for enhanced results
* - **Data extraction**
- Basic data extraction with structured layout information and bounding box data
- *as PyMuPDF*
- Advanced data extraction with structure tags such as headings, lists, tables
- Advanced layout analysis and semantic understanding
- Advanced data extraction including layout analysis with semantic understanding and enhanced bounding box data
* - **Table extraction**
- Basic table extraction as part of text extraction
- *as PyMuPDF*
- Advanced table extraction with cell structure and data types
- Superior table detection
- Advanced table extraction with cell structure, including support for merged cells and complex layouts
* - **Image extraction**
- Basic image extraction
- *as PyMuPDF*
- Advanced detection and rendering of image areas on page saving them to disk or embedding in MD output
- Superior detection of "picture" areas
* - **Vector extraction**
- Vector extraction and clustering
- *as PyMuPDF*
- *as PyMuPDF*
- Superior detection of "picture" areas
- Superior detection of "picture" areas
* - **Popular RAG Integrations**
- Langchane, LlamaIndex
- Langchain, LlamaIndex
- *as PyMuPDF*
- *as PyMuPDF* and with some addiotnal help methods for RAG workflows
- *as PyMuPDF4LLM*
- *as PyMuPDF* and with some additional help methods for RAG workflows
* - **OCR**
- On-demand invocation of built-in Tesseract for text detection on pages or images.
- On-demand invocation of built-in Tesseract for text detection on pages or images
- *as PyMuPDF*
- *as PyMuPDF*
- Automatic OCR based on page content analysis.


- Automatic OCR based on page content analysis. OCR adapators for popular OCR engines available

----

Expand Down
2 changes: 1 addition & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,6 @@ This documentation covers all versions up to |version|.
:maxdepth: 1

about.rst
pymupdf-layout/index.rst
pymupdf4llm/index.rst
pymupdf-pro/index.rst

Expand Down Expand Up @@ -73,6 +72,7 @@ This documentation covers all versions up to |version|.

module.rst
classes.rst
pymupdf4llm/api.rst
algebra.rst
lowlevel.rst
glossary.rst
Expand Down
6 changes: 3 additions & 3 deletions docs/page.rst
Original file line number Diff line number Diff line change
Expand Up @@ -332,15 +332,15 @@ In a nutshell, this is what you can do with PyMuPDF:
|history_end|


.. method:: apply_redactions(images=PDF_REDACT_IMAGE_PIXELS|2, graphics=PDF_REDACT_LINE_ART_REMOVE_IF_TOUCHED|2, text=PDF_REDACT_TEXT_REMOVE|0)
.. method:: apply_redactions(images=PDF_REDACT_IMAGE_PIXELS|2, graphics=PDF_REDACT_LINE_ART_REMOVE_IF_COVERED|1, text=PDF_REDACT_TEXT_REMOVE|0)

**PDF only**: Remove all **content** contained in any redaction rectangle on the page.

**This method applies and then deletes all redactions from the page.**

:arg int images: How to redact overlapping images. The default (2) blanks out overlapping pixels. `PDF_REDACT_IMAGE_NONE | 0` ignores, and `PDF_REDACT_IMAGE_REMOVE | 1` completely removes images overlapping any redaction annotation. Option `PDF_REDACT_IMAGE_REMOVE_UNLESS_INVISIBLE | 3` only removes images that are actually visible.
:arg int images: How to redact overlapping images. The default `PDF_REDACT_IMAGE_PIXELS | 2` blanks out overlapping pixels. `PDF_REDACT_IMAGE_NONE | 0` ignores, and `PDF_REDACT_IMAGE_REMOVE | 1` completely removes images overlapping any redaction annotation. Option `PDF_REDACT_IMAGE_REMOVE_UNLESS_INVISIBLE | 3` only removes images that are actually visible.

:arg int graphics: How to redact overlapping vector graphics (also called "line-art" or "drawings"). The default (2) removes any overlapping vector graphics. `PDF_REDACT_LINE_ART_NONE | 0` ignores, and `PDF_REDACT_LINE_ART_REMOVE_IF_COVERED | 1` removes graphics fully contained in a redaction annotation. When removing line-art, please be aware that **stroked** vector graphics (i.e. type "s" or "sf") have a **larger wrapping rectangle** than one might expect: first of all, at least 50% of the path's line width have to be added in each direction to truly include all of the drawing. If a so-called "miter limit" is provided (see page 121 of the PDF specification), the enlarging value is `miter * width / 2`. So, when letting everything default (width = 1, miter = 10), the redaction rectangle should be at least 5 points larger in every direction.
:arg int graphics: How to redact overlapping vector graphics (also called "line-art" or "drawings"). The default `PDF_REDACT_LINE_ART_REMOVE_IF_COVERED | 1` removes any overlapping vector graphics. `PDF_REDACT_LINE_ART_NONE | 0` ignores, and `PDF_REDACT_LINE_ART_REMOVE_IF_TOUCHED | 2` removes graphics fully contained in a redaction annotation. When removing line-art, please be aware that **stroked** vector graphics (i.e. type "s" or "sf") have a **larger wrapping rectangle** than one might expect: first of all, at least 50% of the path's line width have to be added in each direction to truly include all of the drawing. If a so-called "miter limit" is provided (see page 121 of the PDF specification), the enlarging value is `miter * width / 2`. So, when letting everything default (width = 1, miter = 10), the redaction rectangle should be at least 5 points larger in every direction.

:arg int text: Whether to redact overlapping text. The default `PDF_REDACT_TEXT_REMOVE | 0` removes all characters whose boundary box overlaps any redaction rectangle. This complies with the original legal / data protection intentions of redaction annotations. Other use cases however may require to **keep text** while redacting vector graphics or images. This can be achieved by setting `text=True|PDF_REDACT_TEXT_NONE | 1`. This does **not comply** with the data protection intentions of redaction annotations. **Do so at your own risk.**

Expand Down
229 changes: 0 additions & 229 deletions docs/pymupdf-layout/index.rst

This file was deleted.

Loading