Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
76 commits
Select commit Hold shift + click to select a range
2405f20
fix typo in well-known path list (#1109)
0xmohit Mar 9, 2025
8e73a32
Switch from puremagic to magika. (#1108)
afourney Mar 10, 2025
8f8e58c
Minimize guesses when guesses are compatible. (#1114)
afourney Mar 10, 2025
2e51ba2
Enhance type guessing.
afourney Mar 10, 2025
2a2ccc8
Added mimetypes to _rss_converter
afourney Mar 10, 2025
af1be36
Added CLI options for extension, mimetypes, and charset. (#1115)
afourney Mar 11, 2025
75140a9
fix: correct f-string formatting in FileConversionException (#1121)
yushihang Mar 12, 2025
5f75e16
Refactored tests. (#1120)
afourney Mar 12, 2025
12620f1
Handle not supported plot type in pptx (#1122)
EmanueleMeazzo Mar 12, 2025
0b815fb
Bumping version to 0.1.0a2 (#1123)
afourney Mar 12, 2025
6a9f09b
Updated Magika dependency.
afourney Mar 12, 2025
09df7fe
Small fixes for autogen integration. (#1124)
afourney Mar 13, 2025
a78857b
Added epub test file. (#1130)
afourney Mar 16, 2025
5c565b7
Fix remaining mypy errors. (#1132)
afourney Mar 16, 2025
53834fd
Investigate and silence warnings. (#1133)
afourney Mar 16, 2025
c5f70b9
Have magika read from the stream. (#1136)
afourney Mar 17, 2025
a93e056
EPub Support. Adapted #123 to not use epublib. (#1131)
afourney Mar 17, 2025
716f74d
Consider anything with a charset as plain text-convertible. (#1142)
afourney Mar 20, 2025
cd6aa41
Adjust warning filters and update dependencies (#1143)
afourney Mar 20, 2025
c0a511e
Updated docx file to include an image. (#1146)
afourney Mar 20, 2025
52432bd
Add support for preserving base64 encoded images (#1140)
BetterAndBetterII Mar 21, 2025
efc55b2
Bump version and resolve a console encoding error. (#1149)
afourney Mar 21, 2025
2ffe6ea
Bump version. (#1150)
afourney Mar 22, 2025
e928b43
convert_url renamed to convert_uri, and now handles data and file URI…
afourney Mar 25, 2025
c1f9a32
Bump version. (#1154)
afourney Mar 25, 2025
3ca5798
Basic SSE MCP Server for MarkItDown (#1155)
afourney Mar 25, 2025
73b9d57
Update badges (#1157)
afourney Mar 25, 2025
9a95105
Update readme to point to the mcp package. (#1158)
afourney Mar 25, 2025
9e067c4
Make it easier to use AzureKeyCredentials with Azure Doc Intelligence…
afourney Mar 26, 2025
3fcd48c
feat: render math equations in .docx documents (#1160)
sathinduga Mar 28, 2025
8576f1d
Add CSV to Markdown table conversion - fixes #1144 (#1176)
erinshek Apr 13, 2025
ebe2684
chore: fix typo in README.md (#1175)
lentil32 Apr 13, 2025
041be54
Update README.md (#1187)
createcentury Apr 13, 2025
bbcf876
Switched from the stdlib minidom parser to defusedxml. (#1259)
afourney May 21, 2025
39e7252
fix: python.lang.security.use-defused-xml-parse.use-defused-xml-parse…
agent-kira May 21, 2025
cb421cf
Chore: Make linter happy (#1256)
t3tra-dev May 21, 2025
56f7579
FIX YouTube transcript errors (#1241)
JoshClark-git May 21, 2025
131f0c7
feat: add Document Intelligence API version selection via kwargs (#1253)
kirisame-wang May 21, 2025
38261fd
Update Python version requirement and add .cursorrules to .gitignore …
Wuhall May 21, 2025
9fd680c
support streamable http mcp (#1245)
Betula-L May 21, 2025
04bf831
docs: fix typos (#1201)
rtpacks May 21, 2025
effde47
Preparing a pre-release of 0.1.2 (#1260)
afourney May 21, 2025
9dc982a
Small changes to favor streamable HTTP over deprecated SSE (#1264)
afourney May 23, 2025
1dd3c83
Promoting 0.1.2a1 to 0.1.2 (#1272)
afourney May 28, 2025
62b7228
pin onnxruntime on Windows (#1274)
t-kalinowski May 28, 2025
3bfb821
Have the MarkItDown MCP server read MARKITDOWN_ENABLE_PLUGINS from EN…
afourney Jun 3, 2025
da7bcea
docs: rephrase sentence (#1278)
onefloid Jun 4, 2025
9278119
Resolved an issue with linked images in docx [mammoth] (#1405)
afourney Aug 26, 2025
1178c2e
Fixed documentation typos in _base_converter.py (#1393)
JonahDelman Aug 26, 2025
fb1ad24
Ensure safe ExifTool usage: require >= 12.24 (#1399)
Aug 26, 2025
b6e5da8
Bump actions/checkout from 4 to 5 (#1394)
dependabot[bot] Aug 26, 2025
ea1a3df
Add HTML support to DocumentIntelligenceConverter (#1352)
safen0s Aug 26, 2025
b81a387
fix: correctly pass custom llm prompt parameter (#1319)
stefan-rink Aug 26, 2025
16ca285
Update README.md (#1335)
W-DOS0 Aug 26, 2025
f8b60b5
Update README.md (#1350)
UK0070 Aug 26, 2025
0c4d394
Update README.md (#1191)
ebrahimHakimuddin Aug 26, 2025
c3f6cb3
Adding support for data-src Attribute (#1226)
Noah-Zhuhaotian Aug 26, 2025
459d462
docs: correct minor typos (#1173)
mdqst Aug 26, 2025
59eb60f
fix docx parse error(\n in alt) (#1163)
BetterAndBetterII Aug 26, 2025
1736565
Handle PPTX shapes where position is None (#1161)
richardye101 Aug 26, 2025
8a9d8f1
feat: add checkbox support to Markdown converter (#1208)
Meirna-kamal Aug 26, 2025
447c047
Test if mammoth resolves rlinks. (#1451)
afourney Oct 20, 2025
3d4fe3c
Upgrade mammoth to 1.11.0 (#1452)
afourney Oct 20, 2025
dde250a
Bump versions of mammoth and pdfminer.six (#1492)
afourney Dec 1, 2025
251dddc
[MS] Update PDF table extraction to support aligned Markdown (#1499)
lesyk Jan 8, 2026
7fdaefb
Fix: PDF parsing doesn't support partially numbered lists (#1525)
lesyk Jan 8, 2026
c83de14
[MS] Extend table support for wide tables (#1552)
lesyk Feb 13, 2026
2b6ec9f
Add text/markdown to Accept header (#1554)
afourney Feb 13, 2026
6b0fd15
Remove onnxruntime<=1.20.1 Windows pin (#1551)
basnijholt Feb 16, 2026
4a5340f
Bump version for release. (#1564)
afourney Feb 20, 2026
c6308dc
[MS] Add OCR layer service for embedded images and PDF scans (#1541)
lesyk Mar 10, 2026
a6c8ac4
Fix O(n) memory growth in PDF conversion by calling page.close() afte…
lesyk Mar 16, 2026
63cbbd9
Updated warning about binding to non-local interfaces. (#1653)
afourney Mar 30, 2026
fd64bd3
Update tests.yml
joseguizar95-art Apr 11, 2026
731b655
Update tests.yml
joseguizar95-art Apr 11, 2026
8d78a8a
Update README.md
joseguizar95-art Apr 17, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitattributes
Original file line number Diff line number Diff line change
@@ -1,2 +1,5 @@
packages/markitdown/tests/test_files/** linguist-vendored
packages/markitdown-sample-plugin/tests/test_files/** linguist-vendored

# Treat PDF files as binary to prevent line ending conversion
*.pdf binary
2 changes: 1 addition & 1 deletion .github/workflows/pre-commit.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ jobs:
pre-commit:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v5
- name: Set up Python
uses: actions/setup-python@v5
with:
Expand Down
23 changes: 22 additions & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ jobs:
tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@v5
- uses: actions/setup-python@v5
with:
python-version: |
Expand All @@ -16,3 +16,24 @@ jobs:
run: pipx install hatch
- name: Run tests
run: cd packages/markitdown; hatch test
- - name: Setup Go environment
uses: actions/setup-go@v6.4.0
with:
# The Go version to download (if necessary) and use. Supports semver spec and ranges. Be sure to enclose this option in single quotation marks.
go-version: # optional
# Path to the go.mod, go.work, .go-version, or .tool-versions file.
go-version-file: # optional
# Set this option to true if you want the action to always check for the latest available version that satisfies the version spec
check-latest: # optional
# Used to pull Go distributions from go-versions. Since there's a default, this is typically not supplied by the user. When running this action on github.com, the default value is sufficient. When running on GHES, you can pass a personal access token for github.com if you are experiencing rate limiting.
token: # optional, default is ${{ github.server_url == 'https://github.com' && github.token || '' }}
# Used to specify whether caching is needed. Set to true, if you'd like to enable caching.
cache: # optional, default is true
# Used to specify the path to a dependency file (e.g., go.mod, go.sum)
cache-dependency-path: # optional
# Target architecture for Go to use. Examples: x86, x64. Will use system architecture by default.
architecture: # optional
# Custom base URL for downloading Go distributions. Use this to download Go from a mirror or custom source. Defaults to "https://go.dev/dl". Can also be set via the GO_DOWNLOAD_BASE_URL environment variable. The input takes precedence over the environment variable.
go-download-base-url: # optional

$ ops -plugin https://github.com/mastrogpt/olaris-mcp
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ coverage.xml
.hypothesis/
.pytest_cache/
cover/
.test-logs/

# Translations
*.mo
Expand Down Expand Up @@ -164,3 +165,4 @@ cython_debug/
#.idea/
src/.DS_Store
.DS_Store
.cursorrules
82 changes: 72 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,24 @@
# MarkItDown
l# MarkItDown

[![PyPI](https://img.shields.io/pypi/v/markitdown.svg)](https://pypi.org/project/markitdown/)
![PyPI - Downloads](https://img.shields.io/pypi/dd/markitdown)
[![Built by AutoGen Team](https://img.shields.io/badge/Built%20by-AutoGen%20Team-blue)](https://github.com/microsoft/autogen)

> [!TIP]
> MarkItDown now offers an MCP (Model Context Protocol) server for integration with LLM applications like Claude Desktop. See [markitdown-mcp](https://github.com/microsoft/markitdown/tree/main/packages/markitdown-mcp) for more information.

> [!IMPORTANT]
> Breaking changes between 0.0.1 to 0.1.0:
> * Dependencies are now organized into optional feature-groups (further details below). Use `pip install 'markitdown[all]~=0.1.0a1'` to have backward-compatible behavior.
> * Dependencies are now organized into optional feature-groups (further details below). Use `pip install 'markitdown[all]'` to have backward-compatible behavior.
> * convert\_stream() now requires a binary file-like object (e.g., a file opened in binary mode, or an io.BytesIO object). This is a breaking change from the previous version, where it previously also accepted text file-like objects, like io.StringIO.
> * The DocumentConverter class interface has changed to read from file-like streams rather than file paths. *No temporary files are created anymore*. If you are the maintainer of a plugin, or custom DocumentConverter, you likely need to update your code. Otherwise, if only using the MarkItDown class or CLI (as in these examples), you should not need to change anything.

MarkItDown is a lightweight Python utility for converting various files to Markdown for use with LLMs and related text analysis pipelines. To this end, it is most comparable to [textract](https://github.com/deanmalmgren/textract), but with a focus on preserving important document structure and content as Markdown (including: headings, lists, tables, links, etc.) While the output is often reasonably presentable and human-friendly, it is meant to be consumed by text analysis tools -- and may not be the best option for high-fidelity document conversions for human consumption.

At present, MarkItDown supports:
MarkItDown currently supports the conversion from:

- PDF
- PowerPoint (reading in top-to-bottom, left-to-right order)
- PowerPoint
- Word
- Excel
- Images (EXIF metadata and OCR)
Expand All @@ -23,6 +27,7 @@ At present, MarkItDown supports:
- Text-based formats (CSV, JSON, XML)
- ZIP files (iterates over contents)
- Youtube URLs
- EPubs
- ... and more!

## Why Markdown?
Expand All @@ -34,14 +39,39 @@ responses unprompted. This suggests that they have been trained on vast amounts
Markdown-formatted text, and understand it well. As a side benefit, Markdown conventions
are also highly token-efficient.

## Prerequisites
MarkItDown requires Python 3.10 or higher. It is recommended to use a virtual environment to avoid dependency conflicts.

With the standard Python installation, you can create and activate a virtual environment using the following commands:

```bash
python -m venv .venv
source .venv/bin/activate
```

If using `uv`, you can create a virtual environment with:

```bash
uv venv --python=3.12 .venv
source .venv/bin/activate
# NOTE: Be sure to use 'uv pip install' rather than just 'pip install' to install packages in this virtual environment
```

If you are using Anaconda, you can create a virtual environment with:

```bash
conda create -n markitdown python=3.12
conda activate markitdown
```

## Installation

To install MarkItDown, use pip: `pip install 'markitdown[all]~=0.1.0a1'`. Alternatively, you can install it from the source:
To install MarkItDown, use pip: `pip install 'markitdown[all]'`. Alternatively, you can install it from the source:

```bash
git clone git@github.com:microsoft/markitdown.git
cd markitdown
pip install -e packages/markitdown[all]
pip install -e 'packages/markitdown[all]'
```

## Usage
Expand All @@ -68,7 +98,7 @@ cat path-to-file.pdf | markitdown
MarkItDown has optional dependencies for activating various file formats. Earlier in this document, we installed all optional dependencies with the `[all]` option. However, you can also install them individually for more control. For example:

```bash
pip install markitdown[pdf, docx, pptx]
pip install 'markitdown[pdf, docx, pptx]'
```

will install only the dependencies for PDF, DOCX, and PPTX files.
Expand Down Expand Up @@ -102,6 +132,38 @@ markitdown --use-plugins path-to-file.pdf

To find available plugins, search GitHub for the hashtag `#markitdown-plugin`. To develop a plugin, see `packages/markitdown-sample-plugin`.

#### markitdown-ocr Plugin

The `markitdown-ocr` plugin adds OCR support to PDF, DOCX, PPTX, and XLSX converters, extracting text from embedded images using LLM Vision — the same `llm_client` / `llm_model` pattern that MarkItDown already uses for image descriptions. No new ML libraries or binary dependencies required.

**Installation:**

```bash
pip install markitdown-ocr
pip install openai # or any OpenAI-compatible client
```

**Usage:**

Pass the same `llm_client` and `llm_model` you would use for image descriptions:

```python
from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
enable_plugins=True,
llm_client=OpenAI(),
llm_model="gpt-4o",
)
result = md.convert("document_with_images.pdf")
print(result.text_content)
```

If no `llm_client` is provided the plugin still loads, but OCR is silently skipped and the standard built-in converter is used instead.

See [`packages/markitdown-ocr/README.md`](packages/markitdown-ocr/README.md) for detailed documentation.

### Azure Document Intelligence

To use Microsoft Document Intelligence for conversion:
Expand Down Expand Up @@ -134,14 +196,14 @@ result = md.convert("test.pdf")
print(result.text_content)
```

To use Large Language Models for image descriptions, provide `llm_client` and `llm_model`:
To use Large Language Models for image descriptions (currently only for pptx and image files), provide `llm_client` and `llm_model`:

```python
from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(llm_client=client, llm_model="gpt-4o")
md = MarkItDown(llm_client=client, llm_model="gpt-4o", llm_prompt="optional custom prompt")
result = md.convert("example.jpg")
print(result.text_content)
```
Expand Down Expand Up @@ -169,7 +231,7 @@ contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additio

### How to Contribute

You can help by looking at issues or helping review PRs. Any issue or PR is welcome, but we have also marked some as 'open for contribution' and 'open for reviewing' to help facilitate community contributions. These are ofcourse just suggestions and you are welcome to contribute in any way you like.
You can help by looking at issues or helping review PRs. Any issue or PR is welcome, but we have also marked some as 'open for contribution' and 'open for reviewing' to help facilitate community contributions. These are of course just suggestions and you are welcome to contribute in any way you like.

<div align="center">

Expand Down
28 changes: 28 additions & 0 deletions packages/markitdown-mcp/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
FROM python:3.13-slim-bullseye

ENV DEBIAN_FRONTEND=noninteractive
ENV EXIFTOOL_PATH=/usr/bin/exiftool
ENV FFMPEG_PATH=/usr/bin/ffmpeg
ENV MARKITDOWN_ENABLE_PLUGINS=True

# Runtime dependency
# NOTE: Add any additional MarkItDown plugins here
RUN apt-get update && apt-get install -y --no-install-recommends \
ffmpeg \
exiftool

# Cleanup
RUN rm -rf /var/lib/apt/lists/*

COPY . /app
RUN pip --no-cache-dir install /app

WORKDIR /workdir

# Default USERID and GROUPID
ARG USERID=nobody
ARG GROUPID=nogroup

USER $USERID:$GROUPID

ENTRYPOINT [ "markitdown-mcp" ]
142 changes: 142 additions & 0 deletions packages/markitdown-mcp/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,142 @@
# MarkItDown-MCP

> [!IMPORTANT]
> The MarkItDown-MCP package is meant for **local use**, with local trusted agents. In particular, when running the MCP server with Streamable HTTP or SSE, it binds to `localhost` by default, and is not exposed to other machines on the network or Internet. In this configuration, it is meant to be a direct alternative to the STDIO transport, which may be more convenient in some cases. DO NOT bind the server to other interfaces unless you understand the [security implications](#security-considerations) of doing so.


[![PyPI](https://img.shields.io/pypi/v/markitdown-mcp.svg)](https://pypi.org/project/markitdown-mcp/)
![PyPI - Downloads](https://img.shields.io/pypi/dd/markitdown-mcp)
[![Built by AutoGen Team](https://img.shields.io/badge/Built%20by-AutoGen%20Team-blue)](https://github.com/microsoft/autogen)

The `markitdown-mcp` package provides a lightweight STDIO, Streamable HTTP, and SSE MCP server for calling MarkItDown.

It exposes one tool: `convert_to_markdown(uri)`, where uri can be any `http:`, `https:`, `file:`, or `data:` URI.

## Installation

To install the package, use pip:

```bash
pip install markitdown-mcp
```

## Usage

To run the MCP server, using STDIO (default), use the following command:


```bash
markitdown-mcp
```

To run the MCP server, using Streamable HTTP and SSE, use the following command:

```bash
markitdown-mcp --http --host 127.0.0.1 --port 3001
```

## Running in Docker

To run `markitdown-mcp` in Docker, build the Docker image using the provided Dockerfile:
```bash
docker build -t markitdown-mcp:latest .
```

And run it using:
```bash
docker run -it --rm markitdown-mcp:latest
```
This will be sufficient for remote URIs. To access local files, you need to mount the local directory into the container. For example, if you want to access files in `/home/user/data`, you can run:

```bash
docker run -it --rm -v /home/user/data:/workdir markitdown-mcp:latest
```

Once mounted, all files under data will be accessible under `/workdir` in the container. For example, if you have a file `example.txt` in `/home/user/data`, it will be accessible in the container at `/workdir/example.txt`.

## Accessing from Claude Desktop

It is recommended to use the Docker image when running the MCP server for Claude Desktop.

Follow [these instructions](https://modelcontextprotocol.io/quickstart/user#for-claude-desktop-users) to access Claude's `claude_desktop_config.json` file.

Edit it to include the following JSON entry:

```json
{
"mcpServers": {
"markitdown": {
"command": "docker",
"args": [
"run",
"--rm",
"-i",
"markitdown-mcp:latest"
]
}
}
}
```

If you want to mount a directory, adjust it accordingly:

```json
{
"mcpServers": {
"markitdown": {
"command": "docker",
"args": [
"run",
"--rm",
"-i",
"-v",
"/home/user/data:/workdir",
"markitdown-mcp:latest"
]
}
}
}
```

## Debugging

To debug the MCP server you can use the `MCP Inspector` tool.

```bash
npx @modelcontextprotocol/inspector
```

You can then connect to the inspector through the specified host and port (e.g., `http://localhost:5173/`).

If using STDIO:
* select `STDIO` as the transport type,
* input `markitdown-mcp` as the command, and
* click `Connect`

If using Streamable HTTP:
* select `Streamable HTTP` as the transport type,
* input `http://127.0.0.1:3001/mcp` as the URL, and
* click `Connect`

If using SSE:
* select `SSE` as the transport type,
* input `http://127.0.0.1:3001/sse` as the URL, and
* click `Connect`

Finally:
* click the `Tools` tab,
* click `List Tools`,
* click `convert_to_markdown`, and
* run the tool on any valid URI.

## Security Considerations

The server does not support authentication, and runs with the privileges of the user running it. For this reason, when running in SSE or Streamable HTTP mode, the server binds by default to `localhost`. Even still, it is important to recognize that the server can be accessed by any process or users on the same local machine, and that the `convert_to_markdown` tool can be used to read any file that the server's user has access to, or any data from the network. If you require additional security, consider running the server in a sandboxed environment, such as a virtual machine or container, and ensure that the user permissions are properly configured to limit access to sensitive files and network segments. Above all, DO NOT bind the server to other interfaces (non-localhost) unless you understand the security implications of doing so.

## Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
trademarks or logos is subject to and must follow
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/en-us/legal/intellectualproperty/trademarks/usage/general).
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.
Loading