feat: add DocConverter for legacy .doc (Word 97-2003) files by AviArora02-commits · Pull Request #1710 · microsoft/markitdown

AviArora02-commits · 2026-04-11T15:03:21Z

Closes #23

Add support for legacy `.doc` (Word 97-2003) files — Closes #23

Summary

This PR adds a new DocConverter that converts legacy Microsoft Word .doc files
(Word 97–2003 OLE Compound Document format) to Markdown, closing issue #23.

# All of these now work:
markitdown document.doc

from markitdown import MarkItDown
md = MarkItDown()

md.convert("document.doc")                                    # file path
md.convert_stream(f, extension=".doc")                        # stream + ext hint
md.convert_stream(f, mimetype="application/msword")           # stream + mime hint

What changed

File	Change
`converters/_doc_converter.py`	New file – `DocConverter` class
`converters/__init__.py`	Export `DocConverter`
`_markitdown.py`	Import and register `DocConverter`
`pyproject.toml`	Add `doc = ["olefile"]` optional dependency group
`tests/_test_vectors.py`	Add `FileTestVector` for `test.doc`
`tests/test_files/test.doc`	Binary test fixture with known UUIDs

Technical approach

The legacy .doc format is an OLE Compound Document (the same container used
by Outlook .msg files, which markitdown already handles via olefile).

No new third-party dependencies are introduced — olefile is already in
pyproject.toml as an optional dependency for OutlookMsgConverter.

The converter follows the Word97-2007 Binary File Format spec:

Open the OLE container with olefile.OleFileIO.
Read the FIB (File Information Block) from the WordDocument stream
to locate the CLX (piece table container) in either 0Table or 1Table.
Parse the CLX to find the PlcPcd (piece table).
For each piece: decode as CP1252 (single-byte compressed) or UTF-16LE
(two bytes/char), preserving paragraph breaks (\r → \n).
Strip non-printable control characters and normalize blank lines.
Falls back gracefully to a UTF-16 brute-scan on any parse error
(handles corrupted or fast-saved documents).

Limitations (documented in the class docstring)

Rich formatting (bold, italic, heading styles, tables) is not preserved.
The legacy binary format requires a full Word parser to reconstruct that
information. Plain text extraction is what LLM pipelines need anyway, and
it keeps the implementation minimal and dependency-free.

Users who need rich formatting from modern Word files should use .docx
(already supported by DocxConverter via mammoth).

Testing

packages/markitdown/tests/test_module_vectors.py  — all local-file tests PASS

The new test.doc fixture contains two known UUIDs that the test vector
checks for, following the same pattern as the existing .docx test.

Install

pip install 'markitdown[doc]'   # just .doc support (olefile only)
pip install 'markitdown[all]'   # everything (already includes olefile)

pr_description.md

Closes microsoft#23 Adds DocConverter using olefile to parse OLE Compound Document format. Reads FIB from WordDocument stream, locates PlcPcd piece table in the table stream, decodes each piece as CP1252 or UTF-16LE. Falls back to UTF-16 brute-scan for corrupted documents. No new dependencies - olefile is already used by OutlookMsgConverter.

AviArora02-commits · 2026-04-11T15:19:30Z

@microsoft-github-policy-service agree

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add DocConverter for legacy .doc (Word 97-2003) files#1710

feat: add DocConverter for legacy .doc (Word 97-2003) files#1710
AviArora02-commits wants to merge 1 commit intomicrosoft:mainfrom
AviArora02-commits:feature/doc-extension-support-issue-23

AviArora02-commits commented Apr 11, 2026

Uh oh!

AviArora02-commits commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AviArora02-commits commented Apr 11, 2026

Add support for legacy .doc (Word 97-2003) files — Closes #23

Summary

What changed

Technical approach

Limitations (documented in the class docstring)

Testing

Install

Uh oh!

AviArora02-commits commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add support for legacy `.doc` (Word 97-2003) files — Closes #23