Skip to content

feat: add DocConverter for legacy .doc (Word 97-2003) files#1710

Open
AviArora02-commits wants to merge 1 commit intomicrosoft:mainfrom
AviArora02-commits:feature/doc-extension-support-issue-23
Open

feat: add DocConverter for legacy .doc (Word 97-2003) files#1710
AviArora02-commits wants to merge 1 commit intomicrosoft:mainfrom
AviArora02-commits:feature/doc-extension-support-issue-23

Conversation

@AviArora02-commits
Copy link
Copy Markdown

Closes #23

Add support for legacy .doc (Word 97-2003) files — Closes #23

Summary

This PR adds a new DocConverter that converts legacy Microsoft Word .doc files
(Word 97–2003 OLE Compound Document format) to Markdown, closing issue #23.

# All of these now work:
markitdown document.doc
from markitdown import MarkItDown
md = MarkItDown()

md.convert("document.doc")                                    # file path
md.convert_stream(f, extension=".doc")                        # stream + ext hint
md.convert_stream(f, mimetype="application/msword")           # stream + mime hint

What changed

File Change
converters/_doc_converter.py New fileDocConverter class
converters/__init__.py Export DocConverter
_markitdown.py Import and register DocConverter
pyproject.toml Add doc = ["olefile"] optional dependency group
tests/_test_vectors.py Add FileTestVector for test.doc
tests/test_files/test.doc Binary test fixture with known UUIDs

Technical approach

The legacy .doc format is an OLE Compound Document (the same container used
by Outlook .msg files, which markitdown already handles via olefile).

No new third-party dependencies are introduced — olefile is already in
pyproject.toml as an optional dependency for OutlookMsgConverter.

The converter follows the Word97-2007 Binary File Format spec:

  1. Open the OLE container with olefile.OleFileIO.
  2. Read the FIB (File Information Block) from the WordDocument stream
    to locate the CLX (piece table container) in either 0Table or 1Table.
  3. Parse the CLX to find the PlcPcd (piece table).
  4. For each piece: decode as CP1252 (single-byte compressed) or UTF-16LE
    (two bytes/char), preserving paragraph breaks (\r\n).
  5. Strip non-printable control characters and normalize blank lines.
  6. Falls back gracefully to a UTF-16 brute-scan on any parse error
    (handles corrupted or fast-saved documents).

Limitations (documented in the class docstring)

Rich formatting (bold, italic, heading styles, tables) is not preserved.
The legacy binary format requires a full Word parser to reconstruct that
information. Plain text extraction is what LLM pipelines need anyway, and
it keeps the implementation minimal and dependency-free.

Users who need rich formatting from modern Word files should use .docx
(already supported by DocxConverter via mammoth).


Testing

packages/markitdown/tests/test_module_vectors.py  — all local-file tests PASS

The new test.doc fixture contains two known UUIDs that the test vector
checks for, following the same pattern as the existing .docx test.


Install

pip install 'markitdown[doc]'   # just .doc support (olefile only)
pip install 'markitdown[all]'   # everything (already includes olefile)

pr_description.md

Closes microsoft#23

Adds DocConverter using olefile to parse OLE Compound Document format.
Reads FIB from WordDocument stream, locates PlcPcd piece table in the
table stream, decodes each piece as CP1252 or UTF-16LE. Falls back to
UTF-16 brute-scan for corrupted documents. No new dependencies - olefile
is already used by OutlookMsgConverter.
@AviArora02-commits
Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for .doc extensions

1 participant