Update tests.yml by joseguizar95-art · Pull Request #1699 · microsoft/markitdown

joseguizar95-art · 2026-04-11T00:48:31Z

No description provided.

* Minimize guesses when guesses are compatible.

)

* Refactored tests. * Fixed CI errors, and included misc tests. * Omit mskanji from streaminfo test. * Omit mskanji from no hints test. * Log results of debugging in comments (linked to Magika issue) * Added docs as to when to use misc tests.

* Handle not supported plot type in pptx * Fixed formatting.

* Adapted microsoft#123 to not use epublib. * Updated README.md

…t#1142)

Adjusts warning filters to be more contextual Updates dependencies for magika and youtube-transcript-api Updates the version to 0.1.0a5 in __about__.py

* optional reserve base64 string in markdown _CustomMarkdownify and pptx * add other converter para support * fix linter * Use *kwarg to pass keep_data_uri para. * Add module cli vector tests * Fixed formatting, and adjusted tests.

microsoft#1153)

* Added an initial minimal MCP server for MarkItDown * Added STDIO default option. * Added a Dockerfile, and updated the README accordingly. Also added instructions for Claude Desktop * Pin mcp version.

* Update badges in subpackages.

* Updated readme with link to the MCP package.

…microsoft#1151) * Make it easier to use AzureKeyCredentials with Azure Doc Intelligence * Fixed mypy type error. * Added more fine-grained options over types. * Pass doc intel options further up the stack.

* feat: math equation rendering in .docx files * fix: import fix on .docx pre processing * test: add test cases for docx equation rendering * docs: add ThirdPartyNotices.md * refactor: reformatted with black

* feat: add version verification for ExifTool to ensure security compliance * fix: improve ExifTool version verification ---------

Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 5.

* fix: correctly pass custom llm prompt parameter

Fix typo in README.md

ISSUE microsoft#1339

Fix: Subtle spelling mistake fixed.

* supportfordata-src

* Handle shapes where position is None * Fixed recursion error, and place no-coord shapes at front

This change introduces functionality to convert HTML checkbox input elements (<input type=checkbox>) into Markdown checkbox syntax ([ ] or [x]). Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com>

* Updated pyproject to require a minimum version of pdfminer.six to ensure CVE-2025-64512 is patched.

…ft#1499) * Added PDF table extraction feature with aligned Markdown (microsoft#1419) * Add PDF test files and enhance extraction tests - Added a medical report scan PDF for testing scanned PDF handling. - Included a retail purchase receipt PDF to validate receipt extraction functionality. - Introduced a multipage invoice PDF to test extraction of complex invoice structures. - Added a borderless table PDF for testing inventory reconciliation report extraction. - Implemented comprehensive tests for PDF table extraction, ensuring proper structure and data integrity. - Enhanced existing tests to validate the order and presence of extracted content across various PDF types. * fix: update dependencies for PDF processing and improve table extraction logic * Bumped version of pdfminer.six --------- Authored-by: Ashok <ashh010101@gmail.com>

…1525) * Fix: PDF parsing doesn't support partially numbered lists * Refactor: Move import of PARTIAL_NUMBERING_PATTERN to the top of the test file * Refactor: Improve assertion formatting in partial numbering tests

* feat: enhance PDF table extraction to support complex forms and add new test cases * feat: enhance PDF table extraction with adaptive column clustering and add comprehensive test cases * fix: correct formatting and improve assertions in PDF table tests

…ft#1541) * Add OCR test data and implement tests for various document formats - Created HTML file with multiple images for testing OCR extraction. - Added several PDF files with different layouts and image placements to validate OCR functionality. - Introduced PPTX files with complex layouts and images at various positions for comprehensive testing. - Included XLSX files with multiple images and complex layouts to ensure accurate OCR extraction. - Implemented a new test suite in `test_ocr.py` to validate OCR functionality across all document types, ensuring context preservation and accuracy. * Enhance OCR functionality and validation in document converters - Refactor image extraction and processing in PDF, PPTX, and XLSX converters for improved readability and consistency. - Implement detailed validation for OCR text positioning relative to surrounding text in test cases. - Introduce comprehensive tests for expected OCR results across various document types, ensuring no base64 images are present. - Improve error handling and logging for better debugging during OCR extraction. * Add support for scanned PDFs with full-page OCR fallback and implement tests * Bump version to 0.1.6b1 in __about__.py * Refactor OCR services to support LLM Vision, update README and tests accordingly * Add OCR-enabled converters and ensure consistent OCR format across document types * Refactor converters to improve import organization and enhance OCR functionality across DOCX, PDF, PPTX, and XLSX converters * Refactor exception imports for consistency across converters and tests * Fix OCR tests to match MockOCRService output and fix cross-platform file URI handling * Bump version to 0.1.6b1 in __about__.py * Skip DOCX/XLSX/PPTX OCR tests when optional dependencies are missing * Add comprehensive OCR test suite for various document formats - Introduced multiple test documents for PDF, DOCX, XLSX, and PPTX formats, covering scenarios with images at the start, middle, and end. - Implemented tests for complex layouts, multi-page documents, and documents with multiple images. - Created a new test script `test_ocr.py` to validate OCR functionality, ensuring context preservation and accurate text extraction. - Added expected OCR results for validation against ground truth. - Included tests for scanned documents to verify OCR fallback mechanisms. * Remove obsolete HTML test files and refactor test cases for file URIs and OCR format consistency - Deleted `html_image_start.html` and `html_multiple_images.html` as they are no longer needed. - Updated `test_file_uris` in `test_module_misc.py` to simplify assertions by removing unnecessary `url2pathname` usage. - Removed `test_ocr_format_consistency.py` as it is no longer relevant to the current testing framework. * Refactor OCR processing in PdfConverterWithOCR and enhance unit tests for multipage PDFs * Revert * Revert * Update REDMEs * Refactor import statements for consistency and improve formatting in converter and test files

microsoft#1612) * Fix O(n) memory growth in PDF conversion by calling page.close() after each page * Refactor PDF memory optimization tests for improved readability and consistency * Add memory benchmarking tests for PDF conversion with page.close() fix * Remove unnecessary blank lines in PDF memory optimization tests for cleaner code * Bump version to 0.1.6b2 in __about__.py * Update PDF conversion tests to include mimetype in StreamInfo

joseguizar95-art

App

microsoft-github-policy-service · 2026-04-12T01:05:46Z

@joseguizar95-art please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement

Contribution License Agreement

This Contribution License Agreement (“Agreement”) is agreed to by the party signing below (“You”),
and conveys certain license rights to Microsoft Corporation and its affiliates (“Microsoft”) for Your
contributions to Microsoft open source projects. This Agreement is effective as of the latest signature
date below.

Definitions.
“Code” means the computer software code, whether in human-readable or machine-executable form,
that is delivered by You to Microsoft under this Agreement.
“Project” means any of the projects owned or managed by Microsoft and offered under a license
approved by the Open Source Initiative (www.opensource.org).
“Submit” is the act of uploading, submitting, transmitting, or distributing code or other content to any
Project, including but not limited to communication on electronic mailing lists, source code control
systems, and issue tracking systems that are managed by, or on behalf of, the Project for the purpose of
discussing and improving that Project, but excluding communication that is conspicuously marked or
otherwise designated in writing by You as “Not a Submission.”
“Submission” means the Code and any other copyrightable material Submitted by You, including any
associated comments and documentation.
Your Submission. You must agree to the terms of this Agreement before making a Submission to any
Project. This Agreement covers any and all Submissions that You, now or in the future (except as
described in Section 4 below), Submit to any Project.
Originality of Work. You represent that each of Your Submissions is entirely Your original work.
Should You wish to Submit materials that are not Your original work, You may Submit them separately
to the Project if You (a) retain all copyright and license information that was in the materials as You
received them, (b) in the description accompanying Your Submission, include the phrase “Submission
containing materials of a third party:” followed by the names of the third party and any licenses or other
restrictions of which You are aware, and (c) follow any other instructions in the Project’s written
guidelines concerning Submissions.
Your Employer. References to “employer” in this Agreement include Your employer or anyone else
for whom You are acting in making Your Submission, e.g. as a contractor, vendor, or agent. If Your
Submission is made in the course of Your work for an employer or Your employer has intellectual
property rights in Your Submission by contract or applicable law, You must secure permission from Your
employer to make the Submission before signing this Agreement. In that case, the term “You” in this
Agreement will refer to You and the employer collectively. If You change employers in the future and
desire to Submit additional Submissions for the new employer, then You agree to sign a new Agreement
and secure permission from the new employer before Submitting those Submissions.
Licenses.

Copyright License. You grant Microsoft, and those who receive the Submission directly or
indirectly from Microsoft, a perpetual, worldwide, non-exclusive, royalty-free, irrevocable license in the
Submission to reproduce, prepare derivative works of, publicly display, publicly perform, and distribute
the Submission and such derivative works, and to sublicense any or all of the foregoing rights to third
parties.
Patent License. You grant Microsoft, and those who receive the Submission directly or
indirectly from Microsoft, a perpetual, worldwide, non-exclusive, royalty-free, irrevocable license under
Your patent claims that are necessarily infringed by the Submission or the combination of the
Submission with the Project to which it was Submitted to make, have made, use, offer to sell, sell and
import or otherwise dispose of the Submission alone or with the Project.
Other Rights Reserved. Each party reserves all rights not expressly granted in this Agreement.
No additional licenses or rights whatsoever (including, without limitation, any implied licenses) are
granted by implication, exhaustion, estoppel or otherwise.

Representations and Warranties. You represent that You are legally entitled to grant the above
licenses. You represent that each of Your Submissions is entirely Your original work (except as You may
have disclosed under Section 3). You represent that You have secured permission from Your employer to
make the Submission in cases where Your Submission is made in the course of Your work for Your
employer or Your employer has intellectual property rights in Your Submission by contract or applicable
law. If You are signing this Agreement on behalf of Your employer, You represent and warrant that You
have the necessary authority to bind the listed employer to the obligations contained in this Agreement.
You are not expected to provide support for Your Submission, unless You choose to do so. UNLESS
REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING, AND EXCEPT FOR THE WARRANTIES
EXPRESSLY STATED IN SECTIONS 3, 4, AND 6, THE SUBMISSION PROVIDED UNDER THIS AGREEMENT IS
PROVIDED WITHOUT WARRANTY OF ANY KIND, INCLUDING, BUT NOT LIMITED TO, ANY WARRANTY OF
NONINFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE.
Notice to Microsoft. You agree to notify Microsoft in writing of any facts or circumstances of which
You later become aware that would make Your representations in this Agreement inaccurate in any
respect.
Information about Submissions. You agree that contributions to Projects and information about
contributions may be maintained indefinitely and disclosed publicly, including Your name and other
information that You submit with Your Submission.
Governing Law/Jurisdiction. This Agreement is governed by the laws of the State of Washington, and
the parties consent to exclusive jurisdiction and venue in the federal courts sitting in King County,
Washington, unless no federal subject matter jurisdiction exists, in which case the parties consent to
exclusive jurisdiction and venue in the Superior Court of King County, Washington. The parties waive all
defenses of lack of personal jurisdiction and forum non-conveniens.
Entire Agreement/Assignment. This Agreement is the entire agreement between the parties, and
supersedes any and all prior agreements, understandings or communications, written or oral, between
the parties relating to the subject matter hereof. This Agreement may be assigned by Microsoft.

0xmohit and others added 30 commits March 8, 2025 19:32

fix typo in well-known path list (microsoft#1109)

2405f20

Switch from puremagic to magika. (microsoft#1108)

8e73a32

Minimize guesses when guesses are compatible. (microsoft#1114)

8f8e58c

* Minimize guesses when guesses are compatible.

Enhance type guessing.

2e51ba2

Added mimetypes to _rss_converter

2a2ccc8

Added CLI options for extension, mimetypes, and charset. (microsoft#1115

af1be36

)

fix: correct f-string formatting in FileConversionException (microsof…

75140a9

…t#1121)

Refactored tests. (microsoft#1120)

5f75e16

* Refactored tests. * Fixed CI errors, and included misc tests. * Omit mskanji from streaminfo test. * Omit mskanji from no hints test. * Log results of debugging in comments (linked to Magika issue) * Added docs as to when to use misc tests.

Handle not supported plot type in pptx (microsoft#1122)

12620f1

* Handle not supported plot type in pptx * Fixed formatting.

Bumping version to 0.1.0a2 (microsoft#1123)

0b815fb

Updated Magika dependency.

6a9f09b

Small fixes for autogen integration. (microsoft#1124)

09df7fe

Added epub test file. (microsoft#1130)

a78857b

Fix remaining mypy errors. (microsoft#1132)

5c565b7

Investigate and silence warnings. (microsoft#1133)

53834fd

Have magika read from the stream. (microsoft#1136)

c5f70b9

EPub Support. Adapted microsoft#123 to not use epublib. (microsoft#1131)

a93e056

* Adapted microsoft#123 to not use epublib. * Updated README.md

Consider anything with a charset as plain text-convertible. (microsof…

716f74d

…t#1142)

Adjust warning filters and update dependencies (microsoft#1143)

cd6aa41

Adjusts warning filters to be more contextual Updates dependencies for magika and youtube-transcript-api Updates the version to 0.1.0a5 in __about__.py

Updated docx file to include an image. (microsoft#1146)

c0a511e

Bump version and resolve a console encoding error. (microsoft#1149)

efc55b2

Bump version. (microsoft#1150)

2ffe6ea

convert_url renamed to convert_uri, and now handles data and file URIs (

e928b43

microsoft#1153)

Bump version. (microsoft#1154)

c1f9a32

Basic SSE MCP Server for MarkItDown (microsoft#1155)

3ca5798

* Added an initial minimal MCP server for MarkItDown * Added STDIO default option. * Added a Dockerfile, and updated the README accordingly. Also added instructions for Claude Desktop * Pin mcp version.

Update badges (microsoft#1157)

73b9d57

* Update badges in subpackages.

Update readme to point to the mcp package. (microsoft#1158)

9a95105

* Updated readme with link to the MCP package.

Make it easier to use AzureKeyCredentials with Azure Doc Intelligence (…

9e067c4

…microsoft#1151) * Make it easier to use AzureKeyCredentials with Azure Doc Intelligence * Fixed mypy type error. * Added more fine-grained options over types. * Pass doc intel options further up the stack.

feat: render math equations in .docx documents (microsoft#1160)

3fcd48c

* feat: math equation rendering in .docx files * fix: import fix on .docx pre processing * test: add test cases for docx equation rendering * docs: add ThirdPartyNotices.md * refactor: reformatted with black

JonahDelman and others added 27 commits August 26, 2025 14:23

Fixed documentation typos in _base_converter.py (microsoft#1393)

1178c2e

Ensure safe ExifTool usage: require >= 12.24 (microsoft#1399)

fb1ad24

* feat: add version verification for ExifTool to ensure security compliance * fix: improve ExifTool version verification ---------

Bump actions/checkout from 4 to 5 (microsoft#1394)

b6e5da8

Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 5.

Add HTML support to DocumentIntelligenceConverter (microsoft#1352)

ea1a3df

fix: correctly pass custom llm prompt parameter (microsoft#1319)

b81a387

* fix: correctly pass custom llm prompt parameter

Update README.md (microsoft#1335)

16ca285

Fix typo in README.md

Update README.md (microsoft#1350)

f8b60b5

ISSUE microsoft#1339

Update README.md (microsoft#1191)

0c4d394

Fix: Subtle spelling mistake fixed.

Adding support for data-src Attribute (microsoft#1226)

c3f6cb3

* supportfordata-src

docs: correct minor typos (microsoft#1173)

459d462

fix docx parse error(\n in alt) (microsoft#1163)

59eb60f

Handle PPTX shapes where position is None (microsoft#1161)

1736565

* Handle shapes where position is None * Fixed recursion error, and place no-coord shapes at front

feat: add checkbox support to Markdown converter (microsoft#1208)

8a9d8f1

This change introduces functionality to convert HTML checkbox input elements (<input type=checkbox>) into Markdown checkbox syntax ([ ] or [x]). Co-authored-by: Meirna Kamal <meirna.kamal@vodafone.com>

Test if mammoth resolves rlinks. (microsoft#1451)

447c047

Upgrade mammoth to 1.11.0 (microsoft#1452)

3d4fe3c

Bump versions of mammoth and pdfminer.six (microsoft#1492)

dde250a

* Updated pyproject to require a minimum version of pdfminer.six to ensure CVE-2025-64512 is patched.

Add text/markdown to Accept header (microsoft#1554)

2b6ec9f

Remove onnxruntime<=1.20.1 Windows pin (microsoft#1551)

6b0fd15

Bump version for release. (microsoft#1564)

4a5340f

Updated warning about binding to non-local interfaces. (microsoft#1653)

63cbbd9

Update tests.yml

fd64bd3

Update tests.yml

731b655

joseguizar95-art commented Apr 11, 2026

View reviewed changes

joseguizar95-art changed the base branch from main to zip_formats April 12, 2026 01:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update tests.yml#1699

Update tests.yml#1699
joseguizar95-art wants to merge 75 commits intomicrosoft:zip_formatsfrom
joseguizar95-art:patch-1

joseguizar95-art commented Apr 11, 2026

Uh oh!

joseguizar95-art left a comment

Uh oh!

microsoft-github-policy-service bot commented Apr 12, 2026

Contribution License Agreement

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

joseguizar95-art commented Apr 11, 2026

Uh oh!

joseguizar95-art left a comment

Choose a reason for hiding this comment

Uh oh!

microsoft-github-policy-service bot commented Apr 12, 2026

Contribution License Agreement

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants