Fix: Update ZipConverter docstring and optimize memory handling for large entries by KhudaBuxMagsi · Pull Request #2079 · microsoft/markitdown

KhudaBuxMagsi · 2026-06-05T20:46:34Z

Docstring Correction: Updated the class docstring to accurately reflect that files are processed in-memory rather than extracted to a temporary directory on disk.

Memory Optimization: Safely structured the file reading process inside the zipObj loop using zipObj.open to ensure better stream handling, reducing the risk of unexpected memory usage spikes.

…arge entries

KhudaBuxMagsi · 2026-06-05T20:48:20Z

@microsoft-github-policy-service agree

kervin-carlos · 2026-06-07T15:38:48Z

Looks good to me.

noezhiya-dot · 2026-06-08T23:07:20Z

Good improvement. Switching from extracting all files at once to streaming them one at a time is the right call for memory efficiency, especially with large ZIPs.

A few observations:

The docstring update is accurate — processing in-memory without temp dirs is cleaner.
The directory skip (name.endswith('/')) is a good addition that prevents attempting to convert directory entries.
One concern: member_file.read() still reads the entire file into memory. For truly large files within a ZIP, you might want to consider a streaming approach or a size limit. That said, this is still better than the previous approach of reading all files upfront.
The indentation change on the if result is not None block makes the nesting clearer.

LGTM, nice cleanup.

noezhiya-dot · 2026-06-09T08:23:26Z

Good memory optimization for the ZipConverter. Processing files in-memory via zipObj.open() instead of extracting to disk is cleaner and avoids OOM on large ZIPs.

A few observations:

Memory: While member_file.read() still loads the full file into memory, this is indeed better than the old temp-dir approach since it avoids disk I/O and cleanup issues. For truly large internal files, a streaming approach would be even better, but that depends on whether downstream converters support streaming — probably a future improvement.
Directory filtering: The name.endswith('/') check is a good addition. Previously, directories in the ZIP would have been passed to converters which could cause errors or empty output.
Error handling: The existing try/except still wraps the per-file processing, so a single bad file won't kill the whole ZIP conversion. That's the right trade-off.
Docstring: The updated docstring is more accurate now since temp dirs are no longer used.

Minor: there's a trailing whitespace on the empty line after the directory check (line with just spaces in the if name.endswith('/): block). Easy lint fix.

Otherwise, solid improvement. Approve.

noezhiya-dot

Code Review: Fix: Update ZipConverter docstring and optimize memory handling for large entries

I have concerns about this PR.

Issue: The memory optimization claim is misleading

The PR title and comment say "optimize memory handling" and "prevent high memory usage (OOM) on large files", but the code still calls member_file.read() which reads the entire file contents into memory. Then io.BytesIO(member_file.read()) wraps it in another buffer. This does NOT reduce memory usage compared to the original zipObj.read(name) — both approaches load the full file into memory.

The actual changes are:

Docstring update — accurate, the old docstring mentioned temp directories which no longer applies.
Directory skip — the if name.endswith('/'): continue addition is a good fix.
Using zipObj.open() instead of zipObj.read() — functionally equivalent for memory since .read() is called immediately.

Suggestions

Remove the misleading "prevent high memory usage (OOM)" comment. The directory skip is the real fix.
If the goal is truly to handle large files, the converter would need to process chunks or use temp files.

The directory skip fix is good. The docstring update is fine. But the PR should not claim memory optimization that is not implemented.

Fix: Update ZipConverter docstring and optimize memory handling for l…

076e8e2

…arge entries

KhudaBuxMagsi changed the title ~~Fix: Update ZipConverter docstring and optimize memory handling for l…~~ Fix: Update ZipConverter docstring and optimize memory handling for large entries Jun 5, 2026

noezhiya-dot suggested changes Jun 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Update ZipConverter docstring and optimize memory handling for large entries#2079

Fix: Update ZipConverter docstring and optimize memory handling for large entries#2079
KhudaBuxMagsi wants to merge 1 commit into
microsoft:mainfrom
KhudaBuxMagsi:main

KhudaBuxMagsi commented Jun 5, 2026

Uh oh!

KhudaBuxMagsi commented Jun 5, 2026

Uh oh!

kervin-carlos commented Jun 7, 2026

Uh oh!

noezhiya-dot commented Jun 8, 2026

Uh oh!

noezhiya-dot commented Jun 9, 2026

Uh oh!

noezhiya-dot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

KhudaBuxMagsi commented Jun 5, 2026

Uh oh!

KhudaBuxMagsi commented Jun 5, 2026

Uh oh!

kervin-carlos commented Jun 7, 2026

Uh oh!

noezhiya-dot commented Jun 8, 2026

Uh oh!

noezhiya-dot commented Jun 9, 2026

Uh oh!

noezhiya-dot left a comment

Choose a reason for hiding this comment

Code Review: Fix: Update ZipConverter docstring and optimize memory handling for large entries

Issue: The memory optimization claim is misleading

Suggestions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants