Skip to content

Fix: Update ZipConverter docstring and optimize memory handling for large entries#2079

Open
KhudaBuxMagsi wants to merge 1 commit into
microsoft:mainfrom
KhudaBuxMagsi:main
Open

Fix: Update ZipConverter docstring and optimize memory handling for large entries#2079
KhudaBuxMagsi wants to merge 1 commit into
microsoft:mainfrom
KhudaBuxMagsi:main

Conversation

@KhudaBuxMagsi

Copy link
Copy Markdown

Docstring Correction: Updated the class docstring to accurately reflect that files are processed in-memory rather than extracted to a temporary directory on disk.

Memory Optimization: Safely structured the file reading process inside the zipObj loop using zipObj.open to ensure better stream handling, reducing the risk of unexpected memory usage spikes.

@KhudaBuxMagsi

Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

@KhudaBuxMagsi KhudaBuxMagsi changed the title Fix: Update ZipConverter docstring and optimize memory handling for l… Fix: Update ZipConverter docstring and optimize memory handling for large entries Jun 5, 2026
@kervin-carlos

Copy link
Copy Markdown

Looks good to me.

@noezhiya-dot

Copy link
Copy Markdown

Good improvement. Switching from extracting all files at once to streaming them one at a time is the right call for memory efficiency, especially with large ZIPs.

A few observations:

  1. The docstring update is accurate — processing in-memory without temp dirs is cleaner.

  2. The directory skip (name.endswith('/')) is a good addition that prevents attempting to convert directory entries.

  3. One concern: member_file.read() still reads the entire file into memory. For truly large files within a ZIP, you might want to consider a streaming approach or a size limit. That said, this is still better than the previous approach of reading all files upfront.

  4. The indentation change on the if result is not None block makes the nesting clearer.

LGTM, nice cleanup.

@noezhiya-dot

Copy link
Copy Markdown

Good memory optimization for the ZipConverter. Processing files in-memory via zipObj.open() instead of extracting to disk is cleaner and avoids OOM on large ZIPs.

A few observations:

  1. Memory: While member_file.read() still loads the full file into memory, this is indeed better than the old temp-dir approach since it avoids disk I/O and cleanup issues. For truly large internal files, a streaming approach would be even better, but that depends on whether downstream converters support streaming — probably a future improvement.

  2. Directory filtering: The name.endswith('/') check is a good addition. Previously, directories in the ZIP would have been passed to converters which could cause errors or empty output.

  3. Error handling: The existing try/except still wraps the per-file processing, so a single bad file won't kill the whole ZIP conversion. That's the right trade-off.

  4. Docstring: The updated docstring is more accurate now since temp dirs are no longer used.

Minor: there's a trailing whitespace on the empty line after the directory check (line with just spaces in the if name.endswith('/): block). Easy lint fix.

Otherwise, solid improvement. Approve.

@noezhiya-dot noezhiya-dot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Fix: Update ZipConverter docstring and optimize memory handling for large entries

I have concerns about this PR.

Issue: The memory optimization claim is misleading

The PR title and comment say "optimize memory handling" and "prevent high memory usage (OOM) on large files", but the code still calls member_file.read() which reads the entire file contents into memory. Then io.BytesIO(member_file.read()) wraps it in another buffer. This does NOT reduce memory usage compared to the original zipObj.read(name) — both approaches load the full file into memory.

The actual changes are:

  1. Docstring update — accurate, the old docstring mentioned temp directories which no longer applies.
  2. Directory skip — the if name.endswith('/'): continue addition is a good fix.
  3. Using zipObj.open() instead of zipObj.read() — functionally equivalent for memory since .read() is called immediately.

Suggestions

  1. Remove the misleading "prevent high memory usage (OOM)" comment. The directory skip is the real fix.
  2. If the goal is truly to handle large files, the converter would need to process chunks or use temp files.

The directory skip fix is good. The docstring update is fine. But the PR should not claim memory optimization that is not implemented.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants