Fix progress bar against non-UTF-8 encodings (#439)#742
Open
LeSingh1 wants to merge 1 commit into
Open
Conversation
For simonw#439. `file_progress()` sets the progress bar length to the file size in bytes (`os.path.getsize(file.name)`), but `UpdateWrapper.__iter__` called `update(len(line))`, which is the decoded character count. With UTF-16-LE input every character is 2 bytes, so the bar capped at 50%; UTF-32 capped at 25%; etc. Simon noted in the issue that the obvious fix (calling `.tell()` on the wrapped text stream) doesn't work because text mode disables it during iteration. The underlying binary buffer doesn't have that restriction though, so this tracks progress against `TextIOWrapper.buffer.tell()` when the wrapped object exposes one. For raw binary streams (no `.buffer` attribute) we keep the old behaviour, which was already byte-accurate. Added six regression tests in tests/test_utils.py covering UTF-8, UTF-16-LE, BOM-prefixed UTF-16, the sniff-style `BufferedReader` chain, a raw binary fallback, and the `.read()` path used by the JSON loader. Each asserts that the sum of update() calls equals the on-disk file size, which is what `click.progressbar` needs to reach 100%.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
For #439.
file_progress()sets the progress bar length to the file size in bytes viaos.path.getsize(file.name), butUpdateWrapper.__iter__calledupdate(len(line)), which is the decoded character count. For UTF-16-LE each character is 2 bytes, so the bar capped at 50%. UTF-32 capped at 25%. ASCII / UTF-8 happened to work because one char usually equals one byte.Repro is the one from the issue:
I tried the
.tell()approach you sketched in the thread. Same result you got:OSError: telling position disabled by next() call. Text mode disables.tell()mid-iteration because of internal decoding buffers. The underlying binary buffer doesn't have that restriction though, so this PR tracks progress againstTextIOWrapper.buffer.tell()when the wrapped object exposes a.bufferattribute. The binary buffer reads ahead in chunks, so the bar jumps a bit at each chunk boundary instead of advancing per line, but it does reach 100%, which is what matters for the regression.For raw binary streams (no
.bufferattribute) we keep the originallen(line)path, which was already byte-accurate.UpdateWrapper.read()had the same shape of bug; I routed it through the same buffer-position tracking.Six regression tests in
tests/test_utils.py:test_updatewrapper_utf8_reports_byte_lengths: baseline (ASCII / UTF-8 still reaches 100%)test_updatewrapper_utf16le_reports_byte_lengths: the Misleading progress bar against utf-16-le CSV input #439 casetest_updatewrapper_utf16le_with_bom_reaches_total_bytes: BOM-prefixed UTF-16 (the BOM byte is consumed before iteration; we should still account for it via the buffer position)test_updatewrapper_through_buffered_reader: the--sniffpath that wraps the raw file inio.BufferedReaderbefore theTextIOWrappertest_updatewrapper_binary_file_unchanged: no.bufferattribute -> fall back to the original behaviourtest_updatewrapper_read_path_utf16le:.read()path used by the JSON loaderEach asserts that the sum of update() calls equals the on-disk file size, which is what
click.progressbarneeds to reach 100%.Ran
pytest tests/test_cli.py tests/test_cli_bulk.py tests/test_cli_convert.py tests/test_utils.py: 249 passed, 3 skipped (Python 3.13.3, macOS). Also verified end-to-end by inserting a 66KB UTF-16-LE CSV with 2000 rows via the CLI; row count came through correctly and the bar reached 100% instead of stopping at 49%.📚 Documentation preview 📚: https://sqlite-utils--742.org.readthedocs.build/en/742/