Fix progress bar against non-UTF-8 encodings (#439) by LeSingh1 · Pull Request #742 · simonw/sqlite-utils

LeSingh1 · 2026-05-18T23:30:16Z

file_progress() sets the progress bar length to the file size in bytes via os.path.getsize(file.name), but UpdateWrapper.__iter__ called update(len(line)), which is the decoded character count. For UTF-16-LE each character is 2 bytes, so the bar capped at 50%. UTF-32 capped at 25%. ASCII / UTF-8 happened to work because one char usually equals one byte.

Repro is the one from the issue:

sqlite-utils insert --csv --delimiter ";" --encoding "utf-16-le" test.db t csv
  [#################-------------------]   49%  00:00:01

I tried the .tell() approach you sketched in the thread. Same result you got: OSError: telling position disabled by next() call. Text mode disables .tell() mid-iteration because of internal decoding buffers. The underlying binary buffer doesn't have that restriction though, so this PR tracks progress against TextIOWrapper.buffer.tell() when the wrapped object exposes a .buffer attribute. The binary buffer reads ahead in chunks, so the bar jumps a bit at each chunk boundary instead of advancing per line, but it does reach 100%, which is what matters for the regression.

For raw binary streams (no .buffer attribute) we keep the original len(line) path, which was already byte-accurate.

UpdateWrapper.read() had the same shape of bug; I routed it through the same buffer-position tracking.

Six regression tests in tests/test_utils.py:

test_updatewrapper_utf8_reports_byte_lengths: baseline (ASCII / UTF-8 still reaches 100%)
test_updatewrapper_utf16le_reports_byte_lengths: the Misleading progress bar against utf-16-le CSV input #439 case
test_updatewrapper_utf16le_with_bom_reaches_total_bytes: BOM-prefixed UTF-16 (the BOM byte is consumed before iteration; we should still account for it via the buffer position)
test_updatewrapper_through_buffered_reader: the --sniff path that wraps the raw file in io.BufferedReader before the TextIOWrapper
test_updatewrapper_binary_file_unchanged: no .buffer attribute -> fall back to the original behaviour
test_updatewrapper_read_path_utf16le: .read() path used by the JSON loader

Each asserts that the sum of update() calls equals the on-disk file size, which is what click.progressbar needs to reach 100%.

Ran pytest tests/test_cli.py tests/test_cli_bulk.py tests/test_cli_convert.py tests/test_utils.py: 249 passed, 3 skipped (Python 3.13.3, macOS). Also verified end-to-end by inserting a 66KB UTF-16-LE CSV with 2000 rows via the CLI; row count came through correctly and the bar reached 100% instead of stopping at 49%.

📚 Documentation preview 📚: https://sqlite-utils--742.org.readthedocs.build/en/742/

For simonw#439. `file_progress()` sets the progress bar length to the file size in bytes (`os.path.getsize(file.name)`), but `UpdateWrapper.__iter__` called `update(len(line))`, which is the decoded character count. With UTF-16-LE input every character is 2 bytes, so the bar capped at 50%; UTF-32 capped at 25%; etc. Simon noted in the issue that the obvious fix (calling `.tell()` on the wrapped text stream) doesn't work because text mode disables it during iteration. The underlying binary buffer doesn't have that restriction though, so this tracks progress against `TextIOWrapper.buffer.tell()` when the wrapped object exposes one. For raw binary streams (no `.buffer` attribute) we keep the old behaviour, which was already byte-accurate. Added six regression tests in tests/test_utils.py covering UTF-8, UTF-16-LE, BOM-prefixed UTF-16, the sniff-style `BufferedReader` chain, a raw binary fallback, and the `.read()` path used by the JSON loader. Each asserts that the sum of update() calls equals the on-disk file size, which is what `click.progressbar` needs to reach 100%.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix progress bar against non-UTF-8 encodings (#439)#742

Fix progress bar against non-UTF-8 encodings (#439)#742
LeSingh1 wants to merge 1 commit into
simonw:mainfrom
LeSingh1:fix-progress-bar-non-utf8-encodings

LeSingh1 commented May 18, 2026 •

edited by github-actions Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

LeSingh1 commented May 18, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

LeSingh1 commented May 18, 2026 •

edited by github-actions Bot

Loading