perf: push the interpreted hot loops further + add CI by SethMorrowSoftware · Pull Request #1 · SethMorrowSoftware/xtQRdecoder

SethMorrowSoftware · 2026-06-13T22:01:38Z

What & why

The decode cost centre (docs/spec.md §11) is the interpreted, single-threaded per-pixel / per-bit loops. The 0.1.0 pass did the arithmetic wins; this PR pulls the remaining LiveCode-specific levers across luminance, both binarizers, the detector scan, and the robust orchestration — all bit-identical, plus the CI workflow the README already advertises.

Performance changes (no change to decode output)

Area	Change	Lever
Luminance	`luminanceSource_newFromImageData` walks the raw plane with `repeat for each byte` + a 4-phase counter	Sequential iterator instead of 3 indexed `byte (o+k) of pRaw` reads/pixel — the biggest interpreted-loop lever in xTalk
Global binarizer	`ghb_getBlackMatrix` builds each 32-bit word from ≤32 pixels and writes it once (skips all-white words)	Per-word array write instead of a `bitMatrix_set` command call + array read/write per black pixel
Hybrid binarizer	`hb_thresholdBlock` inlines the black-pixel set	Drops the per-pixel command dispatch + `shl` call across thousands of 8×8 blocks
Detector scan	`finderPatternFinder` loads one row word per 32 columns and shifts 1 bit/pixel	Each pixel test is `bitAnd`/`div 2` instead of a `bitMatrix_get` + `uShr` call
Robust path	`qrDecodeResultRobust` builds the downsampled luminance source once per scale, reused across binarizers	Halves the downsample+greyscale (cost centre) when the global fallback runs

Correctness

Every change is bit-identical by construction and checked by simulating old-vs-new against the exact BitMatrix packing model:

Luminance regrouping — identical over 2000 random W×H buffers (incl. trailing padding).
Inlined / word-batched bit-set — identical to bitMatrix_set over 500 + 3000 random matrices/images; shl(1,k) == 2^k verified for all k in 0..31.
Finder row-scan word-cache — identical to bitMatrix_get over 3000 random rows (incl. widths that aren't multiples of 32).
Robust-path reuse — the luminance source is deterministic per (plane, scale) and read-only downstream (binaryBitmap copies it in; binarizers only read ["lum"]).

tools/lint_lcs.py is clean across all 53 files; lib/xtQRdecoder.livecodescript is regenerated and build_livecodescript.py --check passes.

Added — CI

.github/workflows/ci.yml runs the two static gates the docs describe (lint_lcs.py + build_livecodescript.py --check, pure Python stdlib) on push to main and on every PR. This is the workflow the README's CI badge points at.

⚠️ Needs a real-engine run before merge

The static gates can't run the xTalk engine. Please run qr/qr_tester.lc (399 unit tests) and qr/qr_golden.lc (5 golden fixtures) on a real engine to confirm — the changes are built to be bit-identical, so I expect both green.

Possible follow-ups (not in this PR)

Fuse luminanceSource_downsampleRaw to compute luminance in one strided pass (drop the intermediate 4-bpp string + second pass).
Store the luminance plane as a binary string to make the binarizer's per-pixel reads sequential too (larger, more invasive change touching the hybrid block access).

https://claude.ai/code/session_01DvJJcEwcoAVVRfM9iRdqa7

The decode cost centre (spec §11) is the interpreted per-pixel loops in luminance conversion and binarization. A prior pass did the arithmetic wins (index hoisting, flat-local accumulation, arrays-by-reference); this pass pulls the two remaining LiveCode levers, with output unchanged: * luminanceSource_newFromImageData now walks the raw pixel plane with a `repeat for each byte` sequential iterator + a 4-phase counter, instead of three indexed `byte (o+k) of pRaw` reads per pixel. Sequential iteration avoids re-resolving the byte chunk on every read — the single biggest interpreted-loop lever in xTalk. Also speeds the downsample path, which feeds the same handler. * The hybrid and global binarizers no longer call the bitMatrix_set command once per black pixel on their O(W·H) threshold loops; the bit-set is inlined, removing the per-pixel handler dispatch and the shl call (shl(1,k) == 2^k for k in 0..31). The word index and set bit are unchanged, so each BitMatrix is bit-identical. Output is bit-identical (verified by simulating old vs new logic over thousands of random buffers/matrices), so the 399 unit tests and 5 golden fixtures are unaffected. Combined library rebuilt; linter and build --check sync gate pass. https://claude.ai/code/session_01DvJJcEwcoAVVRfM9iRdqa7

…nce reuse Builds on the previous hot-loop pass with three more bit-identical speedups (each verified by simulating old vs new against the exact BitMatrix packing model over thousands of random inputs) plus the missing CI workflow. Performance (no change to decode output): * globalHistogramBinarizer: the whole-image threshold now builds each 32-bit BitMatrix word from up to 32 pixels and writes it once per word (skipping all-white words), instead of a bitMatrix_set command call + array read/write per black pixel. * finderPatternFinder: the main row scan loads one 32-bit row word per 32 columns and shifts it one bit per pixel, so each pixel test is a bitAnd/div 2 instead of a bitMatrix_get + uShr function call. Bit-identical on the sequential row walk. * qrDecodeResultRobust: builds the downsampled luminance source once per scale and reuses it across binarizers, instead of recomputing the downsample + greyscale (the cost centre) for every (binarizer, scale) strategy -- halves that work on photos that need the global fallback. The source is read-only downstream, so reuse is behaviour-preserving. Added: * .github/workflows/ci.yml -- runs tools/lint_lcs.py and build_livecodescript.py --check (the documented static gates; pure Python stdlib) on push to main and on PRs. This is the workflow the README CI badge already points at. Combined library rebuilt; linter clean across 53 files and build --check passes. https://claude.ai/code/session_01DvJJcEwcoAVVRfM9iRdqa7

STEP 2 (greyscale) dominates decode time on real phone photos because that path runs luminanceSource_downsampleRaw, which the earlier repeat-for-each win did NOT cover: it did a `byte (o+1) to (o+4) of pRaw` chunk read AND a `put after` append for every kept pixel (750k for a 1500x2000 -> step-2 image), built a multi-MB intermediate plane, then re-walked it in a second pass. Fuse it into a single pass: extract each KEPT source row with one chunk read, walk it with the fast `repeat for each byte` iterator, and compute the §7.3 greyscale inline. No intermediate reduced plane, no second pass, and one chunk read per row instead of per pixel. Output is bit-identical (verified by simulating the fused pass vs the old downsampleRaw + newFromImageData over 4000 random sizes/steps). The column-keep test uses a full if/end if (not an inline-if) to avoid the inline-if + else-if binding hazard (ARCHITECTURE §4.12). Combined library rebuilt; linter clean and build --check passes. https://claude.ai/code/session_01DvJJcEwcoAVVRfM9iRdqa7

Real-engine profiling showed the "fused single pass" downsample made grayscale SLOWER, not faster. The fix was based on a wrong hypothesis: it cut indexed chunk reads ~750x (per-row instead of per-pixel), yet got slower -- which proves chunk-read overhead was never the bottleneck. The actual cost is the per-byte work inside `repeat for each` (byteToNum + array write + branching), and the fused version walked ~2x the bytes (every column of each kept row, discarding the skipped ones) with a heavier loop body. Restore the original two-pass downsampleRaw, which only runs the heavy luminance compute on the kept pixels. Round-1's newFromImageData repeat-for-each is kept. Combined library rebuilt; linter clean and build --check passes. https://claude.ai/code/session_01DvJJcEwcoAVVRfM9iRdqa7

Profiling showed STEP 2 (grayscale/downsample) dominates on large photos and is near the floor for INTERPRETED xTalk: the cost is the per-kept-pixel work (byteToNum + array write) over hundreds of thousands of pixels, not byte-access overhead. The big remaining lever is to stop downsampling in interpreted code and let the engine do it in compiled C. Add luminanceSource_decodeResampled: decode the image, resample it to the target scale with the engine's compiled `resizeImage`, then read the already-small imageData -- skipping the interpreted per-pixel downsample entirely. It targets the same integer-step dimensions the interpreted path produces, so detector geometry is unchanged; only the resampling filter differs. Wire it into qrDecodeResultRobust behind the new ENGINE_RESAMPLE hint (cached per scale like the interpreted path; the costly full-res decode is skipped in this mode). Because the engine's resampler changes pixel values vs nearest- neighbour, it CHANGES decode behaviour and is therefore OFF BY DEFAULT -- the default path stays bit-identical, so the 399 tests and 5 golden fixtures are unaffected. Documented in the README hints table and CHANGELOG; re-verify qr_golden.lc on the target engine when enabling it. Also: removed the stale CHANGELOG bullet for the reverted fused-downsample. Combined library rebuilt; linter clean across 53 files and build --check passes. https://claude.ai/code/session_01DvJJcEwcoAVVRfM9iRdqa7

…esizeImage The previous commit used a bare `resizeImage` command. On a build that lacks it (e.g. the reporter's engine), that is a PARSE error -- and a parse error in one module takes down the whole combined library. Invoke resizeImage via `do` instead, so an absent command fails at RUNTIME (caught) rather than at compile time, and fall back to the interpreted downsample when it isn't available. Net: the library compiles on every engine again; ENGINE_RESAMPLE uses the compiled resampler where present and is a safe no-op (interpreted downsample, same as default) where it isn't. Linter clean, combined library rebuilt, build --check passes. https://claude.ai/code/session_01DvJJcEwcoAVVRfM9iRdqa7

claude added 2 commits June 13, 2026 22:01

SethMorrowSoftware changed the title ~~perf: speed up the interpreted pixel hot loops (bit-identical output)~~ perf: push the interpreted hot loops further + add CI Jun 13, 2026

claude added 4 commits June 13, 2026 22:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: push the interpreted hot loops further + add CI#1

perf: push the interpreted hot loops further + add CI#1
SethMorrowSoftware wants to merge 6 commits into
mainfrom
claude/elegant-gates-8etn3x

SethMorrowSoftware commented Jun 13, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

SethMorrowSoftware commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What & why

Performance changes (no change to decode output)

Correctness

Added — CI

⚠️ Needs a real-engine run before merge

Possible follow-ups (not in this PR)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SethMorrowSoftware commented Jun 13, 2026 •

edited

Loading