perf: push the interpreted hot loops further + add CI#1
Draft
SethMorrowSoftware wants to merge 6 commits into
Draft
perf: push the interpreted hot loops further + add CI#1SethMorrowSoftware wants to merge 6 commits into
SethMorrowSoftware wants to merge 6 commits into
Conversation
The decode cost centre (spec §11) is the interpreted per-pixel loops in luminance conversion and binarization. A prior pass did the arithmetic wins (index hoisting, flat-local accumulation, arrays-by-reference); this pass pulls the two remaining LiveCode levers, with output unchanged: * luminanceSource_newFromImageData now walks the raw pixel plane with a `repeat for each byte` sequential iterator + a 4-phase counter, instead of three indexed `byte (o+k) of pRaw` reads per pixel. Sequential iteration avoids re-resolving the byte chunk on every read — the single biggest interpreted-loop lever in xTalk. Also speeds the downsample path, which feeds the same handler. * The hybrid and global binarizers no longer call the bitMatrix_set command once per black pixel on their O(W·H) threshold loops; the bit-set is inlined, removing the per-pixel handler dispatch and the shl call (shl(1,k) == 2^k for k in 0..31). The word index and set bit are unchanged, so each BitMatrix is bit-identical. Output is bit-identical (verified by simulating old vs new logic over thousands of random buffers/matrices), so the 399 unit tests and 5 golden fixtures are unaffected. Combined library rebuilt; linter and build --check sync gate pass. https://claude.ai/code/session_01DvJJcEwcoAVVRfM9iRdqa7
…nce reuse Builds on the previous hot-loop pass with three more bit-identical speedups (each verified by simulating old vs new against the exact BitMatrix packing model over thousands of random inputs) plus the missing CI workflow. Performance (no change to decode output): * globalHistogramBinarizer: the whole-image threshold now builds each 32-bit BitMatrix word from up to 32 pixels and writes it once per word (skipping all-white words), instead of a bitMatrix_set command call + array read/write per black pixel. * finderPatternFinder: the main row scan loads one 32-bit row word per 32 columns and shifts it one bit per pixel, so each pixel test is a bitAnd/div 2 instead of a bitMatrix_get + uShr function call. Bit-identical on the sequential row walk. * qrDecodeResultRobust: builds the downsampled luminance source once per scale and reuses it across binarizers, instead of recomputing the downsample + greyscale (the cost centre) for every (binarizer, scale) strategy -- halves that work on photos that need the global fallback. The source is read-only downstream, so reuse is behaviour-preserving. Added: * .github/workflows/ci.yml -- runs tools/lint_lcs.py and build_livecodescript.py --check (the documented static gates; pure Python stdlib) on push to main and on PRs. This is the workflow the README CI badge already points at. Combined library rebuilt; linter clean across 53 files and build --check passes. https://claude.ai/code/session_01DvJJcEwcoAVVRfM9iRdqa7
STEP 2 (greyscale) dominates decode time on real phone photos because that path runs luminanceSource_downsampleRaw, which the earlier repeat-for-each win did NOT cover: it did a `byte (o+1) to (o+4) of pRaw` chunk read AND a `put after` append for every kept pixel (750k for a 1500x2000 -> step-2 image), built a multi-MB intermediate plane, then re-walked it in a second pass. Fuse it into a single pass: extract each KEPT source row with one chunk read, walk it with the fast `repeat for each byte` iterator, and compute the §7.3 greyscale inline. No intermediate reduced plane, no second pass, and one chunk read per row instead of per pixel. Output is bit-identical (verified by simulating the fused pass vs the old downsampleRaw + newFromImageData over 4000 random sizes/steps). The column-keep test uses a full if/end if (not an inline-if) to avoid the inline-if + else-if binding hazard (ARCHITECTURE §4.12). Combined library rebuilt; linter clean and build --check passes. https://claude.ai/code/session_01DvJJcEwcoAVVRfM9iRdqa7
Real-engine profiling showed the "fused single pass" downsample made grayscale SLOWER, not faster. The fix was based on a wrong hypothesis: it cut indexed chunk reads ~750x (per-row instead of per-pixel), yet got slower -- which proves chunk-read overhead was never the bottleneck. The actual cost is the per-byte work inside `repeat for each` (byteToNum + array write + branching), and the fused version walked ~2x the bytes (every column of each kept row, discarding the skipped ones) with a heavier loop body. Restore the original two-pass downsampleRaw, which only runs the heavy luminance compute on the kept pixels. Round-1's newFromImageData repeat-for-each is kept. Combined library rebuilt; linter clean and build --check passes. https://claude.ai/code/session_01DvJJcEwcoAVVRfM9iRdqa7
Profiling showed STEP 2 (grayscale/downsample) dominates on large photos and is near the floor for INTERPRETED xTalk: the cost is the per-kept-pixel work (byteToNum + array write) over hundreds of thousands of pixels, not byte-access overhead. The big remaining lever is to stop downsampling in interpreted code and let the engine do it in compiled C. Add luminanceSource_decodeResampled: decode the image, resample it to the target scale with the engine's compiled `resizeImage`, then read the already-small imageData -- skipping the interpreted per-pixel downsample entirely. It targets the same integer-step dimensions the interpreted path produces, so detector geometry is unchanged; only the resampling filter differs. Wire it into qrDecodeResultRobust behind the new ENGINE_RESAMPLE hint (cached per scale like the interpreted path; the costly full-res decode is skipped in this mode). Because the engine's resampler changes pixel values vs nearest- neighbour, it CHANGES decode behaviour and is therefore OFF BY DEFAULT -- the default path stays bit-identical, so the 399 tests and 5 golden fixtures are unaffected. Documented in the README hints table and CHANGELOG; re-verify qr_golden.lc on the target engine when enabling it. Also: removed the stale CHANGELOG bullet for the reverted fused-downsample. Combined library rebuilt; linter clean across 53 files and build --check passes. https://claude.ai/code/session_01DvJJcEwcoAVVRfM9iRdqa7
…esizeImage The previous commit used a bare `resizeImage` command. On a build that lacks it (e.g. the reporter's engine), that is a PARSE error -- and a parse error in one module takes down the whole combined library. Invoke resizeImage via `do` instead, so an absent command fails at RUNTIME (caught) rather than at compile time, and fall back to the interpreted downsample when it isn't available. Net: the library compiles on every engine again; ENGINE_RESAMPLE uses the compiled resampler where present and is a safe no-op (interpreted downsample, same as default) where it isn't. Linter clean, combined library rebuilt, build --check passes. https://claude.ai/code/session_01DvJJcEwcoAVVRfM9iRdqa7
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What & why
The decode cost centre (
docs/spec.md§11) is the interpreted, single-threaded per-pixel / per-bit loops. The0.1.0pass did the arithmetic wins; this PR pulls the remaining LiveCode-specific levers across luminance, both binarizers, the detector scan, and the robust orchestration — all bit-identical, plus the CI workflow the README already advertises.Performance changes (no change to decode output)
luminanceSource_newFromImageDatawalks the raw plane withrepeat for each byte+ a 4-phase counterbyte (o+k) of pRawreads/pixel — the biggest interpreted-loop lever in xTalkghb_getBlackMatrixbuilds each 32-bit word from ≤32 pixels and writes it once (skips all-white words)bitMatrix_setcommand call + array read/write per black pixelhb_thresholdBlockinlines the black-pixel setshlcall across thousands of 8×8 blocksfinderPatternFinderloads one row word per 32 columns and shifts 1 bit/pixelbitAnd/div 2instead of abitMatrix_get+uShrcallqrDecodeResultRobustbuilds the downsampled luminance source once per scale, reused across binarizersCorrectness
Every change is bit-identical by construction and checked by simulating old-vs-new against the exact
BitMatrixpacking model:W×Hbuffers (incl. trailing padding).bitMatrix_setover 500 + 3000 random matrices/images;shl(1,k) == 2^kverified for allkin0..31.bitMatrix_getover 3000 random rows (incl. widths that aren't multiples of 32).(plane, scale)and read-only downstream (binaryBitmapcopies it in; binarizers only read["lum"]).tools/lint_lcs.pyis clean across all 53 files;lib/xtQRdecoder.livecodescriptis regenerated andbuild_livecodescript.py --checkpasses.Added — CI
.github/workflows/ci.ymlruns the two static gates the docs describe (lint_lcs.py+build_livecodescript.py --check, pure Python stdlib) on push tomainand on every PR. This is the workflow the README's CI badge points at.The static gates can't run the xTalk engine. Please run
qr/qr_tester.lc(399 unit tests) andqr/qr_golden.lc(5 golden fixtures) on a real engine to confirm — the changes are built to be bit-identical, so I expect both green.Possible follow-ups (not in this PR)
luminanceSource_downsampleRawto compute luminance in one strided pass (drop the intermediate 4-bpp string + second pass).https://claude.ai/code/session_01DvJJcEwcoAVVRfM9iRdqa7