Skip to content

perf: push the interpreted hot loops further + add CI#1

Draft
SethMorrowSoftware wants to merge 6 commits into
mainfrom
claude/elegant-gates-8etn3x
Draft

perf: push the interpreted hot loops further + add CI#1
SethMorrowSoftware wants to merge 6 commits into
mainfrom
claude/elegant-gates-8etn3x

Conversation

@SethMorrowSoftware

@SethMorrowSoftware SethMorrowSoftware commented Jun 13, 2026

Copy link
Copy Markdown
Owner

What & why

The decode cost centre (docs/spec.md §11) is the interpreted, single-threaded per-pixel / per-bit loops. The 0.1.0 pass did the arithmetic wins; this PR pulls the remaining LiveCode-specific levers across luminance, both binarizers, the detector scan, and the robust orchestration — all bit-identical, plus the CI workflow the README already advertises.

Performance changes (no change to decode output)

Area Change Lever
Luminance luminanceSource_newFromImageData walks the raw plane with repeat for each byte + a 4-phase counter Sequential iterator instead of 3 indexed byte (o+k) of pRaw reads/pixel — the biggest interpreted-loop lever in xTalk
Global binarizer ghb_getBlackMatrix builds each 32-bit word from ≤32 pixels and writes it once (skips all-white words) Per-word array write instead of a bitMatrix_set command call + array read/write per black pixel
Hybrid binarizer hb_thresholdBlock inlines the black-pixel set Drops the per-pixel command dispatch + shl call across thousands of 8×8 blocks
Detector scan finderPatternFinder loads one row word per 32 columns and shifts 1 bit/pixel Each pixel test is bitAnd/div 2 instead of a bitMatrix_get + uShr call
Robust path qrDecodeResultRobust builds the downsampled luminance source once per scale, reused across binarizers Halves the downsample+greyscale (cost centre) when the global fallback runs

Correctness

Every change is bit-identical by construction and checked by simulating old-vs-new against the exact BitMatrix packing model:

  • Luminance regrouping — identical over 2000 random W×H buffers (incl. trailing padding).
  • Inlined / word-batched bit-set — identical to bitMatrix_set over 500 + 3000 random matrices/images; shl(1,k) == 2^k verified for all k in 0..31.
  • Finder row-scan word-cache — identical to bitMatrix_get over 3000 random rows (incl. widths that aren't multiples of 32).
  • Robust-path reuse — the luminance source is deterministic per (plane, scale) and read-only downstream (binaryBitmap copies it in; binarizers only read ["lum"]).

tools/lint_lcs.py is clean across all 53 files; lib/xtQRdecoder.livecodescript is regenerated and build_livecodescript.py --check passes.

Added — CI

.github/workflows/ci.yml runs the two static gates the docs describe (lint_lcs.py + build_livecodescript.py --check, pure Python stdlib) on push to main and on every PR. This is the workflow the README's CI badge points at.

⚠️ Needs a real-engine run before merge

The static gates can't run the xTalk engine. Please run qr/qr_tester.lc (399 unit tests) and qr/qr_golden.lc (5 golden fixtures) on a real engine to confirm — the changes are built to be bit-identical, so I expect both green.

Possible follow-ups (not in this PR)

  • Fuse luminanceSource_downsampleRaw to compute luminance in one strided pass (drop the intermediate 4-bpp string + second pass).
  • Store the luminance plane as a binary string to make the binarizer's per-pixel reads sequential too (larger, more invasive change touching the hybrid block access).

https://claude.ai/code/session_01DvJJcEwcoAVVRfM9iRdqa7

claude added 2 commits June 13, 2026 22:01
The decode cost centre (spec §11) is the interpreted per-pixel loops in
luminance conversion and binarization. A prior pass did the arithmetic
wins (index hoisting, flat-local accumulation, arrays-by-reference); this
pass pulls the two remaining LiveCode levers, with output unchanged:

* luminanceSource_newFromImageData now walks the raw pixel plane with a
  `repeat for each byte` sequential iterator + a 4-phase counter, instead
  of three indexed `byte (o+k) of pRaw` reads per pixel. Sequential
  iteration avoids re-resolving the byte chunk on every read — the single
  biggest interpreted-loop lever in xTalk. Also speeds the downsample path,
  which feeds the same handler.

* The hybrid and global binarizers no longer call the bitMatrix_set command
  once per black pixel on their O(W·H) threshold loops; the bit-set is
  inlined, removing the per-pixel handler dispatch and the shl call
  (shl(1,k) == 2^k for k in 0..31). The word index and set bit are
  unchanged, so each BitMatrix is bit-identical.

Output is bit-identical (verified by simulating old vs new logic over
thousands of random buffers/matrices), so the 399 unit tests and 5 golden
fixtures are unaffected. Combined library rebuilt; linter and build --check
sync gate pass.

https://claude.ai/code/session_01DvJJcEwcoAVVRfM9iRdqa7
…nce reuse

Builds on the previous hot-loop pass with three more bit-identical speedups
(each verified by simulating old vs new against the exact BitMatrix packing
model over thousands of random inputs) plus the missing CI workflow.

Performance (no change to decode output):
* globalHistogramBinarizer: the whole-image threshold now builds each 32-bit
  BitMatrix word from up to 32 pixels and writes it once per word (skipping
  all-white words), instead of a bitMatrix_set command call + array read/write
  per black pixel.
* finderPatternFinder: the main row scan loads one 32-bit row word per 32
  columns and shifts it one bit per pixel, so each pixel test is a bitAnd/div 2
  instead of a bitMatrix_get + uShr function call. Bit-identical on the
  sequential row walk.
* qrDecodeResultRobust: builds the downsampled luminance source once per scale
  and reuses it across binarizers, instead of recomputing the downsample +
  greyscale (the cost centre) for every (binarizer, scale) strategy -- halves
  that work on photos that need the global fallback. The source is read-only
  downstream, so reuse is behaviour-preserving.

Added:
* .github/workflows/ci.yml -- runs tools/lint_lcs.py and
  build_livecodescript.py --check (the documented static gates; pure Python
  stdlib) on push to main and on PRs. This is the workflow the README CI badge
  already points at.

Combined library rebuilt; linter clean across 53 files and build --check passes.

https://claude.ai/code/session_01DvJJcEwcoAVVRfM9iRdqa7
@SethMorrowSoftware SethMorrowSoftware changed the title perf: speed up the interpreted pixel hot loops (bit-identical output) perf: push the interpreted hot loops further + add CI Jun 13, 2026
claude added 4 commits June 13, 2026 22:36
STEP 2 (greyscale) dominates decode time on real phone photos because that
path runs luminanceSource_downsampleRaw, which the earlier repeat-for-each
win did NOT cover: it did a `byte (o+1) to (o+4) of pRaw` chunk read AND a
`put after` append for every kept pixel (750k for a 1500x2000 -> step-2
image), built a multi-MB intermediate plane, then re-walked it in a second
pass.

Fuse it into a single pass: extract each KEPT source row with one chunk read,
walk it with the fast `repeat for each byte` iterator, and compute the §7.3
greyscale inline. No intermediate reduced plane, no second pass, and one chunk
read per row instead of per pixel.

Output is bit-identical (verified by simulating the fused pass vs the old
downsampleRaw + newFromImageData over 4000 random sizes/steps). The column-keep
test uses a full if/end if (not an inline-if) to avoid the inline-if + else-if
binding hazard (ARCHITECTURE §4.12). Combined library rebuilt; linter clean and
build --check passes.

https://claude.ai/code/session_01DvJJcEwcoAVVRfM9iRdqa7
Real-engine profiling showed the "fused single pass" downsample made grayscale
SLOWER, not faster. The fix was based on a wrong hypothesis: it cut indexed
chunk reads ~750x (per-row instead of per-pixel), yet got slower -- which proves
chunk-read overhead was never the bottleneck. The actual cost is the per-byte
work inside `repeat for each` (byteToNum + array write + branching), and the
fused version walked ~2x the bytes (every column of each kept row, discarding
the skipped ones) with a heavier loop body.

Restore the original two-pass downsampleRaw, which only runs the heavy luminance
compute on the kept pixels. Round-1's newFromImageData repeat-for-each is kept.
Combined library rebuilt; linter clean and build --check passes.

https://claude.ai/code/session_01DvJJcEwcoAVVRfM9iRdqa7
Profiling showed STEP 2 (grayscale/downsample) dominates on large photos and
is near the floor for INTERPRETED xTalk: the cost is the per-kept-pixel work
(byteToNum + array write) over hundreds of thousands of pixels, not byte-access
overhead. The big remaining lever is to stop downsampling in interpreted code
and let the engine do it in compiled C.

Add luminanceSource_decodeResampled: decode the image, resample it to the target
scale with the engine's compiled `resizeImage`, then read the already-small
imageData -- skipping the interpreted per-pixel downsample entirely. It targets
the same integer-step dimensions the interpreted path produces, so detector
geometry is unchanged; only the resampling filter differs.

Wire it into qrDecodeResultRobust behind the new ENGINE_RESAMPLE hint (cached
per scale like the interpreted path; the costly full-res decode is skipped in
this mode). Because the engine's resampler changes pixel values vs nearest-
neighbour, it CHANGES decode behaviour and is therefore OFF BY DEFAULT -- the
default path stays bit-identical, so the 399 tests and 5 golden fixtures are
unaffected. Documented in the README hints table and CHANGELOG; re-verify
qr_golden.lc on the target engine when enabling it.

Also: removed the stale CHANGELOG bullet for the reverted fused-downsample.
Combined library rebuilt; linter clean across 53 files and build --check passes.

https://claude.ai/code/session_01DvJJcEwcoAVVRfM9iRdqa7
…esizeImage

The previous commit used a bare `resizeImage` command. On a build that lacks it
(e.g. the reporter's engine), that is a PARSE error -- and a parse error in one
module takes down the whole combined library. Invoke resizeImage via `do`
instead, so an absent command fails at RUNTIME (caught) rather than at compile
time, and fall back to the interpreted downsample when it isn't available.

Net: the library compiles on every engine again; ENGINE_RESAMPLE uses the
compiled resampler where present and is a safe no-op (interpreted downsample,
same as default) where it isn't. Linter clean, combined library rebuilt, build
--check passes.

https://claude.ai/code/session_01DvJJcEwcoAVVRfM9iRdqa7
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants