Vectorize small-offset overlapping copies on AArch64 with NEON tbl#3
Open
alexey-milovidov wants to merge 1 commit into
Open
Vectorize small-offset overlapping copies on AArch64 with NEON tbl#3alexey-milovidov wants to merge 1 commit into
alexey-milovidov wants to merge 1 commit into
Conversation
For small match offsets (overlap period < 16 bytes), `ZSTD_wildcopy` falls back to an 8-byte-per-iteration `COPY8` loop. On AArch64 this dominates decompression of integer and low-cardinality columns, whose matches frequently have offsets of 4 or 8 bytes (e.g. repeated fixed-width values). Build the repeating pattern once in a NEON register and store 16 bytes per iteration, advancing the pattern with a single `vqtbl1q_u8` table lookup and no load inside the loop. The scalar `COPY8` path is kept for other targets. On Graviton 4 (Neoverse-V2), ZSTD level 1 decompression of ClickBench `hits` columns speeds up: UserID 2.30x, AdvEngineID 1.46x, ClientIP 1.42x, RegionID 1.27x; string and incompressible columns are unchanged. Verified byte-exact and clean under ASan and UBSan. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Used by ClickHouse: ClickHouse/ClickHouse#108049
Vectorize small-offset overlapping copies on AArch64 with NEON
tblFor small match offsets (overlap period < 16 bytes),
ZSTD_wildcopyfalls back to an 8-byte-per-iterationCOPY8loop. On AArch64 this loop dominates decompression of integer and low-cardinality data, whose matches frequently have offsets of 4 or 8 bytes (e.g. runs of repeated fixed-width values).This builds the repeating pattern once in a NEON register and stores 16 bytes per iteration, advancing the pattern with a single
vqtbl1q_u8table lookup and no load inside the loop:init[diff][j] = j % diffbuilds the period-diffpattern from a 16-byte load;adv[diff][j] = (16 + j) % diffshifts the pattern forward by 16 bytes each iteration, which composes correctly becausepattern[k] = src[k % diff].The scalar
COPY8path is kept unchanged for non-NEON targets.Results
On Graviton 4 (Neoverse-V2), ZSTD level 1 decompression of ClickBench
hitscolumns (decompression throughput, best of several runs):UserIDAdvEngineIDClientIPRegionIDURL/Title/RefererWatchIDCorrectness
Round-trip decompression is byte-exact against the original input for all tested columns at block sizes from 4 KB to 1 MB, and clean under AddressSanitizer and UndefinedBehaviorSanitizer. The 16-byte stores overrun the output by at most 15 bytes, well within the existing
WILDCOPY_OVERLENGTH(32) slack already required by the non-overlapping path.