Skip to content

Vectorize small-offset overlapping copies on AArch64 with NEON tbl#3

Open
alexey-milovidov wants to merge 1 commit into
clickhousefrom
aarch64-neon-overlap-copy
Open

Vectorize small-offset overlapping copies on AArch64 with NEON tbl#3
alexey-milovidov wants to merge 1 commit into
clickhousefrom
aarch64-neon-overlap-copy

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented Jun 21, 2026

Copy link
Copy Markdown
Member

Used by ClickHouse: ClickHouse/ClickHouse#108049

Vectorize small-offset overlapping copies on AArch64 with NEON tbl

For small match offsets (overlap period < 16 bytes), ZSTD_wildcopy falls back to an 8-byte-per-iteration COPY8 loop. On AArch64 this loop dominates decompression of integer and low-cardinality data, whose matches frequently have offsets of 4 or 8 bytes (e.g. runs of repeated fixed-width values).

This builds the repeating pattern once in a NEON register and stores 16 bytes per iteration, advancing the pattern with a single vqtbl1q_u8 table lookup and no load inside the loop:

  • init[diff][j] = j % diff builds the period-diff pattern from a 16-byte load;
  • adv[diff][j] = (16 + j) % diff shifts the pattern forward by 16 bytes each iteration, which composes correctly because pattern[k] = src[k % diff].

The scalar COPY8 path is kept unchanged for non-NEON targets.

Results

On Graviton 4 (Neoverse-V2), ZSTD level 1 decompression of ClickBench hits columns (decompression throughput, best of several runs):

column type baseline MB/s patched MB/s speedup
UserID Int64 2624 6053 2.30x
AdvEngineID Int16 5990 8712 1.46x
ClientIP Int32 1857 2607 1.42x
RegionID Int32 2468 3127 1.27x
URL / Title / Referer String ~1.00x (unchanged)
WatchID Int64 (incompressible) ~1.00x (unchanged)

Correctness

Round-trip decompression is byte-exact against the original input for all tested columns at block sizes from 4 KB to 1 MB, and clean under AddressSanitizer and UndefinedBehaviorSanitizer. The 16-byte stores overrun the output by at most 15 bytes, well within the existing WILDCOPY_OVERLENGTH (32) slack already required by the non-overlapping path.

For small match offsets (overlap period < 16 bytes), `ZSTD_wildcopy` falls
back to an 8-byte-per-iteration `COPY8` loop. On AArch64 this dominates
decompression of integer and low-cardinality columns, whose matches frequently
have offsets of 4 or 8 bytes (e.g. repeated fixed-width values).

Build the repeating pattern once in a NEON register and store 16 bytes per
iteration, advancing the pattern with a single `vqtbl1q_u8` table lookup and
no load inside the loop. The scalar `COPY8` path is kept for other targets.

On Graviton 4 (Neoverse-V2), ZSTD level 1 decompression of ClickBench `hits`
columns speeds up: UserID 2.30x, AdvEngineID 1.46x, ClientIP 1.42x,
RegionID 1.27x; string and incompressible columns are unchanged. Verified
byte-exact and clean under ASan and UBSan.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant