Rewrite varlen 32-byte block encoder with copy_nonoverlapping by joseph-isaacs · Pull Request #7999 · vortex-data/vortex

joseph-isaacs · 2026-05-18T16:05:24Z

Part 14 of 25 in the stacked PR series adding vortex-row.

This PR contains exactly one commit; review just that diff in isolation.

What this commit does

The byte-at-a-time XOR loop is per-byte branch-heavy: 32 conditional writes per block on each path, even for the ascending (no-XOR) case where the body is exactly a memcpy(32) + stamp(1).

Rewrites encode_varlen_value with two distinct fast paths:

Ascending: copy_nonoverlapping(src, dst, 32) + a single 0xFF stamp. The compiler folds the loop into a SIMD memcpy.
Descending: a xor_copy_block helper that XOR-copies 32 bytes via four u64 reads/writes; LLVM lowers it to SIMD on x86.

The partial-block tail uses write_bytes for the zero-padding instead of a per-byte loop. utf8 throughput: ~0.92 GB/s → ~1.39 GB/s. struct_mixed: +35%.

Stack

#	PR	Title	Branch
1	#7986	vortex-row: crate scaffolding	`claude/row-c01-crate-scaffolding`
2	#7987	vortex-row: add SortField and RowEncodeOptions	`claude/row-c02-sortfield-options`
3	#7988	vortex-row: codec for fixed-width canonical types	`claude/row-c03-codec-fixed-width`
4	#7989	vortex-row: codec for varlen canonical types	`claude/row-c04-codec-varlen`
5	#7990	vortex-row: codec for nested canonical types	`claude/row-c05-codec-nested`
6	#7991	vortex-row: compute_sizes helper and RowSize ScalarFn	`claude/row-c06-rowsize-scalarfn`
7	#7992	vortex-row: RowEncode ScalarFn	`claude/row-c07-rowencode-scalarfn`
8	#7993	vortex-row: convert_columns + tests + bench scaffolding	`claude/row-c08-convert-columns-tests-bench`
9	#7994	Skip ListView validation in row encoder output	`claude/row-c09-skip-listview-validation`
10	#7995	Add validity fast-path helper for the four pattern-matching encoders	`claude/row-c10-validity-fast-path`
11	#7996	Skip zero-init of output buffer	`claude/row-c11-skip-zero-init`
12	#7997	Auto-vectorize pure-fixed offsets construction	`claude/row-c12-vectorize-pure-fixed-offsets`
13	#7998	Auto-vectorize mixed-path offsets construction	`claude/row-c13-vectorize-mixed-offsets`
14	#7999	Rewrite varlen 32-byte block encoder with copy_nonoverlapping	`claude/row-c14-varlen-block-copy-nonoverlapping`
15	#8000	Walk VarBinView rows directly in row encoder hot loop	`claude/row-c15-walk-varbinview-directly`
16	#8001	Add arithmetic-write fast path for fixed-before-varlen columns	`claude/row-c16-arith-write-fast-path`
17	#8002	Specialize Constant for the arithmetic-write fast path	`claude/row-c17-specialize-constant-arith`
18	#8003	RowSizeKernel and RowEncodeKernel dispatch helpers	`claude/row-c18-kernel-dispatch-helpers`
19	#8004	Inventory-based registry for downstream encoding kernels	`claude/row-c19-inventory-registry`
20	#8005	Constant row-encode kernel	`claude/row-c20-constant-kernel`
21	#8006	Dict row-encode kernel	`claude/row-c21-dict-kernel`
22	#8007	Patched row-encode kernel	`claude/row-c22-patched-kernel`
23	#8008	RunEnd row-encode kernel (vortex-runend)	`claude/row-c23-runend-kernel`
24	#8009	BitPacked row-encode kernel (vortex-fastlanes)	`claude/row-c24-bitpacked-kernel`
25	#7985	FoR and Delta row-encode kernels (vortex-fastlanes)	`claude/row-pr3-kernels`

Base of this PR: #7998 (claude/row-c13-vectorize-mixed-offsets)
Next in stack: #8000 (claude/row-c15-walk-varbinview-directly)

Combined context

For the full design + rationale, see PR #7985 (top of stack).

The byte-at-a-time XOR loop is per-byte branch-heavy: 32 conditional writes per block on each path, even for the ascending (no-XOR) case where the body is exactly a `memcpy(32) + stamp(1)`. Rewrite `encode_varlen_value` with two distinct fast paths: - Ascending: `copy_nonoverlapping(src, dst, 32)` + a single 0xFF stamp. The compiler folds the loop into a SIMD memcpy. - Descending: a `xor_copy_block` helper that XOR-copies 32 bytes via four u64 reads/writes; LLVM lowers it to SIMD on x86. The partial-block tail uses `write_bytes` for the zero-padding instead of a per-byte loop. utf8 throughput: ~0.92 GB/s → ~1.39 GB/s. struct_mixed: +35%. Signed-off-by: Claude <noreply@anthropic.com>

joseph-isaacs closed this May 18, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite varlen 32-byte block encoder with copy_nonoverlapping#7999

Rewrite varlen 32-byte block encoder with copy_nonoverlapping#7999
joseph-isaacs wants to merge 1 commit into
claude/row-c13-vectorize-mixed-offsetsfrom
claude/row-c14-varlen-block-copy-nonoverlapping

joseph-isaacs commented May 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joseph-isaacs commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this commit does

Stack

Combined context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

joseph-isaacs commented May 18, 2026 •

edited

Loading