Skip to content

Rewrite varlen 32-byte block encoder with copy_nonoverlapping#7999

Closed
joseph-isaacs wants to merge 1 commit into
claude/row-c13-vectorize-mixed-offsetsfrom
claude/row-c14-varlen-block-copy-nonoverlapping
Closed

Rewrite varlen 32-byte block encoder with copy_nonoverlapping#7999
joseph-isaacs wants to merge 1 commit into
claude/row-c13-vectorize-mixed-offsetsfrom
claude/row-c14-varlen-block-copy-nonoverlapping

Conversation

@joseph-isaacs
Copy link
Copy Markdown
Contributor

@joseph-isaacs joseph-isaacs commented May 18, 2026

Part 14 of 25 in the stacked PR series adding vortex-row.

This PR contains exactly one commit; review just that diff in isolation.

What this commit does

The byte-at-a-time XOR loop is per-byte branch-heavy: 32 conditional writes per block on each path, even for the ascending (no-XOR) case where the body is exactly a memcpy(32) + stamp(1).

Rewrites encode_varlen_value with two distinct fast paths:

  • Ascending: copy_nonoverlapping(src, dst, 32) + a single 0xFF stamp. The compiler folds the loop into a SIMD memcpy.
  • Descending: a xor_copy_block helper that XOR-copies 32 bytes via four u64 reads/writes; LLVM lowers it to SIMD on x86.

The partial-block tail uses write_bytes for the zero-padding instead of a per-byte loop. utf8 throughput: ~0.92 GB/s → ~1.39 GB/s. struct_mixed: +35%.

Stack

# PR Title Branch
1 #7986 vortex-row: crate scaffolding claude/row-c01-crate-scaffolding
2 #7987 vortex-row: add SortField and RowEncodeOptions claude/row-c02-sortfield-options
3 #7988 vortex-row: codec for fixed-width canonical types claude/row-c03-codec-fixed-width
4 #7989 vortex-row: codec for varlen canonical types claude/row-c04-codec-varlen
5 #7990 vortex-row: codec for nested canonical types claude/row-c05-codec-nested
6 #7991 vortex-row: compute_sizes helper and RowSize ScalarFn claude/row-c06-rowsize-scalarfn
7 #7992 vortex-row: RowEncode ScalarFn claude/row-c07-rowencode-scalarfn
8 #7993 vortex-row: convert_columns + tests + bench scaffolding claude/row-c08-convert-columns-tests-bench
9 #7994 Skip ListView validation in row encoder output claude/row-c09-skip-listview-validation
10 #7995 Add validity fast-path helper for the four pattern-matching encoders claude/row-c10-validity-fast-path
11 #7996 Skip zero-init of output buffer claude/row-c11-skip-zero-init
12 #7997 Auto-vectorize pure-fixed offsets construction claude/row-c12-vectorize-pure-fixed-offsets
13 #7998 Auto-vectorize mixed-path offsets construction claude/row-c13-vectorize-mixed-offsets
14 #7999 Rewrite varlen 32-byte block encoder with copy_nonoverlapping claude/row-c14-varlen-block-copy-nonoverlapping
15 #8000 Walk VarBinView rows directly in row encoder hot loop claude/row-c15-walk-varbinview-directly
16 #8001 Add arithmetic-write fast path for fixed-before-varlen columns claude/row-c16-arith-write-fast-path
17 #8002 Specialize Constant for the arithmetic-write fast path claude/row-c17-specialize-constant-arith
18 #8003 RowSizeKernel and RowEncodeKernel dispatch helpers claude/row-c18-kernel-dispatch-helpers
19 #8004 Inventory-based registry for downstream encoding kernels claude/row-c19-inventory-registry
20 #8005 Constant row-encode kernel claude/row-c20-constant-kernel
21 #8006 Dict row-encode kernel claude/row-c21-dict-kernel
22 #8007 Patched row-encode kernel claude/row-c22-patched-kernel
23 #8008 RunEnd row-encode kernel (vortex-runend) claude/row-c23-runend-kernel
24 #8009 BitPacked row-encode kernel (vortex-fastlanes) claude/row-c24-bitpacked-kernel
25 #7985 FoR and Delta row-encode kernels (vortex-fastlanes) claude/row-pr3-kernels

Base of this PR: #7998 (claude/row-c13-vectorize-mixed-offsets)
Next in stack: #8000 (claude/row-c15-walk-varbinview-directly)

Combined context

For the full design + rationale, see PR #7985 (top of stack).

The byte-at-a-time XOR loop is per-byte branch-heavy: 32 conditional
writes per block on each path, even for the ascending (no-XOR) case
where the body is exactly a `memcpy(32) + stamp(1)`.

Rewrite `encode_varlen_value` with two distinct fast paths:
- Ascending: `copy_nonoverlapping(src, dst, 32)` + a single 0xFF stamp.
  The compiler folds the loop into a SIMD memcpy.
- Descending: a `xor_copy_block` helper that XOR-copies 32 bytes via
  four u64 reads/writes; LLVM lowers it to SIMD on x86.

The partial-block tail uses `write_bytes` for the zero-padding instead
of a per-byte loop.

utf8 throughput: ~0.92 GB/s → ~1.39 GB/s.
struct_mixed: +35%.

Signed-off-by: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants