Replace platform-specific SIMD with portable wide crate #127

lilith · 2026-01-18T10:37:04Z

Summary

Replace platform-specific SSE2/NEON intrinsics in f_pixel::diff() with portable wide::f32x4
Remove HistSortTmp union in favor of plain u32 (union was unnecessary since PalIndex fits in u32)
Remove unnecessary get_unchecked in qsort_pivot (3 bounds checks is negligible overhead)

Details

The diff() function had three separate implementations using unsafe intrinsics:

#[cfg(target_arch = "x86_64")] with SSE2
#[cfg(target_arch = "aarch64")] with NEON
Scalar fallback

Now there's a single portable implementation using wide::f32x4 that compiles to equivalent SIMD on both architectures. A scalar reference implementation is retained under #[cfg(test)] with a brute-force comparison test that verifies both implementations match across edge cases.

Test plan

All existing tests pass
New diff_simd_matches_scalar test verifies SIMD matches scalar reference
cargo clippy passes

Replace hand-written SSE2 (x86_64) and NEON (aarch64) intrinsics in f_pixel::diff() with safe, portable SIMD using the wide crate's f32x4. - Direct bytemuck cast from ARGBF to f32x4 (both are Pod) - Single implementation works on all platforms (x86, ARM, WASM, etc.) - Includes scalar reference and brute-force comparison test - ~70 lines removed, eliminates all unsafe in this function

The union was used to reinterpret the same memory as either mc_sort_value (u32) during median cut sorting, or likely_palette_index (PalIndex) after. Since PalIndex fits in u32, we can simply store u32 and cast on read. This eliminates unsafe union field access with no runtime cost.

This sorts 3 pivot indices and accesses them 3 times total. The bounds check overhead is negligible compared to the partitioning work that follows. Safe indexing is simpler and equally fast.

lilith · 2026-01-18T13:25:54Z

Follow-up: MaybeUninit removal feasibility

I've prototyped making MaybeUninit opt-in via an unsafe-uninit feature flag. The default path uses zero-initialized buffers instead, which eliminates most unsafe code for pure Rust users.

Benchmark results (512×512 image, 3 runs each)

Mode	quantize	remap/dither	histogram
Safe (zero-init)	35.9 / 35.7 / 36.9 ms	11.6 / 11.9 / 11.5 ms	5.5 / 6.4 / 5.3 ms
MaybeUninit	35.5 / 35.2 / 35.9 ms	11.9 / 11.2 / 11.5 ms	5.7 / 5.9 / 5.9 ms

The difference is within noise (~1-3%). The zero-initialization cost is negligible compared to the actual quantization work.

Changes required

Slot<T> type alias: MaybeUninit<T> or T based on feature
Helper functions that work in both modes
Image::new_fn gated behind unsafe-uninit (callback API inherently requires unsafe)
_internal_c_ffi implies unsafe-uninit (C users keep current behavior)

Would you be interested in this approach for the main repo? It would allow #![deny(unsafe_code)] for pure Rust builds while maintaining backward compatibility for C FFI users.

lilith added 3 commits January 18, 2026 03:35

Remove unnecessary get_unchecked in qsort_pivot

4bf8a30

This sorts 3 pivot indices and accesses them 3 times total. The bounds check overhead is negligible compared to the partitioning work that follows. Safe indexing is simpler and equally fast.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Replace platform-specific SIMD with portable wide crate #127

Replace platform-specific SIMD with portable wide crate #127

lilith commented Jan 18, 2026

Uh oh!

lilith commented Jan 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Replace platform-specific SIMD with portable wide crate #127

Are you sure you want to change the base?

Replace platform-specific SIMD with portable wide crate #127

Conversation

lilith commented Jan 18, 2026

Summary

Details

Test plan

Uh oh!

lilith commented Jan 18, 2026

Follow-up: MaybeUninit removal feasibility

Benchmark results (512×512 image, 3 runs each)

Changes required

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant