Skip to content

Conversation

@lilith
Copy link

@lilith lilith commented Jan 18, 2026

Summary

  • Replace platform-specific SSE2/NEON intrinsics in f_pixel::diff() with portable wide::f32x4
  • Remove HistSortTmp union in favor of plain u32 (union was unnecessary since PalIndex fits in u32)
  • Remove unnecessary get_unchecked in qsort_pivot (3 bounds checks is negligible overhead)

Details

The diff() function had three separate implementations using unsafe intrinsics:

  • #[cfg(target_arch = "x86_64")] with SSE2
  • #[cfg(target_arch = "aarch64")] with NEON
  • Scalar fallback

Now there's a single portable implementation using wide::f32x4 that compiles to equivalent SIMD on both architectures. A scalar reference implementation is retained under #[cfg(test)] with a brute-force comparison test that verifies both implementations match across edge cases.

Test plan

  • All existing tests pass
  • New diff_simd_matches_scalar test verifies SIMD matches scalar reference
  • cargo clippy passes

Replace hand-written SSE2 (x86_64) and NEON (aarch64) intrinsics in
f_pixel::diff() with safe, portable SIMD using the wide crate's f32x4.

- Direct bytemuck cast from ARGBF to f32x4 (both are Pod)
- Single implementation works on all platforms (x86, ARM, WASM, etc.)
- Includes scalar reference and brute-force comparison test
- ~70 lines removed, eliminates all unsafe in this function
The union was used to reinterpret the same memory as either mc_sort_value
(u32) during median cut sorting, or likely_palette_index (PalIndex) after.

Since PalIndex fits in u32, we can simply store u32 and cast on read.
This eliminates unsafe union field access with no runtime cost.
This sorts 3 pivot indices and accesses them 3 times total.
The bounds check overhead is negligible compared to the partitioning
work that follows. Safe indexing is simpler and equally fast.
@lilith
Copy link
Author

lilith commented Jan 18, 2026

Follow-up: MaybeUninit removal feasibility

I've prototyped making MaybeUninit opt-in via an unsafe-uninit feature flag. The default path uses zero-initialized buffers instead, which eliminates most unsafe code for pure Rust users.

Benchmark results (512×512 image, 3 runs each)

Mode quantize remap/dither histogram
Safe (zero-init) 35.9 / 35.7 / 36.9 ms 11.6 / 11.9 / 11.5 ms 5.5 / 6.4 / 5.3 ms
MaybeUninit 35.5 / 35.2 / 35.9 ms 11.9 / 11.2 / 11.5 ms 5.7 / 5.9 / 5.9 ms

The difference is within noise (~1-3%). The zero-initialization cost is negligible compared to the actual quantization work.

Changes required

  • Slot<T> type alias: MaybeUninit<T> or T based on feature
  • Helper functions that work in both modes
  • Image::new_fn gated behind unsafe-uninit (callback API inherently requires unsafe)
  • _internal_c_ffi implies unsafe-uninit (C users keep current behavior)

Would you be interested in this approach for the main repo? It would allow #![deny(unsafe_code)] for pure Rust builds while maintaining backward compatibility for C FFI users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant