Native `get_bit_slice_raw`: add unit tests, consider making it generic and adopting for circuit construction

While auditing Pippenger, I discovered that an implementation of bit slicing contained there is way more efficient than going through `uint256_t` and using its custom method (the latter path adds non-trivial overhead to Pippenger benches). We can probably adopt this method more broadly, say, when computing witness values for bit slices in stdlib primitives.