Following up from #184
We currently perform CLMUL with XMM registers, but we could use YMM with _mm256_clmulepi64_epi128, or ZMM with _mm512_clmulepi64_epi128. The latter would provide 1 CLMUL-per-block processing (amortized over 4 blocks processed in parallel).
It seems like these instructions are both available on the same families of CPUs, so it seems like if we add an additional backend, it should probably be AVX-512.
It could perhaps be cfg gated like in the aes crate, both to preserve MSRV and to give us time to decide if it's actually a good idea.
Following up from #184
We currently perform CLMUL with XMM registers, but we could use YMM with
_mm256_clmulepi64_epi128, or ZMM with_mm512_clmulepi64_epi128. The latter would provide 1 CLMUL-per-block processing (amortized over 4 blocks processed in parallel).It seems like these instructions are both available on the same families of CPUs, so it seems like if we add an additional backend, it should probably be AVX-512.
It could perhaps be
cfggated like in theaescrate, both to preserve MSRV and to give us time to decide if it's actually a good idea.