Skip to content

take fast path if c2c transform does not need padding or trimming#283

Open
chillenb wants to merge 3 commits intoIntelPython:masterfrom
chillenb:faster
Open

take fast path if c2c transform does not need padding or trimming#283
chillenb wants to merge 3 commits intoIntelPython:masterfrom
chillenb:faster

Conversation

@chillenb
Copy link

@chillenb chillenb commented Mar 3, 2026

Thanks for creating and maintaining this package!

If you try to get MKL C-API performance out of this package, you will probably discover that fftn is very sensitive to the input arguments. Here is an example:

In [1]: import numpy as np
   ...: import mkl_fft
   ...: N = 200
   ...: A = np.random.random((1,N,N,N)).astype(np.complex128)
 
In [2]: %timeit mkl_fft.fftn(A, s=A.shape[1:], axes=(1,2,3))
164 ms ± 187 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %timeit mkl_fft.fftn(A, axes=(1,2,3))
6.56 ms ± 304 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit mkl_fft.interfaces.numpy_fft.fftn(A, axes=(1,2,3))
165 ms ± 31.3 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

This is because mkl_fft.fftn always takes a slow path (_iter_fftnd) when s != None. Furthermore, the NumPy and SciPy interfaces don't pass through s=None unchanged, so they are also forced to take this path.
This pull request allows fftn to detect when the input s argument is equivalent to s=None so it can use the faster function _iter_complementary.

After these code changes, performance aligns better with expectations:

In [1]: import numpy as np
   ...: import mkl_fft
   ...: N = 200
   ...: A = np.random.random((1,N,N,N)).astype(np.complex128)

In [2]: %timeit mkl_fft.interfaces.numpy_fft.fftn(A, axes=(1,2,3))
8.28 ms ± 551 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [3]: %timeit mkl_fft.fftn(A, s=A.shape[1:], axes=(1,2,3))
The slowest run took 4.49 times longer than the fastest. This could mean that an intermediate result is being cached.
9.92 ms ± 7.49 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [4]: %timeit mkl_fft.fftn(A, axes=(1,2,3))
6.49 ms ± 60.7 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Test system: dual-socket Xeon Platinum 8268 server.

@intel-python-devops
Copy link

Can one of the admins verify this patch?

@chillenb
Copy link
Author

chillenb commented Mar 3, 2026

Oops, I didn't realize that the CI would have to be approved again after fixing the whitespace. Sorry!

The tests previously approved did run and pass, though.

@ndgrigorian
Copy link
Collaborator

Oops, I didn't realize that the CI would have to be approved again after fixing the whitespace. Sorry!

The tests previously approved did run and pass, though.

It's no problem, thanks for this contribution to the project. :)

I'm not sure if any of our tests currently cover this case and compare with e.g. numpy, so adding a test would be good too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants