Improving uarray performance

Published: 07/08/2019

What did you do this week?

I have been focused on my uarray PR (uarray#1780). uarray defines a protocol for dispatching function calls to multiple different backend implentations. In my PR, I've been reimplementing the core function dispatch mechanism in C++ using the Python C-API. This week I've moved the backend registration system in to C++ which means the protocol is now 100% C++. This has brought the overhead down from ~5 us per function call to just ~700 ns or about 10 times more than a normal python function call. This overhead was one of the main blockers for the adoption of uarray so is very nice to see it come down.
I've also updated the vendored version of pypocketfft in scipy#10393. This new version includes a small cache for the FFT "twiddle factors" which I helped implement. This improves benchmarks by ~20% in most cases or as much as 60% for some input sizes.

What is coming up next?

My uarray PR has already been merged over the weekend so I can update my scipy.fft code and update the benchmarks there. I also plan on using the new version of pypocketfft to add support for Hermitian FFTs (like numpy's hfft). This would make scipy.fft a complete replacement for numpy.fft's functionality.

I can also work on adding pre-planned transforms to the scipy.fft interface. This would also require pocketfft's plan cacheing much more flexible so I expect we can add user config options to our automatic plan cacheing.

Did you get stuck anywhere?

No significan blockers this week.