As earlier, I profiled the execution time of the code of non-rigid registration and denoising with local-PCA in DIPY, and found out that the most time-consuming parts are all nested for loops. Two of them are implemented in Cython, and the other one needs to be cythonized. Then we can try to improve their performance with OpenMP. In order to do this, I need to learn and do some experiments on Cython and OpenMP.
I followed Cython Tutorials and Users Guide to learn Cython, and mainly focused on how to make faster code.
First I learned how to compile Cython code and then I learned how to set annotate parameter to show the analysis of the compiled code. In the analysis, the white lines mean they don’t interact with Python, and you can safely release the GIL and implement them into multithreading algorithm. And yellow lines mean they are interacting with Python, the darker the more Python interactions. There are several ways to make these lines lighter, even make them white, and hence improve their performance.
First of all, you can static type the parameters of functions and the local variables. Second you can turn off security checks, like nonecheck, boundscheck, and wraparound, etc. Also, you can use efficient indexing or memory views for faster array lookups and assignments.
Then I did some experiments on cython.parallel module, which is a parallelism supporting OpenMP. I tried a simple example without write conflict. I used a double cores CPU with 4 threads to test it. It gives me 12% speed up.
Note that the second picture shows that with OpenMP, it took all 4 threads to executing the code. (See thread 2 and 4, there are more occupation when using OpenMP).
I also did some experiments on a more complicated example with some write conflict. In this situation we might need to use a locker to force some lines of code to be executed thread by thread, not concurrently. However, I still cannot show the reasonable result.
You can find my experiments on this repo.