Learning Cython and OpenMP

As earlier, I profiled the execution time of the code of non-rigid registration and denoising with local-PCA in DIPY, and found out that the most time-consuming parts are all nested for loops. Two of them are implemented in Cython, and the other one needs to be cythonized. Then we can try to improve their performance with OpenMP. In order to do this, I need to learn and do some experiments on Cython and OpenMP.

I followed Cython Tutorials and Users Guide to learn Cython, and mainly focused on how to make faster code.

First I learned how to compile Cython code and then I learned how to set annotate parameter to show the analysis of the compiled code. In the analysis, the white lines mean they don’t interact with Python, and you can safely release the GIL and implement them into multithreading algorithm. And yellow lines mean they are interacting with Python, the darker the more Python interactions. There are several ways to make these lines lighter, even make them white, and hence improve their performance.

First of all, you can static type the parameters of functions and the local variables. Second you can turn off security checks, like nonecheck, boundscheck, and wraparound, etc. Also, you can use efficient indexing or memory views for faster array lookups and assignments.

Then I did some experiments on cython.parallel module, which is a parallelism supporting OpenMP. I tried a simple example without write conflict. I used a double cores CPU with 4 threads to test it. It gives me 12% speed up.

Simple example without OpenMP
Simple example with OpenMP

Note that the second picture shows that with OpenMP, it took all 4 threads to executing the code. (See thread 2 and 4, there are more occupation when using OpenMP).

I also did some experiments on a more complicated example with some write conflict. In this situation we might need to use a locker to force some lines of code to be executed thread by thread, not concurrently. However, I still cannot show the reasonable result.

You can find my experiments on this repo.

Profiling run time of registration and local-PCA

My project is trying to speed up the execution time of registration algorithm. So at first, I need to find out which function is most time-consuming. In this week, I use cProfile and snakeviz to profile the run time.

  • At first, I profile the example of 3D affine registration from DIPY tutorial.
profile of affine registration – combine

This example contains three types of affine registration: translation, rigid transformation, and affine transformation. So, I divide it into three small pieces respectively.

profile of affine registration – translation
profile of affine registration – rigid transformation
profile of affine registration – affine tranformation

The cumulative and run time of them is:

cumulative time of affine registration
run time of affine registration

As we can see, there are three major time consuming functions: gradient(), _update_mutual_information(), and _update_histogram().

  1. For the first function – gradient(), it’s the most time consuming one, due to nested for loops in _gradient_2d() or _gradient_3d() in vector_fields.pyx. It’s already in Cython with nogil, but it still costs a lot time. So we need to speed it up with OpenMP.
  2. For the second function, we can see it’s fast when using only translation; while it’s slow when using rigid or affine transformation. So the runtime depends on the complexity of transformation matrix. And this is understandable, we can just leave it as it is.
  3. For the last function, cumulative time per optimization is constant. And run time per call is small. So no future action need to be done at this moment.

So, for affine registration, we can work on the nested loops of _gradient_2d() and _gradient_3d() in vector_fields.pyx to improve the execution time.

nested loops in _gradient_3d() and _gradient_2d()
  • Secondly, I profile the example of 2D and 3D diffeomorphic (SyN) registration from DIPY tutorial.
profile of 2D SyN registration
profile of 3D SyN registration

Agian, I separate the 2D example into 2 parts: Circle-To-C experiment and b0 image.

profile of 2D SyN registration – Circle-To-C
profile of 2D SyN registration – b0 image

The cumulative and run time of them is:

cumulative time of SyN registration
run time of SyN registration

Here we  can see, there are three major time consuming functions: median_otsu(), optimize(), and _find_and_load().

  1. The first function is for brain extraction. We use it to remove the skull from b0 volume. As shown in the profile, this is the most expensive function. But it’s out of my concern right now.
  2. For the second function – optimize(), it’s slow in 3D case, due to the nested loop in invert_vector_field_fixed_point_3d() in vector_fields.pyx. There is also nested loop in 2D case in invert_vector_field_fixed_point_2d(), but the speed is acceptable. Also it’s already in Cython with nogil, but it still takes time. As the same before, we need to speed it up with OpenMP.
  3. For the third function, the run time per call is small. So, it’s no need to optimize.

So again, for diffeomorphic (SyN) registration, we can work on the nested loop of invert_vector_field_fixed_point_3d() in vector_fields.pyx for optimization.

nested loop in invert_vector_field_fixed_point_3d()
  • At last, I profile the example of denoise by local-PCA algorithm from DIPY tutorial.
profile of denoising by local-PCA – using eig
profile of denoising by local-PCA – using svd

Both of these take time because nested loop in localpca() in localpca.py. As the same, we can work on this nested loop to improve the performance.

nested loops in localpca()

As a conclusion, there is nested loop in affine registration, diffeomorphic registration, and denoising by local-PCA, which increase the run time dramatically. So my major job is to implement the nested loop using OpenMP to increase the speed of execution.

My first try on implementing nested for loop using OpenMP is on this branch.

Intro to myself and my project

 

It’s really excited to get accepted into Google Summer of Code project. I will make contribution to DIPY (Diffusion MRI analysis) in Python. I’m a master student in computer science at Indiana University. Right now, I’m trying to merge my interest in both medical physics and computer science, more specifically to apply machine learning and other technology of computing to medical imaging and radiotherapy. So I think this project will be good experience for me.

My project is to implement multi-thread capacities for nonrigid registration in DIPY using OpenMP. In DIPY, it implements a broad range of algorithms for denoising, registration, reconstruction, fiber tracking, clustering, visualization, and statistical analysis of MRI data. For nonrigid image registration, the major goal is to calculate a deffeomorphic transformation to register two 2D images or 3D volumes. To achieve this, we need to do a lot of iterations to maximize the similarity metric, thus it slow down the optimization process.

To speed up the calculation, one solution is to convert to a parallel computing algorithm. The OpenMP API supports multi-platform shared-memory parallel programming in C/C++ and Fortran. Due to Global Interpreter Lock (GIL), there is no point to use threads for CPU intensive tasks in Python. Fortunately, with the help of Cython, we can release GIL during computations. So in this project, I will implement a multi-thread parallel algorithm using OpenMP and Cython to improve the performance of nonrigid image registration in DIPY.

It’s amazing that I just started to learn how to code in Python, and right now I have made some contributions to the open source software of DIPY in Python. See my pull requests before applying for this project. Also It’s really lucky for me to get my name (Ricci Woo) added into the contributors of DIPY release 0.14. In this bonding period, I learned basic principle in Diffusion MRI and image processing. Also I learned how to profile runtime of functions in Python to see which function is the most expensive one, then I can work on improving performance on it.

Oh, I forgot to tell my name. That is Xinquan Wu (Legal), or you can call me Ricci Woo (Nickname). You can contact me at email: ricciwoo@gmail.com. I will push my code to my fork of DIPY on GitHub.