Speedup per thread

Last week, I added a timer in the simple example on parallel, calculating the time of multi-threading part. For this, I figured out in this example, it only took around one third of the total time for the part on multi-thread execution. Then I run this example on the server with 24 cores (48 threads) to investigate the speedup performance and efficiency with various number of threads.

Speedup per thread

In this table, it proves the speedup performance of multi-thread. For example, with 2 threads, it yields 1.92 speedup compared to that on single thread. If used 18 threads, we can get 17.05 speedup. And if you used all threads, we can only have 32.59 speedup. This is just what we expected. Also, it’s a good news that we are sure that we can get so much speedup by implementing multi-thread algorithm with OpenMP.

I also investigated the difference among static, dynamic, and guided scheduling. But it showed no difference on multi-threading performance.

We used a lot of memory views in Cython code on affine registration, I was wondering how they would be implemented in multi-thread parallelism. So I investigated the C code of simple example generated by Cython.

Cython code of memory view ‘out’
C code of memory view ‘out’

In different threads, it gets access to the memory view through different index. This will cause a problem when parallelizing the code of affine registration (line 3109).

Memory view of affine registration

Here we can see, in multiple threads, it writes to the same array ‘x’, ‘dx’, and ‘q’. And this gives a writing conflict among different threads.

As discussed with my mentors, to solve this problem, we consider two different methods. One is to put the multi-thread block into a function. The other one is to add a locker on this part.

This week, I will try to implement these two methods, and investigate which one is better.

Investigating performance of multithreads

Last week, to complete the experiments on OpenMP, I made an example with the writing conflict among multiple threads, and then solved this problem with a locker.

Then I started to investigate the performance of multithreads. In the simple example, I ran on my MacBook Air with 2 cores (4 threads), yielding 12% speed up, not exceeding 75% of occupation on each thread. That might be due to some security mechanism. Then I tested on other hardware systems to make sure that I can run this example with 100% performance on each thread. I tested on Macbook Pro with 4 cores (8 threads), workstation with 12 cores (24 threads), and server with 24 cores (48 threads). All these systems can reach to 100% performance. But before getting full performance, I need to wait for a long time for the execution to start. That made this execution slow down, even slower than that without multithreads. For this problem, my guess is because of scheduling of OpenMP. This week, I will try to reduce the number of threads concurrently and try to investigate the scheduling mechanism of OpenMP.

Also in last week, I investigated the code of affine and diffeomorphic registration in DIPY. And I realized, to implement multithreads algorithm of them, affine registration needs no locker, while diffeomorphic registration needs a locker. So I tried to implement multithreading in affine registration. I yielded 36% speed up on my Macbook Air with 2 cores (4 threads). You can find it in this branch.

So, this week, I will try to figure out how to implement full performance on each thread. Also I will try to implement multithreads algorithm in affine and diffeomorphic registration.

Learning Cython and OpenMP

As earlier, I profiled the execution time of the code of non-rigid registration and denoising with local-PCA in DIPY, and found out that the most time-consuming parts are all nested for loops. Two of them are implemented in Cython, and the other one needs to be cythonized. Then we can try to improve their performance with OpenMP. In order to do this, I need to learn and do some experiments on Cython and OpenMP.

I followed Cython Tutorials and Users Guide to learn Cython, and mainly focused on how to make faster code.

First I learned how to compile Cython code and then I learned how to set annotate parameter to show the analysis of the compiled code. In the analysis, the white lines mean they don’t interact with Python, and you can safely release the GIL and implement them into multithreading algorithm. And yellow lines mean they are interacting with Python, the darker the more Python interactions. There are several ways to make these lines lighter, even make them white, and hence improve their performance.

First of all, you can static type the parameters of functions and the local variables. Second you can turn off security checks, like nonecheck, boundscheck, and wraparound, etc. Also, you can use efficient indexing or memory views for faster array lookups and assignments.

Then I did some experiments on cython.parallel module, which is a parallelism supporting OpenMP. I tried a simple example without write conflict. I used a double cores CPU with 4 threads to test it. It gives me 12% speed up.

Simple example without OpenMP
Simple example with OpenMP

Note that the second picture shows that with OpenMP, it took all 4 threads to executing the code. (See thread 2 and 4, there are more occupation when using OpenMP).

I also did some experiments on a more complicated example with some write conflict. In this situation we might need to use a locker to force some lines of code to be executed thread by thread, not concurrently. However, I still cannot show the reasonable result.

You can find my experiments on this repo.

Profiling run time of registration and local-PCA

My project is trying to speed up the execution time of registration algorithm. So at first, I need to find out which function is most time-consuming. In this week, I use cProfile and snakeviz to profile the run time.

  • At first, I profile the example of 3D affine registration from DIPY tutorial.
profile of affine registration – combine

This example contains three types of affine registration: translation, rigid transformation, and affine transformation. So, I divide it into three small pieces respectively.

profile of affine registration – translation
profile of affine registration – rigid transformation
profile of affine registration – affine tranformation

The cumulative and run time of them is:

cumulative time of affine registration
run time of affine registration

As we can see, there are three major time consuming functions: gradient(), _update_mutual_information(), and _update_histogram().

  1. For the first function – gradient(), it’s the most time consuming one, due to nested for loops in _gradient_2d() or _gradient_3d() in vector_fields.pyx. It’s already in Cython with nogil, but it still costs a lot time. So we need to speed it up with OpenMP.
  2. For the second function, we can see it’s fast when using only translation; while it’s slow when using rigid or affine transformation. So the runtime depends on the complexity of transformation matrix. And this is understandable, we can just leave it as it is.
  3. For the last function, cumulative time per optimization is constant. And run time per call is small. So no future action need to be done at this moment.

So, for affine registration, we can work on the nested loops of _gradient_2d() and _gradient_3d() in vector_fields.pyx to improve the execution time.

nested loops in _gradient_3d() and _gradient_2d()
  • Secondly, I profile the example of 2D and 3D diffeomorphic (SyN) registration from DIPY tutorial.
profile of 2D SyN registration
profile of 3D SyN registration

Agian, I separate the 2D example into 2 parts: Circle-To-C experiment and b0 image.

profile of 2D SyN registration – Circle-To-C
profile of 2D SyN registration – b0 image

The cumulative and run time of them is:

cumulative time of SyN registration
run time of SyN registration

Here we  can see, there are three major time consuming functions: median_otsu(), optimize(), and _find_and_load().

  1. The first function is for brain extraction. We use it to remove the skull from b0 volume. As shown in the profile, this is the most expensive function. But it’s out of my concern right now.
  2. For the second function – optimize(), it’s slow in 3D case, due to the nested loop in invert_vector_field_fixed_point_3d() in vector_fields.pyx. There is also nested loop in 2D case in invert_vector_field_fixed_point_2d(), but the speed is acceptable. Also it’s already in Cython with nogil, but it still takes time. As the same before, we need to speed it up with OpenMP.
  3. For the third function, the run time per call is small. So, it’s no need to optimize.

So again, for diffeomorphic (SyN) registration, we can work on the nested loop of invert_vector_field_fixed_point_3d() in vector_fields.pyx for optimization.

nested loop in invert_vector_field_fixed_point_3d()
  • At last, I profile the example of denoise by local-PCA algorithm from DIPY tutorial.
profile of denoising by local-PCA – using eig
profile of denoising by local-PCA – using svd

Both of these take time because nested loop in localpca() in localpca.py. As the same, we can work on this nested loop to improve the performance.

nested loops in localpca()

As a conclusion, there is nested loop in affine registration, diffeomorphic registration, and denoising by local-PCA, which increase the run time dramatically. So my major job is to implement the nested loop using OpenMP to increase the speed of execution.

My first try on implementing nested for loop using OpenMP is on this branch.

Intro to myself and my project


It’s really excited to get accepted into Google Summer of Code project. I will make contribution to DIPY (Diffusion MRI analysis) in Python. I’m a master student in computer science at Indiana University. Right now, I’m trying to merge my interest in both medical physics and computer science, more specifically to apply machine learning and other technology of computing to medical imaging and radiotherapy. So I think this project will be good experience for me.

My project is to implement multi-thread capacities for nonrigid registration in DIPY using OpenMP. In DIPY, it implements a broad range of algorithms for denoising, registration, reconstruction, fiber tracking, clustering, visualization, and statistical analysis of MRI data. For nonrigid image registration, the major goal is to calculate a deffeomorphic transformation to register two 2D images or 3D volumes. To achieve this, we need to do a lot of iterations to maximize the similarity metric, thus it slow down the optimization process.

To speed up the calculation, one solution is to convert to a parallel computing algorithm. The OpenMP API supports multi-platform shared-memory parallel programming in C/C++ and Fortran. Due to Global Interpreter Lock (GIL), there is no point to use threads for CPU intensive tasks in Python. Fortunately, with the help of Cython, we can release GIL during computations. So in this project, I will implement a multi-thread parallel algorithm using OpenMP and Cython to improve the performance of nonrigid image registration in DIPY.

It’s amazing that I just started to learn how to code in Python, and right now I have made some contributions to the open source software of DIPY in Python. See my pull requests before applying for this project. Also It’s really lucky for me to get my name (Ricci Woo) added into the contributors of DIPY release 0.14. In this bonding period, I learned basic principle in Diffusion MRI and image processing. Also I learned how to profile runtime of functions in Python to see which function is the most expensive one, then I can work on improving performance on it.

Oh, I forgot to tell my name. That is Xinquan Wu (Legal), or you can call me Ricci Woo (Nickname). You can contact me at email: ricciwoo@gmail.com. I will push my code to my fork of DIPY on GitHub.