Final report of GSoC 2018

Title of the project

Implementing multi-thread capacities for nonrigid registration in DIPY using OpenMP


The code of image registration in DIPY is widely used. This nonrigid registration contains two types: the affine registration and the diffeomorphic registration. The existing code is single threaded. Making this code parallelized using multithreading can help reduce execution time.  Some of the functions are already efficient or are easy to parallelize. However, some functions may write to the same address in different threads and will need locks, and hence can not be speeded up as much.

The state-of-art of the code before my work

For affine registration, it takes 66.1 s to run an example of 3d affine registration on a server system with 24 cores (48 threads). There are 3 functions that take most of the time: 1. gradient takes 26.5 s (40%); 2. _update_mutual_information takes 22.6 s (34.2%); 3. _update_histogram takes 10.3 s (15.6%).

profile of original 3d affine registration

For diffeomorphic registration, it takes 12.2 s to run an example of 3d diffeomorphic registration on a server system with 24 cores (48 threads). There are 4 functions that take most of the time: 1. __invert_models takes 3.77 s (30.9%); 2. transform_inverse takes 1.14 s (9.3%); 3. update takes 0.833 s (6.8%); 4. initialize_iteration takes 0.821 s (6.7%).

profile of 3d diffeomorphic registration

Parallelized affine registration

By carefully investigating the functions called from those slower functions, we can see that it’s the functions that have nested for loops that take most of the time. There are 3 such time-consuming functions: 1. _gradient_3d in vector_fields.pyx; 2. _compute_pdfs_dense_3d in parzenhist.pyx; 3. _joint_pdf_gradient_dense_3d in parzenhist.pyx.

To parallelize them, we need to first cythonize them, that is to make them cdef and nogil, then parallelize them using prange from cython.parallel module to replace the range in the for loop. I try to parallelize each of them.

speedup of single call to the function in affine registration

The result shows: 1. the parallelized _gradient_3d gives me 7.47 times speedup; 2. the parallelized _compute_pds_dense_3d even slows down.3. the parallelized _joint_pdf_gradient_dense_3d has 2.68 times speedup, but it slows down when making multiple calls, it even failed in some tests of DIPY continuous integration. In the later two functions, different threads may write to the same address, this may cause conflicts. So we need to add a lock to resolve the problem. However, the lock will make everything slow down.

So for the best of performance, I cythonized _gardient_* in vector_fields.pyx and _compute_pdfs_* in parzenhist.pyx, and only parallelized _gradient_3d, see pull request #1612.

profile of parallelized 3d affine registration

Here we can see that gradient change from 26.5 s to 3.79 s, it gives 6.99 times speedup.

Parallelized diffeomorphic registration

For the case in diffeomorphic registration, there are 5 time-consuming functions that have nested for loops: 1. _compose_vector_fields_3d, 2. invert_vector_field_fixed_point_3d, 3. warp_3d, 4. reorient_vector_field_3d, all in vector_fields.pyx, and 5. precompute_cc_factors_3d in crosscorr.pyx. Again, I try to parallelize each of them.

speedup of single call to the function in diffeomorphic registration

Here, we can see two of the functions give us speedup performance: _compose_vector_fields_3d and warp_3d. And I only parallelized these two function, see pull request #1613.

profile of parallelized 3d diffeomorphic registration

Here we can see there is no difference between original and parallelized version.

Challenges and experiences

1. how to use Cython and OpenMP to parallelize the code

I followed Cython’s documentation to learn how to use Cython, especially the cython.parallel module which is the native parallelism using OpenMP. I did a lot of experiments in this repository to investigate: the difference of def, cdef functions, the performance of different number of threads, the difference among three scheduling, and the use of a locker.

2. speedup performance with respect to different number of threads

The performance of this simple example shows we can get 32.59 times speedup using 48 threads.

speedup of simple example in different # of threads

However, the performance of _gradient_3d shows the speedup performance of it drops greatly as the number of threads grows.

speedup of _gradient_3d in different # of threads

We can see even with 48 threads, we can only get 8.033 times speedup. This may due to the preparation for launching a big number of threads needs a lot of time when the code becomes complicated.

3. no significant difference among static, dynamic and guided scheduling

different scheduling of simple example
different scheduling of _gradient_3d

The performance among different scheduling of simple example and _gradient_3d show that there is no significant difference among the three scheduling. The static scheduling may be the worse choice, and both dynamic and guided scheduling would be good.

Conclusion and discussion

The idea of having more number of threads may speedup the execution more is great idea. However, due the complexity of the code, it may not give you the speedup performance as you expected. For the simple case, each iteration of the for loop is independent to each other. In this case, it can be easily parallelized using prange, and you gain a great number of speedup. However, in more complicated situations, different iterations will affect the other, like they may write to the same variable or same address in the memory. In such cases, it’s not easy to parallelize it. Sometimes, you need a lock to resolve the writing conflict. Sometimes, it’s hard to get fully performance in each thread. All these will slow down you code even you parallelize it. So multithreading parallelism is not easy. To go further in this direction, my suggestion is to dig into OpenMP, try using OpenMP directly instead of cython.parallel module.

The Code of my work

1. PR for parallelized affine registration:

2. PR for parallelized diffeomorphic registration:

To test my code

1. install DIPY from my fork of DIPY repository:

git clone

cd dipy

git clean -fxd

pip install –user -e .

2. download code for testing

cd ..

git clone

cd registration

3. profile the original example of both 3d affine and diffeomorphic registration

python -m cProfile -o

snakeviz # this shows profile of original 3d affine registration

python # this generate data for 3d diffeomorphic registration

python -m cProfile -o

snakeviz # this shows profile of original 3d diffeomorphic registration

4. profile the parallelized version of 3d affine registration

cd ../dipy

git checkout –track origin/aff-par-cpdf-grad

pip install –user -e .

cd ../registration

python -m cProfile -o

snakeviz # this shows profile of parallelized 3d affine registration

5. profile the parallelized version of 3d diffeomorphic registration

cd ../dipy

git checkout –track origin/syn-par-comp-warp

pip install –user -e .

cd ../registration

python -m cProfile -o

snakeviz # this shows profile of parallelized 3d diffeomorphic registration

6. if you test on a headless cluster, replace and with and

Parallelized affine registration

In the last 2 week, I use the line_profiler (kernprof) to profile the function _update_mutual_information() line by line of when running the affine registration example.

We can see the most time consuming functions are: 1. gradient() – 43.5%, 2. update_gradient_dense() – 38.5%, 3. _update_histogram() – 17.8%.

For the first function, the actual time consuming part is inside the function _gradient_3d() of vector_fields.pyx. I cythonized it, in which I made it cdef void and nogil, this gave me 1.47 times speed up. And then I parallelized it using prange and local function, it totally gave me 6.48 times speed up.

For the second function, the actual time consuming part is inside the function _joint_pdf_gradient_dense_3d() of parzenhist.pyx. I made it void and nogil, this gave me 3.05 times speed up. But if I parallelize it using prange, it will be slow down. So I didn’t do that.

For the third one, the real time consuming part is in the function _compute_pdfs_dense_3d() of parzenhist.pyx. Again, I made it void and nogil, the run time performance was almost the same as before. However, if I parallelize it using prange, it will again be slow down. So, like the second one, I didn’t do this.

Then I go on to try to parallelize the diffeomorphic registration. I first profile the example of diffeomorphic registration, then try to do the line profiler. And I find out the most time consuming function is __invert_models() – 47%. And the actual function that takes time is _compose_vector_fields_3d(). I try to parallelize it, but the speed up performance is not quite clear. I will try to investigate more on this.

try parallelize _compute_pdfs_dense_3d

In the last week, I reorganized my fork of DIPY repo. The repository of aff-par-grad-ori is the original one, that is the clone of upstream/master. In the repository of aff-par-grad-all, I parallelized _gradient_3d() and _sparse_gradient_3d() in vector_fields.pyx using dynamic allocation (malloc). In the repository of aff-par-grad-fun, I parallelized the two function using local function. With respect to these three repositories, I added timer for timing the parallel part of the function in the repositories of xxx-xxx-xxx-xxx-tim. The one using malloc gave me 4.45 times speedup on average, while the one with local function gave me 6.59 times speedup on average, tested on and on the cluster with 24 cores (48 threads).

speedup of parallelized _gradient_3d()

Also, I profiled on these three repositories. Both two methods gave me speedup effect, and the one with local function was a little bit faster.

profile of

Then I tried to parallelize _compute_pdfs_dense_3d() in parzenhist.pyx using local function. However, it even slowed down the execution of Maybe I need to try some other method, like adding a locker or using dynamic allocation (malloc). If I could successfully speedup the execution of _compute_pdfs_dense_3d(), this will make it faster on _update_histogram in the profiling of

Also, I tried to parallelize _joint_pdf_gradient_dense_3d() in parzenhist.pyx. I first tried using dynamic allocation (malloc) and adding a locker (but unfinished). Then I tried to do it using local function. In this way, I need to some local buffer with dimension undefined before hand. Then I need to use dynamic allocation (malloc) for this. Also, for avoiding using ‘with gil’ statement, I need to change _jacobian() to be the ones without memory views. I made these changes in the repository of aff-par-jpdf-fun. So I need to do more test on this. If I parallelize this, it will become faster on _update_mutual_information in the profiling of

local function

In the last two week, I made a small example to test the implementation of using a local function to hide the variables as Prof. Elef suggested. First, I move the locker out of the inner for loop like this. Then I put the whole inner for loop into a local function like this. The original one runs 9.0126 second, the other two take 1.6221 second and 1.6269 second respectively. The speedup effect comes from the reduction of the number of lockers.

Then I implemented the same algorithm in _gradient_3d() in repo: aff-par-grad-fun. Also I added a timer in _gradient_3d() in repo: aff-par-grad-fun-tim. The local function is like this (line 3067-3120). Inside this function we can make local buffers as local c arrays, like x[3], dx[3], q[3], instead of using dynamic allocation in parallel or using global memory view with locker. I tested for the runtime using this example. It takes 0.037685 second per run. The original one is 0.216153second. 5.73 times speedup. The parallel one that uses dynamic allocation (malloc) takes 0.060911 second. We can see it’s faster then before.

I also make an example to test local variable and global counter. We can add a locker in this example, it takes 0.918798 second, compare to the original one 0.070092 second. It even slows down. Then I use a local function without locker, it takes 0.036968 second. We can see it’s faster than the original one.

Then I try to implement the same algorithm in _joint_pdf_gradient_dense_3d(). I test this using this example. The original one takes 0.319248 second, while the one using local function without locker runs 0.109481 second.

At the same time, I test the performance of different scheduling with respect to number of threads. For the simple example:

scheduling w.r.t. number of threads of simple example

Then, for the example in _gradient_3d() (line 3168):

scheduling w.r.t. number of threads of _gradient_3d()

In this two examples, we can see the static scheduling is the worse choice. And the dynamic and guided scheduling have not much difference. So dynamic or guided is the better choices.

first parallel version of 3d affine registration

In the last two week, I made a small example of 3d affine registration to test the speedup performance of parallelization. This example called directly to the function _gradient_3d() in vector_fields.pyx. In this function, I used prange (line 3125) to implement the multi-threads parallel algorithm. And also, I added a timer in both the original version (line 3114, 3162) and parallel one (line 3113, 3174) to time and compare the parallel part. Then I tested it on a cluster with 24 cores (48 threads). The result showed that the original one took 0.221 second per run, while the parallel one took 0.027 second each run. So there was a 8.18 speedup. I also output the resulting gradient array as .npy files. Then I compared them numerically to make sure I got the same result before and after parallelization.

Then I tested it using specific number of threads. But the runtime became fluctuating. The speedup was not so clear as the number of threads grows.

Speedup with specific number of threads

I also tested on _gradient_2d() for 2d affine registration. But I cannot see any speedup at all. So I remove all the change on it.

I also implemented the multi-thread algorithm in sparse cases both in 2d and 3d. The function for calculating gradient in sparse cases are _sparse_gradient_3d() and _sparse_gradient_2d(). The results showed that there was speedup in 3d case, but not in 2d. So I also removed the change on 2d sparse case.

When I parallelized _gradient_3d(), it needs some thread local buffer. I followed Cython documentation on Using Parallelism to implement thread local arrays. I used “with parallel()” before prange. In this block before prange, it would run in parallel to allocate thread local buffer for each thread just as I did in 3d affine registration (line 3116-3124).

I also did some experiments to investigate how Cython implements variables. 1. If in a prange block, it only reads variables, then these variables become shared among all threads; 2. If in a prange block, it writes some values to variables, these variables become thread private (local). It means in each thread, there is a copy of the variable outside the prange block. Also, it will initialize all floating point values (including float and double) to NaN. All integers will be initialize as the furthest number from 0; 3. Pointers and memory views will be thread shared; 4. If you use an inplace operator (for example, i += 1), the variable will be reduced (i in this example). It means any code after this operator will be not able to see this variable.

After this, I profiled the 3d affine registration example of DIPY both in dense case and sparse case:,,, and

profile example of affine registration

We can see speedups in 3d cases (both dense and sparse), but no speedup at all in 2d cases. Also we can see _update_histogram() in 3d dense case takes a little bit long time for execution. So I tried to parallelize _joint_pdf_gradient_dense_3d() in parzenhist.pyx. This is the most time-consuming part of _update_histogram().

For the thread local buffer of this function, I tried two ways to implement it: one was to add a local buffer for each thread in repo: aff-par-joint-pdf (line 1017); the other was to add a lock for writing conflicting part in repo: aff_par-hist-lock (line 1031-1041). Both of them showed speedup in execution. And the one with a lock gave more speedup effect. And both these two gave the same numerically result. However, when I added a timer to the original version, it yielded strange image. So I need to do some test on this parallelization to make sure I did the right thing.

I also did some experiments on the difference between scheduling. In Cython, when using prange, we can have three type of scheduling: static is for situation that each iteration takes almost the same amount of time; dynamic is for some situation that I don’t know the run time of each iteration ahead; guided is to enhanced dynamic scheduling that avoid the last iteration to take a long time.


Here we can see, if you automatically use all threads that are available, dynamic scheduling is the best choice. If you specify the number of threads less than total number of threads, all three scheduling show no difference.

Here you can find the pull request of my first parallel version of 3d affine registration.

Speedup per thread

Last week, I added a timer in the simple example on parallel, calculating the time of multi-threading part. For this, I figured out in this example, it only took around one third of the total time for the part on multi-thread execution. Then I run this example on the server with 24 cores (48 threads) to investigate the speedup performance and efficiency with various number of threads.

Speedup per thread

In this table, it proves the speedup performance of multi-thread. For example, with 2 threads, it yields 1.92 speedup compared to that on single thread. If used 18 threads, we can get 17.05 speedup. And if you used all threads, we can only have 32.59 speedup. This is just what we expected. Also, it’s a good news that we are sure that we can get so much speedup by implementing multi-thread algorithm with OpenMP.

I also investigated the difference among static, dynamic, and guided scheduling. But it showed no difference on multi-threading performance.

We used a lot of memory views in Cython code on affine registration, I was wondering how they would be implemented in multi-thread parallelism. So I investigated the C code of simple example generated by Cython.

Cython code of memory view ‘out’
C code of memory view ‘out’

In different threads, it gets access to the memory view through different index. This will cause a problem when parallelizing the code of affine registration (line 3109).

Memory view of affine registration

Here we can see, in multiple threads, it writes to the same array ‘x’, ‘dx’, and ‘q’. And this gives a writing conflict among different threads.

As discussed with my mentors, to solve this problem, we consider two different methods. One is to put the multi-thread block into a function. The other one is to add a locker on this part.

This week, I will try to implement these two methods, and investigate which one is better.

Investigating performance of multithreads

Last week, to complete the experiments on OpenMP, I made an example with the writing conflict among multiple threads, and then solved this problem with a locker.

Then I started to investigate the performance of multithreads. In the simple example, I ran on my MacBook Air with 2 cores (4 threads), yielding 12% speed up, not exceeding 75% of occupation on each thread. That might be due to some security mechanism. Then I tested on other hardware systems to make sure that I can run this example with 100% performance on each thread. I tested on Macbook Pro with 4 cores (8 threads), workstation with 12 cores (24 threads), and server with 24 cores (48 threads). All these systems can reach to 100% performance. But before getting full performance, I need to wait for a long time for the execution to start. That made this execution slow down, even slower than that without multithreads. For this problem, my guess is because of scheduling of OpenMP. This week, I will try to reduce the number of threads concurrently and try to investigate the scheduling mechanism of OpenMP.

Also in last week, I investigated the code of affine and diffeomorphic registration in DIPY. And I realized, to implement multithreads algorithm of them, affine registration needs no locker, while diffeomorphic registration needs a locker. So I tried to implement multithreading in affine registration. I yielded 36% speed up on my Macbook Air with 2 cores (4 threads). You can find it in this branch.

So, this week, I will try to figure out how to implement full performance on each thread. Also I will try to implement multithreads algorithm in affine and diffeomorphic registration.

Learning Cython and OpenMP

As earlier, I profiled the execution time of the code of non-rigid registration and denoising with local-PCA in DIPY, and found out that the most time-consuming parts are all nested for loops. Two of them are implemented in Cython, and the other one needs to be cythonized. Then we can try to improve their performance with OpenMP. In order to do this, I need to learn and do some experiments on Cython and OpenMP.

I followed Cython Tutorials and Users Guide to learn Cython, and mainly focused on how to make faster code.

First I learned how to compile Cython code and then I learned how to set annotate parameter to show the analysis of the compiled code. In the analysis, the white lines mean they don’t interact with Python, and you can safely release the GIL and implement them into multithreading algorithm. And yellow lines mean they are interacting with Python, the darker the more Python interactions. There are several ways to make these lines lighter, even make them white, and hence improve their performance.

First of all, you can static type the parameters of functions and the local variables. Second you can turn off security checks, like nonecheck, boundscheck, and wraparound, etc. Also, you can use efficient indexing or memory views for faster array lookups and assignments.

Then I did some experiments on cython.parallel module, which is a parallelism supporting OpenMP. I tried a simple example without write conflict. I used a double cores CPU with 4 threads to test it. It gives me 12% speed up.

Simple example without OpenMP
Simple example with OpenMP

Note that the second picture shows that with OpenMP, it took all 4 threads to executing the code. (See thread 2 and 4, there are more occupation when using OpenMP).

I also did some experiments on a more complicated example with some write conflict. In this situation we might need to use a locker to force some lines of code to be executed thread by thread, not concurrently. However, I still cannot show the reasonable result.

You can find my experiments on this repo.

Profiling run time of registration and local-PCA

My project is trying to speed up the execution time of registration algorithm. So at first, I need to find out which function is most time-consuming. In this week, I use cProfile and snakeviz to profile the run time.

  • At first, I profile the example of 3D affine registration from DIPY tutorial.
profile of affine registration – combine

This example contains three types of affine registration: translation, rigid transformation, and affine transformation. So, I divide it into three small pieces respectively.

profile of affine registration – translation
profile of affine registration – rigid transformation
profile of affine registration – affine tranformation

The cumulative and run time of them is:

cumulative time of affine registration
run time of affine registration

As we can see, there are three major time consuming functions: gradient(), _update_mutual_information(), and _update_histogram().

  1. For the first function – gradient(), it’s the most time consuming one, due to nested for loops in _gradient_2d() or _gradient_3d() in vector_fields.pyx. It’s already in Cython with nogil, but it still costs a lot time. So we need to speed it up with OpenMP.
  2. For the second function, we can see it’s fast when using only translation; while it’s slow when using rigid or affine transformation. So the runtime depends on the complexity of transformation matrix. And this is understandable, we can just leave it as it is.
  3. For the last function, cumulative time per optimization is constant. And run time per call is small. So no future action need to be done at this moment.

So, for affine registration, we can work on the nested loops of _gradient_2d() and _gradient_3d() in vector_fields.pyx to improve the execution time.

nested loops in _gradient_3d() and _gradient_2d()
  • Secondly, I profile the example of 2D and 3D diffeomorphic (SyN) registration from DIPY tutorial.
profile of 2D SyN registration
profile of 3D SyN registration

Agian, I separate the 2D example into 2 parts: Circle-To-C experiment and b0 image.

profile of 2D SyN registration – Circle-To-C
profile of 2D SyN registration – b0 image

The cumulative and run time of them is:

cumulative time of SyN registration
run time of SyN registration

Here we  can see, there are three major time consuming functions: median_otsu(), optimize(), and _find_and_load().

  1. The first function is for brain extraction. We use it to remove the skull from b0 volume. As shown in the profile, this is the most expensive function. But it’s out of my concern right now.
  2. For the second function – optimize(), it’s slow in 3D case, due to the nested loop in invert_vector_field_fixed_point_3d() in vector_fields.pyx. There is also nested loop in 2D case in invert_vector_field_fixed_point_2d(), but the speed is acceptable. Also it’s already in Cython with nogil, but it still takes time. As the same before, we need to speed it up with OpenMP.
  3. For the third function, the run time per call is small. So, it’s no need to optimize.

So again, for diffeomorphic (SyN) registration, we can work on the nested loop of invert_vector_field_fixed_point_3d() in vector_fields.pyx for optimization.

nested loop in invert_vector_field_fixed_point_3d()
  • At last, I profile the example of denoise by local-PCA algorithm from DIPY tutorial.
profile of denoising by local-PCA – using eig
profile of denoising by local-PCA – using svd

Both of these take time because nested loop in localpca() in As the same, we can work on this nested loop to improve the performance.

nested loops in localpca()

As a conclusion, there is nested loop in affine registration, diffeomorphic registration, and denoising by local-PCA, which increase the run time dramatically. So my major job is to implement the nested loop using OpenMP to increase the speed of execution.

My first try on implementing nested for loop using OpenMP is on this branch.

Intro to myself and my project


It’s really excited to get accepted into Google Summer of Code project. I will make contribution to DIPY (Diffusion MRI analysis) in Python. I’m a master student in computer science at Indiana University. Right now, I’m trying to merge my interest in both medical physics and computer science, more specifically to apply machine learning and other technology of computing to medical imaging and radiotherapy. So I think this project will be good experience for me.

My project is to implement multi-thread capacities for nonrigid registration in DIPY using OpenMP. In DIPY, it implements a broad range of algorithms for denoising, registration, reconstruction, fiber tracking, clustering, visualization, and statistical analysis of MRI data. For nonrigid image registration, the major goal is to calculate a deffeomorphic transformation to register two 2D images or 3D volumes. To achieve this, we need to do a lot of iterations to maximize the similarity metric, thus it slow down the optimization process.

To speed up the calculation, one solution is to convert to a parallel computing algorithm. The OpenMP API supports multi-platform shared-memory parallel programming in C/C++ and Fortran. Due to Global Interpreter Lock (GIL), there is no point to use threads for CPU intensive tasks in Python. Fortunately, with the help of Cython, we can release GIL during computations. So in this project, I will implement a multi-thread parallel algorithm using OpenMP and Cython to improve the performance of nonrigid image registration in DIPY.

It’s amazing that I just started to learn how to code in Python, and right now I have made some contributions to the open source software of DIPY in Python. See my pull requests before applying for this project. Also It’s really lucky for me to get my name (Ricci Woo) added into the contributors of DIPY release 0.14. In this bonding period, I learned basic principle in Diffusion MRI and image processing. Also I learned how to profile runtime of functions in Python to see which function is the most expensive one, then I can work on improving performance on it.

Oh, I forgot to tell my name. That is Xinquan Wu (Legal), or you can call me Ricci Woo (Nickname). You can contact me at email: I will push my code to my fork of DIPY on GitHub.