Final report of GSoC 2018

Title of the project

Implementing multi-thread capacities for nonrigid registration in DIPY using OpenMP

Abstract

The code of image registration in DIPY is widely used. This nonrigid registration contains two types: the affine registration and the diffeomorphic registration. The existing code is single threaded. Making this code parallelized using multithreading can help reduce execution time.  Some of the functions are already efficient or are easy to parallelize. However, some functions may write to the same address in different threads and will need locks, and hence can not be speeded up as much.

The state-of-art of the code before my work

For affine registration, it takes 66.1 s to run an example of 3d affine registration on a server system with 24 cores (48 threads). There are 3 functions that take most of the time: 1. gradient takes 26.5 s (40%); 2. _update_mutual_information takes 22.6 s (34.2%); 3. _update_histogram takes 10.3 s (15.6%).

profile of original 3d affine registration

For diffeomorphic registration, it takes 12.2 s to run an example of 3d diffeomorphic registration on a server system with 24 cores (48 threads). There are 4 functions that take most of the time: 1. __invert_models takes 3.77 s (30.9%); 2. transform_inverse takes 1.14 s (9.3%); 3. update takes 0.833 s (6.8%); 4. initialize_iteration takes 0.821 s (6.7%).

profile of 3d diffeomorphic registration

Parallelized affine registration

By carefully investigating the functions called from those slower functions, we can see that it’s the functions that have nested for loops that take most of the time. There are 3 such time-consuming functions: 1. _gradient_3d in vector_fields.pyx; 2. _compute_pdfs_dense_3d in parzenhist.pyx; 3. _joint_pdf_gradient_dense_3d in parzenhist.pyx.

To parallelize them, we need to first cythonize them, that is to make them cdef and nogil, then parallelize them using prange from cython.parallel module to replace the range in the for loop. I try to parallelize each of them.

speedup of single call to the function in affine registration

The result shows: 1. the parallelized _gradient_3d gives me 7.47 times speedup; 2. the parallelized _compute_pds_dense_3d even slows down.3. the parallelized _joint_pdf_gradient_dense_3d has 2.68 times speedup, but it slows down when making multiple calls, it even failed in some tests of DIPY continuous integration. In the later two functions, different threads may write to the same address, this may cause conflicts. So we need to add a lock to resolve the problem. However, the lock will make everything slow down.

So for the best of performance, I cythonized _gardient_* in vector_fields.pyx and _compute_pdfs_* in parzenhist.pyx, and only parallelized _gradient_3d, see pull request #1612.

profile of parallelized 3d affine registration

Here we can see that gradient change from 26.5 s to 3.79 s, it gives 6.99 times speedup.

Parallelized diffeomorphic registration

For the case in diffeomorphic registration, there are 5 time-consuming functions that have nested for loops: 1. _compose_vector_fields_3d, 2. invert_vector_field_fixed_point_3d, 3. warp_3d, 4. reorient_vector_field_3d, all in vector_fields.pyx, and 5. precompute_cc_factors_3d in crosscorr.pyx. Again, I try to parallelize each of them.

speedup of single call to the function in diffeomorphic registration

Here, we can see two of the functions give us speedup performance: _compose_vector_fields_3d and warp_3d. And I only parallelized these two function, see pull request #1613.

profile of parallelized 3d diffeomorphic registration

Here we can see there is no difference between original and parallelized version.

Challenges and experiences

1. how to use Cython and OpenMP to parallelize the code

I followed Cython’s documentation to learn how to use Cython, especially the cython.parallel module which is the native parallelism using OpenMP. I did a lot of experiments in this repository to investigate: the difference of def, cdef functions, the performance of different number of threads, the difference among three scheduling, and the use of a locker.

2. speedup performance with respect to different number of threads

The performance of this simple example shows we can get 32.59 times speedup using 48 threads.

speedup of simple example in different # of threads

However, the performance of _gradient_3d shows the speedup performance of it drops greatly as the number of threads grows.

speedup of _gradient_3d in different # of threads

We can see even with 48 threads, we can only get 8.033 times speedup. This may due to the preparation for launching a big number of threads needs a lot of time when the code becomes complicated.

3. no significant difference among static, dynamic and guided scheduling

different scheduling of simple example
different scheduling of _gradient_3d

The performance among different scheduling of simple example and _gradient_3d show that there is no significant difference among the three scheduling. The static scheduling may be the worse choice, and both dynamic and guided scheduling would be good.

Conclusion and discussion

The idea of having more number of threads may speedup the execution more is great idea. However, due the complexity of the code, it may not give you the speedup performance as you expected. For the simple case, each iteration of the for loop is independent to each other. In this case, it can be easily parallelized using prange, and you gain a great number of speedup. However, in more complicated situations, different iterations will affect the other, like they may write to the same variable or same address in the memory. In such cases, it’s not easy to parallelize it. Sometimes, you need a lock to resolve the writing conflict. Sometimes, it’s hard to get fully performance in each thread. All these will slow down you code even you parallelize it. So multithreading parallelism is not easy. To go further in this direction, my suggestion is to dig into OpenMP, try using OpenMP directly instead of cython.parallel module.

The Code of my work

1. PR for parallelized affine registration:

https://github.com/nipy/dipy/pull/1612

2. PR for parallelized diffeomorphic registration:

https://github.com/nipy/dipy/pull/1613

To test my code

1. install DIPY from my fork of DIPY repository:

git clone https://github.com/RicciWoo/dipy

cd dipy

git clean -fxd

pip install –user -e .

2. download code for testing

cd ..

git clone https://github.com/RicciWoo/registration

cd registration

3. profile the original example of both 3d affine and diffeomorphic registration

python -m cProfile -o aff_reg_3d_ori.prof lc_aff_reg_3d.py

snakeviz aff_reg_3d_ori.prof # this shows profile of original 3d affine registration

python lc_syn_otsu.py # this generate data for 3d diffeomorphic registration

python -m cProfile -o syn_reg_3d_ori.prof lc_syn_reg_3d.py

snakeviz syn_reg_3d_ori.prof # this shows profile of original 3d diffeomorphic registration

4. profile the parallelized version of 3d affine registration

cd ../dipy

git checkout –track origin/aff-par-cpdf-grad

pip install –user -e .

cd ../registration

python -m cProfile -o aff_reg_3d_par.prof lc_aff_reg_3d.py

snakeviz aff_reg_3d_par.prof # this shows profile of parallelized 3d affine registration

5. profile the parallelized version of 3d diffeomorphic registration

cd ../dipy

git checkout –track origin/syn-par-comp-warp

pip install –user -e .

cd ../registration

python -m cProfile -o syn_reg_3d_par.prof lc_syn_reg_3d.py

snakeviz syn_reg_3d_par.prof # this shows profile of parallelized 3d diffeomorphic registration

6. if you test on a headless cluster, replace lc_aff_reg_3d.py and lc_syn_reg_3d.py with sl_aff_reg_3d.py and sl_syn_reg_3d.py

Leave a Reply

Your email address will not be published. Required fields are marked *