Title of the project
Implementing multi-thread capacities for nonrigid registration in DIPY using OpenMP
The code of image registration in DIPY is widely used. This nonrigid registration contains two types: the affine registration and the diffeomorphic registration. The existing code is single threaded. Making this code parallelized using multithreading can help reduce execution time. Some of the functions are already efficient or are easy to parallelize. However, some functions may write to the same address in different threads and will need locks, and hence can not be speeded up as much.
The state-of-art of the code before my work
For affine registration, it takes 66.1 s to run an example of 3d affine registration on a server system with 24 cores (48 threads). There are 3 functions that take most of the time: 1. gradient takes 26.5 s (40%); 2. _update_mutual_information takes 22.6 s (34.2%); 3. _update_histogram takes 10.3 s (15.6%).
For diffeomorphic registration, it takes 12.2 s to run an example of 3d diffeomorphic registration on a server system with 24 cores (48 threads). There are 4 functions that take most of the time: 1. __invert_models takes 3.77 s (30.9%); 2. transform_inverse takes 1.14 s (9.3%); 3. update takes 0.833 s (6.8%); 4. initialize_iteration takes 0.821 s (6.7%).
Parallelized affine registration
By carefully investigating the functions called from those slower functions, we can see that it’s the functions that have nested for loops that take most of the time. There are 3 such time-consuming functions: 1. _gradient_3d in vector_fields.pyx; 2. _compute_pdfs_dense_3d in parzenhist.pyx; 3. _joint_pdf_gradient_dense_3d in parzenhist.pyx.
To parallelize them, we need to first cythonize them, that is to make them cdef and nogil, then parallelize them using prange from cython.parallel module to replace the range in the for loop. I try to parallelize each of them.
The result shows: 1. the parallelized _gradient_3d gives me 7.47 times speedup; 2. the parallelized _compute_pds_dense_3d even slows down.3. the parallelized _joint_pdf_gradient_dense_3d has 2.68 times speedup, but it slows down when making multiple calls, it even failed in some tests of DIPY continuous integration. In the later two functions, different threads may write to the same address, this may cause conflicts. So we need to add a lock to resolve the problem. However, the lock will make everything slow down.
So for the best of performance, I cythonized _gardient_* in vector_fields.pyx and _compute_pdfs_* in parzenhist.pyx, and only parallelized _gradient_3d, see pull request #1612.
Here we can see that gradient change from 26.5 s to 3.79 s, it gives 6.99 times speedup.
Parallelized diffeomorphic registration
For the case in diffeomorphic registration, there are 5 time-consuming functions that have nested for loops: 1. _compose_vector_fields_3d, 2. invert_vector_field_fixed_point_3d, 3. warp_3d, 4. reorient_vector_field_3d, all in vector_fields.pyx, and 5. precompute_cc_factors_3d in crosscorr.pyx. Again, I try to parallelize each of them.
Here, we can see two of the functions give us speedup performance: _compose_vector_fields_3d and warp_3d. And I only parallelized these two function, see pull request #1613.
Here we can see there is no difference between original and parallelized version.
Challenges and experiences
1. how to use Cython and OpenMP to parallelize the code
I followed Cython’s documentation to learn how to use Cython, especially the cython.parallel module which is the native parallelism using OpenMP. I did a lot of experiments in this repository to investigate: the difference of def, cdef functions, the performance of different number of threads, the difference among three scheduling, and the use of a locker.
2. speedup performance with respect to different number of threads
The performance of this simple example shows we can get 32.59 times speedup using 48 threads.
However, the performance of _gradient_3d shows the speedup performance of it drops greatly as the number of threads grows.
We can see even with 48 threads, we can only get 8.033 times speedup. This may due to the preparation for launching a big number of threads needs a lot of time when the code becomes complicated.
3. no significant difference among static, dynamic and guided scheduling
The performance among different scheduling of simple example and _gradient_3d show that there is no significant difference among the three scheduling. The static scheduling may be the worse choice, and both dynamic and guided scheduling would be good.
Conclusion and discussion
The idea of having more number of threads may speedup the execution more is great idea. However, due the complexity of the code, it may not give you the speedup performance as you expected. For the simple case, each iteration of the for loop is independent to each other. In this case, it can be easily parallelized using prange, and you gain a great number of speedup. However, in more complicated situations, different iterations will affect the other, like they may write to the same variable or same address in the memory. In such cases, it’s not easy to parallelize it. Sometimes, you need a lock to resolve the writing conflict. Sometimes, it’s hard to get fully performance in each thread. All these will slow down you code even you parallelize it. So multithreading parallelism is not easy. To go further in this direction, my suggestion is to dig into OpenMP, try using OpenMP directly instead of cython.parallel module.
The Code of my work
1. PR for parallelized affine registration:
2. PR for parallelized diffeomorphic registration:
To test my code
1. install DIPY from my fork of DIPY repository:
git clone https://github.com/RicciWoo/dipy
git clean -fxd
pip install –user -e .
2. download code for testing
git clone https://github.com/RicciWoo/registration
3. profile the original example of both 3d affine and diffeomorphic registration
python -m cProfile -o aff_reg_3d_ori.prof lc_aff_reg_3d.py
snakeviz aff_reg_3d_ori.prof # this shows profile of original 3d affine registration
python lc_syn_otsu.py # this generate data for 3d diffeomorphic registration
python -m cProfile -o syn_reg_3d_ori.prof lc_syn_reg_3d.py
snakeviz syn_reg_3d_ori.prof # this shows profile of original 3d diffeomorphic registration
4. profile the parallelized version of 3d affine registration
git checkout –track origin/aff-par-cpdf-grad
pip install –user -e .
python -m cProfile -o aff_reg_3d_par.prof lc_aff_reg_3d.py
snakeviz aff_reg_3d_par.prof # this shows profile of parallelized 3d affine registration
5. profile the parallelized version of 3d diffeomorphic registration
git checkout –track origin/syn-par-comp-warp
pip install –user -e .
python -m cProfile -o syn_reg_3d_par.prof lc_syn_reg_3d.py
snakeviz syn_reg_3d_par.prof # this shows profile of parallelized 3d diffeomorphic registration
6. if you test on a headless cluster, replace lc_aff_reg_3d.py and lc_syn_reg_3d.py with sl_aff_reg_3d.py and sl_syn_reg_3d.py