Profiling run time of registration and local-PCA

My project is trying to speed up the execution time of registration algorithm. So at first, I need to find out which function is most time-consuming. In this week, I use cProfile and snakeviz to profile the run time.

  • At first, I profile the example of 3D affine registration from DIPY tutorial.
profile of affine registration – combine

This example contains three types of affine registration: translation, rigid transformation, and affine transformation. So, I divide it into three small pieces respectively.

profile of affine registration – translation
profile of affine registration – rigid transformation
profile of affine registration – affine tranformation

The cumulative and run time of them is:

cumulative time of affine registration
run time of affine registration

As we can see, there are three major time consuming functions: gradient(), _update_mutual_information(), and _update_histogram().

  1. For the first function – gradient(), it’s the most time consuming one, due to nested for loops in _gradient_2d() or _gradient_3d() in vector_fields.pyx. It’s already in Cython with nogil, but it still costs a lot time. So we need to speed it up with OpenMP.
  2. For the second function, we can see it’s fast when using only translation; while it’s slow when using rigid or affine transformation. So the runtime depends on the complexity of transformation matrix. And this is understandable, we can just leave it as it is.
  3. For the last function, cumulative time per optimization is constant. And run time per call is small. So no future action need to be done at this moment.

So, for affine registration, we can work on the nested loops of _gradient_2d() and _gradient_3d() in vector_fields.pyx to improve the execution time.

nested loops in _gradient_3d() and _gradient_2d()
  • Secondly, I profile the example of 2D and 3D diffeomorphic (SyN) registration from DIPY tutorial.
profile of 2D SyN registration
profile of 3D SyN registration

Agian, I separate the 2D example into 2 parts: Circle-To-C experiment and b0 image.

profile of 2D SyN registration – Circle-To-C
profile of 2D SyN registration – b0 image

The cumulative and run time of them is:

cumulative time of SyN registration
run time of SyN registration

Here we  can see, there are three major time consuming functions: median_otsu(), optimize(), and _find_and_load().

  1. The first function is for brain extraction. We use it to remove the skull from b0 volume. As shown in the profile, this is the most expensive function. But it’s out of my concern right now.
  2. For the second function – optimize(), it’s slow in 3D case, due to the nested loop in invert_vector_field_fixed_point_3d() in vector_fields.pyx. There is also nested loop in 2D case in invert_vector_field_fixed_point_2d(), but the speed is acceptable. Also it’s already in Cython with nogil, but it still takes time. As the same before, we need to speed it up with OpenMP.
  3. For the third function, the run time per call is small. So, it’s no need to optimize.

So again, for diffeomorphic (SyN) registration, we can work on the nested loop of invert_vector_field_fixed_point_3d() in vector_fields.pyx for optimization.

nested loop in invert_vector_field_fixed_point_3d()
  • At last, I profile the example of denoise by local-PCA algorithm from DIPY tutorial.
profile of denoising by local-PCA – using eig
profile of denoising by local-PCA – using svd

Both of these take time because nested loop in localpca() in localpca.py. As the same, we can work on this nested loop to improve the performance.

nested loops in localpca()

As a conclusion, there is nested loop in affine registration, diffeomorphic registration, and denoising by local-PCA, which increase the run time dramatically. So my major job is to implement the nested loop using OpenMP to increase the speed of execution.

My first try on implementing nested for loop using OpenMP is on this branch.

Leave a Reply

Your email address will not be published. Required fields are marked *