# Profiling run time of registration and local-PCA

My project is trying to speed up the execution time of registration algorithm. So at first, I need to find out which function is most time-consuming. In this week, I use cProfile and snakeviz to profile the run time.

• At first, I profile the example of 3D affine registration from DIPY tutorial.

This example contains three types of affine registration: translation, rigid transformation, and affine transformation. So, I divide it into three small pieces respectively.

The cumulative and run time of them is:

As we can see, there are three major time consuming functions: gradient(), _update_mutual_information(), and _update_histogram().

1. For the first function – gradient(), it’s the most time consuming one, due to nested for loops in _gradient_2d() or _gradient_3d() in vector_fields.pyx. It’s already in Cython with nogil, but it still costs a lot time. So we need to speed it up with OpenMP.
2. For the second function, we can see it’s fast when using only translation; while it’s slow when using rigid or affine transformation. So the runtime depends on the complexity of transformation matrix. And this is understandable, we can just leave it as it is.
3. For the last function, cumulative time per optimization is constant. And run time per call is small. So no future action need to be done at this moment.

So, for affine registration, we can work on the nested loops of _gradient_2d() and _gradient_3d() in vector_fields.pyx to improve the execution time.

• Secondly, I profile the example of 2D and 3D diffeomorphic (SyN) registration from DIPY tutorial.

Agian, I separate the 2D example into 2 parts: Circle-To-C experiment and b0 image.

The cumulative and run time of them is:

Here we  can see, there are three major time consuming functions: median_otsu(), optimize(), and _find_and_load().

1. The first function is for brain extraction. We use it to remove the skull from b0 volume. As shown in the profile, this is the most expensive function. But it’s out of my concern right now.
2. For the second function – optimize(), it’s slow in 3D case, due to the nested loop in invert_vector_field_fixed_point_3d() in vector_fields.pyx. There is also nested loop in 2D case in invert_vector_field_fixed_point_2d(), but the speed is acceptable. Also it’s already in Cython with nogil, but it still takes time. As the same before, we need to speed it up with OpenMP.
3. For the third function, the run time per call is small. So, it’s no need to optimize.

So again, for diffeomorphic (SyN) registration, we can work on the nested loop of invert_vector_field_fixed_point_3d() in vector_fields.pyx for optimization.

• At last, I profile the example of denoise by local-PCA algorithm from DIPY tutorial.

Both of these take time because nested loop in localpca() in localpca.py. As the same, we can work on this nested loop to improve the performance.

As a conclusion, there is nested loop in affine registration, diffeomorphic registration, and denoising by local-PCA, which increase the run time dramatically. So my major job is to implement the nested loop using OpenMP to increase the speed of execution.

My first try on implementing nested for loop using OpenMP is on this branch.