try parallelize _compute_pdfs_dense_3d

In the last week, I reorganized my fork of DIPY repo. The repository of aff-par-grad-ori is the original one, that is the clone of upstream/master. In the repository of aff-par-grad-all, I parallelized _gradient_3d() and _sparse_gradient_3d() in vector_fields.pyx using dynamic allocation (malloc). In the repository of aff-par-grad-fun, I parallelized the two function using local function. With respect to these three repositories, I added timer for timing the parallel part of the function in the repositories of xxx-xxx-xxx-xxx-tim. The one using malloc gave me 4.45 times speedup on average, while the one with local function gave me 6.59 times speedup on average, tested on and on the cluster with 24 cores (48 threads).

speedup of parallelized _gradient_3d()

Also, I profiled on these three repositories. Both two methods gave me speedup effect, and the one with local function was a little bit faster.

profile of

Then I tried to parallelize _compute_pdfs_dense_3d() in parzenhist.pyx using local function. However, it even slowed down the execution of Maybe I need to try some other method, like adding a locker or using dynamic allocation (malloc). If I could successfully speedup the execution of _compute_pdfs_dense_3d(), this will make it faster on _update_histogram in the profiling of

Also, I tried to parallelize _joint_pdf_gradient_dense_3d() in parzenhist.pyx. I first tried using dynamic allocation (malloc) and adding a locker (but unfinished). Then I tried to do it using local function. In this way, I need to some local buffer with dimension undefined before hand. Then I need to use dynamic allocation (malloc) for this. Also, for avoiding using ‘with gil’ statement, I need to change _jacobian() to be the ones without memory views. I made these changes in the repository of aff-par-jpdf-fun. So I need to do more test on this. If I parallelize this, it will become faster on _update_mutual_information in the profiling of

local function

In the last two week, I made a small example to test the implementation of using a local function to hide the variables as Prof. Elef suggested. First, I move the locker out of the inner for loop like this. Then I put the whole inner for loop into a local function like this. The original one runs 9.0126 second, the other two take 1.6221 second and 1.6269 second respectively. The speedup effect comes from the reduction of the number of lockers.

Then I implemented the same algorithm in _gradient_3d() in repo: aff-par-grad-fun. Also I added a timer in _gradient_3d() in repo: aff-par-grad-fun-tim. The local function is like this (line 3067-3120). Inside this function we can make local buffers as local c arrays, like x[3], dx[3], q[3], instead of using dynamic allocation in parallel or using global memory view with locker. I tested for the runtime using this example. It takes 0.037685 second per run. The original one is 0.216153second. 5.73 times speedup. The parallel one that uses dynamic allocation (malloc) takes 0.060911 second. We can see it’s faster then before.

I also make an example to test local variable and global counter. We can add a locker in this example, it takes 0.918798 second, compare to the original one 0.070092 second. It even slows down. Then I use a local function without locker, it takes 0.036968 second. We can see it’s faster than the original one.

Then I try to implement the same algorithm in _joint_pdf_gradient_dense_3d(). I test this using this example. The original one takes 0.319248 second, while the one using local function without locker runs 0.109481 second.

At the same time, I test the performance of different scheduling with respect to number of threads. For the simple example:

scheduling w.r.t. number of threads of simple example

Then, for the example in _gradient_3d() (line 3168):

scheduling w.r.t. number of threads of _gradient_3d()

In this two examples, we can see the static scheduling is the worse choice. And the dynamic and guided scheduling have not much difference. So dynamic or guided is the better choices.