In the last week, I reorganized my fork of DIPY repo. The repository of aff-par-grad-ori is the original one, that is the clone of upstream/master. In the repository of aff-par-grad-all, I parallelized _gradient_3d() and _sparse_gradient_3d() in vector_fields.pyx using dynamic allocation (malloc). In the repository of aff-par-grad-fun, I parallelized the two function using local function. With respect to these three repositories, I added timer for timing the parallel part of the function in the repositories of xxx-xxx-xxx-xxx-tim. The one using malloc gave me 4.45 times speedup on average, while the one with local function gave me 6.59 times speedup on average, tested on sl_test_grad_3d.py and on the cluster with 24 cores (48 threads).
Also, I profiled sl_aff_reg_3d.py on these three repositories. Both two methods gave me speedup effect, and the one with local function was a little bit faster.
Then I tried to parallelize _compute_pdfs_dense_3d() in parzenhist.pyx using local function. However, it even slowed down the execution of sl_test_cpdf_3d.py. Maybe I need to try some other method, like adding a locker or using dynamic allocation (malloc). If I could successfully speedup the execution of _compute_pdfs_dense_3d(), this will make it faster on _update_histogram in the profiling of sl_aff_reg_3d.py.
Also, I tried to parallelize _joint_pdf_gradient_dense_3d() in parzenhist.pyx. I first tried using dynamic allocation (malloc) and adding a locker (but unfinished). Then I tried to do it using local function. In this way, I need to some local buffer with dimension undefined before hand. Then I need to use dynamic allocation (malloc) for this. Also, for avoiding using ‘with gil’ statement, I need to change _jacobian() to be the ones without memory views. I made these changes in the repository of aff-par-jpdf-fun. So I need to do more test on this. If I parallelize this, it will become faster on _update_mutual_information in the profiling of sl_aff_reg_3d.py.
In the last two week, I made a small example to test the implementation of using a local function to hide the variables as Prof. Elef suggested. First, I move the locker out of the inner for loop like this. Then I put the whole inner for loop into a local function like this. The original one runs 9.0126 second, the other two take 1.6221 second and 1.6269 second respectively. The speedup effect comes from the reduction of the number of lockers.
Then I implemented the same algorithm in _gradient_3d() in repo: aff-par-grad-fun. Also I added a timer in _gradient_3d() in repo: aff-par-grad-fun-tim. The local function is like this (line 3067-3120). Inside this function we can make local buffers as local c arrays, like x, dx, q, instead of using dynamic allocation in parallel or using global memory view with locker. I tested for the runtime using this example. It takes 0.037685 second per run. The original one is 0.216153second. 5.73 times speedup. The parallel one that uses dynamic allocation (malloc) takes 0.060911 second. We can see it’s faster then before.
I also make an example to test local variable and global counter. We can add a locker in this example, it takes 0.918798 second, compare to the original one 0.070092 second. It even slows down. Then I use a local function without locker, it takes 0.036968 second. We can see it’s faster than the original one.