In the last week, I reorganized my fork of DIPY repo. The repository of aff-par-grad-ori is the original one, that is the clone of upstream/master. In the repository of aff-par-grad-all, I parallelized _gradient_3d() and _sparse_gradient_3d() in vector_fields.pyx using dynamic allocation (malloc). In the repository of aff-par-grad-fun, I parallelized the two function using local function. With respect to these three repositories, I added timer for timing the parallel part of the function in the repositories of xxx-xxx-xxx-xxx-tim. The one using malloc gave me 4.45 times speedup on average, while the one with local function gave me 6.59 times speedup on average, tested on sl_test_grad_3d.py and on the cluster with 24 cores (48 threads).
Also, I profiled sl_aff_reg_3d.py on these three repositories. Both two methods gave me speedup effect, and the one with local function was a little bit faster.
Then I tried to parallelize _compute_pdfs_dense_3d() in parzenhist.pyx using local function. However, it even slowed down the execution of sl_test_cpdf_3d.py. Maybe I need to try some other method, like adding a locker or using dynamic allocation (malloc). If I could successfully speedup the execution of _compute_pdfs_dense_3d(), this will make it faster on _update_histogram in the profiling of sl_aff_reg_3d.py.
Also, I tried to parallelize _joint_pdf_gradient_dense_3d() in parzenhist.pyx. I first tried using dynamic allocation (malloc) and adding a locker (but unfinished). Then I tried to do it using local function. In this way, I need to some local buffer with dimension undefined before hand. Then I need to use dynamic allocation (malloc) for this. Also, for avoiding using ‘with gil’ statement, I need to change _jacobian() to be the ones without memory views. I made these changes in the repository of aff-par-jpdf-fun. So I need to do more test on this. If I parallelize this, it will become faster on _update_mutual_information in the profiling of sl_aff_reg_3d.py.