In the last two week, I made a small example to test the implementation of using a local function to hide the variables as Prof. Elef suggested. First, I move the locker out of the inner for loop like this. Then I put the whole inner for loop into a local function like this. The original one runs 9.0126 second, the other two take 1.6221 second and 1.6269 second respectively. The speedup effect comes from the reduction of the number of lockers.

Then I implemented the same algorithm in _gradient_3d() in repo: aff-par-grad-fun. Also I added a timer in _gradient_3d() in repo: aff-par-grad-fun-tim. The local function is like this (line 3067-3120). Inside this function we can make local buffers as local c arrays, like x[3], dx[3], q[3], instead of using dynamic allocation in parallel or using global memory view with locker. I tested for the runtime using this example. It takes 0.037685 second per run. The original one is 0.216153second. 5.73 times speedup. The parallel one that uses dynamic allocation (malloc) takes 0.060911 second. We can see it’s faster then before.

I also make an example to test local variable and global counter. We can add a locker in this example, it takes 0.918798 second, compare to the original one 0.070092 second. It even slows down. Then I use a local function without locker, it takes 0.036968 second. We can see it’s faster than the original one.

Then I try to implement the same algorithm in _joint_pdf_gradient_dense_3d(). I test this using this example. The original one takes 0.319248 second, while the one using local function without locker runs 0.109481 second.

At the same time, I test the performance of different scheduling with respect to number of threads. For the simple example:

Then, for the example in _gradient_3d() (line 3168):

In this two examples, we can see the static scheduling is the worse choice. And the dynamic and guided scheduling have not much difference. So dynamic or guided is the better choices.