In the last two week, I made a small example of 3d affine registration to test the speedup performance of parallelization. This example called directly to the function _gradient_3d() in vector_fields.pyx. In this function, I used prange (line 3125) to implement the multi-threads parallel algorithm. And also, I added a timer in both the original version (line 3114, 3162) and parallel one (line 3113, 3174) to time and compare the parallel part. Then I tested it on a cluster with 24 cores (48 threads). The result showed that the original one took 0.221 second per run, while the parallel one took 0.027 second each run. So there was a 8.18 speedup. I also output the resulting gradient array as .npy files. Then I compared them numerically to make sure I got the same result before and after parallelization.
Then I tested it using specific number of threads. But the runtime became fluctuating. The speedup was not so clear as the number of threads grows.
I also tested on _gradient_2d() for 2d affine registration. But I cannot see any speedup at all. So I remove all the change on it.
I also implemented the multi-thread algorithm in sparse cases both in 2d and 3d. The function for calculating gradient in sparse cases are _sparse_gradient_3d() and _sparse_gradient_2d(). The results showed that there was speedup in 3d case, but not in 2d. So I also removed the change on 2d sparse case.
When I parallelized _gradient_3d(), it needs some thread local buffer. I followed Cython documentation on Using Parallelism to implement thread local arrays. I used “with parallel()” before prange. In this block before prange, it would run in parallel to allocate thread local buffer for each thread just as I did in 3d affine registration (line 3116-3124).
I also did some experiments to investigate how Cython implements variables. 1. If in a prange block, it only reads variables, then these variables become shared among all threads; 2. If in a prange block, it writes some values to variables, these variables become thread private (local). It means in each thread, there is a copy of the variable outside the prange block. Also, it will initialize all floating point values (including float and double) to NaN. All integers will be initialize as the furthest number from 0; 3. Pointers and memory views will be thread shared; 4. If you use an inplace operator (for example, i += 1), the variable will be reduced (i in this example). It means any code after this operator will be not able to see this variable.
After this, I profiled the 3d affine registration example of DIPY both in dense case and sparse case: aff_reg_3d.py, aff_reg_3d_spa.py, aff_reg_2d.py, and aff_reg_2d_spa.py.
We can see speedups in 3d cases (both dense and sparse), but no speedup at all in 2d cases. Also we can see _update_histogram() in 3d dense case takes a little bit long time for execution. So I tried to parallelize _joint_pdf_gradient_dense_3d() in parzenhist.pyx. This is the most time-consuming part of _update_histogram().
For the thread local buffer of this function, I tried two ways to implement it: one was to add a local buffer for each thread in repo: aff-par-joint-pdf (line 1017); the other was to add a lock for writing conflicting part in repo: aff_par-hist-lock (line 1031-1041). Both of them showed speedup in execution. And the one with a lock gave more speedup effect. And both these two gave the same numerically result. However, when I added a timer to the original version, it yielded strange image. So I need to do some test on this parallelization to make sure I did the right thing.
I also did some experiments on the difference between scheduling. In Cython, when using prange, we can have three type of scheduling: static is for situation that each iteration takes almost the same amount of time; dynamic is for some situation that I don’t know the run time of each iteration ahead; guided is to enhanced dynamic scheduling that avoid the last iteration to take a long time.
Here we can see, if you automatically use all threads that are available, dynamic scheduling is the best choice. If you specify the number of threads less than total number of threads, all three scheduling show no difference.
Here you can find the pull request of my first parallel version of 3d affine registration.