first parallel version of 3d affine registration

In the last two week, I made a small example of 3d affine registration to test the speedup performance of parallelization. This example called directly to the function _gradient_3d() in vector_fields.pyx. In this function, I used prange (line 3125) to implement the multi-threads parallel algorithm. And also, I added a timer in both the original version (line 3114, 3162) and parallel one (line 3113, 3174) to time and compare the parallel part. Then I tested it on a cluster with 24 cores (48 threads). The result showed that the original one took 0.221 second per run, while the parallel one took 0.027 second each run. So there was a 8.18 speedup. I also output the resulting gradient array as .npy files. Then I compared them numerically to make sure I got the same result before and after parallelization.

Then I tested it using specific number of threads. But the runtime became fluctuating. The speedup was not so clear as the number of threads grows.

Speedup with specific number of threads

I also tested on _gradient_2d() for 2d affine registration. But I cannot see any speedup at all. So I remove all the change on it.

I also implemented the multi-thread algorithm in sparse cases both in 2d and 3d. The function for calculating gradient in sparse cases are _sparse_gradient_3d() and _sparse_gradient_2d(). The results showed that there was speedup in 3d case, but not in 2d. So I also removed the change on 2d sparse case.

When I parallelized _gradient_3d(), it needs some thread local buffer. I followed Cython documentation on Using Parallelism to implement thread local arrays. I used “with parallel()” before prange. In this block before prange, it would run in parallel to allocate thread local buffer for each thread just as I did in 3d affine registration (line 3116-3124).

I also did some experiments to investigate how Cython implements variables. 1. If in a prange block, it only reads variables, then these variables become shared among all threads; 2. If in a prange block, it writes some values to variables, these variables become thread private (local). It means in each thread, there is a copy of the variable outside the prange block. Also, it will initialize all floating point values (including float and double) to NaN. All integers will be initialize as the furthest number from 0; 3. Pointers and memory views will be thread shared; 4. If you use an inplace operator (for example, i += 1), the variable will be reduced (i in this example). It means any code after this operator will be not able to see this variable.

After this, I profiled the 3d affine registration example of DIPY both in dense case and sparse case: aff_reg_3d.py, aff_reg_3d_spa.py, aff_reg_2d.py, and aff_reg_2d_spa.py.

profile example of affine registration

We can see speedups in 3d cases (both dense and sparse), but no speedup at all in 2d cases. Also we can see _update_histogram() in 3d dense case takes a little bit long time for execution. So I tried to parallelize _joint_pdf_gradient_dense_3d() in parzenhist.pyx. This is the most time-consuming part of _update_histogram().

For the thread local buffer of this function, I tried two ways to implement it: one was to add a local buffer for each thread in repo: aff-par-joint-pdf (line 1017); the other was to add a lock for writing conflicting part in repo: aff_par-hist-lock (line 1031-1041). Both of them showed speedup in execution. And the one with a lock gave more speedup effect. And both these two gave the same numerically result. However, when I added a timer to the original version, it yielded strange image. So I need to do some test on this parallelization to make sure I did the right thing.

I also did some experiments on the difference between scheduling. In Cython, when using prange, we can have three type of scheduling: static is for situation that each iteration takes almost the same amount of time; dynamic is for some situation that I don’t know the run time of each iteration ahead; guided is to enhanced dynamic scheduling that avoid the last iteration to take a long time.

scheduling

Here we can see, if you automatically use all threads that are available, dynamic scheduling is the best choice. If you specify the number of threads less than total number of threads, all three scheduling show no difference.

Here you can find the pull request of my first parallel version of 3d affine registration.

Speedup per thread

Last week, I added a timer in the simple example on parallel, calculating the time of multi-threading part. For this, I figured out in this example, it only took around one third of the total time for the part on multi-thread execution. Then I run this example on the server with 24 cores (48 threads) to investigate the speedup performance and efficiency with various number of threads.

Speedup per thread

In this table, it proves the speedup performance of multi-thread. For example, with 2 threads, it yields 1.92 speedup compared to that on single thread. If used 18 threads, we can get 17.05 speedup. And if you used all threads, we can only have 32.59 speedup. This is just what we expected. Also, it’s a good news that we are sure that we can get so much speedup by implementing multi-thread algorithm with OpenMP.

I also investigated the difference among static, dynamic, and guided scheduling. But it showed no difference on multi-threading performance.

We used a lot of memory views in Cython code on affine registration, I was wondering how they would be implemented in multi-thread parallelism. So I investigated the C code of simple example generated by Cython.

Cython code of memory view ‘out’
C code of memory view ‘out’

In different threads, it gets access to the memory view through different index. This will cause a problem when parallelizing the code of affine registration (line 3109).

Memory view of affine registration

Here we can see, in multiple threads, it writes to the same array ‘x’, ‘dx’, and ‘q’. And this gives a writing conflict among different threads.

As discussed with my mentors, to solve this problem, we consider two different methods. One is to put the multi-thread block into a function. The other one is to add a locker on this part.

This week, I will try to implement these two methods, and investigate which one is better.

Investigating performance of multithreads

Last week, to complete the experiments on OpenMP, I made an example with the writing conflict among multiple threads, and then solved this problem with a locker.

Then I started to investigate the performance of multithreads. In the simple example, I ran on my MacBook Air with 2 cores (4 threads), yielding 12% speed up, not exceeding 75% of occupation on each thread. That might be due to some security mechanism. Then I tested on other hardware systems to make sure that I can run this example with 100% performance on each thread. I tested on Macbook Pro with 4 cores (8 threads), workstation with 12 cores (24 threads), and server with 24 cores (48 threads). All these systems can reach to 100% performance. But before getting full performance, I need to wait for a long time for the execution to start. That made this execution slow down, even slower than that without multithreads. For this problem, my guess is because of scheduling of OpenMP. This week, I will try to reduce the number of threads concurrently and try to investigate the scheduling mechanism of OpenMP.

Also in last week, I investigated the code of affine and diffeomorphic registration in DIPY. And I realized, to implement multithreads algorithm of them, affine registration needs no locker, while diffeomorphic registration needs a locker. So I tried to implement multithreading in affine registration. I yielded 36% speed up on my Macbook Air with 2 cores (4 threads). You can find it in this branch.

So, this week, I will try to figure out how to implement full performance on each thread. Also I will try to implement multithreads algorithm in affine and diffeomorphic registration.