Last week, I added a timer in the simple example on parallel, calculating the time of multi-threading part. For this, I figured out in this example, it only took around one third of the total time for the part on multi-thread execution. Then I run this example on the server with 24 cores (48 threads) to investigate the speedup performance and efficiency with various number of threads.
In this table, it proves the speedup performance of multi-thread. For example, with 2 threads, it yields 1.92 speedup compared to that on single thread. If used 18 threads, we can get 17.05 speedup. And if you used all threads, we can only have 32.59 speedup. This is just what we expected. Also, it’s a good news that we are sure that we can get so much speedup by implementing multi-thread algorithm with OpenMP.
I also investigated the difference among static, dynamic, and guided scheduling. But it showed no difference on multi-threading performance.
We used a lot of memory views in Cython code on affine registration, I was wondering how they would be implemented in multi-thread parallelism. So I investigated the C code of simple example generated by Cython.
In different threads, it gets access to the memory view through different index. This will cause a problem when parallelizing the code of affine registration (line 3109).
Here we can see, in multiple threads, it writes to the same array ‘x’, ‘dx’, and ‘q’. And this gives a writing conflict among different threads.
As discussed with my mentors, to solve this problem, we consider two different methods. One is to put the multi-thread block into a function. The other one is to add a locker on this part.
This week, I will try to implement these two methods, and investigate which one is better.