Last week, to complete the experiments on OpenMP, I made an example with the writing conflict among multiple threads, and then solved this problem with a locker.
Then I started to investigate the performance of multithreads. In the simple example, I ran on my MacBook Air with 2 cores (4 threads), yielding 12% speed up, not exceeding 75% of occupation on each thread. That might be due to some security mechanism. Then I tested on other hardware systems to make sure that I can run this example with 100% performance on each thread. I tested on Macbook Pro with 4 cores (8 threads), workstation with 12 cores (24 threads), and server with 24 cores (48 threads). All these systems can reach to 100% performance. But before getting full performance, I need to wait for a long time for the execution to start. That made this execution slow down, even slower than that without multithreads. For this problem, my guess is because of scheduling of OpenMP. This week, I will try to reduce the number of threads concurrently and try to investigate the scheduling mechanism of OpenMP.
Also in last week, I investigated the code of affine and diffeomorphic registration in DIPY. And I realized, to implement multithreads algorithm of them, affine registration needs no locker, while diffeomorphic registration needs a locker. So I tried to implement multithreading in affine registration. I yielded 36% speed up on my Macbook Air with 2 cores (4 threads). You can find it in this branch.
So, this week, I will try to figure out how to implement full performance on each thread. Also I will try to implement multithreads algorithm in affine and diffeomorphic registration.