Week #6

js94
Published: 07/12/2019

Now that PCA implementation was working, I needed to improve the performance as the previous performance was too slow to be put into use. As per mentor's suggestions, I tried to replace the loading with an identity matrix to avoid overhead and reduce the number of samples. Furthermore, I tried to reconstruct the original data matrix at the merging stage and ran PCA on top of that to confirm the accuracy of the current implementation.

What did I do this week?

Check various ways to improve PCA performance

Did I get stuck anywhere?

I was getting errors from dimensions of the buffers that I pre-defined. 

What will I work on next week?

I will work on other ways of improving the performance of the algorithm

 

 

View Blog Post

Week #5

js94
Published: 07/07/2019

I continued working on PCA. I've divided up into several implementations for PCA, and tested which had the best performance of all. I've managed to take down the time to ~26 minutes compared to 40 minutes last week. Furthermore, the difference in spectral norm error was within tolerable bound.

What did I do this week?

I worked on PCA, implemented tests as well as several different implementations for PCA

Did I get stuck anywhere?

I was stuck due to the dimension problem in PCA. By the current nature of LiberTEM architecture, I need to pre-specify the dimension of all objects and that required quite a bit of scribbling notes before I made it to work.

View Blog Post

Week #4

js94
Published: 06/24/2019

I continued working on PCA. This week I made test cases and visualized the PCA results. So far it seems to work well with one major caveat being that it is way too slow. Currently PCA takes ~20s on a toy dataset and over 40 minutes on an actual raw dataset. This is presumably due to memory issue as well as computational cost of performing SVD at every step. I will be working on optimizing the performance in the coming weeks.

 

What did I do this week?

I continued working on PCA, made test cases and visualized the results. Merged PR for documentation

Did I get stuck anywhere?

Currently, PCA takes too long (~40 minutes on a large raw data). I need to cut it down significantly somehow. 

What's coming up next week?

I will be working on optimizing PCA performance

View Blog Post

Week #3

js94
Published: 06/15/2019

This week, I continued working on Principal Component Analysis. To cope with LiberTEM's architecture, I separated the algorithm into two parts, where the first part concerns the local computation of processing individual frames and the second part concerns the global computation of merging the outputs from the first part. For the first part, I used the algorithm from "Candid Covariance-Free Incremental PCA" paper. For the second part, I used the algorithm from a ph.d thesis that introduces efficient merging of SVDs. To test the result, I was trying to use frobenius norm error to measure the similarity between two eigenspace/eigenvector matrices, one that was approximated by the algorithms that I used and the other that was computed using full-batch PCA (i.e., exact solution). One trouble I had was how to set the reasonable error bound. In other words, what is a reasonable frobenius norm error bound to say that the current method "well-approximates" the true solution? I opened up an issue to document the researches that I have done related to PCA and NNMF, as well as my current progress on the subjects. While working on this issue, I also tried working a bit on documentation for UDF and submitted a PR. 

What did I do this week?

I worked on documentation for UDF interface and continued working on researching/implementing Principal Component Analysis

Did I get stuck anywhere?

I implemented PCA and it ran fine on a small toy data, but had two major problems. First, it gave me a memory error on a real data, implying that the matrix that I'm using is probably too big. Also, I have yet to formulate testing scheme for the algorithm

What's coming up next week?

I will continue to work on PCA. Also, I will research into NNMF and if it can be more easily implemented in LiberTEM

View Blog Post

Week #2

js94
Published: 06/08/2019

This week, I continued working on Principal Component Analysis for LiberTEM. Even though PCA is a well-studied algorithm with many available modifications for distributed computing, it was very hard to find a suitable algorithm that can be placed under LiberTEM architecture. LiberTEM's User-Defined Functions, under which the PCA functionality will be placed, can be viewed as a two-step action. First, independent computation of a 'partition' of multiple diffraction pattern images, and second, merging of these independent computations of partitions. Most online PCA algorithms had well-established solution for the first part, e.g., incremental PCA, where each new observation can be added to existing PCA (or SVD) solution. The major problem was the second part. As far as I could find, there was no efficient solution for 'merging' two PCA results. I managed to find a merging algorithm that is used in a packaged called gensim, which is used for topic modeling, but still am not sure if this is a legitimate one. Currently, I've reorganized the code so that I use different algorithms for 'partition'-level computation and 'merging'-level computation. I have tried running it on a sample toy dataset, but all the singular values became zero. Also, when I ran it on an actual data, I was getting memory error, presumably due to large covariance matrix that I'm storing for computation purpose. I'm currently thinking about first reshaping the dimension of the data through some form of preprocessing so that memory error does not occur.

What did I do this week? 

I was building on the skeleton code for Principal Component Analysis that I made last week. 

Did you get stuck anywhere?

I'm currently stuck in 1) forming a good test case, and 2) understanding whether my current algorithm is valid. Since online PCA algorithms are approximation methods, some even with nondeterministic output, it is hard to come up with a good test case. Also, a lot of the available algorithms cannot be fit into LiberTEM's architecture so I had to combine two different algorithms. 

What's coming up next week?

Next week, I really hope to finalize PCA by forming test cases and checking if algorithm is correct. 

View Blog Post