Week #2
js94
Published: 06/08/2019
This week, I continued working on Principal Component Analysis for LiberTEM. Even though PCA is a well-studied algorithm with many available modifications for distributed computing, it was very hard to find a suitable algorithm that can be placed under LiberTEM architecture. LiberTEM's User-Defined Functions, under which the PCA functionality will be placed, can be viewed as a two-step action. First, independent computation of a 'partition' of multiple diffraction pattern images, and second, merging of these independent computations of partitions. Most online PCA algorithms had well-established solution for the first part, e.g., incremental PCA, where each new observation can be added to existing PCA (or SVD) solution. The major problem was the second part. As far as I could find, there was no efficient solution for 'merging' two PCA results. I managed to find a merging algorithm that is used in a packaged called gensim, which is used for topic modeling, but still am not sure if this is a legitimate one. Currently, I've reorganized the code so that I use different algorithms for 'partition'-level computation and 'merging'-level computation. I have tried running it on a sample toy dataset, but all the singular values became zero. Also, when I ran it on an actual data, I was getting memory error, presumably due to large covariance matrix that I'm storing for computation purpose. I'm currently thinking about first reshaping the dimension of the data through some form of preprocessing so that memory error does not occur.
What did I do this week?
I was building on the skeleton code for Principal Component Analysis that I made last week.
Did you get stuck anywhere?
I'm currently stuck in 1) forming a good test case, and 2) understanding whether my current algorithm is valid. Since online PCA algorithms are approximation methods, some even with nondeterministic output, it is hard to come up with a good test case. Also, a lot of the available algorithms cannot be fit into LiberTEM's architecture so I had to combine two different algorithms.
What's coming up next week?
Next week, I really hope to finalize PCA by forming test cases and checking if algorithm is correct.
View Blog Post
Week #1 Blog post (in detail)
js94
Published: 06/04/2019
Most of my time was spent on working on coming up with good implementations for PCA on LiberTEM. To fully utilize LiberTEM's architectural benefits, I needed to make sure that the algorithm can be 1) computed in parallel and 2) two disjoint computation results for PCA can be merged into a single PCA result. I have been exploring Incremental PCA algorith, which allows adding a batch of data to computed PCA results, and distributed PCA, which computes disjoint set of data in parallel and merges them by only keeping the first n components. Most of such parallel algorithms are approximation of the true PCA vectors at best by design, so I need to come up with sound way of testing these functions. At the same time, the dataset that LiberTEM is dealing with is 4D matrix (or 3D, depending on the application), where the first two represents the scanned position and the last two represents the diffraction pattern image data. Thus, I had to come up with some way of computing PCA over this dataset. One way is to 'flatten' each diffraction pattern into 1d vector, and stack them up so that each row would constitute a single 'observation.' Currently, I'm trying to preprocessing the data using radial binning so that PCA can be applied on a smaller number of variables.
View Blog Post
Week #1
js94
Published: 06/03/2019
What did I do this week?
In continuing what I have done during the community bonding period, I was researching more about online algorithms for dimensionality reduction techniques. I've implemented the skeleton code for online Principal Component Analysis (PCA) using incremental algorithm and submitted as PR
What is coming up next?
I'm currently trying to construct good test cases for PCA code. On the side, I have been writing code for parallel Non-negative matrix factorization (NNMF). I plan to submit PR for this soon as well.
Did you get stuck anywhere?
At the moment, I'm very much stuck at constructing test cases. Since most online algorithms are approximations at best, some with nondeterministic output, it is quite hard to come up with a sound test case.
View Blog Post
First blog post
js94
Published: 05/28/2019
Hi all. For the summer, I will be working on implementing various dimensionality reduction techniques on LiberTEM. During the community bonding period, I've been trying to understand the codebase better as well as researching on the existing methods for distributed implementation of dimensionality reduction techniques, namely PCA, ICA and UMAP. As I enter into the official coding period, the first part of the project will be somewhat of a combination of implementing the interface for User-Defined Functions in LiberTEM as well as distributed implementation of PCA. More about this in the upcoming blog.
View Blog Post