Week #3

js94
Published: 06/15/2019

This week, I continued working on Principal Component Analysis. To cope with LiberTEM's architecture, I separated the algorithm into two parts, where the first part concerns the local computation of processing individual frames and the second part concerns the global computation of merging the outputs from the first part. For the first part, I used the algorithm from "Candid Covariance-Free Incremental PCA" paper. For the second part, I used the algorithm from a ph.d thesis that introduces efficient merging of SVDs. To test the result, I was trying to use frobenius norm error to measure the similarity between two eigenspace/eigenvector matrices, one that was approximated by the algorithms that I used and the other that was computed using full-batch PCA (i.e., exact solution). One trouble I had was how to set the reasonable error bound. In other words, what is a reasonable frobenius norm error bound to say that the current method "well-approximates" the true solution? I opened up an issue to document the researches that I have done related to PCA and NNMF, as well as my current progress on the subjects. While working on this issue, I also tried working a bit on documentation for UDF and submitted a PR. 

What did I do this week?

I worked on documentation for UDF interface and continued working on researching/implementing Principal Component Analysis

Did I get stuck anywhere?

I implemented PCA and it ran fine on a small toy data, but had two major problems. First, it gave me a memory error on a real data, implying that the matrix that I'm using is probably too big. Also, I have yet to formulate testing scheme for the algorithm

What's coming up next week?

I will continue to work on PCA. Also, I will research into NNMF and if it can be more easily implemented in LiberTEM

View Blog Post

Week #2

js94
Published: 06/08/2019

This week, I continued working on Principal Component Analysis for LiberTEM. Even though PCA is a well-studied algorithm with many available modifications for distributed computing, it was very hard to find a suitable algorithm that can be placed under LiberTEM architecture. LiberTEM's User-Defined Functions, under which the PCA functionality will be placed, can be viewed as a two-step action. First, independent computation of a 'partition' of multiple diffraction pattern images, and second, merging of these independent computations of partitions. Most online PCA algorithms had well-established solution for the first part, e.g., incremental PCA, where each new observation can be added to existing PCA (or SVD) solution. The major problem was the second part. As far as I could find, there was no efficient solution for 'merging' two PCA results. I managed to find a merging algorithm that is used in a packaged called gensim, which is used for topic modeling, but still am not sure if this is a legitimate one. Currently, I've reorganized the code so that I use different algorithms for 'partition'-level computation and 'merging'-level computation. I have tried running it on a sample toy dataset, but all the singular values became zero. Also, when I ran it on an actual data, I was getting memory error, presumably due to large covariance matrix that I'm storing for computation purpose. I'm currently thinking about first reshaping the dimension of the data through some form of preprocessing so that memory error does not occur.

What did I do this week? 

I was building on the skeleton code for Principal Component Analysis that I made last week. 

Did you get stuck anywhere?

I'm currently stuck in 1) forming a good test case, and 2) understanding whether my current algorithm is valid. Since online PCA algorithms are approximation methods, some even with nondeterministic output, it is hard to come up with a good test case. Also, a lot of the available algorithms cannot be fit into LiberTEM's architecture so I had to combine two different algorithms. 

What's coming up next week?

Next week, I really hope to finalize PCA by forming test cases and checking if algorithm is correct. 

View Blog Post

Week #1 Blog post (in detail)

js94
Published: 06/04/2019

Most of my time was spent on working on coming up with good implementations for PCA on LiberTEM. To fully utilize LiberTEM's architectural benefits, I needed to make sure that the algorithm can be 1) computed in parallel and 2) two disjoint computation results for PCA can be merged into a single PCA result. I have been exploring Incremental PCA algorith, which allows adding a batch of data to computed PCA results, and distributed PCA, which computes disjoint set of data in parallel and merges them by only keeping the first n components. Most of such parallel algorithms are approximation of the true PCA vectors at best by design, so I need to come up with sound way of testing these functions. At the same time, the dataset that LiberTEM is dealing with is 4D matrix (or 3D, depending on the application), where the first two represents the scanned position and the last two represents the diffraction pattern image data. Thus, I had to come up with some way of computing PCA over this dataset. One way is to 'flatten' each diffraction pattern into 1d vector, and stack them up so that each row would  constitute a single 'observation.' Currently, I'm trying to preprocessing the data using radial binning so that PCA can be applied on a smaller number of variables. 

View Blog Post

Week #1

js94
Published: 06/03/2019

What did I do this week? 

In continuing what I have done during the community bonding period, I was researching more about online algorithms for dimensionality reduction techniques. I've implemented the skeleton code for online Principal Component Analysis (PCA) using incremental algorithm and submitted as PR 

What is coming up next?

I'm currently trying to construct good test cases for PCA code. On the side, I have been writing code for parallel Non-negative matrix factorization (NNMF). I plan to submit PR for this soon as well. 

Did you get stuck anywhere?

At the moment, I'm very much stuck at constructing test cases. Since most online algorithms are approximations at best, some with nondeterministic output, it is quite hard to come up with a sound test case. 

View Blog Post

First blog post

js94
Published: 05/28/2019

Hi all. For the summer, I will be working on implementing various dimensionality reduction techniques on LiberTEM. During the community bonding period, I've been trying to understand the codebase better as well as researching on the existing methods for distributed implementation of dimensionality reduction techniques, namely PCA, ICA and UMAP. As I enter into the official coding period, the first part of the project will be somewhat of a combination of implementing the interface for User-Defined Functions in LiberTEM as well as distributed implementation of PCA. More about this in the upcoming blog. 

View Blog Post