Week #2

js94
Published: 06/08/2019

This week, I continued working on Principal Component Analysis for LiberTEM. Even though PCA is a well-studied algorithm with many available modifications for distributed computing, it was very hard to find a suitable algorithm that can be placed under LiberTEM architecture. LiberTEM's User-Defined Functions, under which the PCA functionality will be placed, can be viewed as a two-step action. First, independent computation of a 'partition' of multiple diffraction pattern images, and second, merging of these independent computations of partitions. Most online PCA algorithms had well-established solution for the first part, e.g., incremental PCA, where each new observation can be added to existing PCA (or SVD) solution. The major problem was the second part. As far as I could find, there was no efficient solution for 'merging' two PCA results. I managed to find a merging algorithm that is used in a packaged called gensim, which is used for topic modeling, but still am not sure if this is a legitimate one. Currently, I've reorganized the code so that I use different algorithms for 'partition'-level computation and 'merging'-level computation. I have tried running it on a sample toy dataset, but all the singular values became zero. Also, when I ran it on an actual data, I was getting memory error, presumably due to large covariance matrix that I'm storing for computation purpose. I'm currently thinking about first reshaping the dimension of the data through some form of preprocessing so that memory error does not occur.

What did I do this week? 

I was building on the skeleton code for Principal Component Analysis that I made last week. 

Did you get stuck anywhere?

I'm currently stuck in 1) forming a good test case, and 2) understanding whether my current algorithm is valid. Since online PCA algorithms are approximation methods, some even with nondeterministic output, it is hard to come up with a good test case. Also, a lot of the available algorithms cannot be fit into LiberTEM's architecture so I had to combine two different algorithms. 

What's coming up next week?

Next week, I really hope to finalize PCA by forming test cases and checking if algorithm is correct. 

DJDT

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (11 rendered)

Cache calls from 1 backend

Signals

Log messages