js94's Blog

Week #7

js94
Published: 07/15/2019

I managed to reduce the performance time to ~100 sec from 26 minutes by fetching a bigger chunk of data at a time than was fetched before. Furthermore, I successfully implemented hyperbox for loading matrix, which essentially is a method of subsampling from loading matrix so that the data is well-represented with less sample. Although more loss was induced by this, I'm inclined to conclude that the cost is manageable and the benefit in terms of speeding up the performance outweighs the cost. In fact, since I'm applying approximation PCA method, some degree of loss for this PCA is expected.

What did I do this week?

Improved overall PCA performance

Did I get stuck anywhere?

I was getting errors because I miscalculated the dimension needed for reconstruction

What will I work on next week?

Seems like PCA has almost come to an end (?). Once I clean up the code, produce some example notebooks for reference, with mentor's approval, I will probably begin working on non-negative matrix factorization, if not this week, next week.

View Blog Post

Week #6

js94
Published: 07/12/2019

Now that PCA implementation was working, I needed to improve the performance as the previous performance was too slow to be put into use. As per mentor's suggestions, I tried to replace the loading with an identity matrix to avoid overhead and reduce the number of samples. Furthermore, I tried to reconstruct the original data matrix at the merging stage and ran PCA on top of that to confirm the accuracy of the current implementation.

What did I do this week?

Check various ways to improve PCA performance

Did I get stuck anywhere?

I was getting errors from dimensions of the buffers that I pre-defined. 

What will I work on next week?

I will work on other ways of improving the performance of the algorithm

 

 

View Blog Post

Week #5

js94
Published: 07/07/2019

I continued working on PCA. I've divided up into several implementations for PCA, and tested which had the best performance of all. I've managed to take down the time to ~26 minutes compared to 40 minutes last week. Furthermore, the difference in spectral norm error was within tolerable bound.

What did I do this week?

I worked on PCA, implemented tests as well as several different implementations for PCA

Did I get stuck anywhere?

I was stuck due to the dimension problem in PCA. By the current nature of LiberTEM architecture, I need to pre-specify the dimension of all objects and that required quite a bit of scribbling notes before I made it to work.

View Blog Post

Week #4

js94
Published: 06/24/2019

I continued working on PCA. This week I made test cases and visualized the PCA results. So far it seems to work well with one major caveat being that it is way too slow. Currently PCA takes ~20s on a toy dataset and over 40 minutes on an actual raw dataset. This is presumably due to memory issue as well as computational cost of performing SVD at every step. I will be working on optimizing the performance in the coming weeks.

 

What did I do this week?

I continued working on PCA, made test cases and visualized the results. Merged PR for documentation

Did I get stuck anywhere?

Currently, PCA takes too long (~40 minutes on a large raw data). I need to cut it down significantly somehow. 

What's coming up next week?

I will be working on optimizing PCA performance

View Blog Post

Week #3

js94
Published: 06/15/2019

This week, I continued working on Principal Component Analysis. To cope with LiberTEM's architecture, I separated the algorithm into two parts, where the first part concerns the local computation of processing individual frames and the second part concerns the global computation of merging the outputs from the first part. For the first part, I used the algorithm from "Candid Covariance-Free Incremental PCA" paper. For the second part, I used the algorithm from a ph.d thesis that introduces efficient merging of SVDs. To test the result, I was trying to use frobenius norm error to measure the similarity between two eigenspace/eigenvector matrices, one that was approximated by the algorithms that I used and the other that was computed using full-batch PCA (i.e., exact solution). One trouble I had was how to set the reasonable error bound. In other words, what is a reasonable frobenius norm error bound to say that the current method "well-approximates" the true solution? I opened up an issue to document the researches that I have done related to PCA and NNMF, as well as my current progress on the subjects. While working on this issue, I also tried working a bit on documentation for UDF and submitted a PR. 

What did I do this week?

I worked on documentation for UDF interface and continued working on researching/implementing Principal Component Analysis

Did I get stuck anywhere?

I implemented PCA and it ran fine on a small toy data, but had two major problems. First, it gave me a memory error on a real data, implying that the matrix that I'm using is probably too big. Also, I have yet to formulate testing scheme for the algorithm

What's coming up next week?

I will continue to work on PCA. Also, I will research into NNMF and if it can be more easily implemented in LiberTEM

View Blog Post