Articles on js94's Bloghttps://blogs.python-gsoc.orgUpdates on different articles published on js94's BlogenSat, 24 Aug 2019 21:36:21 +0000Week #12https://blogs.python-gsoc.org/en/js94s-blog/week-12/<p>I unsuccessfully tried to resolve the issue with dimensions in NMF. The main problem that I was facing was with specifying the correct dimensions for matrix operations within NMF algorithm and I have yet to identify the correct dimensions. Furthermore, there is no incremental algorithms for NMF as in PCA. Thus, if a large data is given, it becomes infeasible to perform NMF. There are several heuristics to go around this issue with scalability, namely parallel NMF, which divides columns (i.e. features) of the data matrix into several subsets, perform NMF independently on each of the column subsets, and then join the results at the end. Unfortunately, this algorithm was not feasible for LiberTEM because with LiberTEM, one only has access to the rows of the data matrix (i.e., images) and not to the full columns (i.e., feature vectors) at each step. I also tried to clean up jupyter notebook and reorganize. </p> <p><strong>What did I do this week?</strong></p> <p>Work on NMF</p> <p><strong>What will I work on next week?</strong></p> <p>Write documentation</p>js94@rice.edu (js94)Sat, 24 Aug 2019 21:36:21 +0000https://blogs.python-gsoc.org/en/js94s-blog/week-12/Week #11https://blogs.python-gsoc.org/en/js94s-blog/week-11/<h1> </h1> <p>Again, I mostly focused on designing edge cases and here ere are some of the ways that I attempted for the edge cases of PCA. First of all, I tried to see if the component matrices returned by standard PCA and the implemented PCA are approximately the same. After adjusting for the signs (in PCA, signs are non deterministic unless some other measures are implemented), I checked that the component matrix returned by both algorithms are almost identical, which adds more credibility to the implemented PCA. I also checked the performance of implemented PCA by changing the number of partitions and by checking the performance on synthetic data matrix, which was generated from collinear vectors. For both methods, the performance of implemented PCA was on par with the standard PCA. Then I tried different methods related to hyperbox method by designing special cases for loading and component matrices. So far, no noticeable differences were present and for the coming week, I need to brush up the conceptual confusion that I'm having on this issue.</p> <p><strong>What did I do this week?</strong></p> <p>Design test cases to differentiate between standard PCA and implemented PCA</p> <p><strong>What will I work on next week?</strong></p> <p>Further work on testing framework for PCA. Continue implementing NNMF</p>js94@rice.edu (js94)Mon, 12 Aug 2019 13:09:15 +0000https://blogs.python-gsoc.org/en/js94s-blog/week-11/Week #10https://blogs.python-gsoc.org/en/js94s-blog/week-10/<p>I continued working on developing test cases. I tried to write up an overview for testing frameworks since such a framework could potentially be useful for other methods beyond PCA. I still haven't found a test where the implemented PCA falls short of the standard PCA, which is good in the sense that the performance is on par with the standard PCA but potentially harmful since we don't know its vulnerabilities. Meanwhile, I'm in the process of implementing NMF. I'm currently am trying to use the hyperbox method that I used in PCA. Unlike PCA, however, it is generally not possible to apply NMF in an incremental manner as we did in PCA so I'm reading up papers to resolve this issue.</p> <p><strong>What did I do this week?</strong></p> <p>Design test cases for PCA. Partially implemented code for NNMF. </p> <p><strong>What will I work on next week?</strong></p> <p>Testing framework for PCA and other methods. Continue implementing NNMF</p>js94@rice.edu (js94)Sun, 04 Aug 2019 23:47:59 +0000https://blogs.python-gsoc.org/en/js94s-blog/week-10/Week #9https://blogs.python-gsoc.org/en/js94s-blog/week-9/<p>I explored different ways in which PCA can be tested. More specifically, I tried to find some testing schemes under which the PCA I developed fails while the standard PCA method works. This practice is to help me understand the potential limitations of the PCA method. Unfortunately, I have yet to find a case where the PCA method fails. So far, it appears that the standard PCA and the implemented PCA performs more or less on the same level. Currently, I'm trying to exploit the fact that "hyperbox" method was used by drawing the loading matrix from different heavy tail distributions.</p> <p><strong>What did I do this week?</strong></p> <p>Design test cases for PCA</p> <p><strong>What will I work on next week?</strong></p> <p>Devising test cases for PCA. Implement NNMF</p>js94@rice.edu (js94)Sun, 04 Aug 2019 23:43:07 +0000https://blogs.python-gsoc.org/en/js94s-blog/week-9/Week #8https://blogs.python-gsoc.org/en/js94s-blog/week-8/<p>For the most part, I spent time in reading about parallel Non-negative Matrix Factorization as this would most likely be the next course of action to take. Unlike PCA which had many scalable, parallel implementations available, NMF seems to be less well-studied and I could not find an implementation that would be suitable for reference. At the same time, I fixed some things and the performance time increased back to ~15 minutes. I also pushed a prototype jupyter notebook that replicates what the current PCA implementation in LiberTEM does, just with standard python libraries (e.g., sklearn, fbpca), and obtained satisfactory performance in reconstruction error. As per my mentor's suggestion, I will be looking into testing edge cases and if the current PCA works under these edge cases.</p> <p><strong>What did I do this week?</strong></p> <p>Improved overall PCA performance</p> <p><strong>What will I work on next week?</strong></p> <p>Devising edge test cases for PCA. Study Parallel NMF in depth</p>js94@rice.edu (js94)Mon, 22 Jul 2019 17:11:41 +0000https://blogs.python-gsoc.org/en/js94s-blog/week-8/Week #7https://blogs.python-gsoc.org/en/js94s-blog/week-7/<p>I managed to reduce the performance time to ~100 sec from 26 minutes by fetching a bigger chunk of data at a time than was fetched before. Furthermore, I successfully implemented hyperbox for loading matrix, which essentially is a method of subsampling from loading matrix so that the data is well-represented with less sample. Although more loss was induced by this, I'm inclined to conclude that the cost is manageable and the benefit in terms of speeding up the performance outweighs the cost. In fact, since I'm applying approximation PCA method, some degree of loss for this PCA is expected.</p> <p><strong>What did I do this week?</strong></p> <p>Improved overall PCA performance</p> <p><strong>Did I get stuck anywhere?</strong></p> <p>I was getting errors because I miscalculated the dimension needed for reconstruction</p> <p><strong>What will I work on next week?</strong></p> <p>Seems like PCA has almost come to an end (?). Once I clean up the code, produce some example notebooks for reference, with mentor's approval, I will probably begin working on non-negative matrix factorization, if not this week, next week.</p>js94@rice.edu (js94)Mon, 15 Jul 2019 14:28:42 +0000https://blogs.python-gsoc.org/en/js94s-blog/week-7/Week #6https://blogs.python-gsoc.org/en/js94s-blog/week-6-2/<p>Now that PCA implementation was working, I needed to improve the performance as the previous performance was too slow to be put into use. As per mentor's suggestions, I tried to replace the loading with an identity matrix to avoid overhead and reduce the number of samples. Furthermore, I tried to reconstruct the original data matrix at the merging stage and ran PCA on top of that to confirm the accuracy of the current implementation.</p> <p><strong>What did I do this week?</strong></p> <p>Check various ways to improve PCA performance</p> <p><strong>Did I get stuck anywhere?</strong></p> <p>I was getting errors from dimensions of the buffers that I pre-defined. </p> <p><strong>What will I work on next week?</strong></p> <p>I will work on other ways of improving the performance of the algorithm</p> <p> </p> <p> </p>js94@rice.edu (js94)Fri, 12 Jul 2019 00:30:06 +0000https://blogs.python-gsoc.org/en/js94s-blog/week-6-2/Week #5https://blogs.python-gsoc.org/en/js94s-blog/week-5-1/<p>I continued working on PCA. I've divided up into several implementations for PCA, and tested which had the best performance of all. I've managed to take down the time to ~26 minutes compared to 40 minutes last week. Furthermore, the difference in spectral norm error was within tolerable bound.</p> <p><strong>What did I do this week?</strong></p> <p>I worked on PCA, implemented tests as well as several different implementations for PCA</p> <p><strong>Did I get stuck anywhere?</strong></p> <p>I was stuck due to the dimension problem in PCA. By the current nature of LiberTEM architecture, I need to pre-specify the dimension of all objects and that required quite a bit of scribbling notes before I made it to work.</p>js94@rice.edu (js94)Sun, 07 Jul 2019 23:36:24 +0000https://blogs.python-gsoc.org/en/js94s-blog/week-5-1/Week #4https://blogs.python-gsoc.org/en/js94s-blog/week-4-1/<p>I continued working on PCA. This week I made test cases and visualized the PCA results. So far it seems to work well with one major caveat being that it is way too slow. Currently PCA takes ~20s on a toy dataset and over 40 minutes on an actual raw dataset. This is presumably due to memory issue as well as computational cost of performing SVD at every step. I will be working on optimizing the performance in the coming weeks.</p> <p> </p> <p><strong>What did I do this week?</strong></p> <p>I continued working on PCA, made test cases and visualized the results. Merged PR for documentation</p> <p><strong>Did I get stuck anywhere?</strong></p> <p>Currently, PCA takes too long (~40 minutes on a large raw data). I need to cut it down significantly somehow. </p> <p><strong>What's coming up next week?</strong></p> <p>I will be working on optimizing PCA performance</p>js94@rice.edu (js94)Mon, 24 Jun 2019 08:27:10 +0000https://blogs.python-gsoc.org/en/js94s-blog/week-4-1/Week #3https://blogs.python-gsoc.org/en/js94s-blog/week-3-1/<p>This week, I continued working on Principal Component Analysis. To cope with LiberTEM's architecture, I separated the algorithm into two parts, where the first part concerns the local computation of processing individual frames and the second part concerns the global computation of merging the outputs from the first part. For the first part, I used the algorithm from "Candid Covariance-Free Incremental PCA" paper. For the second part, I used the algorithm from a ph.d thesis that introduces efficient merging of SVDs. To test the result, I was trying to use frobenius norm error to measure the similarity between two eigenspace/eigenvector matrices, one that was approximated by the algorithms that I used and the other that was computed using full-batch PCA (i.e., exact solution). One trouble I had was how to set the reasonable error bound. In other words, what is a reasonable frobenius norm error bound to say that the current method "well-approximates" the true solution? I opened up an issue to document the researches that I have done related to PCA and NNMF, as well as my current progress on the subjects. While working on this issue, I also tried working a bit on documentation for UDF and submitted a PR. </p> <p><strong>What did I do this week?</strong></p> <p>I worked on documentation for UDF interface and continued working on researching/implementing Principal Component Analysis</p> <p><strong>Did I get stuck anywhere?</strong></p> <p>I implemented PCA and it ran fine on a small toy data, but had two major problems. First, it gave me a memory error on a real data, implying that the matrix that I'm using is probably too big. Also, I have yet to formulate testing scheme for the algorithm</p> <p><strong>What's coming up next week?</strong></p> <p>I will continue to work on PCA. Also, I will research into NNMF and if it can be more easily implemented in LiberTEM</p>js94@rice.edu (js94)Sat, 15 Jun 2019 10:09:10 +0000https://blogs.python-gsoc.org/en/js94s-blog/week-3-1/Week #2https://blogs.python-gsoc.org/en/js94s-blog/week-2-1/<p>This week, I continued working on Principal Component Analysis for LiberTEM. Even though PCA is a well-studied algorithm with many available modifications for distributed computing, it was very hard to find a suitable algorithm that can be placed under LiberTEM architecture. LiberTEM's User-Defined Functions, under which the PCA functionality will be placed, can be viewed as a two-step action. First, independent computation of a 'partition' of multiple diffraction pattern images, and second, merging of these independent computations of partitions. Most online PCA algorithms had well-established solution for the first part, e.g., incremental PCA, where each new observation can be added to existing PCA (or SVD) solution. The major problem was the second part. As far as I could find, there was no efficient solution for 'merging' two PCA results. I managed to find a merging algorithm that is used in a packaged called gensim, which is used for topic modeling, but still am not sure if this is a legitimate one. Currently, I've reorganized the code so that I use different algorithms for 'partition'-level computation and 'merging'-level computation. I have tried running it on a sample toy dataset, but all the singular values became zero. Also, when I ran it on an actual data, I was getting memory error, presumably due to large covariance matrix that I'm storing for computation purpose. I'm currently thinking about first reshaping the dimension of the data through some form of preprocessing so that memory error does not occur.</p> <p><strong>What did I do this week? </strong></p> <p>I was building on the skeleton code for Principal Component Analysis that I made last week. </p> <p><strong>Did you get stuck anywhere?</strong></p> <p>I'm currently stuck in 1) forming a good test case, and 2) understanding whether my current algorithm is valid. Since online PCA algorithms are approximation methods, some even with nondeterministic output, it is hard to come up with a good test case. Also, a lot of the available algorithms cannot be fit into LiberTEM's architecture so I had to combine two different algorithms. </p> <p><strong>What's coming up next week?</strong></p> <p>Next week, I really hope to finalize PCA by forming test cases and checking if algorithm is correct. </p>js94@rice.edu (js94)Sat, 08 Jun 2019 11:44:29 +0000https://blogs.python-gsoc.org/en/js94s-blog/week-2-1/Week #1 Blog post (in detail)https://blogs.python-gsoc.org/en/js94s-blog/week-1-blog-post-in-detail/<p>Most of my time was spent on working on coming up with good implementations for PCA on LiberTEM. To fully utilize LiberTEM's architectural benefits, I needed to make sure that the algorithm can be 1) computed in parallel and 2) two disjoint computation results for PCA can be merged into a single PCA result. I have been exploring Incremental PCA algorith, which allows adding a batch of data to computed PCA results, and distributed PCA, which computes disjoint set of data in parallel and merges them by only keeping the first n components. Most of such parallel algorithms are approximation of the true PCA vectors at best by design, so I need to come up with sound way of testing these functions. At the same time, the dataset that LiberTEM is dealing with is 4D matrix (or 3D, depending on the application), where the first two represents the scanned position and the last two represents the diffraction pattern image data. Thus, I had to come up with some way of computing PCA over this dataset. One way is to 'flatten' each diffraction pattern into 1d vector, and stack them up so that each row would  constitute a single 'observation.' Currently, I'm trying to preprocessing the data using radial binning so that PCA can be applied on a smaller number of variables. </p>js94@rice.edu (js94)Tue, 04 Jun 2019 14:34:53 +0000https://blogs.python-gsoc.org/en/js94s-blog/week-1-blog-post-in-detail/Week #1https://blogs.python-gsoc.org/en/js94s-blog/week-1-2/<p><strong>What did I do this week? </strong></p> <p>In continuing what I have done during the community bonding period, I was researching more about online algorithms for dimensionality reduction techniques. I've implemented the skeleton code for online Principal Component Analysis (PCA) using incremental algorithm and submitted as PR </p> <p><strong>What is coming up next?</strong></p> <p>I'm currently trying to construct good test cases for PCA code. On the side, I have been writing code for parallel Non-negative matrix factorization (NNMF). I plan to submit PR for this soon as well. </p> <p><strong>Did you get stuck anywhere?</strong></p> <p>At the moment, I'm very much stuck at constructing test cases. Since most online algorithms are approximations at best, some with nondeterministic output, it is quite hard to come up with a sound test case. </p>js94@rice.edu (js94)Mon, 03 Jun 2019 10:09:33 +0000https://blogs.python-gsoc.org/en/js94s-blog/week-1-2/First blog posthttps://blogs.python-gsoc.org/en/js94s-blog/first-blog-post/<p>Hi all. For the summer, I will be working on implementing various dimensionality reduction techniques on LiberTEM. During the community bonding period, I've been trying to understand the codebase better as well as researching on the existing methods for distributed implementation of dimensionality reduction techniques, namely PCA, ICA and UMAP. As I enter into the official coding period, the first part of the project will be somewhat of a combination of implementing the interface for User-Defined Functions in LiberTEM as well as distributed implementation of PCA. More about this in the upcoming blog. </p>js94@rice.edu (js94)Tue, 28 May 2019 20:29:19 +0000https://blogs.python-gsoc.org/en/js94s-blog/first-blog-post/