GSoC Weekly Check-In #1 (June 7)

Published: 06/07/2021

Hi Python community members! My name is Rahul and I'm an incoming Master's student in Informatics at TU Munich. I am stoked about being accepted to GSoC with Activeloop!
The problem I'll be working on this GSoC is interesting and challenging. Datasets are often modified and there is no efficient way to check if two datasets are identical. This becomes worse for large datasets that don't fit in the memory (1TB+). My project intends to design a hashing technique to compare such large scale, out-of-core machine learning datasets.

What did I do this week?
Last week was the 'Community Bonding Period'. I met my mentors, Vinn and Abhinav. I got to discuss about my project in-depth and resolve concerns I had with my proposal. I had proposed to implement Locality Sensitive Hashing for comparing datasets, an algorithmic technique that hashes similar input items into the same "buckets" with high probability. LSH works great to find similar images/ text documents but for checking if a dataset has changed or not, md5 hashing the entire dataset could work fine. So, for the next few weeks, I'll work on implementing a sample-wise hashing algorithm for large datasets. The two most important factors are the time complexity of the algorithm (since some datasets can be petabytes large) and the accuracy (how much similarity can be detected). A couple interesting links I found during my research:

I've also been uploading datasets to Hub from Kaggle before Hub 2.0's official release on June 19th. This has been a great way to get familiar with how Hub stores datasets datasets. Check them out here!

What will I do next week?
This week I'll focus on implementing sample-wise hashing for datasets. A unique hash will be generated for each sample in a dataset with md5. Another dataset that is to be compared will be hashed similarly. The hashes generated from each dataset are compared and a 'similarity score' is generated depending upon how much hashes match. For testing, I'll start off with two identical datasets. The similarity score should be 100% with the algorithm. I'll change some samples from one of the datasets and monitor how the 'similarity score' changes. Ideally, the algorithm detects the images I changed. If this goes well, I'll play around with datasets of different sizes.

Did I get stuck anywhere?
The past week was spent on learning more about Hub and the community, so I haven't been stuck anywhere.