Introduction
rahulbshrestha
Published: 06/20/2021
Hi Python community members! My name is Rahul and I'm an incoming Master's student in Informatics at TU Munich. I am stoked about being accepted to GSoC with Activeloop!
The problem I'll be working on this GSoC is interesting and challenging. Datasets are often modified and there is no efficient way to check if two datasets are identical. This becomes worse for large datasets that don't fit in the memory (1TB+). My project intends to design a hashing technique to compare such large scale, out-of-core machine learning datasets.
View Blog Post
GSoC Weekly Check-In #2 (June 14)
rahulbshrestha
Published: 06/15/2021
What did I do this week?
I started off by writing some Python code which compares two datasets and prints out a similarity score (in %). The steps are: generate an md5 hash for each file, append the hash and directory location into a list, compare this list with other hash lists. If both datasets are the same, the similarity score is 100%. I tested it out with several cases, for example renaming, deleting and duplicating files. The similarity score successfully changed every time and the differing files are pointed out.
My implementation:
https://github.com/rahulbshrestha/hash-dataset
What will I do next week?
I’ll be experimenting with larger datasets and monitoring its execution time. I would like to implement splitting a hash list into smaller subsets and check if that subset exists in another hashlist. This could save time as when a user uploads a new dataset, every file doesn'>
Did I get stuck anywhere?
I learned that some Python libraries are rather inefficient in terms of speed, so I had to rewrite some parts of my code. The execution time adds up when working with large datasets.
View Blog Post
GSoC Weekly Check-In #1 (June 7)
rahulbshrestha
Published: 06/07/2021
What did I do this week?
Last week was the 'Community Bonding Period'. I met my mentors, Vinn and Abhinav. I got to discuss about my project in-depth and resolve concerns I had with my proposal. I had proposed to implement Locality Sensitive Hashing for comparing datasets, an algorithmic technique that hashes similar input items into the same "buckets" with high probability. LSH works great to find similar images/ text documents but for checking if a dataset has changed or not, md5 hashing the entire dataset could work fine. So, for the next few weeks, I'll work on implementing a sample-wise hashing algorithm for large datasets. The two most important factors are the time complexity of the algorithm (since some datasets can be petabytes large) and the accuracy (how much similarity can be detected). A couple interesting links I found during my research:
I've also been uploading datasets to Hub from Kaggle before Hub 2.0's official release on June 19th. This has been a great way to get familiar with how Hub stores datasets datasets. Check them out
here!
What will I do next week?
This week I'll focus on implementing sample-wise hashing for datasets. A unique hash will be generated for each sample in a dataset with md5. Another dataset that is to be compared will be hashed similarly. The hashes generated from each dataset are compared and a 'similarity score' is generated depending upon how much hashes match. For testing, I'll start off with two identical datasets. The similarity score should be 100% with the algorithm. I'll change some samples from one of the datasets and monitor how the 'similarity score' changes. Ideally, the algorithm detects the images I changed. If this goes well, I'll play around with datasets of different sizes.
Did I get stuck anywhere?
The past week was spent on learning more about Hub and the community, so I haven't been stuck anywhere.
View Blog Post