What did I do this week?
Last week was the 'Community Bonding Period'. I met my mentors, Vinn and Abhinav. I got to discuss about my project in-depth and resolve concerns I had with my proposal. I had proposed to implement Locality Sensitive Hashing for comparing datasets, an algorithmic technique that hashes similar input items into the same "buckets" with high probability. LSH works great to find similar images/ text documents but for checking if a dataset has changed or not, md5 hashing the entire dataset could work fine. So, for the next few weeks, I'll work on implementing a sample-wise hashing algorithm for large datasets. The two most important factors are the time complexity of the algorithm (since some datasets can be petabytes large) and the accuracy (how much similarity can be detected). A couple interesting links I found during my research:
- Deduplication using Spark's MLib
- How does one check huge files identity if hashing is CPU bound?
- Comparison between two big directories
- md5deep and hashdeep
I've also been uploading datasets to Hub from Kaggle before Hub 2.0's official release on June 19th. This has been a great way to get familiar with how Hub stores datasets datasets. Check them out here!
What will I do next week?
This week I'll focus on implementing sample-wise hashing for datasets. A unique hash will be generated for each sample in a dataset with md5. Another dataset that is to be compared will be hashed similarly. The hashes generated from each dataset are compared and a 'similarity score' is generated depending upon how much hashes match. For testing, I'll start off with two identical datasets. The similarity score should be 100% with the algorithm. I'll change some samples from one of the datasets and monitor how the 'similarity score' changes. Ideally, the algorithm detects the images I changed. If this goes well, I'll play around with datasets of different sizes.
Did I get stuck anywhere?
The past week was spent on learning more about Hub and the community, so I haven't been stuck anywhere.