GSoC Weekly Check-In #7 (July 19)
rahulbshrestha
Published: 07/19/2021
What did I do this week?
This week, I started working with the Hub 2.0 codebase. I’ve implemented hashing samples in a dataset using
murmurhash3. Depending on which tensor is selected, the hashes are generated and stored as a json file inside a Hub dataset.
What will I do next week?
Next week, I’ll be implementing a way to compare the hash list generated for the dataset being loaded to hash lists in Hub’s cloud storage. This will prevent dataset duplication. Hub users will know if the dataset they’re uploading already exists.
Did I get stuck anywhere?
I had trouble figuring out how caching works in Hub. A call with my mentor (Abhinav) cleared everything up.
View Blog Post
GSoC Weekly Check-In #6 (July 12)
rahulbshrestha
Published: 07/13/2021
What did I do this week?
This week, I had my mid GSoC presentation where I talked about my progress so far contributing to Hub. I also started working with the Hub 2.0 codebase by familiarising myself with the chunking + compressing mechanism.
What will I do next week?
This week, I’ll start working on a fork of the Hub main branch. I will generate hashes when a dataset is being uploaded and will be using those hashes to compare with existing datasets. This will be an ‘opt-in’ mechanism i.e the user will have to choose if they want to generate hashes or not. After that, the eventual goal is to inform the user if a similar dataset already exists to prevent reupload.
Did I get stuck anywhere?
No.
View Blog Post
GSoC Weekly Check-In #5 (July 5)
rahulbshrestha
Published: 07/07/2021
What did I do this week?
This week, I did some benchmarking of performance of different hashing algorithms for different dataset sizes. After my benchmark, I’ve concluded that the best hashing algorithms to use are murmurhash3 or xxhash.
What will I do next week?
This week, I’ll start integrating my algorithm into Hub. I’ll also be looking into ‘Bloom filters’ if that is a possible option to deal with larger datasets.
Did I get stuck anywhere?
No, this week’s task was straightforward. Although, one thing I found problematic was unzipping datasets with > 1 million files takes 3-4 hours.
View Blog Post
GSoC Weekly Check-In #4 (June 28)
rahulbshrestha
Published: 06/29/2021
What did I do this week?
This week, I set up an environment for experimenting with large datasets using an Amazon S3 bucket and EC2 instance. Previously, I was testing my code on my local machine. My hard drive has 10 GB free only so it wasn’t possible to work with TB/PB sized datasets.
What will I do next week?
-
Benchmark performance of the implemented hashing technique with datasets of different sizes.
-
Monitor if separating hashes based on their first three digits is evenly spread among buckets.
- Check probability of hash collisions for several hashing algorithms (md5, sha-256, murmur hash, etc.). We want to avoid false positives that could arise from collisions in large datasets. Also, check if summing up hashes is a viable quick approach to find if two datasets are identical
Did I get stuck anywhere?
I was having difficulty working with Boto3, the AWS software development kit for Python. I had to go through the documentation extensively to resolve my problems.
View Blog Post
GSoC Weekly Check-In #3 (June 21)
rahulbshrestha
Published: 06/22/2021
What did I do this week?
I implemented separating hashed samples into separate buckets. This will enable a faster search for hashes as each hash won't have to be used during comparision.
What will I do next week?
-
Understand how Hub stores datasets. I don’t fully understand how Hub compresses datasets and how the chunking occurs. I intend to learn about this
-
Learn how to work with an Amazon S3 bucket and EC2 instance. I haven’t used these before but I’ll need it when working with >1TB sized datasets.
- Replace hash list with a Hash table or Merkle tree. Right now, I’m storing hashes in a list. This doesn’t seem optimal. I’ll try to implement a Hash Table or a Merkle tree to store these hashes so it is quicker to find hashes during lookup.
Did I get stuck anywhere?
Separating hashes into buckets was a bit more complicated than I thought. The hashes aren’t always split evenly among different buckets.
View Blog Post