Articles on rahulbshrestha's Blog

GSoC Weekly Check-In #12 (August 23)

rahulshrestha0101@gmail.com (rahulbshrestha) — Tue, 24 Aug 2021 16:38:51 +0000

GSoC has ended! I've written about my experience here

GSoC Weekly Check-In #11 (August 16)

rahulshrestha0101@gmail.com (rahulbshrestha) — Thu, 19 Aug 2021 08:35:29 +0000

What did I do this week?

Had my pull request reviewed again. Made some structural changes to my code
Added “hidden tensors”. This is intended to separate tensors created by the user from those created internally e.g the hashes tensor.

What will I do next week?

Benchmark my branch with Hub’s main branch.
Add tests for hashing samples with transforms
Add a brute force method to list all datasets on Activeloop cloud and load their hashes tensor (if it exists)
Improve documentation and tests
Present my project at Activeloop (Final GSoC presentation)

Did I get stuck anywhere?
Nope. Everything was pretty straightforward.

GSoC Weekly Check-In #10 (August 09)

rahulshrestha0101@gmail.com (rahulbshrestha) — Wed, 11 Aug 2021 20:43:21 +0000

What did I do this week?

Implemented “linked tensors”. If a source tensor is linked to a destination tensor, any append to the source tensor will also be done to the destination tensor.
Implemented storing hashes as a separate tensor. For example, create_tensor (images, hash_samples=True), any sample appended to images will be hashed and appended to a separate “hashes” tensor.

What will I do next week?

Write test cases and documentation for my code.
Review any changes requested on my pull request
Add a feature to account for compression type when hashing samples

Did I get stuck anywhere?
I faced a strange bug, appending some data to a tensor’s meta file would be copied to every other tensor meta file too. I did a temporary fix but am looking into what caused this.

GSoC Weekly Check-In #9 (August 02)

rahulshrestha0101@gmail.com (rahulbshrestha) — Wed, 04 Aug 2021 20:38:53 +0000

What did I do this week?

Made a pull request to the Hub’s main branch. Received some feedback from Abhinav and Dyllan , of which, I’ve made adjustments.
Wrote test cases and fixed comments
Dyllan proposed switching my current approach of storing hashes in a meta file called “hashlist.json” to “linked tensor”. Every time a sample is appended to a tensor, its corresponding linked tensors also get a sampled appended. This will allow us to deal hashes as a separate tensor (plus, get additional tensor features). I've created a basic framework but will need to run it by members of the team.

What will I do next week?

Changing the architecture will take most of my time this week. I need to implement linked tensors and sort out its details. One advantage is, I won’t need to adapt the hashes to transforms and version control, as hashes will be considered as a tensor.

Did I get stuck anywhere?
This was my first ever PR to Hub, so I had to learn about the style guide, code coverage tools, CI tests, etc. I had some trouble running the CI tests after making my PR. One of the libraries I was using, mmh3 (murmurhash3) hadn’t been stated in the requirements.txt file.

GSoC Weekly Check-In #8 (July 26)

rahulshrestha0101@gmail.com (rahulbshrestha) — Tue, 27 Jul 2021 19:37:48 +0000

What did I do this week?

Merged my local forked repo with the main branch used by Hub. I had to make some fixes as the architecture had changed.
Added “hub.compare (dataset-1, dataset-2)”. This allows comparison of hashlists of two datasets using Jaccard Similarity.
Fixed the way samples were hashed, to account for compression. Previously, samples were hashed after compression, resulting in two different hashes for the same sample’s compressed and uncompressed version.

What will I do next week?

Use transforms to generate hashes in a distributed manner.
Complete task from last week to compare hashes of datasets being uploaded to datasets already on Hub.
Write test cases and document code
Optimize size of hashlists. Unnecessary quotation marks are being used right now to store hashes e.g. hashes = [“ d2asdsdf“, “asd223gk”]. This should be removed to save space.

Did I get stuck anywhere?
As mentioned above, merging with the main branch meant I had to make some fixes to my code which took time.

GSoC Weekly Check-In #7 (July 19)

rahulshrestha0101@gmail.com (rahulbshrestha) — Mon, 19 Jul 2021 19:25:16 +0000

What did I do this week?
This week, I started working with the Hub 2.0 codebase. I’ve implemented hashing samples in a dataset using murmurhash3. Depending on which tensor is selected, the hashes are generated and stored as a json file inside a Hub dataset.

What will I do next week?
Next week, I’ll be implementing a way to compare the hash list generated for the dataset being loaded to hash lists in Hub’s cloud storage. This will prevent dataset duplication. Hub users will know if the dataset they’re uploading already exists.

Did I get stuck anywhere?
I had trouble figuring out how caching works in Hub. A call with my mentor (Abhinav) cleared everything up.

GSoC Weekly Check-In #6 (July 12)

rahulshrestha0101@gmail.com (rahulbshrestha) — Tue, 13 Jul 2021 20:37:33 +0000

What did I do this week?
This week, I had my mid GSoC presentation where I talked about my progress so far contributing to Hub. I also started working with the Hub 2.0 codebase by familiarising myself with the chunking + compressing mechanism.

What will I do next week?
This week, I’ll start working on a fork of the Hub main branch. I will generate hashes when a dataset is being uploaded and will be using those hashes to compare with existing datasets. This will be an ‘opt-in’ mechanism i.e the user will have to choose if they want to generate hashes or not. After that, the eventual goal is to inform the user if a similar dataset already exists to prevent reupload.

Did I get stuck anywhere?
No.

GSoC Weekly Check-In #5 (July 5)

rahulshrestha0101@gmail.com (rahulbshrestha) — Wed, 07 Jul 2021 19:14:53 +0000

What did I do this week?
This week, I did some benchmarking of performance of different hashing algorithms for different dataset sizes. After my benchmark, I’ve concluded that the best hashing algorithms to use are murmurhash3 or xxhash.

What will I do next week?
This week, I’ll start integrating my algorithm into Hub. I’ll also be looking into ‘Bloom filters’ if that is a possible option to deal with larger datasets.

Did I get stuck anywhere?
No, this week’s task was straightforward. Although, one thing I found problematic was unzipping datasets with > 1 million files takes 3-4 hours.

GSoC Weekly Check-In #4 (June 28)

rahulshrestha0101@gmail.com (rahulbshrestha) — Tue, 29 Jun 2021 19:03:42 +0000

What did I do this week?
This week, I set up an environment for experimenting with large datasets using an Amazon S3 bucket and EC2 instance. Previously, I was testing my code on my local machine. My hard drive has 10 GB free only so it wasn’t possible to work with TB/PB sized datasets.

What will I do next week?

Benchmark performance of the implemented hashing technique with datasets of different sizes.
Monitor if separating hashes based on their first three digits is evenly spread among buckets.
Check probability of hash collisions for several hashing algorithms (md5, sha-256, murmur hash, etc.). We want to avoid false positives that could arise from collisions in large datasets. Also, check if summing up hashes is a viable quick approach to find if two datasets are identical

Did I get stuck anywhere?
I was having difficulty working with Boto3, the AWS software development kit for Python. I had to go through the documentation extensively to resolve my problems.

GSoC Weekly Check-In #3 (June 21)

rahulshrestha0101@gmail.com (rahulbshrestha) — Tue, 22 Jun 2021 19:51:07 +0000

What did I do this week?
I implemented separating hashed samples into separate buckets. This will enable a faster search for hashes as each hash won't have to be used during comparision.

What will I do next week?

Understand how Hub stores datasets. I don’t fully understand how Hub compresses datasets and how the chunking occurs. I intend to learn about this
Learn how to work with an Amazon S3 bucket and EC2 instance. I haven’t used these before but I’ll need it when working with >1TB sized datasets.
Replace hash list with a Hash table or Merkle tree. Right now, I’m storing hashes in a list. This doesn’t seem optimal. I’ll try to implement a Hash Table or a Merkle tree to store these hashes so it is quicker to find hashes during lookup.

Did I get stuck anywhere?
Separating hashes into buckets was a bit more complicated than I thought. The hashes aren’t always split evenly among different buckets.

Introduction

rahulshrestha0101@gmail.com (rahulbshrestha) — Sun, 20 Jun 2021 19:18:19 +0000

Hi Python community members! My name is Rahul and I'm an incoming Master's student in Informatics at TU Munich. I am stoked about being accepted to GSoC with Activeloop!

The problem I'll be working on this GSoC is interesting and challenging. Datasets are often modified and there is no efficient way to check if two datasets are identical. This becomes worse for large datasets that don't fit in the memory (1TB+). My project intends to design a hashing technique to compare such large scale, out-of-core machine learning datasets.

GSoC Weekly Check-In #2 (June 14)

rahulshrestha0101@gmail.com (rahulbshrestha) — Tue, 15 Jun 2021 15:46:12 +0000

What did I do this week?
I started off by writing some Python code which compares two datasets and prints out a similarity score (in %). The steps are: generate an md5 hash for each file, append the hash and directory location into a list, compare this list with other hash lists. If both datasets are the same, the similarity score is 100%. I tested it out with several cases, for example renaming, deleting and duplicating files. The similarity score successfully changed every time and the differing files are pointed out. My implementation: https://github.com/rahulbshrestha/hash-dataset

What will I do next week?
I’ll be experimenting with larger datasets and monitoring its execution time. I would like to implement splitting a hash list into smaller subsets and check if that subset exists in another hashlist. This could save time as when a user uploads a new dataset, every file doesn'>
Did I get stuck anywhere?
I learned that some Python libraries are rather inefficient in terms of speed, so I had to rewrite some parts of my code. The execution time adds up when working with large datasets.

GSoC Weekly Check-In #1 (June 7)

rahulshrestha0101@gmail.com (rahulbshrestha) — Mon, 07 Jun 2021 18:38:26 +0000

What did I do this week?
Last week was the 'Community Bonding Period'. I met my mentors, Vinn and Abhinav. I got to discuss about my project in-depth and resolve concerns I had with my proposal. I had proposed to implement Locality Sensitive Hashing for comparing datasets, an algorithmic technique that hashes similar input items into the same "buckets" with high probability. LSH works great to find similar images/ text documents but for checking if a dataset has changed or not, md5 hashing the entire dataset could work fine. So, for the next few weeks, I'll work on implementing a sample-wise hashing algorithm for large datasets. The two most important factors are the time complexity of the algorithm (since some datasets can be petabytes large) and the accuracy (how much similarity can be detected). A couple interesting links I found during my research:

I've also been uploading datasets to Hub from Kaggle before Hub 2.0's official release on June 19th. This has been a great way to get familiar with how Hub stores datasets datasets. Check them out here!

What will I do next week?
This week I'll focus on implementing sample-wise hashing for datasets. A unique hash will be generated for each sample in a dataset with md5. Another dataset that is to be compared will be hashed similarly. The hashes generated from each dataset are compared and a 'similarity score' is generated depending upon how much hashes match. For testing, I'll start off with two identical datasets. The similarity score should be 100% with the algorithm. I'll change some samples from one of the datasets and monitor how the 'similarity score' changes. Ideally, the algorithm detects the images I changed. If this goes well, I'll play around with datasets of different sizes.

Did I get stuck anywhere?
The past week was spent on learning more about Hub and the community, so I haven't been stuck anywhere.