GSoC has ended! I've written about my experience here
View Blog Post
rahulbshrestha's Blog
What did I do this week?
What will I do next week?
Did I get stuck anywhere?
Nope. Everything was pretty straightforward.
View Blog Post
- Had my pull request reviewed again. Made some structural changes to my code
- Added “hidden tensors”. This is intended to separate tensors created by the user from those created internally e.g the hashes tensor.
What will I do next week?
- Benchmark my branch with Hub’s main branch.
- Add tests for hashing samples with transforms
- Add a brute force method to list all datasets on Activeloop cloud and load their hashes tensor (if it exists)
- Improve documentation and tests
- Present my project at Activeloop (Final GSoC presentation)
Did I get stuck anywhere?
Nope. Everything was pretty straightforward.
What did I do this week?
What will I do next week?
Did I get stuck anywhere?
I faced a strange bug, appending some data to a tensor’s meta file would be copied to every other tensor meta file too. I did a temporary fix but am looking into what caused this.
View Blog Post
- Implemented “linked tensors”. If a source tensor is linked to a destination tensor, any append to the source tensor will also be done to the destination tensor.
- Implemented storing hashes as a separate tensor. For example, create_tensor (images, hash_samples=True), any sample appended to images will be hashed and appended to a separate “hashes” tensor.
What will I do next week?
- Write test cases and documentation for my code.
- Review any changes requested on my pull request
- Add a feature to account for compression type when hashing samples
Did I get stuck anywhere?
I faced a strange bug, appending some data to a tensor’s meta file would be copied to every other tensor meta file too. I did a temporary fix but am looking into what caused this.
What did I do this week?
What will I do next week?
Did I get stuck anywhere?
This was my first ever PR to Hub, so I had to learn about the style guide, code coverage tools, CI tests, etc. I had some trouble running the CI tests after making my PR. One of the libraries I was using, mmh3 (murmurhash3) hadn’t been stated in the requirements.txt file.
View Blog Post
- Made a pull request to the Hub’s main branch. Received some feedback from Abhinav and Dyllan , of which, I’ve made adjustments.
- Wrote test cases and fixed comments
- Dyllan proposed switching my current approach of storing hashes in a meta file called “hashlist.json” to “linked tensor”. Every time a sample is appended to a tensor, its corresponding linked tensors also get a sampled appended. This will allow us to deal hashes as a separate tensor (plus, get additional tensor features). I've created a basic framework but will need to run it by members of the team.
What will I do next week?
- Changing the architecture will take most of my time this week. I need to implement linked tensors and sort out its details. One advantage is, I won’t need to adapt the hashes to transforms and version control, as hashes will be considered as a tensor.
Did I get stuck anywhere?
This was my first ever PR to Hub, so I had to learn about the style guide, code coverage tools, CI tests, etc. I had some trouble running the CI tests after making my PR. One of the libraries I was using, mmh3 (murmurhash3) hadn’t been stated in the requirements.txt file.
What did I do this week?
What will I do next week?
Did I get stuck anywhere?
As mentioned above, merging with the main branch meant I had to make some fixes to my code which took time.
View Blog Post
- Merged my local forked repo with the main branch used by Hub. I had to make some fixes as the architecture had changed.
- Added “hub.compare (dataset-1, dataset-2)”. This allows comparison of hashlists of two datasets using Jaccard Similarity.
- Fixed the way samples were hashed, to account for compression. Previously, samples were hashed after compression, resulting in two different hashes for the same sample’s compressed and uncompressed version.
What will I do next week?
- Use transforms to generate hashes in a distributed manner.
- Complete task from last week to compare hashes of datasets being uploaded to datasets already on Hub.
- Write test cases and document code
- Optimize size of hashlists. Unnecessary quotation marks are being used right now to store hashes e.g. hashes = [“ d2asdsdf“, “asd223gk”]. This should be removed to save space.
Did I get stuck anywhere?
As mentioned above, merging with the main branch meant I had to make some fixes to my code which took time.