Blog Post #2: Research on Datasets 📕

Published: 07/01/2021

Hey friends!
Welcome to the 2nd blog post on my Google Summer of Code '21 Journey. It's exciting times, a lot of work is being done on Hub 2.0 and things are looking better than ever! 🚀

How it started

This week, I put myself in the users' shoes. I started using Hub 2.0 to upload some datasets, I was using Hub 1.3.7 however, I encountered this error:

Some fiddling around with my datasets and I couldn't figure out what the problem is. I approached the community members on the Slack channel and they asked me to roll back to Hub version from 1.3.7 to 1.3.5, the issue persisted. This gave me an opportunity to work with the alpha release of Hub 2.0 to solve my error, specifically Hub auto! Hub auto is a feature I am working on for Hub 2.0. It currently works on Image classification tasks thanks to my super mentor Dyllan.

How its going

With the issues I encountered, Dyllan asked me to test Hub auto with datasets. I took this opportunity to document the errors on Notion. Currently, Hub auto works on 3 file extensions [.jpeg, jpg, png] I created a list of all file extensions that do not work and throw in the error does not support the " " extension. Available extensions: ['.jpeg', '.jpg', '.png'].
I went ahead and created a color-coded toggle list that would have the dataset up-top and it would reveal the datasets under the hood. I tested Hub auto with around 200 datasets and made a detail list of all errors encountered as such:

File Extensions

Hub auto errors

Summing up
It's been a productive week, I have utilised my time testing hub auto and I am looking forward to to bring it up to the release/2.0 state next week.
Week 3-4 Research 💭

Published: 06/22/2021

Let's take a deep dive into what I've been doing for the past week followed by some tasks for next week!

What did I do this week?

Following up on last week's blog, I was tasked with research on dataset structures and file extensions. I have been through 200 datasets in the past week and documented every bit of it on Notion to later present it to my mentor Dyllan McCreary. This task will help us get super equipped to work on hub-auto.
This week Activeloop organised few CVPR panels on Clubhouse that I actively took part in as an audience. I thoroughly enjoyed the presentation by the speakers and I'm looking forward to more discussions of such kind on ML and Computer Vision discussions in the future.

What will I do next week?

Next week, I have put myself to 2 tasks:
  1. Going through more datasets
  2. Currently, the hub auto branch doesn't work with the compression branch. I am going to experiment and try to make them work together from the ground up!

Did I get stuck anywhere?

I was trying to resolve merge conflicts between Hub release/2.0 branch and feature/2.0/auto. Unfortunately, I wasn't able to find common ground between these features. I might be able to get it done if I start building it from the ground up this week!
Week 2-3 Coding and Research 🧐

Published: 06/14/2021

The Python Software Foundation provides a pretty template, so I have decided to use that to answer 3 basic questions about my week's progress.

What did I do this week?

I have been in touch with my mentor Dyllan, we have begun brainstorming on how to approach our project. Last week's work on Index Map was taken over by Dyllan, as it was a little ambiguous and complex for me. Well, I am glad I gave it a shot because no effort hasn't ever paid off.
Next up, I am going through a tonne of Kaggle datasets and figuring out how the structure co-relation. I am using Notability and Notion for highlighting and creating a decent layout of my work.
Here's some Behind the Scenes: 🐳



What will I do next week?

I plan to go through a lot more datasets to cover as many edge cases as possible. In parallel, I am working on uploading datasets to Hub using its latest Hub 2.0 alpha version which released last week.

Did I get stuck anywhere?

I was unable to upload datasets to Hub, turns out there was a bug which is actively been worked on and I was recommended by the proactive community to use Hub 2.0 to proceed further.

Well, that's all for now. I will write a comprehensive one next week! ✨
Week 1-2 (Community Bonding Period) 🌎

Published: 06/07/2021

Hey friends!
Happy to say that I'm off to a great start to Google Summer of Code! The first 2 weeks we have the Community Bonding Period, the idea behind this is to get a good grasp of the codebase and connect with the wonderful people at the organisation.
My community bonding period started with an insightful discussion with my mentor Dyllan McCreary, who walked me through the new codebase Activeloop has been working on for Hub 2.0. He was kind enough to get deep into the important parts of Hub. Dyllan assigned me a few (3) tasks that would help me get familiarised with the codebase.
This session was quite helpful and now that I have tasks assigned to me I thought it was great that I could get a head-start to the summer of code!

Task 1: Depickle the code!

As a general practice it is advised against to use pickle() in your code as it is susceptible to security vulnerabilities. Hence, I was required to replace all occurances of pickle.dumps() and pickle.loads() to something else. (json did the trick!) For more info on why pickle shouldn't be used check this out.

Task 2: Convert IndexMap to a list of IndexMapEntry

This task took the longest to complete. We have been through multiple ways we could do this and have decided to stick to the class method. I was required to convert Indexmap from a list of dictionaries to a list of IndexMapEntry (new class). Thus creating new classes for both IndexMap and IndexMapEntry, followed by writing tests for the same. In the course of this task, Dyllan introduced me to a datatype namedTuples and also helped me get started with parametrising tests. I have implemented the classes and tests however, this broke some part of the other code the team has tirelessly put together :(
I am currently working on it and it should be done by today 🤞🏻

Task 3: Modify Read/Write fixtures for cache in Storage tests

This was a 2-minute task but it helped me get familiar with a few storage things that happen under the hood in Hub 2.0.

All these tasks were designed to give me a solid head start, in hope of making me feel at home when working on my project this summer!

Apart from these tasks we witnessed the launch of the Alpha version of Hub 2.0 on 3rd June 2021. It was a wonderful launch where the team showcased all the shiny features coming to Hub 2.0 with a progress report on what is implemented. The results displayed more than a 6x increase in performance compared to Hub 1.0. Activeloop's CEO Davit Buniatyan's presentation showed what the team has achieved with Hub 2.0 is truly remarkable!
Things are just getting started and I am beyond excited for what is to come! ♥️
Welcome to my GSoC journey ✨

Published: 06/07/2021

Hello there!
I am Eshan Arora (@thisiseshan everywhere on the internet), a senior year computer engineering undergrad at NIT Surat, India.
Welcome to my blog, I am thrilled to share that I will be working on Hub 2.0 by this summer!

Hub aims to reduce time spent by the researchers figuring out data by enabling dataset streaming, thus allowing data scientists to spend more of their time on building epic Machine Learning models.

My project this summer is to implement Automatic generation of Schema at Hub 2.0, which would enable any kind of dataset to be seamlessly stored at Hub with just a single line of code.
I am excited to be given this opportunity and aim to bring the best out of it this summer!

You can checkout Hub here
For Hub 2.0 checkout the release/2.0 branch! 🚀
