Eshan's Blog

Blog Post #4: Write Tests ✅

Eshan
Published: 07/15/2021

Hey friends!
We are mid-way through the GSOC journey and we have put considerable amount of work into it. The project is turning out good and we are very close to the finish line. This is where we enable DRS and move quickly!
Recently another branch was merged into Hub's main that optimized the code and replaced the npz format. So naturally, after last week's checkin I updated the code to run Hub auto smoothly.
Later last week I spent time on 2 major things:
  • Mid season presentation
  • Writing tests for Hub auto

  • Mid-season presentation

    This week I presented my project's progress to Activeloop members, highlighting my progress and laying out a plan to wrap up this project in the second half of GSOC.
    I think it went great and I received important tips from the members that would enhance the working of Hub auto.
    I'll take this opportunity to share my progress on the blog. Dyllan and I have been through multiple iterations of hub auto and we have settled at feature/2.0/image-classification-auto.
    I have tested hub.auto on about 200 datasets and the success rate is around 50%, soon things will turn around when I'll integrate handling of multiple file extensions in the coming weeks and this number will shoot to 80-90% 🚀

    Writing tests for Hub auto

    I wrote tests for hub-auto that makes the testing of hub-auto convenient. Additionally, I have added the "kaggle" functionality to hub-auto that was earlier removed. This api would allow users to download Kaggle datasets, convert them to structured datasets with just one line of code.
    To wrap up, this week was super productive and by next week we are hoping to increase the success rate of hub-auto!
    View Blog Post

    Week 4-5 check-in! 🐳

    Eshan
    Published: 07/08/2021

    Hey friends!
    Welcome to a new weekly checkin, we have made big progress in our journey of GSOC and this week was pretty significant!

    What Did I do this week?

    Before we get started, it's story time! Few months ago, my mentor Dyllan had developed a version of hub-auto which worked on an early release of Hub 2.0. This version of hub-auto did the job of structuring datasets and worked locally.
    In the past few weeks, Davit, Abhinav, Kristina and Dyllan (all part of Activeloop.ai) have worked on Wasabi Integration, Compression, new index meta and a complete refactor of chunk-engine. This broke the hub-auto code which has not been changed in months. So naturally, this week I upgraded hub-auto to work with the latest release of Hub 2.0! I encountered several bugs but it took me few days to get the hang of what hub-auto is supposed to be doing in the latest release.

    What will I do in the coming week?

    Now that I have upgraded hub-auto to work with the latest release of Hub 2.0. I am going to use that to upload datasets directly to Hub cloud ☁ What earlier took hours, will now be done in seconds. I had prepared a Notion doc with over 200 datasets and their response to hub-auto. Dyllan asked me to create a spreadsheet with that data. I'm currently working on the spreadsheet part. Once that is ready, we can essentially use all that data to improve hub-auto in the coming weeks.

    Where did I get stuck?

    While upgrading hub-auto, I fixed several issues with the code. There were many libraries to be dropped and chunks of code to be removed. Particularly to make the code work with the refactored chunk-engine. Fortunately, I wasn't very stuck with bugs like I was used to during the earlier days of contributing to Hub. Experience is the best teacher after all.

    Special thanks to my mentor Dyllan, who has been super proactive in helping me improve hub-auto and fixing bugs that were out of my scope. Much respect and love to you man! 💜
    View Blog Post

    Blog Post #3: Update Hub auto to work with compression 🤖

    Eshan
    Published: 07/01/2021

    Hey friends!
    Welcome to the 3rd blog post on my Google Summer of Code '21 Journey. This is a follow-up to my last week's blog post. Things have taken an interesting turn with Wasabi integration and Compression being merged into the repo release/2.0.

    Previously, I was working on updating feature/2.0/hub-auto branch to work with release/2.0. Now with the latest integration, this process has become as little complex as a lot of changes have been made, for good.

    I tried merging feature/2.0/hub-auto into release/2.0 locally. I fixed several merge conflicts however, I got stuck at this error.



    I spent few days solving these errors however, my efforts proved futile. I brought this up to my mentor Dyllan and he stated an alternate approach. Since neither Compression/Wasabi nor Hub auto was written by me. It was given that I would have face trouble merging two pieces of code. He also kindly allowed me to take my time with it.

    He suggested that I should start with release/2.0 and start from there to work towards Hub 2.0. This approach might take time but would allow me to gain a deep sense of the working of Hub auto. Thus I have begun working on the building Hub auto from the ground up.

    Shh! I am also giving myself another day to work on the integration of hub 2.0 if it works great! else I will resolve back to start building it from the ground up.

    I will be mainly working on 2 functions:
    • from_kaggle
    • This function would download the dataset from Kaggle, Convert it into a Structured Hub dataset - locally.
    • from_path
    • This function is the essence of Hub auto, it will allow users to directly convert any unstructured datastet to Hub structured dataset with just one line of code.

    I have begun working on these functions and things are turning out satisfactory. Hope to get in a lot of progress this week! 🚀
    View Blog Post

    Blog Post #2: Research on Datasets 📕

    Eshan
    Published: 07/01/2021

    Hey friends!
    Welcome to the 2nd blog post on my Google Summer of Code '21 Journey. It's exciting times, a lot of work is being done on Hub 2.0 and things are looking better than ever! 🚀

    How it started

    This week, I put myself in the users' shoes. I started using Hub 2.0 to upload some datasets, I was using Hub 1.3.7 however, I encountered this error:



    Some fiddling around with my datasets and I couldn't figure out what the problem is. I approached the community members on the Slack channel and they asked me to roll back to Hub version from 1.3.7 to 1.3.5, the issue persisted. This gave me an opportunity to work with the alpha release of Hub 2.0 to solve my error, specifically Hub auto! Hub auto is a feature I am working on for Hub 2.0. It currently works on Image classification tasks thanks to my super mentor Dyllan.

    How its going

    With the issues I encountered, Dyllan asked me to test Hub auto with datasets. I took this opportunity to document the errors on Notion. Currently, Hub auto works on 3 file extensions [.jpeg, jpg, png] I created a list of all file extensions that do not work and throw in the error
    hub.auto does not support the " " extension. Available extensions: ['.jpeg', '.jpg', '.png'].
    I went ahead and created a color-coded toggle list that would have the dataset up-top and it would reveal the datasets under the hood. I tested Hub auto with around 200 datasets and made a detail list of all errors encountered as such:

    File Extensions


    Hub auto errors


    Summing up
    It's been a productive week, I have utilised my time testing hub auto and I am looking forward to to bring it up to the release/2.0 state next week.
    View Blog Post

    Week 3-4 Research 💭

    Eshan
    Published: 06/22/2021

    Let's take a deep dive into what I've been doing for the past week followed by some tasks for next week!

    What did I do this week?

    Following up on last week's blog, I was tasked with research on dataset structures and file extensions. I have been through 200 datasets in the past week and documented every bit of it on Notion to later present it to my mentor Dyllan McCreary. This task will help us get super equipped to work on hub-auto.
    This week Activeloop organised few CVPR panels on Clubhouse that I actively took part in as an audience. I thoroughly enjoyed the presentation by the speakers and I'm looking forward to more discussions of such kind on ML and Computer Vision discussions in the future.

    What will I do next week?

    Next week, I have put myself to 2 tasks:
    1. Going through more datasets
    2. Currently, the hub auto branch doesn't work with the compression branch. I am going to experiment and try to make them work together from the ground up!

    Did I get stuck anywhere?

    I was trying to resolve merge conflicts between Hub release/2.0 branch and feature/2.0/auto. Unfortunately, I wasn't able to find common ground between these features. I might be able to get it done if I start building it from the ground up this week!
    View Blog Post