GSoC Blog | Activeloop | Week 2

lowlypalace
Published: 06/29/2022

What did you do this week?

This week, I've been focusing on running experiments with automatic dataset augmentations as well as labels fixing.

As a first experiment, I have used cleanlab to automatically find label issues in MNIST dataset. As I'm using an open-source tool cleanlab, it requires to get out-of-sample probabilities for each sample in the dataset. For that, I had to use cross-validation. There are two main ways to implement cross validation for neural networks: 1) wrap the model into sklearn-compatible model and use cross validation from scikit-learn, 2) implement your own cross validation algorithm and extract probabilities from each fold. I have been experimenting with both of the ways. First, I tried to use skorch Python library that wraps PyTorch model into a sklearn-compatible model. I had to overwrite a few methods to make it compatible with Hub datasets. 

As a second experiment, I have implemented data augmentation pipeline with Pre-Trained Policies in Torchvision as well as compared them to the baseline and plain approaches.

  • Plain — only Normalize() operation is applied.
  • Baseline — combination of HorizontalFlip()RandomCrop(), and RandomErasing() .
  • AutoAugment — ****policy where AutoAugment is an additional transformation along with the baseline configuration. Augmentation Example on Colab. torchvision provides pre-trained policies on datasets like CIFAR-10, ImageNet, or SVHN. All of these are available in AutoAugemntPolicy package.

After applying data transformations, I trained the models with Fashion-MNIST dataset on 40 epochs. Below, I show the results that I've obtained.

Plain

transform_train = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,)),
    ])
Epoch: [40 | 40] LR: 0.100000
Processing |################################| (469/469) Data: 0.003s | Batch: 0.021s | Total: 0:00:09 | ETA: 0:00:01 | Loss: 0.1647 | top1:  94.2267 | top5:  99.9650
Processing |################################| (100/100) Data: 0.007s | Batch: 0.017s | Total: 0:00:01 | ETA: 0:00:01 | Loss: 0.2718 | top1:  90.8600 | top5:  99.8200
92.22

Baseline

transform_train = transforms.Compose([
                transforms.RandomHorizontalFlip(),
                transforms.RandomCrop(32, 4),
                transforms.ToTensor(),
                transforms.Normalize((0.1307,), (0.3081,)),
                transforms.RandomErasing()
            ])
Epoch: [40 | 40] LR: 0.100000
Processing |################################| (469/469) Data: 0.023s | Batch: 0.043s | Total: 0:00:20 | ETA: 0:00:01 | Loss: 0.2602 | top1:  90.5000 | top5:  99.9117
Processing |################################| (100/100) Data: 0.008s | Batch: 0.018s | Total: 0:00:01 | ETA: 0:00:01 | Loss: 0.2430 | top1:  91.3800 | top5:  99.8700
Best acc:
91.38

AutoAugment

transform_train = transforms.Compose([
        transforms.AutoAugment(AutoAugmentPolicy.IMAGENET),
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,)),
    ])
Epoch: [40 | 40] LR: 0.100000
Processing |################################| (469/469) Data: 0.033s | Batch: 0.054s | Total: 0:00:25 | ETA: 0:00:01 | Loss: 0.2633 | top1:  90.4683 | top5:  99.9067
Processing |################################| (100/100) Data: 0.006s | Batch: 0.016s | Total: 0:00:01 | ETA: 0:00:01 | Loss: 0.2281 | top1:  91.6000 | top5:  99.9500
Best acc:
92.03

 

What is coming up next?

As a next step, I will be working on making cleanlab compatible with Hub datasets. Specifically, I will implement my own cross validation algorithm to obtain out-of-sample probabilities.

Did you get stuck anywhere?

I mainly had design questions on how the users will be interacting with the API, but I've communicated with the team and resolved the doubts.

DJDT

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (11 rendered)

Cache calls from 1 backend

Signals

Log messages