GSoC Blog | Activeloop | Week 2

lowlypalace
Published: 06/29/2022

What did you do this week?

This week, I've been focusing on running experiments with automatic dataset augmentations as well as labels fixing.

As a first experiment, I have used cleanlab to automatically find label issues in MNIST dataset. As I'm using an open-source tool cleanlab, it requires to get out-of-sample probabilities for each sample in the dataset. For that, I had to use cross-validation. There are two main ways to implement cross validation for neural networks: 1) wrap the model into sklearn-compatible model and use cross validation from scikit-learn, 2) implement your own cross validation algorithm and extract probabilities from each fold. I have been experimenting with both of the ways. First, I tried to use skorch Python library that wraps PyTorch model into a sklearn-compatible model. I had to overwrite a few methods to make it compatible with Hub datasets. 

As a second experiment, I have implemented data augmentation pipeline with Pre-Trained Policies in Torchvision as well as compared them to the baseline and plain approaches.

  • Plain — only Normalize() operation is applied.
  • Baseline — combination of HorizontalFlip()RandomCrop(), and RandomErasing() .
  • AutoAugment — ****policy where AutoAugment is an additional transformation along with the baseline configuration. Augmentation Example on Colab. torchvision provides pre-trained policies on datasets like CIFAR-10, ImageNet, or SVHN. All of these are available in AutoAugemntPolicy package.

After applying data transformations, I trained the models with Fashion-MNIST dataset on 40 epochs. Below, I show the results that I've obtained.

Plain

transform_train = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,)),
    ])
Epoch: [40 | 40] LR: 0.100000
Processing |################################| (469/469) Data: 0.003s | Batch: 0.021s | Total: 0:00:09 | ETA: 0:00:01 | Loss: 0.1647 | top1:  94.2267 | top5:  99.9650
Processing |################################| (100/100) Data: 0.007s | Batch: 0.017s | Total: 0:00:01 | ETA: 0:00:01 | Loss: 0.2718 | top1:  90.8600 | top5:  99.8200
92.22

Baseline

transform_train = transforms.Compose([
                transforms.RandomHorizontalFlip(),
                transforms.RandomCrop(32, 4),
                transforms.ToTensor(),
                transforms.Normalize((0.1307,), (0.3081,)),
                transforms.RandomErasing()
            ])
Epoch: [40 | 40] LR: 0.100000
Processing |################################| (469/469) Data: 0.023s | Batch: 0.043s | Total: 0:00:20 | ETA: 0:00:01 | Loss: 0.2602 | top1:  90.5000 | top5:  99.9117
Processing |################################| (100/100) Data: 0.008s | Batch: 0.018s | Total: 0:00:01 | ETA: 0:00:01 | Loss: 0.2430 | top1:  91.3800 | top5:  99.8700
Best acc:
91.38

AutoAugment

transform_train = transforms.Compose([
        transforms.AutoAugment(AutoAugmentPolicy.IMAGENET),
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,)),
    ])
Epoch: [40 | 40] LR: 0.100000
Processing |################################| (469/469) Data: 0.033s | Batch: 0.054s | Total: 0:00:25 | ETA: 0:00:01 | Loss: 0.2633 | top1:  90.4683 | top5:  99.9067
Processing |################################| (100/100) Data: 0.006s | Batch: 0.016s | Total: 0:00:01 | ETA: 0:00:01 | Loss: 0.2281 | top1:  91.6000 | top5:  99.9500
Best acc:
92.03

 

What is coming up next?

As a next step, I will be working on making cleanlab compatible with Hub datasets. Specifically, I will implement my own cross validation algorithm to obtain out-of-sample probabilities.

Did you get stuck anywhere?

I mainly had design questions on how the users will be interacting with the API, but I've communicated with the team and resolved the doubts.