lowlypalace's Blog

GSoC Blog | Activeloop | Week 3

lowlypalace
Published: 07/17/2022

Overview

This week, I've been mainly working on making Hub datasets compatible with cleanlab. I implemented three tools to benchmark how cleanlab would work on the same dataset fetched from different sources.

Hub Dataset + Dataloader + Skorch

The first tool allows to run cleanlab with Hub dataset format and allows to directly pass custom Hub Dataloader. As cleanlab features leverage scikit-learn compatibility, I wrap the PyTorch neural net using skorch, which makes it scikit-learn-compatible. However, I had to overwrite a few of the methods such as get_dataset, get_iterator, train_step_single, evaluation_step and validation_step in the generic NeuralNet class to make Hub datasets work with skorch.

Pytorch Dataset + Pytorch Dataloader + Skorch

The second tool fetches the same data from torch.datasets, however, this time I didn't need to overwrite any scorch NeuralNet methods as they support standard PyTorch datasets and Dataloader by default. This step was mainly to ensure that I'm handling the Hub dataset format properly and to compare that the metrics for training match the one with the Hub dataset format.

Computing Out-of-sample Probabilities with Cross Validation

The third tool works with Hub datasets, however, it doesn't use skorch. Instead, this tool computes out-of-sample probabilities using cross validation for Hub dataset format. As skorch doesn't include functionality for cross-validation that's required by cleanlab, this week I focused on implementing cross-validation from scratch.

View Blog Post

GSoC Blog | Activeloop | Week 2

lowlypalace
Published: 06/29/2022

What did you do this week?

This week, I've been focusing on running experiments with automatic dataset augmentations as well as labels fixing.

As a first experiment, I have used cleanlab to automatically find label issues in MNIST dataset. As I'm using an open-source tool cleanlab, it requires to get out-of-sample probabilities for each sample in the dataset. For that, I had to use cross-validation. There are two main ways to implement cross validation for neural networks: 1) wrap the model into sklearn-compatible model and use cross validation from scikit-learn, 2) implement your own cross validation algorithm and extract probabilities from each fold. I have been experimenting with both of the ways. First, I tried to use skorch Python library that wraps PyTorch model into a sklearn-compatible model. I had to overwrite a few methods to make it compatible with Hub datasets. 

As a second experiment, I have implemented data augmentation pipeline with Pre-Trained Policies in Torchvision as well as compared them to the baseline and plain approaches.

  • Plain — only Normalize() operation is applied.
  • Baseline — combination of HorizontalFlip()RandomCrop(), and RandomErasing() .
  • AutoAugment — ****policy where AutoAugment is an additional transformation along with the baseline configuration. Augmentation Example on Colab. torchvision provides pre-trained policies on datasets like CIFAR-10, ImageNet, or SVHN. All of these are available in AutoAugemntPolicy package.

After applying data transformations, I trained the models with Fashion-MNIST dataset on 40 epochs. Below, I show the results that I've obtained.

Plain

transform_train = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,)),
    ])
Epoch: [40 | 40] LR: 0.100000
Processing |################################| (469/469) Data: 0.003s | Batch: 0.021s | Total: 0:00:09 | ETA: 0:00:01 | Loss: 0.1647 | top1:  94.2267 | top5:  99.9650
Processing |################################| (100/100) Data: 0.007s | Batch: 0.017s | Total: 0:00:01 | ETA: 0:00:01 | Loss: 0.2718 | top1:  90.8600 | top5:  99.8200
92.22

Baseline

transform_train = transforms.Compose([
                transforms.RandomHorizontalFlip(),
                transforms.RandomCrop(32, 4),
                transforms.ToTensor(),
                transforms.Normalize((0.1307,), (0.3081,)),
                transforms.RandomErasing()
            ])
Epoch: [40 | 40] LR: 0.100000
Processing |################################| (469/469) Data: 0.023s | Batch: 0.043s | Total: 0:00:20 | ETA: 0:00:01 | Loss: 0.2602 | top1:  90.5000 | top5:  99.9117
Processing |################################| (100/100) Data: 0.008s | Batch: 0.018s | Total: 0:00:01 | ETA: 0:00:01 | Loss: 0.2430 | top1:  91.3800 | top5:  99.8700
Best acc:
91.38

AutoAugment

transform_train = transforms.Compose([
        transforms.AutoAugment(AutoAugmentPolicy.IMAGENET),
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,)),
    ])
Epoch: [40 | 40] LR: 0.100000
Processing |################################| (469/469) Data: 0.033s | Batch: 0.054s | Total: 0:00:25 | ETA: 0:00:01 | Loss: 0.2633 | top1:  90.4683 | top5:  99.9067
Processing |################################| (100/100) Data: 0.006s | Batch: 0.016s | Total: 0:00:01 | ETA: 0:00:01 | Loss: 0.2281 | top1:  91.6000 | top5:  99.9500
Best acc:
92.03

 

What is coming up next?

As a next step, I will be working on making cleanlab compatible with Hub datasets. Specifically, I will implement my own cross validation algorithm to obtain out-of-sample probabilities.

Did you get stuck anywhere?

I mainly had design questions on how the users will be interacting with the API, but I've communicated with the team and resolved the doubts.

View Blog Post

GSoC Blog | Activeloop | Week 1

lowlypalace
Published: 06/29/2022

This week, along with the community bonding period, I had a deep dive into the codebase of the project. Along with that, the project has a strong research component. The goal of the project is to offer users a set of automatic tools that they can use to improve the overall quality of their datasets. Therefore, I focused on researching various data-centric tools (e.g. auto-augmentation, fixing labels, slice discovery) and their trade-offs. Below, I describe a few of the data-centric tools that I discovered and experimented with. 

1. Fix Dataset

These tools focus on identify errors in datasets. These include traditional constraint-based data cleaning methods, as well as those that use machine learning to detect and resolve data errors.

The labels in datasets from real-world applications can be of far lower quality. Recent studies have discovered that even ML benchmark datasets are full of label errors. The goal of this step would be to use one of the open-source tools, such as cleanlab, that automatically finds and fixes errors in any ML dataset. 

2. Auto Augmentations

A technique to increase the diversity of your training set by applying random (but realistic) transformations, such as image rotation.

Automatic augmentation is useful not only increase the accuracy, it also prevents overfitting and makes models generalize better. Transformations enlarge the dataset by adding slightly modified copies of already existing images.

2.1 API

First, I came up with a high-level API to automatically augment images:

ds.autoaugment(task)

ds.autoaugment() takes a task and a set of optional parameters and returns a set of optimal augmentation policies.

Args

task - A name of deep learning task. Supported values are classification and semantic_segmentation.

num_classes - An optional parameter to provide a number of distinct classes in the classification or segmentation dataset. If not given, finds the number of classes automatically.

model - An optional parameter to provide a custom model. By default uses [pytorch-image-model](<https://github.com/rwightman/pytorch-image-models>) for classification and [segmentation_models.pytorch](<https://github.com/qubvel/segmentation_models.pytorch>) for semantic segmentation.

preprocess - An optional parameter to provide preprocessing transofrms. If images have different sizes or formats, you could define preprocessing transforms (such as Resizing, Cropping and Normalization).

Returns

transform - A wrapper function that contains discovered policies for the augmentation pipeline. This function can be applied on a complete dataset when loading the dataset or during training.

ds.autoaugment() produces a transform pipeline (a configuration for an augmentation pipeline). We can augment the dataset as follows:


2.2 Data Augmentation Approaches

2.2.1 Pre-Trained Policies

  • PyTorch provides pre-trained augmentation transforms policies. We can try to use AutoAugment policies learned on different datasets, try it on a different dataset and compare it to the baseline with or only a few basic transformations.
  • DADA provides Data Augmentation policies found for CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets.
  • Pros
    • No time spent on finding policies, training and validation
    • No input parameters needed
  • Cons
    • The policies are not tailored for the dataset at hand

2.2.2 Faster AA / DADA Implementation

Next, I researched various automatic approaches to find data augmentation policies from data.

  • There are many data augmentation tools that were developed in recent years.

  • Faster AA / DADA are a few of the newest tools (libraries) that provide a good accuracy and time trade-off.

  • The libraries implement only basic classification tasks with common datasets (e.g. CIFAR, SVHN).

  • Object detection and image segmentation is not supported.

  • The libraries are research oriented.

Table below shows the training time on ImageNet for DADA, Faster AA and Deep AA. The DADA is as twice as fast as Faster AA.

Number of GPUs DADA Faster AA Deep AA
1 GPU 1.3 2.3 96
2 GPU 0.6 1.1 48
4 GPU 0.3 0.5 24
8 GPU 0.1 0.2 12

While the accuracy of Deep AA on ImageNet with ResNet-50 is higher than DADA and Faster AA, it is considerably slower.

Dataset DADA Faster AA Deep AA
ImageNet (ResNet-50) 77.5 76.5 78.30 ± 0.14
ImageNet (ResNet-200) - - 81.32 ± 0.17
CIFAR 10 (Wide-ResNet-28-10) 97.3 97.4 97.56 ± 0.14
CIFAR 100 (Wide-ResNet-28-10) 82.5 82.7 84.02 ± 0.18

2.2.3 Albumentations

  • Albumentations supports different computer vision tasks such as classification, semantic segmentation, instance segmentation, object detection, and pose estimation.

  • For the most image operations, Albumentations is faster than all alternatives

  • AutoAlbument is an AutoML tool that learns image augmentation policies from data using the Faster AutoAugment algorithm

    • AutoAlbument supports image classification and semantic segmentation tasks.
    • Under the hood, it uses the Faster AutoAugment algorithm.
    • We can use Albumentations to utilize policies discovered by AutoAlbument.


Besides coding, this week I jumped on another task to get to know the Hub community better. I coordinated a team of open-source contributors and allocated them some of the tasks. 

 

 

View Blog Post