GSoC Blog | Activeloop | Week 1

lowlypalace
Published: 06/29/2022

This week, along with the community bonding period, I had a deep dive into the codebase of the project. Along with that, the project has a strong research component. The goal of the project is to offer users a set of automatic tools that they can use to improve the overall quality of their datasets. Therefore, I focused on researching various data-centric tools (e.g. auto-augmentation, fixing labels, slice discovery) and their trade-offs. Below, I describe a few of the data-centric tools that I discovered and experimented with.

1. Fix Dataset

These tools focus on identify errors in datasets. These include traditional constraint-based data cleaning methods, as well as those that use machine learning to detect and resolve data errors.

The labels in datasets from real-world applications can be of far lower quality. Recent studies have discovered that even ML benchmark datasets are full of label errors. The goal of this step would be to use one of the open-source tools, such as cleanlab, that automatically finds and fixes errors in any ML dataset.

2. Auto Augmentations

A technique to increase the diversity of your training set by applying random (but realistic) transformations, such as image rotation.

Automatic augmentation is useful not only increase the accuracy, it also prevents overfitting and makes models generalize better. Transformations enlarge the dataset by adding slightly modified copies of already existing images.

2.1 API

First, I came up with a high-level API to automatically augment images:

ds.autoaugment(task)

ds.autoaugment() takes a task and a set of optional parameters and returns a set of optimal augmentation policies.

Args

task - A name of deep learning task. Supported values are classification and semantic_segmentation.

num_classes - An optional parameter to provide a number of distinct classes in the classification or segmentation dataset. If not given, finds the number of classes automatically.

model - An optional parameter to provide a custom model. By default uses [pytorch-image-model](<https://github.com/rwightman/pytorch-image-models>) for classification and [segmentation_models.pytorch](<https://github.com/qubvel/segmentation_models.pytorch>) for semantic segmentation.

preprocess - An optional parameter to provide preprocessing transofrms. If images have different sizes or formats, you could define preprocessing transforms (such as Resizing, Cropping and Normalization).

Returns

transform - A wrapper function that contains discovered policies for the augmentation pipeline. This function can be applied on a complete dataset when loading the dataset or during training.

ds.autoaugment() produces a transform pipeline (a configuration for an augmentation pipeline). We can augment the dataset as follows:

2.2 Data Augmentation Approaches

2.2.1 Pre-Trained Policies

PyTorch provides pre-trained augmentation transforms policies. We can try to use AutoAugment policies learned on different datasets, try it on a different dataset and compare it to the baseline with or only a few basic transformations.
DADA provides Data Augmentation policies found for CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets.
Pros
- No time spent on finding policies, training and validation
- No input parameters needed
Cons
- The policies are not tailored for the dataset at hand

2.2.2 Faster AA / DADA Implementation

Next, I researched various automatic approaches to find data augmentation policies from data.

There are many data augmentation tools that were developed in recent years.
Faster AA / DADA are a few of the newest tools (libraries) that provide a good accuracy and time trade-off.
The libraries implement only basic classification tasks with common datasets (e.g. CIFAR, SVHN).
Object detection and image segmentation is not supported.
The libraries are research oriented.

Table below shows the training time on ImageNet for DADA, Faster AA and Deep AA. The DADA is as twice as fast as Faster AA.

Number of GPUs	DADA	Faster AA	Deep AA
1 GPU	1.3	2.3	96
2 GPU	0.6	1.1	48
4 GPU	0.3	0.5	24
8 GPU	0.1	0.2	12

While the accuracy of Deep AA on ImageNet with ResNet-50 is higher than DADA and Faster AA, it is considerably slower.

Dataset	DADA	Faster AA	Deep AA
ImageNet (ResNet-50)	77.5	76.5	78.30 ± 0.14
ImageNet (ResNet-200)	-	-	81.32 ± 0.17
CIFAR 10 (Wide-ResNet-28-10)	97.3	97.4	97.56 ± 0.14
CIFAR 100 (Wide-ResNet-28-10)	82.5	82.7	84.02 ± 0.18

2.2.3 Albumentations

Albumentations supports different computer vision tasks such as classification, semantic segmentation, instance segmentation, object detection, and pose estimation.
For the most image operations, Albumentations is faster than all alternatives
AutoAlbument is an AutoML tool that learns image augmentation policies from data using the Faster AutoAugment algorithm
- AutoAlbument supports image classification and semantic segmentation tasks.
- Under the hood, it uses the Faster AutoAugment algorithm.
- We can use Albumentations to utilize policies discovered by AutoAlbument.

Besides coding, this week I jumped on another task to get to know the Hub community better. I coordinated a team of open-source contributors and allocated them some of the tasks.

GSoC Blog | Activeloop | Week 1

1. Fix Dataset

2. Auto Augmentations

2.2 Data Augmentation Approaches

2.2.1 Pre-Trained Policies

2.2.2 Faster AA / DADA Implementation

2.2.3 Albumentations

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (11 rendered)

Cache calls from 1 backend

Signals

Log messages