This week, along with the community bonding period, I had a deep dive into the codebase of the project. Along with that, the project has a strong research component. The goal of the project is to offer users a set of automatic tools that they can use to improve the overall quality of their datasets. Therefore, I focused on researching various data-centric tools (e.g. auto-augmentation, fixing labels, slice discovery) and their trade-offs. Below, I describe a few of the data-centric tools that I discovered and experimented with.
1. Fix Dataset
These tools focus on identify errors in datasets. These include traditional constraint-based data cleaning methods, as well as those that use machine learning to detect and resolve data errors.
The labels in datasets from real-world applications can be of far lower quality. Recent studies have discovered that even ML benchmark datasets are full of label errors. The goal of this step would be to use one of the open-source tools, such as cleanlab, that automatically finds and fixes errors in any ML dataset.
2. Auto Augmentations
A technique to increase the diversity of your training set by applying random (but realistic) transformations, such as image rotation.
Automatic augmentation is useful not only increase the accuracy, it also prevents overfitting and makes models generalize better. Transformations enlarge the dataset by adding slightly modified copies of already existing images.
2.1 API
First, I came up with a high-level API to automatically augment images:
ds.autoaugment(task)
ds.autoaugment()
takes a task
and a set of optional parameters and returns a set of optimal augmentation policies.
Args
task
- A name of deep learning task. Supported values are classification
and semantic_segmentation
.
num_classes
- An optional parameter to provide a number of distinct classes in the classification or segmentation dataset. If not given, finds the number of classes automatically.
model
- An optional parameter to provide a custom model. By default uses [pytorch-image-model](<https://github.com/rwightman/pytorch-image-models>)
for classification and [segmentation_models.pytorch](<https://github.com/qubvel/segmentation_models.pytorch>)
for semantic segmentation.
preprocess
- An optional parameter to provide preprocessing transofrms. If images have different sizes or formats, you could define preprocessing transforms (such as Resizing, Cropping and Normalization).
Returns
transform
- A wrapper function that contains discovered policies for the augmentation pipeline. This function can be applied on a complete dataset when loading the dataset or during training.
ds.autoaugment()
produces a transform pipeline (a configuration for an augmentation pipeline). We can augment the dataset as follows:
2.2 Data Augmentation Approaches
2.2.1 Pre-Trained Policies
PyTorch
provides pre-trained augmentation transforms policies. We can try to use AutoAugment policies learned on different datasets, try it on a different dataset and compare it to the baseline with or only a few basic transformations.- DADA provides Data Augmentation policies found for CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets.
- Pros
- No time spent on finding policies, training and validation
- No input parameters needed
- Cons
- The policies are not tailored for the dataset at hand
2.2.2 Faster AA / DADA Implementation
Next, I researched various automatic approaches to find data augmentation policies from data.
-
There are many data augmentation tools that were developed in recent years.
-
Faster AA / DADA are a few of the newest tools (libraries) that provide a good accuracy and time trade-off.
-
The libraries implement only basic classification tasks with common datasets (e.g. CIFAR, SVHN).
-
Object detection and image segmentation is not supported.
-
The libraries are research oriented.
Table below shows the training time on ImageNet for DADA, Faster AA and Deep AA. The DADA is as twice as fast as Faster AA.
Number of GPUs | DADA | Faster AA | Deep AA |
1 GPU | 1.3 | 2.3 | 96 |
2 GPU | 0.6 | 1.1 | 48 |
4 GPU | 0.3 | 0.5 | 24 |
8 GPU | 0.1 | 0.2 | 12 |
While the accuracy of Deep AA on ImageNet with ResNet-50 is higher than DADA and Faster AA, it is considerably slower.
Dataset | DADA | Faster AA | Deep AA |
ImageNet (ResNet-50) | 77.5 | 76.5 | 78.30 ± 0.14 |
ImageNet (ResNet-200) | - | - | 81.32 ± 0.17 |
CIFAR 10 (Wide-ResNet-28-10) | 97.3 | 97.4 | 97.56 ± 0.14 |
CIFAR 100 (Wide-ResNet-28-10) | 82.5 | 82.7 | 84.02 ± 0.18 |
2.2.3 Albumentations
-
Albumentations supports different computer vision tasks such as classification, semantic segmentation, instance segmentation, object detection, and pose estimation.
-
For the most image operations, Albumentations is faster than all alternatives
-
AutoAlbument
is an AutoML tool that learns image augmentation policies from data using the Faster AutoAugment algorithm- AutoAlbument supports image classification and semantic segmentation tasks.
- Under the hood, it uses the Faster AutoAugment algorithm.
- We can use Albumentations to utilize policies discovered by AutoAlbument.
Besides coding, this week I jumped on another task to get to know the Hub community better. I coordinated a team of open-source contributors and allocated them some of the tasks.