Articles on lowlypalace's Blog

GSoC Blog | Activeloop | Week 12

danielgareev@gmail.com (lowlypalace) — Sun, 11 Sep 2022 10:15:45 +0000

What's Done

Created a tutorial showing how to find label errors in Hub datasets: Finding Label Issues in Image Classification Datasets
Completed a blog post How Noisy Labels Impact ML Models. This blog touches on some of the reasons on why labeling errors happen, why the errors in labels are imperative and what tools and techniques can be used to overcome these errors. At the end, it shows how to use cleanlab to easily find noise in Hub datasets.

Next Steps

Finalize PR and fix reviewers’ feedback.
Try to create and run unit tests.
Check if custom transform function works with the workflow.
Finalize final names of functions, like find_mislabels, fix_issues, find_issues, add_issues_tensors.

I’ll do the following if I have extra time:

Add valid_transform parameter.
Make it possible to select specific tensors from validation set.
Add message that we checked out on the branch after adding tensors.
Add dataset health printout.
Try to pass x and y instead.

GSoC Blog | Activeloop | Week 11

danielgareev@gmail.com (lowlypalace) — Sun, 11 Sep 2022 10:11:59 +0000

What’s Done ✅

→ Updated API

from hub.integrations.cleanlab import clean_labels, create_tensors, clean_view
from hub.integrations import skorch

ds = hub.load("hub://ds")

tform = transforms.Compose(
    [
        transforms.ToTensor(),
        transforms.Normalize((0.5,), (0.5,)),
    ]
)

transform = {"images": tform, "labels": None}

# Get scikit-learn compatible PyTorch module to pass into clean_labels as a classifier
model = skorch(dataset=ds, epochs=5, batch_size=16, transform=transform, tensors=[],  valid_transform, skorch_kwargs)

# Obtain a DataFrame with columns is_label_issue, label_quality and predicted_label 
label_issues = find_label_issues(
    dataset=ds,
    model=model,
    folds=3,
)

# Create label_issues tensor group on "labels" branch
create_label_isssues_tensors(
    dataset=ds,
    label_issues=label_issues,
    branch="labels"
)

# Get dataset view where only clean labels are present, and the rest are filtered out.
ds_clean = clean_view(ds)

→ Link to PR: https://github.com/activeloopai/Hub/pull/1821

Skorch Integration

`skorch()`

Added support for providing the validation set for training skorch(dataset=ds, valid_dataset=valid_ds)
Added keyword arguments that can be passed into modules to fine-tune the parameters for advanced users.
- skorch_kwargs arguments to be passed to the skorch NeuralNet constructor. Additionally, iterator_train__transform and iterator_valid__transform can be used to set params for the training and validation iterators.
Made passing in the images and labels tensors more explicit.
Modularized methods.
- Separated skorch module from cleanlab to make it easier to instantiate skorch even if you’re not using cleanlab in the downstream.
- Further modularized skorch module into separate functions and modules.
- Added utils functions in a separate file.
Added error-checking utils to check errors early.

→ Check if a dataset and valid_dataset that’s passed in is a Hub Dataset.

→ Check if the tensors’ htypes are supported for image classification tasks.

Cleanlab Integration

`clean_labels()`

Implemented a function to compute guessed_label by the classifier after pruning.
Added flag pretrained to skip cross-validation if pretrained model is used to compute out-of-sample probabilities faster on a single fit().
Instead of returning a tuple of numpy ndarrays label_issues, label_quality_scores and predicted_labels, now clean_labels() returns a single label_issues dataframe with columns is_label_issue, label_quality, predicted_label.
Added keyword arguments that can be passed into modules to fine-tune the parameters for advanced users.
- label_issues_kwargs can be be passed to the cleanlab.filter.find_label_issues function.
- label_quality_kwargs can be passed to the cleanlab.rank.get_label_quality_scores function.

`create_tensors()`

Added the ability to select branch to commit to when creating tensors
Modularized methods.
- create_tensors() is now a separate method that takes in label_issues dataframe or looks for label_issues in tensors to get a view where only clean labels are present and the rest are filtered out. This will now return commit_id.
- Added utils functions in a separate file.
Added error-checking utils to check errors early.
- Check early if a user has write access to the dataset before creating the tensors.
- Check if label_issues dataframe columns have correct dtypes and are a subset of a dataset before appending them to tensors.

`clean_view()`

Added a method clean_view(ds) to get a dataset view where only clean labels are present, and the rest are filtered out. This can be useful to pass the clean dataset to downstream ML frameworks for training.

Other

Created custom config for dependencies hub[’cleanlab’].
Created common utils that are reused across modules.
Renamed some of the function and variable names to be more clear.
Clarified the docstrings parameters and improved readability.
Merged main branch and resolved conflicts.
Commented on the aspects I’m not sure about in my PR.
Run tests on 10+ Activeloop image classification datasets (without creating tensors).

Next Steps

Finalize PR after the getting reviewers’ feedback.
Try to create and run unit tests.
Create a notebook that showcases the workflow (such as https://docs.activeloop.ai/playbooks/evaluating-model-performance)
Create a blog post with a bit more insight into the problem statement and results of the running workflow on various datasets with varying noise levels (such as https://www.activeloop.ai/resources/)

GSoC Blog | Activeloop | Week 10

danielgareev@gmail.com (lowlypalace) — Sun, 11 Sep 2022 10:11:13 +0000

What’s Done ✅

→ Updated API

from hub.integrations.cleanlab import clean_labels

training_params = {'module' = resnet18(), 'criterion' = CrossEntropyLoss, 
'optimizer' = SGD, 'epochs' = 10, 'optimizer_lr' = 0.01, 'device' = "cpu",
'folds = 5'}

clean_labels( ds,
					  	training_params = training_params,
			        verbose = True,
			        tensors = ['images', 'labels'],
			        overwrite = False,
							num_workers = 1,
							batch_size = 1,
							shuffle = True,
							transform = {},
							create_tensors = True
				...
)

→ Added create_tensors flag.

create_tensors boolean flag would be useful here to confirm if a user wants to append new label_issues tensor. If the flag create_tensors is False, then is_label_issues, label_quality_scores numpy arrays are returned. If True, tensors is_label_issues and label_quality_scores are created and also returned as numpy arrays.

→ Added support to provide validation set for training

clean_labels(*ds_train, ds_valid)*
No support yet to compute label errors for validation set

→ Made providing tensors names more explicit

→ Fixed some errors related to checking if an image tensor is RGB or Grayscale

→ Minor improvements (e.g. matching device in the core function rather than making it a required parameter)

What’s Next

Coding

→ Prune API

prune_labels(ds)
Instead of deleting samples, enable users to create an instance of the dataset that would only fetch correct samples when filling up batches?
- It could be easily possible for users to ds = ds[clean_idx] and then use a clean dataset for the downstream.
Leave out pruning to the users and code it up in the blog post instead?
Create a new branch

→ Create a tensor guessed_label to add labels guessed by the classifier after pruning.

Relabeling workflow on Activeloop?

→ Create custom config for pip install (e.g. pip install hub[’cleanlab’])

→ Add flag branch to move to a different branch instead of making a commit on a current branch.

→ Add flags add_branch = True

~~→ Add support for bounding boxes, task = 'classification' or task = 'segmentation'~~

→ Raise error if not htype image

~~→ Add support for TensorFlow modules~~

→ Add optional cleanlab kwargs to pass down

→ Add optional skorch kwargs to pass down

→ Tests

Unit tests
Tests with Activeloop datasets

→ Make it possible to skorch(ds)

→ Raise error if I don’t have write access

GSoC Blog | Activeloop | Week 9

danielgareev@gmail.com (lowlypalace) — Sun, 11 Sep 2022 10:09:52 +0000

What’s Done ✅

API

Created an API entry point for cleaning labels in dataset.py.

Cleans the labels of the dataset and creates a set of tensors under label_issues group for the entire dataset.

API

def clean_labels(
        self,
        module = None,
        criterion = None,
        optimizer = None,
        optimizer_lr: int = 0.01,
        device: str = "cpu",
        epochs: int = 10,
        folds: int = 5,
        verbose: bool = True,
        tensors: Optional[list] = None,
        dataloader_train_params: [dict] = None,
        dataloader_valid_params: Optional[dict] = None,
        overwrite: bool = False
        # skorch_kwargs: Optional[dict] = None,
    ):
        """
        Cleans the labels of the dataset. Computes out-of-sample predictions and uses Confident Learning (CL) algorithm to clean the labels.
        Creates a set of tensors under label_issues group for the entire dataset.

        Note:
            Currently, only image classification task us supported. Therefore, the method accepts two tensors for the images and labels (e.g. ['images', 'labels']).
            The tensors can be specified in dataloader_train_params or tensors. Any PyTorch module can be used as a classifier.

        Args:
            module (class): A PyTorch torch.nn.Module module (class or instance). In general, the uninstantiated class should be passed, although instantiated modules will also work. Default is torchvision.models.resnet18(), which is a PyTorch ResNet-18 model.
            criterion (class): A PyTorch criterion. The uninitialized criterion (loss) used to optimize the module. Default is torch.nn.CrossEntropyLoss.
            optimizer (class): A PyTorch optimizer. The uninitialized optimizer (update rule) used to optimize the module. Default is torch.optim.SGD.
            optimizer_lr (int): The learning rate passed to the optimizer. Default is 0.01.
            device (str): A PyTorch device. The device on which the module and criterion are located. Default is "cpu".
            epochs (int): The number of epochs to train for each fit() call. Default is 10.
            tensors (list): A list of tensor names that would be considered for cleaning (e.g. ['images', 'labels']).
            dataloader_train_params (dict): Keyword arguments to pass into torch.utils.data.DataLoader. Options that may especially impact accuracy include: shuffle, batch_size.
            dataloader_valid_params (dict): Keyword arguments to pass into torch.utils.data.DataLoader. Options that may especially impact accuracy include: shuffle, batch_size. If not provided, dataloader_train_params will be used with shuffle=False.
            overwrite (bool): If True, will overwrite label_issues tensors if they already exists. Default is False.
            fold (int): Sets the number of cross-validation folds used to compute out-of-sample probabilities for each example in the dataset. The default is 5.
            skorch_kwargs (dict): Keyword arguments to pass into skorch.NeuralNet. Options that may especially impact accuracy include: ...

        Returns:
            label_issues: A boolean mask for the entire dataset where True represents a label issue and False represents an example that is confidently/accurately labeled.
            label_quality_scores: Returns label quality scores for each datapoint, where lower scores indicate labels less likely to be correct.
        """

Skorch Integration

Made skorch compatitable with Hub dataset format.
- Added the integration skorch.py in hub/integrations/pytorch.
- Created a class VisionClassifierNet that wraps the PyTorch Module in an sklearn interface.
- Make skorch compatitable with Hub’s PyTorch Dataloader.
- Set the defaults for relevant skorch parameters such as module, criterion, optimizer.

Core Functions for Cleaning Labels

Created the component clean_labels.py in hub/core/experimental/labels .
- Implemented core function clean_labels() which cleans the labels of a dataset.
  - Wraps a PyTorch instance in a sklearn classifier. Next, it runs cross-validation to get out-of-sample predicted probabilities for each example. Then, it finds label issues (boolean mask) and label quality scores (floats from 0 to 1) for each sample in the dataset. At the end, it creates tensors with label issues.
- Implemented helper functions.
  - get_dataset_tensors() returns the tensors of a dataset. If a list of tensors is not provided, it will try to find them in the dataloader_train_params in the transform. If none of these are provided, it will iterate over the dataset tensors and return any tensors that match htype 'image' for images and htype 'class_label' for labels. Additionally, this function will also check if the dataset already has a label_issues group.
  - estimate_cv_predicted_probabilities() computes an out-of-sample predicted probability for every example in a dataset using cross validation.
  - append_label_issues_tensors() creates a group of tensors label_issues. After creating tensors, automatically commits the changes.

GSoC Blog | Activeloop | Week 8

danielgareev@gmail.com (lowlypalace) — Sun, 11 Sep 2022 10:08:41 +0000

What did you do this week?

This week, after deriving conclusions from my previous experiments, it was the time to take all of the insights as well as the code and try to make cleanlab work with Hub datasets. After working on the integration for a few weeks, I created my draft PR.

What is coming up next?

As a next step, I will be finalizing the API structure, as well as adding some additional functionality to the feature.

Did you get stuck anywhere?

Not really.

GSoC Blog | Activeloop | Week 7

danielgareev@gmail.com (lowlypalace) — Sun, 11 Sep 2022 10:02:06 +0000

This week, I had a quick sync with the mentors as it seemed like I couldn't derive systematic results from running my previous experiments.

I liked the idea to introduce some noise to some dataset that has a low rate (e.g. less than 1-3% of misclassified labels to compare baseline with cleanlab. I used Fashion MNIST and introduced some random corruption to the training set by flipping the labels.

In one of my experiments, I set the maximum noise to 50% and gradually introduced 5% the noise at each step, comparing the performance of baseline and cleanlab in parallel. This time, instead of relabelling, I decided to prune the samples with low confidence scores. Here’s a quick example: if I have 60,000 samples in the dataset, at 10% of the noise, I’d randomly flip 6,000 labels. The baseline would then be trained with all samples (60,000), but the cleanlab would be trained only on the labels that weren’t classified as erroneous by cleanlab . For example, If cleanlab found that 5,000 are labeled incorrectly, then I would only use 55,000 images for training.

It seems that pruning the samples with lower confidence works well, as cleanlab seems to remove the labels that were introduced with each noise level. We can also see that the accuracy stays around 80% with cleanlab, while with the random noise (e.g. without removing any samples) it drops linearly. On average, I can also see that cleanlab on average prunes more labels than I initially introduced in the data. The mean of additional samples that cleanlab discards is ≈4500 samples across all noise levels. Since I don’t know the true noise in the original dataset, it’s hard to say whether cleanlab is doing a good job on removing these, but I would argue that cleanlab seems to be overestimating. However, it seems to systematically pick up the newly introduced noisy labels and identify them as erroneous.

I was surprised that after introducing random noise, CL would still prune the erroneous labels. The way CL algorithm works is by accurately and directly characterizing the uncertainty of label noise in the data. The foundation CL depends on is that label noise is class-conditional, depending only on the latent true class, not the data . For instance, a leopard is likely to be mistakenly labeled as a jaguar . Cleanlab takes that assumption and computes joint distribution among different classes (e.g. 3% of the data is labeled leopard (noisy label), but the true label is jaguar). The main idea is that underlying data has implications for the labeler’s decisions, and I basically took this assumption out of equation after randomly swapping labels in the dataset (e.g. I didn’t care if a certain class would be more likely to be mislabelled as another class). I would think that CL algorithm relies on this assumption heavily to guess which label to prune, but the performance was still accurate and stable. In the real-world noisy data, the mislabelling between different classes would have a stronger statistical dependence, so I can say that this example was even a bit more difficult for cleanlab.

I’ve tried to experiment with different threshold values for pruning and relabelling the images (e.g. remove 20% of the images with lowest label quality, but leave and relabel the rest). I’ve started with a threshold of 0% (e.g. relabel all labels to the ones predicted by cleanlab ) and then gradually increased the threshold value with a 10% step till I reached 100% prune level (e.g. remove all labels that were found to be erroneous by cleanlab). As before, I run these from 0 to 50% of the noise level.On the graph, I plotted the accuracy of the models trained with training sets that were fixed with different threshold values. For example, 100% Prune / 0% Relabel indicates the accuracy of the model when all erroneous samples and their labels were deleted, while 0% Prune / 100% Relabel shows the accuracy of the model when all of the samples were left but relabelled.Looking at the graph, I can say that cleanlab definitely does a great job at identifying labels, but not necessarily at fixing them automatically. As soon as I increase the % of labels that I’d like to relabel, the accuracy starts to go down in linear way. The training set with 100% of pruning got the highest accuracy, while the training set with all labels relabelled got the worst accuracy on the fixed model.As a next step, I can try to see what happens if we only remove a certain % of erroneous samples, but leave the labels of the other erroneous samples as they are. (edited)

I also run this pipeline on the Roman Dataset (DCAI). This dataset is quite noisy, so there’s not a ton of improvement on 0-10% of the noise level. However, as I introduce more noisy labels, looks like cleanlab is still able to pick them up. Running a few more trials to see what’s the performance with different Prune vs Relabel threshold.

GSoC Blog | Activeloop | Week 7

danielgareev@gmail.com (lowlypalace) — Sun, 11 Sep 2022 09:50:57 +0000

This week, I worked on running the experiments on a variety of datasets.

Here are a few exeriments that I did to benchmark cleanlab performance of three different datasets (MNIST Roman (DCAI), Flower 102, Fashion MNIST). For all experiments, I used a fixed model and applied resizing and normalization to the original images.

What do 1️⃣, 2️⃣, 3️⃣, 4️⃣ mean? I re-run the entire fitting with a a few random seeds to get an estimate of the variance between the accuracies of baseline and cleanlab.

1️⃣ = seed(0)

2️⃣ = seed(1)

3️⃣ = seed(123)

4️⃣ = seed(42)

First Training Run

	Roman MNIST (DCAI)	Flower 102	Fashion MNIST	KMNIST
<mark>Baseline Accuracy</mark>	1️⃣ 0.7945 2️⃣ 0.8511 3️⃣ 0.7699	1️⃣ 0.6568 2️⃣ 0.6176 3️⃣ 0.6274	1️⃣ 0.8958 2️⃣ 0.8944 3️⃣ 0.8987
<mark>+ Cleanlab Accuracy</mark>	1️⃣ 0.7933 → -0.0012 ⬇️ 2️⃣ 0.8031 → -0.048 ⬇️ 3️⃣ 0.7109 → -0.059 ⬇️	1️⃣ 0.5421 → -0.1147 ⬇️ 2️⃣ 0.5441 → -0.0735 ⬇️ 3️⃣ 0.5647 → -0.0627 ⬇️	1️⃣ 0.8992 → 0.0034 ⬆️ 2️⃣ 0.8951 → 0.0007 ⬆️ 3️⃣ 0.8866 → -0.0121 ⬇️
Parameters	`batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10`	`batch_size = 16 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10`	`batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10`
Transform	`Resize((224, 224)), ToTensor(), Normalize( [0.485, 0.456, 0.406], [0.229, 0.224, 0.225] )`	`Resize((224, 224)), ToTensor(), Normalize( [0.485, 0.456, 0.406], [0.229, 0.224, 0.225] )`	`ToTensor(), Normalize((0.), (1.))`	`ToTensor(), Normalize((0.), (1.))`	`Resize((300, 300)), ToTensor(), Normalize( [0.485, 0.456, 0.406], [0.229, 0.224, 0.225] )`
Network	`resnet = models.resnet18() resnet.fc = nn.Linear(resnet.fc.in_features, 10)`	`resnet = models.resnet18() resnet.fc = nn.Linear(resnet.fc.in_features, 102)`	`resnet = models.resnet18() resnet.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)` `resnet.fc = nn.Linear(resnet.fc.in_features, 10)`	`resnet = models.resnet18() resnet.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)` `resnet.fc = nn.Linear(resnet.fc.in_features, 10)`	`resnet = models.resnet18() resnet.fc = nn.Linear(resnet.fc.in_features, 47)`
Parameters	`batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False train_split = None` `epochs = 10`	`batch_size = 16 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10`	`batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10`	`batch_size = 64 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10`	`batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10`
Number of Classes	10	102	10	10	47
Images Dimension	224x224	224x224	28x28	28x28	300x300

Training with 20 Epochs

In the results below, I used epochs = 20 instead of epochs = 10. The rest of the parameters (e.g. Network, Transform) are unchanged.

	MNIST Roman	Flower 102	MNIST Fashion	KMNIST	Describable Textures Dataset
<mark>Baseline Accuracy</mark>	1️⃣ 0.6875 2️⃣ 0.7736 3️⃣ 0.6617 4️⃣ 0.7945 Mean = 0.7293	1️⃣ 0.5421 2️⃣ 0.5617 3️⃣ 0.6578 4️⃣ 0.6294	1️⃣ 0.891 2️⃣ 0.8977 3️⃣ 0.8977
<mark>+ Cleanlab Accuracy</mark>	1️⃣ 0.7257 → 0.0382 ⬆️ 2️⃣ 0.8400 → 0.0664 ⬆️ 3️⃣ 0.8511 → 0.1894 ⬆️ 4️⃣ 0.8757 → 0.0812 ⬆️ Mean = 0.8231 Mean Difference = 0.0938 ⬆️	1️⃣ 0.6117 → 0.0696 ⬆️ 2️⃣ 0.6254 → 0.0833 ⬆️ 3️⃣ 0.5598 → -0.098 ⬇️ 4️⃣ 0.5705 → -0.0589 ⬇️	1️⃣ 0.8982 3️⃣ 0.897
Parameters	`batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False train_split = None` `epochs = 20`	`batch_size = 16 model = resnet18() train_shuffle = True test_shuffle = False train_split = None` `epochs = 20`	`batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split = None` `epochs = 20`	`batch_size = 64 model = resnet18() train_shuffle = True test_shuffle = False train_split = None` `epochs = 20`

Training with 30 Epochs

In the results below, I used epochs = 30 instead of epochs = 20. The rest of the parameters (e.g. Network, Transform) are unchanged.

MNIST Roman	Flower 102	MNIST Fashion	KMNIST	Describable Textures Dataset
<mark>Baseline Accuracy</mark>	0.8425 0.8831 0.8769 0.8560 Mean = 0.8646
<mark>+ Cleanlab Accuracy</mark>	0.8646 → 0.0221 0.8228 → -0.0602 0.8696 → -0.0073 0.8720 → 0.0159 Mean = 0.8573 Mean Difference = 0.0073
Parameters	`batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False train_split = None` `epochs = 30`	`batch_size = 16 model = resnet18() train_shuffle = True test_shuffle = False train_split = None` `epochs = 30`	`batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split = None` `epochs = 30`	`batch_size = 64 model = resnet18() train_shuffle = True test_shuffle = False train_split = None` `epochs = 30`

Training with Resnet50

In the results below, all of the parameters stay the same as in run above, but this time I changed the network to resnet50() instead of resnet18(). Epochs are also back to epochs = 20.

	MNIST Roman	Flower 102	MNIST Fashion		Describable Textures Dataset
<mark>Baseline Accuracy</mark>	1️⃣ 0.7835 2️⃣ 0.7589 3️⃣ 0.8068 4️⃣ 0.8560
<mark>+ Cleanlab Accuracy</mark>	1️⃣ 0.8044 2️⃣ 0.8560 3️⃣ 0.8376 4️⃣ 0.7859
Parameters	`batch_size = 8 model = resnet50() train_shuffle = True test_shuffle = False train_split = None` `epochs = 20`	`batch_size = 16 model = resnet50() train_shuffle = True test_shuffle = False train_split = None` `epochs = 20`	`batch_size = 32 model = resnet50() train_shuffle = True test_shuffle = False train_split = None` `epochs = 20`	`batch_size = 64 model = resnet50() train_shuffle = True test_shuffle = False train_split = None` `epochs = 20`	`batch_size = 8 model = resnet50() train_shuffle = True test_shuffle = False train_split = None` `epochs = 20`

Training with Validation Set

In the results below, all of the parameters stay the same as in run above, however, in this run I tried to run trainings with validation set. Therefore, 20% of the dataset is used for the internal training validation. In other datasets where the validation set exist, it’s used as a validation set. I set the model back to resnet18() as it seems it gives better baseline accuracies over three datasets.

MNIST Roman	Flower 102	MNIST Fashion		Describable Textures Dataset
<mark>Baseline Accuracy</mark>	1️⃣ 0.6432 2️⃣ 0.6445 3️⃣ 0.6777 4️⃣ 0.6383 Mean = 0.6509	0.6421 0.6578 0.6823 0.6372 Mean = 0.6549	0.8974 0.891 0.8906 0.894 Mean = 0.89325	0.9425 0.9439 0.9436 0.9364 Mean = 0.9416
<mark>+ Cleanlab Accuracy</mark>	1️⃣ 0.5584 2️⃣ 0.5879 3️⃣ 0.6076 4️⃣ 0.4587 Mean = 0.5531
Parameters	`batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False` `train_split=ValidSplit(cv=5, stratified=False) valid_shuffle = False` `epochs = 20`	`batch_size = 16 model = resnet18() train_shuffle = True test_shuffle = False` `train_split=predefined_split(Dataset(valid_data, valid_labels)) valid_shuffle = False` `epochs = 20`	`batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False` `train_split=ValidSplit(cv=5, stratified=False) valid_shuffle = False` `epochs = 20`	`batch_size = 64 model = resnet18() train_shuffle = True test_shuffle = False` `train_split=ValidSplit(cv=5, stratified=False)` `valid_shuffle = False` `epochs = 20`

Training with Validation Set (Stratified Sampling)

Using arbitrary random seed can result in large differences between the training and validation set distributions. These differences can have unintended downstream consequences in the modeling process. As an example, the proportion of digit X can much higher in the training set than in the validation set. To overcome this, I’m using stratified sampling (sampling from each class with equal probability) to create the validation set for the datasets where it’s not available by default (e.g. MNIST Roman, MNIST Fashion, KMNIST).

	MNIST Roman	Flower 102	MNIST Fashion		Describable Textures Dataset
<mark>Baseline Accuracy</mark>	0.7515 0.6998 0.7958 0.8610 Mean = 0.7770	N/A	0.8928 0.8948 0.8969 0.895 Mean = 0.894875		N/A
<mark>+ Cleanlab Accuracy</mark>	0.6027 → -0.1488 0.8228 → 0.1230 0.8130 → 0.0172 0.6900 → -0.1709 Mean = 0.7321 Mean Difference = -0.0448
Parameters	`batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False` `train_split=ValidSplit(cv=5, stratified=True) valid_shuffle = False` `epochs = 20`		`batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False` `train_split=ValidSplit(cv=5, stratified=True) valid_shuffle = False` `epochs = 20`	`batch_size = 64 model = resnet18() train_shuffle = True test_shuffle = False` `train_split=ValidSplit(cv=5, stratified=True)` `valid_shuffle = False` `epochs = 20`

Training with Early Stopping

	MNIST Roman	Flower 102	MNIST Fashion	Describable Textures Dataset
<mark>Baseline Accuracy</mark>	0.8327 0.8388 0.7921 0.8597 Mean = 0.8308	0.6705 0.6578 0.6558 0.6313 Mean = 0.6539	0.8856, 0.8916, 0.8856, 0.8917 Mean = 0.88862
<mark>+ Cleanlab Accuracy</mark>	0.8683 0.8339 0.8597 0.8105 Mean = 0.8431	0.6539 0.6176	0.887 0.8904
Parameters	`callbacks=[EarlyStopping(monitor='train_loss', patience=5)]`	`callbacks=[EarlyStopping(monitor='train_loss', patience=5)]` `train_split=predefined_split(Dataset(valid_data, valid_labels))`

Notebooks to reproduce results:

GSoC Blog | Activeloop | Week 6

danielgareev@gmail.com (lowlypalace) — Sun, 11 Sep 2022 09:43:49 +0000

What did you do this week?

This week, I've set up a pipeline for running experiments on a variety of datasets to compare the accuracies of baseline (the model trained on dirty data) and cleanlab (the model trained on clean data).

What is coming up next?

As a next step, I will be running experiments on a variety of datasets to benchmark the accuracy.

Did you get stuck anywhere?

Not really, it was a bit tricky to set up a pipeline in a way that's reproducible. I managed to overcome this by fixing the seeds.

GSoC Blog | Activeloop | Week 5

danielgareev@gmail.com (lowlypalace) — Tue, 19 Jul 2022 13:48:34 +0000

This week, I've been focusing on the high-level overview of data-centric strategies. I used Roman MNIST Dataset that was provided in the Data-Centric AI Competition. For all experiments, I used fixed Resnet50 and applied resizing and normalization to the original dataset. I also used a fixed seed to be able to replicate the runs. Here are some of the metrics I’ve been getting:

Fixing Labels with Cleanlab.

Baseline Accuracy: 0.7318
+ CleanLearning Accuracy: 0.8290

Automatic Augmentations with AutoAlbument (uses Faster AutoAugment algorithm)

Baseline Accuracy: 0.7318
+ AutoAlbument Augmentations: 0.7404

Augmentations with Basic Augmentations and Pre-trained Torch Policies

Baseline Accuracy: 0.7318
Basic Baseline Augmentations: 0.8696
ImageNet Pre-Trained Policy: 0.8560

Here are my findings for these strategies:

Fixing Labels with Cleanlab.
- I’m now trying out smaller k values for cross-validation to see to which extent it impacts the accuracy and improves the training speed. I will also try out more epoch ranges to see if this has impact on the accuracy. I’ve noticed that cleanlab performs well only when we already have a robust baseline model. For this specific dataset, I’ve noticed that if the accuracy of the baseline model is less than 0.7, then cleanlab actually has negative effect on the accuracy. I believe this is because of the confident learning algorithm, as it needs to get as accurate confidence scores for each label as possible.
- For now, the labels fixing and augmentations are applied separately. I think it would be also interesting to see how the accuracy changes after applying CleanLearning and then AutoAlbument. But I haven’t found out an easy way to get the corrected labels from CleanLearning to overwrite the initial labels of the dataset. I’ve messaged the cleanlab team to get their help on this.
Automatic Augmentations with AutoAlbument.
- I’ve only used epochs = 15 to find optimal augmentation policy. I’ll try out more epochs ranges to see if this can improve the accuracy.
- I have also used the whole train and validation datasets for finding the optimal augmentation policy. I then applied augmentations only to train dataset and validated it on the test dataset.
Augmentations with Basic Augmentations and Pre-trained Torch Policies
- I was surprised by the accuracy improvement after applying basic transformations, such as HorizontalFlip(), RandomCrop(), and RandomErasing().
- I only applied augmentations to the train dataset.

GSoC Blog | Activeloop | Week 4

danielgareev@gmail.com (lowlypalace) — Mon, 18 Jul 2022 09:02:55 +0000

What did you do this week?

This week, I've been focusing on implementing custom cross-validation algorithm to compute out-of-sample probabilities for Hub Datasets.

What is coming up next?

As a next step, I will be working on a generic high-level pipeline that consists of fixing labels on a particular dataset, applying augmentations to a dataset and finding slices that underperform on a particular dataset.

Did you get stuck anywhere?

Not particularly, I had a few issues with ensuring that my experiments are deterministic with PyTorch but at the end I was able to fix it.

GSoC Blog | Activeloop | Week 3

danielgareev@gmail.com (lowlypalace) — Sun, 17 Jul 2022 22:27:16 +0000

Overview

This week, I've been mainly working on making Hub datasets compatible with cleanlab. I implemented three tools to benchmark how cleanlab would work on the same dataset fetched from different sources.

Hub Dataset + Dataloader + Skorch

The first tool allows to run cleanlab with Hub dataset format and allows to directly pass custom Hub Dataloader. As cleanlab features leverage scikit-learn compatibility, I wrap the PyTorch neural net using skorch, which makes it scikit-learn-compatible. However, I had to overwrite a few of the methods such as get_dataset, get_iterator, train_step_single, evaluation_step and validation_step in the generic NeuralNet class to make Hub datasets work with skorch.

Pytorch Dataset + Pytorch Dataloader + Skorch

The second tool fetches the same data from torch.datasets, however, this time I didn't need to overwrite any scorch NeuralNet methods as they support standard PyTorch datasets and Dataloader by default. This step was mainly to ensure that I'm handling the Hub dataset format properly and to compare that the metrics for training match the one with the Hub dataset format.

Computing Out-of-sample Probabilities with Cross Validation

The third tool works with Hub datasets, however, it doesn't use skorch. Instead, this tool computes out-of-sample probabilities using cross validation for Hub dataset format. As skorch doesn't include functionality for cross-validation that's required by cleanlab, this week I focused on implementing cross-validation from scratch.

GSoC Blog | Activeloop | Week 2

danielgareev@gmail.com (lowlypalace) — Wed, 29 Jun 2022 13:22:15 +0000

What did you do this week?

This week, I've been focusing on running experiments with automatic dataset augmentations as well as labels fixing.

As a first experiment, I have used cleanlab to automatically find label issues in MNIST dataset. As I'm using an open-source tool cleanlab, it requires to get out-of-sample probabilities for each sample in the dataset. For that, I had to use cross-validation. There are two main ways to implement cross validation for neural networks: 1) wrap the model into sklearn-compatible model and use cross validation from scikit-learn, 2) implement your own cross validation algorithm and extract probabilities from each fold. I have been experimenting with both of the ways. First, I tried to use skorch Python library that wraps PyTorch model into a sklearn-compatible model. I had to overwrite a few methods to make it compatible with Hub datasets.

As a second experiment, I have implemented data augmentation pipeline with Pre-Trained Policies in Torchvision as well as compared them to the baseline and plain approaches.

Plain — only Normalize() operation is applied.
Baseline — combination of HorizontalFlip(), RandomCrop(), and RandomErasing() .
AutoAugment — ****policy where AutoAugment is an additional transformation along with the baseline configuration. Augmentation Example on Colab. torchvision provides pre-trained policies on datasets like CIFAR-10, ImageNet, or SVHN. All of these are available in AutoAugemntPolicy package.

After applying data transformations, I trained the models with Fashion-MNIST dataset on 40 epochs. Below, I show the results that I've obtained.

Plain

transform_train = transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,)),
    ])

Epoch: [40 | 40] LR: 0.100000
Processing |################################| (469/469) Data: 0.003s | Batch: 0.021s | Total: 0:00:09 | ETA: 0:00:01 | Loss: 0.1647 | top1:  94.2267 | top5:  99.9650
Processing |################################| (100/100) Data: 0.007s | Batch: 0.017s | Total: 0:00:01 | ETA: 0:00:01 | Loss: 0.2718 | top1:  90.8600 | top5:  99.8200
92.22

Baseline

transform_train = transforms.Compose([
                transforms.RandomHorizontalFlip(),
                transforms.RandomCrop(32, 4),
                transforms.ToTensor(),
                transforms.Normalize((0.1307,), (0.3081,)),
                transforms.RandomErasing()
            ])

Epoch: [40 | 40] LR: 0.100000
Processing |################################| (469/469) Data: 0.023s | Batch: 0.043s | Total: 0:00:20 | ETA: 0:00:01 | Loss: 0.2602 | top1:  90.5000 | top5:  99.9117
Processing |################################| (100/100) Data: 0.008s | Batch: 0.018s | Total: 0:00:01 | ETA: 0:00:01 | Loss: 0.2430 | top1:  91.3800 | top5:  99.8700
Best acc:
91.38

AutoAugment

transform_train = transforms.Compose([
        transforms.AutoAugment(AutoAugmentPolicy.IMAGENET),
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,)),
    ])

Epoch: [40 | 40] LR: 0.100000
Processing |################################| (469/469) Data: 0.033s | Batch: 0.054s | Total: 0:00:25 | ETA: 0:00:01 | Loss: 0.2633 | top1:  90.4683 | top5:  99.9067
Processing |################################| (100/100) Data: 0.006s | Batch: 0.016s | Total: 0:00:01 | ETA: 0:00:01 | Loss: 0.2281 | top1:  91.6000 | top5:  99.9500
Best acc:
92.03

What is coming up next?

As a next step, I will be working on making cleanlab compatible with Hub datasets. Specifically, I will implement my own cross validation algorithm to obtain out-of-sample probabilities.

Did you get stuck anywhere?

I mainly had design questions on how the users will be interacting with the API, but I've communicated with the team and resolved the doubts.

GSoC Blog | Activeloop | Week 1

danielgareev@gmail.com (lowlypalace) — Wed, 29 Jun 2022 13:01:40 +0000

This week, along with the community bonding period, I had a deep dive into the codebase of the project. Along with that, the project has a strong research component. The goal of the project is to offer users a set of automatic tools that they can use to improve the overall quality of their datasets. Therefore, I focused on researching various data-centric tools (e.g. auto-augmentation, fixing labels, slice discovery) and their trade-offs. Below, I describe a few of the data-centric tools that I discovered and experimented with.

1. Fix Dataset

These tools focus on identify errors in datasets. These include traditional constraint-based data cleaning methods, as well as those that use machine learning to detect and resolve data errors.

The labels in datasets from real-world applications can be of far lower quality. Recent studies have discovered that even ML benchmark datasets are full of label errors. The goal of this step would be to use one of the open-source tools, such as cleanlab, that automatically finds and fixes errors in any ML dataset.

2. Auto Augmentations

A technique to increase the diversity of your training set by applying random (but realistic) transformations, such as image rotation.

Automatic augmentation is useful not only increase the accuracy, it also prevents overfitting and makes models generalize better. Transformations enlarge the dataset by adding slightly modified copies of already existing images.

2.1 API

First, I came up with a high-level API to automatically augment images:

ds.autoaugment(task)

ds.autoaugment() takes a task and a set of optional parameters and returns a set of optimal augmentation policies.

Args

task - A name of deep learning task. Supported values are classification and semantic_segmentation.

num_classes - An optional parameter to provide a number of distinct classes in the classification or segmentation dataset. If not given, finds the number of classes automatically.

model - An optional parameter to provide a custom model. By default uses [pytorch-image-model](<https://github.com/rwightman/pytorch-image-models>) for classification and [segmentation_models.pytorch](<https://github.com/qubvel/segmentation_models.pytorch>) for semantic segmentation.

preprocess - An optional parameter to provide preprocessing transofrms. If images have different sizes or formats, you could define preprocessing transforms (such as Resizing, Cropping and Normalization).

Returns

transform - A wrapper function that contains discovered policies for the augmentation pipeline. This function can be applied on a complete dataset when loading the dataset or during training.

ds.autoaugment() produces a transform pipeline (a configuration for an augmentation pipeline). We can augment the dataset as follows:

2.2 Data Augmentation Approaches

2.2.1 Pre-Trained Policies

PyTorch provides pre-trained augmentation transforms policies. We can try to use AutoAugment policies learned on different datasets, try it on a different dataset and compare it to the baseline with or only a few basic transformations.
DADA provides Data Augmentation policies found for CIFAR-10, CIFAR-100, SVHN, and ImageNet datasets.
Pros
- No time spent on finding policies, training and validation
- No input parameters needed
Cons
- The policies are not tailored for the dataset at hand

2.2.2 Faster AA / DADA Implementation

Next, I researched various automatic approaches to find data augmentation policies from data.

There are many data augmentation tools that were developed in recent years.
Faster AA / DADA are a few of the newest tools (libraries) that provide a good accuracy and time trade-off.
The libraries implement only basic classification tasks with common datasets (e.g. CIFAR, SVHN).
Object detection and image segmentation is not supported.
The libraries are research oriented.

Table below shows the training time on ImageNet for DADA, Faster AA and Deep AA. The DADA is as twice as fast as Faster AA.

Number of GPUs	DADA	Faster AA	Deep AA
1 GPU	1.3	2.3	96
2 GPU	0.6	1.1	48
4 GPU	0.3	0.5	24
8 GPU	0.1	0.2	12

While the accuracy of Deep AA on ImageNet with ResNet-50 is higher than DADA and Faster AA, it is considerably slower.

Dataset	DADA	Faster AA	Deep AA
ImageNet (ResNet-50)	77.5	76.5	78.30 ± 0.14
ImageNet (ResNet-200)	-	-	81.32 ± 0.17
CIFAR 10 (Wide-ResNet-28-10)	97.3	97.4	97.56 ± 0.14
CIFAR 100 (Wide-ResNet-28-10)	82.5	82.7	84.02 ± 0.18

2.2.3 Albumentations

Albumentations supports different computer vision tasks such as classification, semantic segmentation, instance segmentation, object detection, and pose estimation.
For the most image operations, Albumentations is faster than all alternatives
AutoAlbument is an AutoML tool that learns image augmentation policies from data using the Faster AutoAugment algorithm
- AutoAlbument supports image classification and semantic segmentation tasks.
- Under the hood, it uses the Faster AutoAugment algorithm.
- We can use Albumentations to utilize policies discovered by AutoAlbument.

Besides coding, this week I jumped on another task to get to know the Hub community better. I coordinated a team of open-source contributors and allocated them some of the tasks.

Articles on lowlypalace's Blog

GSoC Blog | Activeloop | Week 12

What's Done

Next Steps

GSoC Blog | Activeloop | Week 11

What’s Done ✅

Skorch Integration

**skorch**()

Cleanlab Integration

clean_labels()

create_tensors()

clean_view()

Other

Next Steps

GSoC Blog | Activeloop | Week 10

What’s Done ✅

What’s Next

Coding

GSoC Blog | Activeloop | Week 9

What’s Done ✅

API

Skorch Integration

Core Functions for Cleaning Labels

GSoC Blog | Activeloop | Week 8

What did you do this week?

What is coming up next?

Did you get stuck anywhere?

GSoC Blog | Activeloop | Week 7

GSoC Blog | Activeloop | Week 7

First Training Run

Training with 20 Epochs

Training with 30 Epochs

Training with Resnet50

Training with Validation Set

Training with Validation Set (Stratified Sampling)

Training with Early Stopping

GSoC Blog | Activeloop | Week 6

What did you do this week?

What is coming up next?

Did you get stuck anywhere?

GSoC Blog | Activeloop | Week 5

GSoC Blog | Activeloop | Week 4

What did you do this week?

What is coming up next?

Did you get stuck anywhere?

GSoC Blog | Activeloop | Week 3

Overview

Hub Dataset + Dataloader + Skorch

Pytorch Dataset + Pytorch Dataloader + Skorch

Computing Out-of-sample Probabilities with Cross Validation

GSoC Blog | Activeloop | Week 2

What did you do this week?

Plain

Baseline

AutoAugment

What is coming up next?

Did you get stuck anywhere?

GSoC Blog | Activeloop | Week 1

1. Fix Dataset

2. Auto Augmentations

2.2 Data Augmentation Approaches

2.2.1 Pre-Trained Policies

2.2.2 Faster AA / DADA Implementation

2.2.3 Albumentations

`skorch()`

`clean_labels()`

`create_tensors()`

`clean_view()`