GSoC Blog | Activeloop | Week 7

lowlypalace
Published: 09/11/2022

This week, I worked on running the experiments on a variety of datasets.

Here are a few exeriments that I did to benchmark cleanlab performance of three different datasets (MNIST Roman (DCAI), Flower 102, Fashion MNIST). For all experiments, I used a fixed model and applied resizing and normalization to the original images.

What do 1️⃣, 2️⃣, 3️⃣, 4️⃣ mean? I re-run the entire fitting with a a few random seeds to get an estimate of the variance between the accuracies of baseline and cleanlab.

1️⃣ = seed(0)

2️⃣ = seed(1)

3️⃣ = seed(123)

4️⃣ = seed(42)

First Training Run

	Roman MNIST (DCAI)	Flower 102	Fashion MNIST	KMNIST
<mark>Baseline Accuracy</mark>	1️⃣ 0.7945 2️⃣ 0.8511 3️⃣ 0.7699	1️⃣ 0.6568 2️⃣ 0.6176 3️⃣ 0.6274	1️⃣ 0.8958 2️⃣ 0.8944 3️⃣ 0.8987
<mark>+ Cleanlab Accuracy</mark>	1️⃣ 0.7933 → -0.0012 ⬇️ 2️⃣ 0.8031 → -0.048 ⬇️ 3️⃣ 0.7109 → -0.059 ⬇️	1️⃣ 0.5421 → -0.1147 ⬇️ 2️⃣ 0.5441 → -0.0735 ⬇️ 3️⃣ 0.5647 → -0.0627 ⬇️	1️⃣ 0.8992 → 0.0034 ⬆️ 2️⃣ 0.8951 → 0.0007 ⬆️ 3️⃣ 0.8866 → -0.0121 ⬇️
Parameters	`batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10`	`batch_size = 16 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10`	`batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10`
Transform	`Resize((224, 224)), ToTensor(), Normalize( [0.485, 0.456, 0.406], [0.229, 0.224, 0.225] )`	`Resize((224, 224)), ToTensor(), Normalize( [0.485, 0.456, 0.406], [0.229, 0.224, 0.225] )`	`ToTensor(), Normalize((0.), (1.))`	`ToTensor(), Normalize((0.), (1.))`	`Resize((300, 300)), ToTensor(), Normalize( [0.485, 0.456, 0.406], [0.229, 0.224, 0.225] )`
Network	`resnet = models.resnet18() resnet.fc = nn.Linear(resnet.fc.in_features, 10)`	`resnet = models.resnet18() resnet.fc = nn.Linear(resnet.fc.in_features, 102)`	`resnet = models.resnet18() resnet.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)` `resnet.fc = nn.Linear(resnet.fc.in_features, 10)`	`resnet = models.resnet18() resnet.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False)` `resnet.fc = nn.Linear(resnet.fc.in_features, 10)`	`resnet = models.resnet18() resnet.fc = nn.Linear(resnet.fc.in_features, 47)`
Parameters	`batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False train_split = None` `epochs = 10`	`batch_size = 16 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10`	`batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10`	`batch_size = 64 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10`	`batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10`
Number of Classes	10	102	10	10	47
Images Dimension	224x224	224x224	28x28	28x28	300x300

Training with 20 Epochs

In the results below, I used epochs = 20 instead of epochs = 10. The rest of the parameters (e.g. Network, Transform) are unchanged.

	MNIST Roman	Flower 102	MNIST Fashion	KMNIST	Describable Textures Dataset
<mark>Baseline Accuracy</mark>	1️⃣ 0.6875 2️⃣ 0.7736 3️⃣ 0.6617 4️⃣ 0.7945 Mean = 0.7293	1️⃣ 0.5421 2️⃣ 0.5617 3️⃣ 0.6578 4️⃣ 0.6294	1️⃣ 0.891 2️⃣ 0.8977 3️⃣ 0.8977
<mark>+ Cleanlab Accuracy</mark>	1️⃣ 0.7257 → 0.0382 ⬆️ 2️⃣ 0.8400 → 0.0664 ⬆️ 3️⃣ 0.8511 → 0.1894 ⬆️ 4️⃣ 0.8757 → 0.0812 ⬆️ Mean = 0.8231 Mean Difference = 0.0938 ⬆️	1️⃣ 0.6117 → 0.0696 ⬆️ 2️⃣ 0.6254 → 0.0833 ⬆️ 3️⃣ 0.5598 → -0.098 ⬇️ 4️⃣ 0.5705 → -0.0589 ⬇️	1️⃣ 0.8982 3️⃣ 0.897
Parameters	`batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False train_split = None` `epochs = 20`	`batch_size = 16 model = resnet18() train_shuffle = True test_shuffle = False train_split = None` `epochs = 20`	`batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split = None` `epochs = 20`	`batch_size = 64 model = resnet18() train_shuffle = True test_shuffle = False train_split = None` `epochs = 20`

Training with 30 Epochs

In the results below, I used epochs = 30 instead of epochs = 20. The rest of the parameters (e.g. Network, Transform) are unchanged.

MNIST Roman	Flower 102	MNIST Fashion	KMNIST	Describable Textures Dataset
<mark>Baseline Accuracy</mark>	0.8425 0.8831 0.8769 0.8560 Mean = 0.8646
<mark>+ Cleanlab Accuracy</mark>	0.8646 → 0.0221 0.8228 → -0.0602 0.8696 → -0.0073 0.8720 → 0.0159 Mean = 0.8573 Mean Difference = 0.0073
Parameters	`batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False train_split = None` `epochs = 30`	`batch_size = 16 model = resnet18() train_shuffle = True test_shuffle = False train_split = None` `epochs = 30`	`batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split = None` `epochs = 30`	`batch_size = 64 model = resnet18() train_shuffle = True test_shuffle = False train_split = None` `epochs = 30`

Training with Resnet50

In the results below, all of the parameters stay the same as in run above, but this time I changed the network to resnet50() instead of resnet18(). Epochs are also back to epochs = 20.

	MNIST Roman	Flower 102	MNIST Fashion		Describable Textures Dataset
<mark>Baseline Accuracy</mark>	1️⃣ 0.7835 2️⃣ 0.7589 3️⃣ 0.8068 4️⃣ 0.8560
<mark>+ Cleanlab Accuracy</mark>	1️⃣ 0.8044 2️⃣ 0.8560 3️⃣ 0.8376 4️⃣ 0.7859
Parameters	`batch_size = 8 model = resnet50() train_shuffle = True test_shuffle = False train_split = None` `epochs = 20`	`batch_size = 16 model = resnet50() train_shuffle = True test_shuffle = False train_split = None` `epochs = 20`	`batch_size = 32 model = resnet50() train_shuffle = True test_shuffle = False train_split = None` `epochs = 20`	`batch_size = 64 model = resnet50() train_shuffle = True test_shuffle = False train_split = None` `epochs = 20`	`batch_size = 8 model = resnet50() train_shuffle = True test_shuffle = False train_split = None` `epochs = 20`

Training with Validation Set

In the results below, all of the parameters stay the same as in run above, however, in this run I tried to run trainings with validation set. Therefore, 20% of the dataset is used for the internal training validation. In other datasets where the validation set exist, it’s used as a validation set. I set the model back to resnet18() as it seems it gives better baseline accuracies over three datasets.

MNIST Roman	Flower 102	MNIST Fashion		Describable Textures Dataset
<mark>Baseline Accuracy</mark>	1️⃣ 0.6432 2️⃣ 0.6445 3️⃣ 0.6777 4️⃣ 0.6383 Mean = 0.6509	0.6421 0.6578 0.6823 0.6372 Mean = 0.6549	0.8974 0.891 0.8906 0.894 Mean = 0.89325	0.9425 0.9439 0.9436 0.9364 Mean = 0.9416
<mark>+ Cleanlab Accuracy</mark>	1️⃣ 0.5584 2️⃣ 0.5879 3️⃣ 0.6076 4️⃣ 0.4587 Mean = 0.5531
Parameters	`batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False` `train_split=ValidSplit(cv=5, stratified=False) valid_shuffle = False` `epochs = 20`	`batch_size = 16 model = resnet18() train_shuffle = True test_shuffle = False` `train_split=predefined_split(Dataset(valid_data, valid_labels)) valid_shuffle = False` `epochs = 20`	`batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False` `train_split=ValidSplit(cv=5, stratified=False) valid_shuffle = False` `epochs = 20`	`batch_size = 64 model = resnet18() train_shuffle = True test_shuffle = False` `train_split=ValidSplit(cv=5, stratified=False)` `valid_shuffle = False` `epochs = 20`

Training with Validation Set (Stratified Sampling)

Using arbitrary random seed can result in large differences between the training and validation set distributions. These differences can have unintended downstream consequences in the modeling process. As an example, the proportion of digit X can much higher in the training set than in the validation set. To overcome this, I’m using stratified sampling (sampling from each class with equal probability) to create the validation set for the datasets where it’s not available by default (e.g. MNIST Roman, MNIST Fashion, KMNIST).

	MNIST Roman	Flower 102	MNIST Fashion		Describable Textures Dataset
<mark>Baseline Accuracy</mark>	0.7515 0.6998 0.7958 0.8610 Mean = 0.7770	N/A	0.8928 0.8948 0.8969 0.895 Mean = 0.894875		N/A
<mark>+ Cleanlab Accuracy</mark>	0.6027 → -0.1488 0.8228 → 0.1230 0.8130 → 0.0172 0.6900 → -0.1709 Mean = 0.7321 Mean Difference = -0.0448
Parameters	`batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False` `train_split=ValidSplit(cv=5, stratified=True) valid_shuffle = False` `epochs = 20`		`batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False` `train_split=ValidSplit(cv=5, stratified=True) valid_shuffle = False` `epochs = 20`	`batch_size = 64 model = resnet18() train_shuffle = True test_shuffle = False` `train_split=ValidSplit(cv=5, stratified=True)` `valid_shuffle = False` `epochs = 20`

Training with Early Stopping

	MNIST Roman	Flower 102	MNIST Fashion	Describable Textures Dataset
<mark>Baseline Accuracy</mark>	0.8327 0.8388 0.7921 0.8597 Mean = 0.8308	0.6705 0.6578 0.6558 0.6313 Mean = 0.6539	0.8856, 0.8916, 0.8856, 0.8917 Mean = 0.88862
<mark>+ Cleanlab Accuracy</mark>	0.8683 0.8339 0.8597 0.8105 Mean = 0.8431	0.6539 0.6176	0.887 0.8904
Parameters	`callbacks=[EarlyStopping(monitor='train_loss', patience=5)]`	`callbacks=[EarlyStopping(monitor='train_loss', patience=5)]` `train_split=predefined_split(Dataset(valid_data, valid_labels))`

Notebooks to reproduce results:

GSoC Blog | Activeloop | Week 7

First Training Run

Training with 20 Epochs

Training with 30 Epochs

Training with Resnet50

Training with Validation Set

Training with Validation Set (Stratified Sampling)

Training with Early Stopping

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (11 rendered)

Cache calls from 1 backend

Signals

Log messages