GSoC Blog | Activeloop | Week 7

lowlypalace
Published: 09/11/2022

This week, I worked on running the experiments on a variety of datasets. 

Here are a few exeriments that I did to benchmark cleanlab performance of three different datasets (MNIST Roman (DCAI), Flower 102, Fashion MNIST). For all experiments, I used a fixed model and applied resizing and normalization to the original images.

What do 1️⃣, 2️⃣, 3️⃣, 4️⃣ mean? I re-run the entire fitting with a a few random seeds to get an estimate of the variance between the accuracies of baseline and cleanlab.

1️⃣  = seed(0)

2️⃣  = seed(1)

3️⃣  = seed(123)

4️⃣ = seed(42)

First Training Run

  Roman MNIST (DCAI) Flower 102 Fashion MNIST KMNIST  
<mark>Baseline Accuracy</mark>

1️⃣ 0.7945 

2️⃣ 0.8511 

3️⃣ 0.7699

1️⃣ 0.6568 

2️⃣ 0.6176 

3️⃣ 0.6274

1️⃣ 0.8958 

2️⃣ 0.8944

 3️⃣ 0.8987

   
<mark>+ Cleanlab Accuracy</mark>

1️⃣ 0.7933 → -0.0012 ⬇️

2️⃣ 0.8031 → -0.048 ⬇️

3️⃣ 0.7109 → -0.059 ⬇️

1️⃣ 0.5421 → -0.1147 ⬇️

2️⃣ 0.5441 → -0.0735 ⬇️

3️⃣ 0.5647 → -0.0627 ⬇️

1️⃣ 0.8992 → 0.0034 ⬆️

2️⃣ 0.8951 → 0.0007 ⬆️

3️⃣ 0.8866 → -0.0121 ⬇️

   
Parameters batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10 batch_size = 16 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10 batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10    
Transform Resize((224, 224)), ToTensor(), Normalize( [0.485, 0.456, 0.406], [0.229, 0.224, 0.225] ) Resize((224, 224)), ToTensor(), Normalize( [0.485, 0.456, 0.406], [0.229, 0.224, 0.225] ) ToTensor(), Normalize((0.), (1.)) ToTensor(), Normalize((0.), (1.)) Resize((300, 300)), ToTensor(), Normalize( [0.485, 0.456, 0.406], [0.229, 0.224, 0.225] )
Network resnet = models.resnet18() resnet.fc = nn.Linear(resnet.fc.in_features, 10) resnet = models.resnet18() resnet.fc = nn.Linear(resnet.fc.in_features, 102) resnet = models.resnet18() resnet.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False) resnet.fc = nn.Linear(resnet.fc.in_features, 10) resnet = models.resnet18() resnet.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False) resnet.fc = nn.Linear(resnet.fc.in_features, 10) resnet = models.resnet18() resnet.fc = nn.Linear(resnet.fc.in_features, 47)
Parameters batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10 batch_size = 16 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10 batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10 batch_size = 64 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10 batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10
Number of Classes 10 102 10 10 47
Images Dimension 224x224 224x224 28x28 28x28 300x300

Training with 20 Epochs

In the results below, I used epochs = 20 instead of epochs = 10. The rest of the parameters (e.g. Network, Transform) are unchanged.

  MNIST Roman Flower 102 MNIST Fashion KMNIST Describable Textures Dataset
<mark>Baseline Accuracy</mark>

1️⃣ 0.6875

 2️⃣ 0.7736 

3️⃣ 0.6617

4️⃣ 0.7945 

Mean = 0.7293

1️⃣ 0.5421 

2️⃣ 0.5617 

3️⃣ 0.6578 

4️⃣ 0.6294

1️⃣ 0.891 

2️⃣ 0.8977

3️⃣ 0.8977

   
<mark>+ Cleanlab Accuracy</mark>

1️⃣ 0.7257  0.0382 ⬆️

2️⃣ 0.8400  0.0664 ⬆️

 3️⃣ 0.8511  0.1894 ⬆️

4️⃣ 0.8757  0.0812

⬆️ Mean = 0.8231

Mean Difference = 0.0938 ⬆️

1️⃣ 0.6117 → 0.0696 ⬆️

2️⃣ 0.6254 → 0.0833 ⬆️

3️⃣ 0.5598 → -0.098 ⬇️

4️⃣ 0.5705 → -0.0589 ⬇️

1️⃣ 0.8982 

3️⃣ 0.897

   
Parameters batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 20 batch_size = 16 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 20 batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 20 batch_size = 64 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 20

Training with 30 Epochs

In the results below, I used epochs = 30 instead of epochs = 20. The rest of the parameters (e.g. Network, Transform) are unchanged.

MNIST Roman Flower 102 MNIST Fashion KMNIST Describable Textures Dataset
<mark>Baseline Accuracy</mark>

0.8425

0.8831

0.8769

0.8560

Mean = 0.8646

       
<mark>+ Cleanlab Accuracy</mark>

0.8646 → 0.0221

0.8228 → -0.0602

0.8696 → -0.0073

0.8720 → 0.0159

Mean = 0.8573

Mean Difference = 0.0073

       
Parameters batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 30 batch_size = 16 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 30 batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 30 batch_size = 64 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 30

 

Training with Resnet50

In the results below, all of the parameters stay the same as in run above, but this time I changed the network to resnet50() instead of resnet18(). Epochs are also back to epochs = 20.

  MNIST Roman Flower 102 MNIST Fashion   Describable Textures Dataset
<mark>Baseline Accuracy</mark>

1️⃣ 0.7835 

2️⃣ 0.7589 

3️⃣ 0.8068 

4️⃣ 0.8560

       
<mark>+ Cleanlab Accuracy</mark>

1️⃣ 0.8044

 2️⃣ 0.8560 

3️⃣ 0.8376 

4️⃣ 0.7859

       
Parameters batch_size = 8 model = resnet50() train_shuffle = True test_shuffle = False train_split = None epochs = 20 batch_size = 16 model = resnet50() train_shuffle = True test_shuffle = False train_split = None epochs = 20 batch_size = 32 model = resnet50() train_shuffle = True test_shuffle = False train_split = None epochs = 20 batch_size = 64 model = resnet50() train_shuffle = True test_shuffle = False train_split = None epochs = 20 batch_size = 8 model = resnet50() train_shuffle = True test_shuffle = False train_split = None epochs = 20

Training with Validation Set

In the results below, all of the parameters stay the same as in run above, however, in this run I tried to run trainings with validation set. Therefore, 20% of the dataset is used for the internal training validation. In other datasets where the validation set exist, it’s used as a validation set. I set the model back to resnet18() as it seems it gives better baseline accuracies over three datasets.

MNIST Roman Flower 102 MNIST Fashion   Describable Textures Dataset
<mark>Baseline Accuracy</mark>

1️⃣ 0.6432 

2️⃣ 0.6445 

3️⃣ 0.6777 

4️⃣ 0.6383 

Mean = 0.6509

0.6421

0.6578

0.6823

0.6372

Mean = 0.6549

0.8974

0.891

0.8906

0.894

Mean = 0.89325

0.9425

0.9439

0.9436

0.9364

Mean = 0.9416

 
<mark>+ Cleanlab Accuracy</mark>

1️⃣ 0.5584 

2️⃣ 0.5879

3️⃣ 0.6076 

4️⃣ 0.4587 Mean = 0.5531

       
Parameters batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False train_split=ValidSplit(cv=5, stratified=False) valid_shuffle = False epochs = 20 batch_size = 16 model = resnet18() train_shuffle = True test_shuffle = False train_split=predefined_split(Dataset(valid_data, valid_labels)) valid_shuffle = False epochs = 20 batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split=ValidSplit(cv=5, stratified=False) valid_shuffle = False epochs = 20 batch_size = 64 model = resnet18() train_shuffle = True test_shuffle = False train_split=ValidSplit(cv=5, stratified=False) valid_shuffle = False epochs = 20

 

Training with Validation Set (Stratified Sampling)

Using arbitrary random seed can result in large differences between the training and validation set distributions. These differences can have unintended downstream consequences in the modeling process. As an example, the proportion of digit X can much higher in the training set than in the validation set. To overcome this, I’m using stratified sampling (sampling from each class with equal probability) to create the validation set for the datasets where it’s not available by default (e.g. MNIST Roman, MNIST Fashion, KMNIST).

  MNIST Roman Flower 102 MNIST Fashion   Describable Textures Dataset
<mark>Baseline Accuracy</mark>

0.7515

0.6998

0.7958

0.8610

Mean = 0.7770

N/A

0.8928

0.8948

0.8969

0.895

Mean = 0.894875

  N/A
<mark>+ Cleanlab Accuracy</mark>

0.6027 → -0.1488

0.8228 → 0.1230

0.8130 → 0.0172

0.6900 → -0.1709

Mean = 0.7321

Mean Difference = -0.0448

       
Parameters batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False train_split=ValidSplit(cv=5, stratified=True) valid_shuffle = False epochs = 20   batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split=ValidSplit(cv=5, stratified=True) valid_shuffle = False epochs = 20 batch_size = 64 model = resnet18() train_shuffle = True test_shuffle = False train_split=ValidSplit(cv=5, stratified=True) valid_shuffle = False epochs = 20

Training with Early Stopping

  MNIST Roman Flower 102 MNIST Fashion   Describable Textures Dataset
<mark>Baseline Accuracy</mark>

0.8327

0.8388

0.7921

0.8597

Mean = 0.8308

0.6705

0.6578

0.6558

0.6313

Mean = 0.6539

0.8856,

0.8916,

0.8856,

0.8917

Mean = 0.88862

   
<mark>+ Cleanlab Accuracy</mark>

0.8683

0.8339

0.8597

0.8105

Mean = 0.8431

0.6539

0.6176

0.887

0.8904

   
Parameters callbacks=[EarlyStopping(monitor='train_loss', patience=5)] callbacks=[EarlyStopping(monitor='train_loss', patience=5)] train_split=predefined_split(Dataset(valid_data, valid_labels))    

Notebooks to reproduce results:

DJDT

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (11 rendered)

Cache calls from 1 backend

Signals

Log messages