This week, I worked on running the experiments on a variety of datasets.
Here are a few exeriments that I did to benchmark cleanlab
performance of three different datasets (MNIST Roman (DCAI), Flower 102, Fashion MNIST). For all experiments, I used a fixed model and applied resizing and normalization to the original images.
What do 1️⃣, 2️⃣, 3️⃣, 4️⃣ mean? I re-run the entire fitting with a a few random seeds to get an estimate of the variance between the accuracies of baseline
and cleanlab
.
1️⃣ =
seed(0)
2️⃣ =
seed(1)
3️⃣ =
seed(123)
4️⃣ =
seed(42)
First Training Run
Roman MNIST (DCAI) | Flower 102 | Fashion MNIST | KMNIST | ||
<mark>Baseline Accuracy</mark> |
1️⃣ 0.7945 2️⃣ 0.8511 3️⃣ 0.7699 |
1️⃣ 0.6568 2️⃣ 0.6176 3️⃣ 0.6274 |
1️⃣ 0.8958 2️⃣ 0.8944 3️⃣ 0.8987 |
||
<mark>+ Cleanlab Accuracy</mark> |
1️⃣ 0.7933 → -0.0012 ⬇️ 2️⃣ 0.8031 → -0.048 ⬇️ 3️⃣ 0.7109 → -0.059 ⬇️ |
1️⃣ 0.5421 → -0.1147 ⬇️ 2️⃣ 0.5441 → -0.0735 ⬇️ 3️⃣ 0.5647 → -0.0627 ⬇️ |
1️⃣ 0.8992 → 0.0034 ⬆️ 2️⃣ 0.8951 → 0.0007 ⬆️ 3️⃣ 0.8866 → -0.0121 ⬇️ |
||
Parameters | batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10 |
batch_size = 16 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10 |
batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10 |
||
Transform | Resize((224, 224)), ToTensor(), Normalize( [0.485, 0.456, 0.406], [0.229, 0.224, 0.225] ) |
Resize((224, 224)), ToTensor(), Normalize( [0.485, 0.456, 0.406], [0.229, 0.224, 0.225] ) |
ToTensor(), Normalize((0.), (1.)) |
ToTensor(), Normalize((0.), (1.)) |
Resize((300, 300)), ToTensor(), Normalize( [0.485, 0.456, 0.406], [0.229, 0.224, 0.225] ) |
Network | resnet = models.resnet18() resnet.fc = nn.Linear(resnet.fc.in_features, 10) |
resnet = models.resnet18() resnet.fc = nn.Linear(resnet.fc.in_features, 102) |
resnet = models.resnet18() resnet.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False) resnet.fc = nn.Linear(resnet.fc.in_features, 10) |
resnet = models.resnet18() resnet.conv1 = nn.Conv2d(1, 64, kernel_size=7, stride=2, padding=3, bias=False) resnet.fc = nn.Linear(resnet.fc.in_features, 10) |
resnet = models.resnet18() resnet.fc = nn.Linear(resnet.fc.in_features, 47) |
Parameters | batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10 |
batch_size = 16 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10 |
batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10 |
batch_size = 64 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10 |
batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 10 |
Number of Classes | 10 | 102 | 10 | 10 | 47 |
Images Dimension | 224x224 | 224x224 | 28x28 | 28x28 | 300x300 |
Training with 20 Epochs
In the results below, I used epochs = 20
instead of epochs = 10
. The rest of the parameters (e.g. Network, Transform) are unchanged.
MNIST Roman | Flower 102 | MNIST Fashion | KMNIST | Describable Textures Dataset | |
<mark>Baseline Accuracy</mark> |
1️⃣ 0.6875 2️⃣ 0.7736 3️⃣ 0.6617 4️⃣ 0.7945 Mean = 0.7293 |
1️⃣ 0.5421 2️⃣ 0.5617 3️⃣ 0.6578 4️⃣ 0.6294 |
1️⃣ 0.891 2️⃣ 0.8977 3️⃣ 0.8977 |
||
<mark>+ Cleanlab Accuracy</mark> |
1️⃣ 0.7257 → 0.0382 ⬆️ 2️⃣ 0.8400 → 0.0664 ⬆️ 3️⃣ 0.8511 → 0.1894 ⬆️ 4️⃣ 0.8757 → 0.0812 ⬆️ Mean = 0.8231 Mean Difference = 0.0938 ⬆️ |
1️⃣ 0.6117 → 0.0696 ⬆️ 2️⃣ 0.6254 → 0.0833 ⬆️ 3️⃣ 0.5598 → -0.098 ⬇️ 4️⃣ 0.5705 → -0.0589 ⬇️ |
1️⃣ 0.8982 3️⃣ 0.897 |
||
Parameters | batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 20 |
batch_size = 16 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 20 |
batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 20 |
batch_size = 64 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 20 |
Training with 30 Epochs
In the results below, I used epochs = 30
instead of epochs = 20
. The rest of the parameters (e.g. Network, Transform) are unchanged.
MNIST Roman | Flower 102 | MNIST Fashion | KMNIST | Describable Textures Dataset | |
<mark>Baseline Accuracy</mark> |
0.8425 0.8831 0.8769 0.8560 Mean = 0.8646 |
||||
<mark>+ Cleanlab Accuracy</mark> |
0.8646 → 0.0221 0.8228 → -0.0602 0.8696 → -0.0073 0.8720 → 0.0159 Mean = 0.8573 Mean Difference = 0.0073 |
||||
Parameters | batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 30 |
batch_size = 16 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 30 |
batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 30 |
batch_size = 64 model = resnet18() train_shuffle = True test_shuffle = False train_split = None epochs = 30 |
Training with Resnet50
In the results below, all of the parameters stay the same as in run above, but this time I changed the network to resnet50()
instead of resnet18()
. Epochs are also back to epochs = 20
.
MNIST Roman | Flower 102 | MNIST Fashion | Describable Textures Dataset | ||
<mark>Baseline Accuracy</mark> |
1️⃣ 0.7835 2️⃣ 0.7589 3️⃣ 0.8068 4️⃣ 0.8560 |
||||
<mark>+ Cleanlab Accuracy</mark> |
1️⃣ 0.8044 2️⃣ 0.8560 3️⃣ 0.8376 4️⃣ 0.7859 |
||||
Parameters | batch_size = 8 model = resnet50() train_shuffle = True test_shuffle = False train_split = None epochs = 20 |
batch_size = 16 model = resnet50() train_shuffle = True test_shuffle = False train_split = None epochs = 20 |
batch_size = 32 model = resnet50() train_shuffle = True test_shuffle = False train_split = None epochs = 20 |
batch_size = 64 model = resnet50() train_shuffle = True test_shuffle = False train_split = None epochs = 20 |
batch_size = 8 model = resnet50() train_shuffle = True test_shuffle = False train_split = None epochs = 20 |
Training with Validation Set
In the results below, all of the parameters stay the same as in run above, however, in this run I tried to run trainings with validation set. Therefore, 20% of the dataset is used for the internal training validation. In other datasets where the validation set exist, it’s used as a validation set. I set the model back to resnet18()
as it seems it gives better baseline accuracies over three datasets.
MNIST Roman | Flower 102 | MNIST Fashion | Describable Textures Dataset | ||
<mark>Baseline Accuracy</mark> |
1️⃣ 0.6432 2️⃣ 0.6445 3️⃣ 0.6777 4️⃣ 0.6383 Mean = 0.6509 |
0.6421 0.6578 0.6823 0.6372 Mean = 0.6549 |
0.8974 0.891 0.8906 0.894 Mean = 0.89325 |
0.9425 0.9439 0.9436 0.9364 Mean = 0.9416 |
|
<mark>+ Cleanlab Accuracy</mark> |
1️⃣ 0.5584 2️⃣ 0.5879 3️⃣ 0.6076 4️⃣ 0.4587 Mean = 0.5531 |
||||
Parameters | batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False train_split=ValidSplit(cv=5, stratified=False) valid_shuffle = False epochs = 20 |
batch_size = 16 model = resnet18() train_shuffle = True test_shuffle = False train_split=predefined_split(Dataset(valid_data, valid_labels)) valid_shuffle = False epochs = 20 |
batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split=ValidSplit(cv=5, stratified=False) valid_shuffle = False epochs = 20 |
batch_size = 64 model = resnet18() train_shuffle = True test_shuffle = False train_split=ValidSplit(cv=5, stratified=False) valid_shuffle = False epochs = 20 |
Training with Validation Set (Stratified Sampling)
Using arbitrary random seed can result in large differences between the training and validation set distributions. These differences can have unintended downstream consequences in the modeling process. As an example, the proportion of digit X can much higher in the training set than in the validation set. To overcome this, I’m using stratified sampling (sampling from each class with equal probability) to create the validation set for the datasets where it’s not available by default (e.g. MNIST Roman, MNIST Fashion, KMNIST).
MNIST Roman | Flower 102 | MNIST Fashion | Describable Textures Dataset | ||
<mark>Baseline Accuracy</mark> |
0.7515 0.6998 0.7958 0.8610 Mean = 0.7770 |
N/A |
0.8928 0.8948 0.8969 0.895 Mean = 0.894875 |
N/A | |
<mark>+ Cleanlab Accuracy</mark> |
0.6027 → -0.1488 0.8228 → 0.1230 0.8130 → 0.0172 0.6900 → -0.1709 Mean = 0.7321 Mean Difference = -0.0448 |
||||
Parameters | batch_size = 8 model = resnet18() train_shuffle = True test_shuffle = False train_split=ValidSplit(cv=5, stratified=True) valid_shuffle = False epochs = 20 |
batch_size = 32 model = resnet18() train_shuffle = True test_shuffle = False train_split=ValidSplit(cv=5, stratified=True) valid_shuffle = False epochs = 20 |
batch_size = 64 model = resnet18() train_shuffle = True test_shuffle = False train_split=ValidSplit(cv=5, stratified=True) valid_shuffle = False epochs = 20 |
Training with Early Stopping
MNIST Roman | Flower 102 | MNIST Fashion | Describable Textures Dataset | ||
<mark>Baseline Accuracy</mark> |
0.8327 0.8388 0.7921 0.8597 Mean = 0.8308 |
0.6705 0.6578 0.6558 0.6313 Mean = 0.6539 |
0.8856, 0.8916, 0.8856, 0.8917 Mean = 0.88862 |
||
<mark>+ Cleanlab Accuracy</mark> |
0.8683 0.8339 0.8597 0.8105 Mean = 0.8431 |
0.6539 0.6176 |
0.887 0.8904 |
||
Parameters | callbacks=[EarlyStopping(monitor='train_loss', patience=5)] |
callbacks=[EarlyStopping(monitor='train_loss', patience=5)] train_split=predefined_split(Dataset(valid_data, valid_labels)) |
Notebooks to reproduce results: