lowlypalace's Blog

GSoC Blog | Activeloop | Week 12

Published: 09/11/2022

What's Done

  • Created a tutorial showing how to find label errors in Hub datasets: Finding Label Issues in Image Classification Datasets
  • Completed a blog post How Noisy Labels Impact ML Models. This blog touches on some of the reasons on why labeling errors happen, why the errors in labels are imperative and what tools and techniques can be used to overcome these errors. At the end, it shows how to use cleanlab to easily find noise in Hub datasets.

Next Steps

  • Finalize PR and fix reviewers’ feedback.
  • Try to create and run unit tests.
  • Check if custom transform function works with the workflow.
  • Finalize final names of functions, like find_mislabels, fix_issues, find_issues, add_issues_tensors.

I’ll do the following if I have extra time:

  • Add valid_transform parameter.
  • Make it possible to select specific tensors from validation set.
  • Add message that we checked out on the branch after adding tensors.
  • Add dataset health printout.
  • Try to pass x and y instead.
View Blog Post

GSoC Blog | Activeloop | Week 11

Published: 09/11/2022

What’s Done

→ Updated API

from hub.integrations.cleanlab import clean_labels, create_tensors, clean_view
from hub.integrations import skorch

ds = hub.load("hub://ds")

tform = transforms.Compose(
        transforms.Normalize((0.5,), (0.5,)),

transform = {"images": tform, "labels": None}

# Get scikit-learn compatible PyTorch module to pass into clean_labels as a classifier
model = skorch(dataset=ds, epochs=5, batch_size=16, transform=transform, tensors=[],  valid_transform, skorch_kwargs)

# Obtain a DataFrame with columns is_label_issue, label_quality and predicted_label 
label_issues = find_label_issues(

# Create label_issues tensor group on "labels" branch

# Get dataset view where only clean labels are present, and the rest are filtered out.
ds_clean = clean_view(ds)

→ Link to PR: https://github.com/activeloopai/Hub/pull/1821

Skorch Integration


  • Added support for providing the validation set for training skorch(dataset=ds, valid_dataset=valid_ds)

  • Added keyword arguments that can be passed into modules to fine-tune the parameters for advanced users.

    • skorch_kwargs arguments to be passed to the skorch NeuralNet constructor. Additionally, iterator_train__transform and iterator_valid__transform can be used to set params for the training and validation iterators.
  • Made passing in the images and labels tensors more explicit.

  • Modularized methods.

    • Separated skorch module from cleanlab to make it easier to instantiate skorch even if you’re not using cleanlab in the downstream.
    • Further modularized skorch module into separate functions and modules.
    • Added utils functions in a separate file.
  • Added error-checking utils to check errors early.

    → Check if a dataset and valid_dataset that’s passed in is a Hub Dataset.

    → Check if the tensors’ htypes are supported for image classification tasks.

Cleanlab Integration


  • Implemented a function to compute  guessed_label  by the classifier after pruning.
  • Added flag pretrained to skip cross-validation if pretrained model is used to compute out-of-sample probabilities faster on a single fit().
  • Instead of returning a tuple of numpy ndarrays label_issues, label_quality_scores and predicted_labels, now clean_labels() returns a single label_issues dataframe with columns is_label_issue, label_quality, predicted_label.
  • Added keyword arguments that can be passed into modules to fine-tune the parameters for advanced users.
    • label_issues_kwargs can be be passed to the cleanlab.filter.find_label_issues function.
    • label_quality_kwargs can be passed to the cleanlab.rank.get_label_quality_scores function.


  • Added the ability to select branch to commit to when creating tensors
  • Modularized methods.
    • create_tensors() is now a separate method that takes in label_issues dataframe or looks for label_issues in tensors to get a view where only clean labels are present and the rest are filtered out. This will now return commit_id.
    • Added utils functions in a separate file.
  • Added error-checking utils to check errors early.
    • Check early if a user has write access to the dataset before creating the tensors.
    • Check if label_issues dataframe columns have correct dtypes and are a subset of a dataset before appending them to tensors.


  • Added a method clean_view(ds) to get a dataset view where only clean labels are present, and the rest are filtered out. This can be useful to pass the clean dataset to downstream ML frameworks for training.


  • Created custom config for dependencies hub[’cleanlab’].
  • Created common utils that are reused across modules.
  • Renamed some of the function and variable names to be more clear.
  • Clarified the docstrings parameters and improved readability.
  • Merged main branch and resolved conflicts.
  • Commented on the aspects I’m not sure about in my PR.
  • Run tests on 10+ Activeloop image classification datasets (without creating tensors).

Next Steps

View Blog Post

GSoC Blog | Activeloop | Week 10

Published: 09/11/2022

What’s Done 

→ Updated API

from hub.integrations.cleanlab import clean_labels

training_params = {'module' = resnet18(), 'criterion' = CrossEntropyLoss, 
'optimizer' = SGD, 'epochs' = 10, 'optimizer_lr' = 0.01, 'device' = "cpu",
'folds = 5'}

clean_labels( ds,
					  	training_params = training_params,
			        verbose = True,
			        tensors = ['images', 'labels'],
			        overwrite = False,
							num_workers = 1,
							batch_size = 1,
							shuffle = True,
							transform = {},
							create_tensors = True

→ Added create_tensors flag.

  • create_tensors boolean flag would be useful here to confirm if a user wants to append new label_issues tensor. If the flag create_tensors is False, then is_label_issues, label_quality_scores numpy arrays are returned. If True, tensors is_label_issues and label_quality_scores are created and also returned as numpy arrays.

→ Added support to provide validation set for training

  • clean_labels(*ds_train, ds_valid)*
  • No support yet to compute label errors for validation set

→ Made providing tensors names more explicit

→ Fixed some errors related to checking if an image tensor is RGB or Grayscale

→ Minor improvements (e.g. matching device in the core function rather than making it a required parameter)

What’s Next


→ Prune API

  • prune_labels(ds)
  • Instead of deleting samples, enable users to create an instance of the dataset that would only fetch correct samples when filling up batches?
    • It could be easily possible for users to ds = ds[clean_idx] and then use a clean dataset for the downstream.
  • Leave out pruning to the users and code it up in the blog post instead?
  • Create a new branch

→ Create a tensor guessed_label to add labels guessed by the classifier after pruning.

  • Relabeling workflow on Activeloop?

→ Create custom config for pip install (e.g. pip install hub[’cleanlab’])

→ Add flag branch to move to a different branch instead of making a commit on a current branch.

→ Add flags add_branch = True

→ Add support for bounding boxes, task = 'classification' or task = 'segmentation'

→ Raise error if not htype image

→ Add support for TensorFlow modules

→ Add optional cleanlab kwargs to pass down

→ Add optional skorch kwargs to pass down

→ Tests

  • Unit tests
  • Tests with Activeloop datasets

→ Make it possible to skorch(ds)

→ Raise error if I don’t have write access

View Blog Post

GSoC Blog | Activeloop | Week 9

Published: 09/11/2022

What’s Done 


  • Created an API entry point for cleaning labels in dataset.py.
    • Cleans the labels of the dataset and creates a set of tensors under label_issues group for the entire dataset.

    • API

      def clean_labels(
              module = None,
              criterion = None,
              optimizer = None,
              optimizer_lr: int = 0.01,
              device: str = "cpu",
              epochs: int = 10,
              folds: int = 5,
              verbose: bool = True,
              tensors: Optional[list] = None,
              dataloader_train_params: [dict] = None,
              dataloader_valid_params: Optional[dict] = None,
              overwrite: bool = False
              # skorch_kwargs: Optional[dict] = None,
              Cleans the labels of the dataset. Computes out-of-sample predictions and uses Confident Learning (CL) algorithm to clean the labels.
              Creates a set of tensors under label_issues group for the entire dataset.
                  Currently, only image classification task us supported. Therefore, the method accepts two tensors for the images and labels (e.g. ['images', 'labels']).
                  The tensors can be specified in dataloader_train_params or tensors. Any PyTorch module can be used as a classifier.
                  module (class): A PyTorch torch.nn.Module module (class or instance). In general, the uninstantiated class should be passed, although instantiated modules will also work. Default is torchvision.models.resnet18(), which is a PyTorch ResNet-18 model.
                  criterion (class): A PyTorch criterion. The uninitialized criterion (loss) used to optimize the module. Default is torch.nn.CrossEntropyLoss.
                  optimizer (class): A PyTorch optimizer. The uninitialized optimizer (update rule) used to optimize the module. Default is torch.optim.SGD.
                  optimizer_lr (int): The learning rate passed to the optimizer. Default is 0.01.
                  device (str): A PyTorch device. The device on which the module and criterion are located. Default is "cpu".
                  epochs (int): The number of epochs to train for each fit() call. Default is 10.
                  tensors (list): A list of tensor names that would be considered for cleaning (e.g. ['images', 'labels']).
                  dataloader_train_params (dict): Keyword arguments to pass into torch.utils.data.DataLoader. Options that may especially impact accuracy include: shuffle, batch_size.
                  dataloader_valid_params (dict): Keyword arguments to pass into torch.utils.data.DataLoader. Options that may especially impact accuracy include: shuffle, batch_size. If not provided, dataloader_train_params will be used with shuffle=False.
                  overwrite (bool): If True, will overwrite label_issues tensors if they already exists. Default is False.
                  fold (int): Sets the number of cross-validation folds used to compute out-of-sample probabilities for each example in the dataset. The default is 5.
                  skorch_kwargs (dict): Keyword arguments to pass into skorch.NeuralNet. Options that may especially impact accuracy include: ...
                  label_issues: A boolean mask for the entire dataset where True represents a label issue and False represents an example that is confidently/accurately labeled.
                  label_quality_scores: Returns label quality scores for each datapoint, where lower scores indicate labels less likely to be correct.

Skorch Integration

  • Made skorch compatitable with Hub dataset format.
    • Added the integration skorch.py in hub/integrations/pytorch.
    • Created a class VisionClassifierNet that wraps the PyTorch Module in an sklearn interface.
    • Make skorch compatitable with Hub’s PyTorch Dataloader.
    • Set the defaults for relevant skorch parameters such as module, criterion, optimizer.

Core Functions for Cleaning Labels

  • Created the component clean_labels.py in hub/core/experimental/labels .
    • Implemented core function clean_labels() which cleans the labels of a dataset.
      • Wraps a PyTorch instance in a sklearn classifier. Next, it runs cross-validation to get out-of-sample predicted probabilities for each example. Then, it finds label issues (boolean mask) and label quality scores (floats from 0 to 1) for each sample in the dataset. At the end, it creates tensors with label issues.
    • Implemented helper functions.
      • get_dataset_tensors() returns the tensors of a dataset. If a list of tensors is not provided, it will try to find them in the dataloader_train_params in the transform. If none of these are provided, it will iterate over the dataset tensors and return any tensors that match htype 'image' for images and htype 'class_label' for labels. Additionally, this function will also check if the dataset already has a label_issues group.
      • estimate_cv_predicted_probabilities() computes an out-of-sample predicted probability for every example in a dataset using cross validation.
      • append_label_issues_tensors() creates a group of tensors label_issues. After creating tensors, automatically commits the changes.
View Blog Post

GSoC Blog | Activeloop | Week 8

Published: 09/11/2022

What did you do this week?

This week, after deriving conclusions from my previous experiments, it was the time to take all of the insights as well as the code and try to make cleanlab work with Hub datasets. After working on the integration for a few weeks, I created my draft PR.

What is coming up next?

As a next step, I will be finalizing the API structure, as well as adding some additional functionality to the feature.

Did you get stuck anywhere?

Not really.

View Blog Post