GSoC Blog | Activeloop | Week 9

Published: 09/11/2022

What’s Done 


  • Created an API entry point for cleaning labels in
    • Cleans the labels of the dataset and creates a set of tensors under label_issues group for the entire dataset.

    • API

      def clean_labels(
              module = None,
              criterion = None,
              optimizer = None,
              optimizer_lr: int = 0.01,
              device: str = "cpu",
              epochs: int = 10,
              folds: int = 5,
              verbose: bool = True,
              tensors: Optional[list] = None,
              dataloader_train_params: [dict] = None,
              dataloader_valid_params: Optional[dict] = None,
              overwrite: bool = False
              # skorch_kwargs: Optional[dict] = None,
              Cleans the labels of the dataset. Computes out-of-sample predictions and uses Confident Learning (CL) algorithm to clean the labels.
              Creates a set of tensors under label_issues group for the entire dataset.
                  Currently, only image classification task us supported. Therefore, the method accepts two tensors for the images and labels (e.g. ['images', 'labels']).
                  The tensors can be specified in dataloader_train_params or tensors. Any PyTorch module can be used as a classifier.
                  module (class): A PyTorch torch.nn.Module module (class or instance). In general, the uninstantiated class should be passed, although instantiated modules will also work. Default is torchvision.models.resnet18(), which is a PyTorch ResNet-18 model.
                  criterion (class): A PyTorch criterion. The uninitialized criterion (loss) used to optimize the module. Default is torch.nn.CrossEntropyLoss.
                  optimizer (class): A PyTorch optimizer. The uninitialized optimizer (update rule) used to optimize the module. Default is torch.optim.SGD.
                  optimizer_lr (int): The learning rate passed to the optimizer. Default is 0.01.
                  device (str): A PyTorch device. The device on which the module and criterion are located. Default is "cpu".
                  epochs (int): The number of epochs to train for each fit() call. Default is 10.
                  tensors (list): A list of tensor names that would be considered for cleaning (e.g. ['images', 'labels']).
                  dataloader_train_params (dict): Keyword arguments to pass into Options that may especially impact accuracy include: shuffle, batch_size.
                  dataloader_valid_params (dict): Keyword arguments to pass into Options that may especially impact accuracy include: shuffle, batch_size. If not provided, dataloader_train_params will be used with shuffle=False.
                  overwrite (bool): If True, will overwrite label_issues tensors if they already exists. Default is False.
                  fold (int): Sets the number of cross-validation folds used to compute out-of-sample probabilities for each example in the dataset. The default is 5.
                  skorch_kwargs (dict): Keyword arguments to pass into skorch.NeuralNet. Options that may especially impact accuracy include: ...
                  label_issues: A boolean mask for the entire dataset where True represents a label issue and False represents an example that is confidently/accurately labeled.
                  label_quality_scores: Returns label quality scores for each datapoint, where lower scores indicate labels less likely to be correct.

Skorch Integration

  • Made skorch compatitable with Hub dataset format.
    • Added the integration in hub/integrations/pytorch.
    • Created a class VisionClassifierNet that wraps the PyTorch Module in an sklearn interface.
    • Make skorch compatitable with Hub’s PyTorch Dataloader.
    • Set the defaults for relevant skorch parameters such as module, criterion, optimizer.

Core Functions for Cleaning Labels

  • Created the component in hub/core/experimental/labels .
    • Implemented core function clean_labels() which cleans the labels of a dataset.
      • Wraps a PyTorch instance in a sklearn classifier. Next, it runs cross-validation to get out-of-sample predicted probabilities for each example. Then, it finds label issues (boolean mask) and label quality scores (floats from 0 to 1) for each sample in the dataset. At the end, it creates tensors with label issues.
    • Implemented helper functions.
      • get_dataset_tensors() returns the tensors of a dataset. If a list of tensors is not provided, it will try to find them in the dataloader_train_params in the transform. If none of these are provided, it will iterate over the dataset tensors and return any tensors that match htype 'image' for images and htype 'class_label' for labels. Additionally, this function will also check if the dataset already has a label_issues group.
      • estimate_cv_predicted_probabilities() computes an out-of-sample predicted probability for every example in a dataset using cross validation.
      • append_label_issues_tensors() creates a group of tensors label_issues. After creating tensors, automatically commits the changes.