GSoC Blog | Activeloop | Week 9

lowlypalace
Published: 09/11/2022

What’s Done ✅

API

Created an API entry point for cleaning labels in dataset.py.

Cleans the labels of the dataset and creates a set of tensors under label_issues group for the entire dataset.

API

def clean_labels(
        self,
        module = None,
        criterion = None,
        optimizer = None,
        optimizer_lr: int = 0.01,
        device: str = "cpu",
        epochs: int = 10,
        folds: int = 5,
        verbose: bool = True,
        tensors: Optional[list] = None,
        dataloader_train_params: [dict] = None,
        dataloader_valid_params: Optional[dict] = None,
        overwrite: bool = False
        # skorch_kwargs: Optional[dict] = None,
    ):
        """
        Cleans the labels of the dataset. Computes out-of-sample predictions and uses Confident Learning (CL) algorithm to clean the labels.
        Creates a set of tensors under label_issues group for the entire dataset.

        Note:
            Currently, only image classification task us supported. Therefore, the method accepts two tensors for the images and labels (e.g. ['images', 'labels']).
            The tensors can be specified in dataloader_train_params or tensors. Any PyTorch module can be used as a classifier.

        Args:
            module (class): A PyTorch torch.nn.Module module (class or instance). In general, the uninstantiated class should be passed, although instantiated modules will also work. Default is torchvision.models.resnet18(), which is a PyTorch ResNet-18 model.
            criterion (class): A PyTorch criterion. The uninitialized criterion (loss) used to optimize the module. Default is torch.nn.CrossEntropyLoss.
            optimizer (class): A PyTorch optimizer. The uninitialized optimizer (update rule) used to optimize the module. Default is torch.optim.SGD.
            optimizer_lr (int): The learning rate passed to the optimizer. Default is 0.01.
            device (str): A PyTorch device. The device on which the module and criterion are located. Default is "cpu".
            epochs (int): The number of epochs to train for each fit() call. Default is 10.
            tensors (list): A list of tensor names that would be considered for cleaning (e.g. ['images', 'labels']).
            dataloader_train_params (dict): Keyword arguments to pass into torch.utils.data.DataLoader. Options that may especially impact accuracy include: shuffle, batch_size.
            dataloader_valid_params (dict): Keyword arguments to pass into torch.utils.data.DataLoader. Options that may especially impact accuracy include: shuffle, batch_size. If not provided, dataloader_train_params will be used with shuffle=False.
            overwrite (bool): If True, will overwrite label_issues tensors if they already exists. Default is False.
            fold (int): Sets the number of cross-validation folds used to compute out-of-sample probabilities for each example in the dataset. The default is 5.
            skorch_kwargs (dict): Keyword arguments to pass into skorch.NeuralNet. Options that may especially impact accuracy include: ...

        Returns:
            label_issues: A boolean mask for the entire dataset where True represents a label issue and False represents an example that is confidently/accurately labeled.
            label_quality_scores: Returns label quality scores for each datapoint, where lower scores indicate labels less likely to be correct.
        """

Skorch Integration

Made skorch compatitable with Hub dataset format.
- Added the integration skorch.py in hub/integrations/pytorch.
- Created a class VisionClassifierNet that wraps the PyTorch Module in an sklearn interface.
- Make skorch compatitable with Hub’s PyTorch Dataloader.
- Set the defaults for relevant skorch parameters such as module, criterion, optimizer.

Core Functions for Cleaning Labels

Created the component clean_labels.py in hub/core/experimental/labels .
- Implemented core function clean_labels() which cleans the labels of a dataset.
  - Wraps a PyTorch instance in a sklearn classifier. Next, it runs cross-validation to get out-of-sample predicted probabilities for each example. Then, it finds label issues (boolean mask) and label quality scores (floats from 0 to 1) for each sample in the dataset. At the end, it creates tensors with label issues.
- Implemented helper functions.
  - get_dataset_tensors() returns the tensors of a dataset. If a list of tensors is not provided, it will try to find them in the dataloader_train_params in the transform. If none of these are provided, it will iterate over the dataset tensors and return any tensors that match htype 'image' for images and htype 'class_label' for labels. Additionally, this function will also check if the dataset already has a label_issues group.
  - estimate_cv_predicted_probabilities() computes an out-of-sample predicted probability for every example in a dataset using cross validation.
  - append_label_issues_tensors() creates a group of tensors label_issues. After creating tensors, automatically commits the changes.

GSoC Blog | Activeloop | Week 9

What’s Done ✅

API

Skorch Integration

Core Functions for Cleaning Labels

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (11 rendered)

Cache calls from 1 backend

Signals

Log messages