What’s Done ✅
API
- Created an API entry point for cleaning labels in
dataset.py
.-
Cleans the labels of the dataset and creates a set of tensors under
label_issues
group for the entire dataset. -
API
def clean_labels( self, module = None, criterion = None, optimizer = None, optimizer_lr: int = 0.01, device: str = "cpu", epochs: int = 10, folds: int = 5, verbose: bool = True, tensors: Optional[list] = None, dataloader_train_params: [dict] = None, dataloader_valid_params: Optional[dict] = None, overwrite: bool = False # skorch_kwargs: Optional[dict] = None, ): """ Cleans the labels of the dataset. Computes out-of-sample predictions and uses Confident Learning (CL) algorithm to clean the labels. Creates a set of tensors under label_issues group for the entire dataset. Note: Currently, only image classification task us supported. Therefore, the method accepts two tensors for the images and labels (e.g. ['images', 'labels']). The tensors can be specified in dataloader_train_params or tensors. Any PyTorch module can be used as a classifier. Args: module (class): A PyTorch torch.nn.Module module (class or instance). In general, the uninstantiated class should be passed, although instantiated modules will also work. Default is torchvision.models.resnet18(), which is a PyTorch ResNet-18 model. criterion (class): A PyTorch criterion. The uninitialized criterion (loss) used to optimize the module. Default is torch.nn.CrossEntropyLoss. optimizer (class): A PyTorch optimizer. The uninitialized optimizer (update rule) used to optimize the module. Default is torch.optim.SGD. optimizer_lr (int): The learning rate passed to the optimizer. Default is 0.01. device (str): A PyTorch device. The device on which the module and criterion are located. Default is "cpu". epochs (int): The number of epochs to train for each fit() call. Default is 10. tensors (list): A list of tensor names that would be considered for cleaning (e.g. ['images', 'labels']). dataloader_train_params (dict): Keyword arguments to pass into torch.utils.data.DataLoader. Options that may especially impact accuracy include: shuffle, batch_size. dataloader_valid_params (dict): Keyword arguments to pass into torch.utils.data.DataLoader. Options that may especially impact accuracy include: shuffle, batch_size. If not provided, dataloader_train_params will be used with shuffle=False. overwrite (bool): If True, will overwrite label_issues tensors if they already exists. Default is False. fold (int): Sets the number of cross-validation folds used to compute out-of-sample probabilities for each example in the dataset. The default is 5. skorch_kwargs (dict): Keyword arguments to pass into skorch.NeuralNet. Options that may especially impact accuracy include: ... Returns: label_issues: A boolean mask for the entire dataset where True represents a label issue and False represents an example that is confidently/accurately labeled. label_quality_scores: Returns label quality scores for each datapoint, where lower scores indicate labels less likely to be correct. """
-
Skorch Integration
- Made
skorch
compatitable with Hub dataset format.- Added the integration
skorch.py
inhub/integrations/pytorch
. - Created a class
VisionClassifierNet
that wraps the PyTorch Module in an sklearn interface. - Make skorch compatitable with Hub’s PyTorch Dataloader.
- Set the defaults for relevant
skorch
parameters such asmodule
,criterion
,optimizer
.
- Added the integration
Core Functions for Cleaning Labels
- Created the component
clean_labels.py
inhub/core/experimental/labels
.- Implemented core function
clean_labels()
which cleans the labels of a dataset.- Wraps a PyTorch instance in a sklearn classifier. Next, it runs cross-validation to get out-of-sample predicted probabilities for each example. Then, it finds label issues (boolean mask) and label quality scores (floats from 0 to 1) for each sample in the dataset. At the end, it creates tensors with label issues.
- Implemented helper functions.
get_dataset_tensors()
returns the tensors of a dataset. If a list of tensors is not provided, it will try to find them in thedataloader_train_params
in the transform. If none of these are provided, it will iterate over the dataset tensors and return any tensors that match htype'image'
for images and htype'class_label'
for labels. Additionally, this function will also check if the dataset already has alabel_issues
group.estimate_cv_predicted_probabilities()
computes an out-of-sample predicted probability for every example in a dataset using cross validation.append_label_issues_tensors()
creates a group of tensorslabel_issues
. After creating tensors, automatically commits the changes.
- Implemented core function