GSoC Blog | Activeloop | Week 10

Published: 09/11/2022

What’s Done 

→ Updated API

from hub.integrations.cleanlab import clean_labels

training_params = {'module' = resnet18(), 'criterion' = CrossEntropyLoss, 
'optimizer' = SGD, 'epochs' = 10, 'optimizer_lr' = 0.01, 'device' = "cpu",
'folds = 5'}

clean_labels( ds,
					  	training_params = training_params,
			        verbose = True,
			        tensors = ['images', 'labels'],
			        overwrite = False,
							num_workers = 1,
							batch_size = 1,
							shuffle = True,
							transform = {},
							create_tensors = True

→ Added create_tensors flag.

  • create_tensors boolean flag would be useful here to confirm if a user wants to append new label_issues tensor. If the flag create_tensors is False, then is_label_issues, label_quality_scores numpy arrays are returned. If True, tensors is_label_issues and label_quality_scores are created and also returned as numpy arrays.

→ Added support to provide validation set for training

  • clean_labels(*ds_train, ds_valid)*
  • No support yet to compute label errors for validation set

→ Made providing tensors names more explicit

→ Fixed some errors related to checking if an image tensor is RGB or Grayscale

→ Minor improvements (e.g. matching device in the core function rather than making it a required parameter)

What’s Next


→ Prune API

  • prune_labels(ds)
  • Instead of deleting samples, enable users to create an instance of the dataset that would only fetch correct samples when filling up batches?
    • It could be easily possible for users to ds = ds[clean_idx] and then use a clean dataset for the downstream.
  • Leave out pruning to the users and code it up in the blog post instead?
  • Create a new branch

→ Create a tensor guessed_label to add labels guessed by the classifier after pruning.

  • Relabeling workflow on Activeloop?

→ Create custom config for pip install (e.g. pip install hub[’cleanlab’])

→ Add flag branch to move to a different branch instead of making a commit on a current branch.

→ Add flags add_branch = True

→ Add support for bounding boxes, task = 'classification' or task = 'segmentation'

→ Raise error if not htype image

→ Add support for TensorFlow modules

→ Add optional cleanlab kwargs to pass down

→ Add optional skorch kwargs to pass down

→ Tests

  • Unit tests
  • Tests with Activeloop datasets

→ Make it possible to skorch(ds)

→ Raise error if I don’t have write access