GSoC Blog | Activeloop | Week 11

lowlypalace
Published: 09/11/2022

What’s Done

→ Updated API

from hub.integrations.cleanlab import clean_labels, create_tensors, clean_view
from hub.integrations import skorch

ds = hub.load("hub://ds")

tform = transforms.Compose(
    [
        transforms.ToTensor(),
        transforms.Normalize((0.5,), (0.5,)),
    ]
)

transform = {"images": tform, "labels": None}

# Get scikit-learn compatible PyTorch module to pass into clean_labels as a classifier
model = skorch(dataset=ds, epochs=5, batch_size=16, transform=transform, tensors=[],  valid_transform, skorch_kwargs)

# Obtain a DataFrame with columns is_label_issue, label_quality and predicted_label 
label_issues = find_label_issues(
    dataset=ds,
    model=model,
    folds=3,
)

# Create label_issues tensor group on "labels" branch
create_label_isssues_tensors(
    dataset=ds,
    label_issues=label_issues,
    branch="labels"
)

# Get dataset view where only clean labels are present, and the rest are filtered out.
ds_clean = clean_view(ds)

→ Link to PR: https://github.com/activeloopai/Hub/pull/1821

Skorch Integration

**skorch**()

  • Added support for providing the validation set for training skorch(dataset=ds, valid_dataset=valid_ds)

  • Added keyword arguments that can be passed into modules to fine-tune the parameters for advanced users.

    • skorch_kwargs arguments to be passed to the skorch NeuralNet constructor. Additionally, iterator_train__transform and iterator_valid__transform can be used to set params for the training and validation iterators.
  • Made passing in the images and labels tensors more explicit.

  • Modularized methods.

    • Separated skorch module from cleanlab to make it easier to instantiate skorch even if you’re not using cleanlab in the downstream.
    • Further modularized skorch module into separate functions and modules.
    • Added utils functions in a separate file.
  • Added error-checking utils to check errors early.

    → Check if a dataset and valid_dataset that’s passed in is a Hub Dataset.

    → Check if the tensors’ htypes are supported for image classification tasks.

Cleanlab Integration

clean_labels()

  • Implemented a function to compute  guessed_label  by the classifier after pruning.
  • Added flag pretrained to skip cross-validation if pretrained model is used to compute out-of-sample probabilities faster on a single fit().
  • Instead of returning a tuple of numpy ndarrays label_issues, label_quality_scores and predicted_labels, now clean_labels() returns a single label_issues dataframe with columns is_label_issue, label_quality, predicted_label.
  • Added keyword arguments that can be passed into modules to fine-tune the parameters for advanced users.
    • label_issues_kwargs can be be passed to the cleanlab.filter.find_label_issues function.
    • label_quality_kwargs can be passed to the cleanlab.rank.get_label_quality_scores function.

create_tensors()

  • Added the ability to select branch to commit to when creating tensors
  • Modularized methods.
    • create_tensors() is now a separate method that takes in label_issues dataframe or looks for label_issues in tensors to get a view where only clean labels are present and the rest are filtered out. This will now return commit_id.
    • Added utils functions in a separate file.
  • Added error-checking utils to check errors early.
    • Check early if a user has write access to the dataset before creating the tensors.
    • Check if label_issues dataframe columns have correct dtypes and are a subset of a dataset before appending them to tensors.

clean_view()

  • Added a method clean_view(ds) to get a dataset view where only clean labels are present, and the rest are filtered out. This can be useful to pass the clean dataset to downstream ML frameworks for training.

Other

  • Created custom config for dependencies hub[’cleanlab’].
  • Created common utils that are reused across modules.
  • Renamed some of the function and variable names to be more clear.
  • Clarified the docstrings parameters and improved readability.
  • Merged main branch and resolved conflicts.
  • Commented on the aspects I’m not sure about in my PR.
  • Run tests on 10+ Activeloop image classification datasets (without creating tensors).

Next Steps