GSoC Blog | Activeloop | Week 11

lowlypalace
Published: 09/11/2022

What’s Done ✅

→ Updated API

from hub.integrations.cleanlab import clean_labels, create_tensors, clean_view
from hub.integrations import skorch

ds = hub.load("hub://ds")

tform = transforms.Compose(
    [
        transforms.ToTensor(),
        transforms.Normalize((0.5,), (0.5,)),
    ]
)

transform = {"images": tform, "labels": None}

# Get scikit-learn compatible PyTorch module to pass into clean_labels as a classifier
model = skorch(dataset=ds, epochs=5, batch_size=16, transform=transform, tensors=[],  valid_transform, skorch_kwargs)

# Obtain a DataFrame with columns is_label_issue, label_quality and predicted_label 
label_issues = find_label_issues(
    dataset=ds,
    model=model,
    folds=3,
)

# Create label_issues tensor group on "labels" branch
create_label_isssues_tensors(
    dataset=ds,
    label_issues=label_issues,
    branch="labels"
)

# Get dataset view where only clean labels are present, and the rest are filtered out.
ds_clean = clean_view(ds)

→ Link to PR: https://github.com/activeloopai/Hub/pull/1821

Skorch Integration

`skorch()`

Added support for providing the validation set for training skorch(dataset=ds, valid_dataset=valid_ds)
Added keyword arguments that can be passed into modules to fine-tune the parameters for advanced users.
- skorch_kwargs arguments to be passed to the skorch NeuralNet constructor. Additionally, iterator_train__transform and iterator_valid__transform can be used to set params for the training and validation iterators.
Made passing in the images and labels tensors more explicit.
Modularized methods.
- Separated skorch module from cleanlab to make it easier to instantiate skorch even if you’re not using cleanlab in the downstream.
- Further modularized skorch module into separate functions and modules.
- Added utils functions in a separate file.
Added error-checking utils to check errors early.

→ Check if a dataset and valid_dataset that’s passed in is a Hub Dataset.

→ Check if the tensors’ htypes are supported for image classification tasks.

Cleanlab Integration

`clean_labels()`

Implemented a function to compute guessed_label by the classifier after pruning.
Added flag pretrained to skip cross-validation if pretrained model is used to compute out-of-sample probabilities faster on a single fit().
Instead of returning a tuple of numpy ndarrays label_issues, label_quality_scores and predicted_labels, now clean_labels() returns a single label_issues dataframe with columns is_label_issue, label_quality, predicted_label.
Added keyword arguments that can be passed into modules to fine-tune the parameters for advanced users.
- label_issues_kwargs can be be passed to the cleanlab.filter.find_label_issues function.
- label_quality_kwargs can be passed to the cleanlab.rank.get_label_quality_scores function.

`create_tensors()`

Added the ability to select branch to commit to when creating tensors
Modularized methods.
- create_tensors() is now a separate method that takes in label_issues dataframe or looks for label_issues in tensors to get a view where only clean labels are present and the rest are filtered out. This will now return commit_id.
- Added utils functions in a separate file.
Added error-checking utils to check errors early.
- Check early if a user has write access to the dataset before creating the tensors.
- Check if label_issues dataframe columns have correct dtypes and are a subset of a dataset before appending them to tensors.

`clean_view()`

Added a method clean_view(ds) to get a dataset view where only clean labels are present, and the rest are filtered out. This can be useful to pass the clean dataset to downstream ML frameworks for training.

Other

Created custom config for dependencies hub[’cleanlab’].
Created common utils that are reused across modules.
Renamed some of the function and variable names to be more clear.
Clarified the docstrings parameters and improved readability.
Merged main branch and resolved conflicts.
Commented on the aspects I’m not sure about in my PR.
Run tests on 10+ Activeloop image classification datasets (without creating tensors).

Next Steps

Finalize PR after the getting reviewers’ feedback.
Try to create and run unit tests.
Create a notebook that showcases the workflow (such as https://docs.activeloop.ai/playbooks/evaluating-model-performance)
Create a blog post with a bit more insight into the problem statement and results of the running workflow on various datasets with varying noise levels (such as https://www.activeloop.ai/resources/)

GSoC Blog | Activeloop | Week 11

What’s Done ✅

Skorch Integration

`skorch()`

Cleanlab Integration

`clean_labels()`

`create_tensors()`

`clean_view()`

Other

Next Steps

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (11 rendered)

Cache calls from 1 backend

Signals

Log messages

GSoC Blog | Activeloop | Week 11

What’s Done ✅

Skorch Integration

**skorch**()

Cleanlab Integration

clean_labels()

create_tensors()

clean_view()

Other

Next Steps

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (11 rendered)

Cache calls from 1 backend

Signals

Log messages

`skorch()`

`clean_labels()`

`create_tensors()`

`clean_view()`