What’s Done ✅
→ Updated API
from hub.integrations.cleanlab import clean_labels, create_tensors, clean_view
from hub.integrations import skorch
ds = hub.load("hub://ds")
tform = transforms.Compose(
[
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
]
)
transform = {"images": tform, "labels": None}
# Get scikit-learn compatible PyTorch module to pass into clean_labels as a classifier
model = skorch(dataset=ds, epochs=5, batch_size=16, transform=transform, tensors=[], valid_transform, skorch_kwargs)
# Obtain a DataFrame with columns is_label_issue, label_quality and predicted_label
label_issues = find_label_issues(
dataset=ds,
model=model,
folds=3,
)
# Create label_issues tensor group on "labels" branch
create_label_isssues_tensors(
dataset=ds,
label_issues=label_issues,
branch="labels"
)
# Get dataset view where only clean labels are present, and the rest are filtered out.
ds_clean = clean_view(ds)
→ Link to PR: https://github.com/activeloopai/Hub/pull/1821
Skorch Integration
**skorch**()
-
Added support for providing the validation set for training
skorch(dataset=ds, valid_dataset=valid_ds) -
Added keyword arguments that can be passed into modules to fine-tune the parameters for advanced users.
skorch_kwargsarguments to be passed to the skorchNeuralNetconstructor. Additionally,iterator_train__transformanditerator_valid__transformcan be used to set params for the training and validation iterators.
-
Made passing in the images and labels tensors more explicit.
-
Modularized methods.
- Separated
skorchmodule fromcleanlabto make it easier to instantiate skorch even if you’re not usingcleanlabin the downstream. - Further modularized
skorchmodule into separate functions and modules. - Added utils functions in a separate file.
- Separated
-
Added error-checking utils to check errors early.
→ Check if a
datasetandvalid_datasetthat’s passed in is a Hub Dataset.→ Check if the tensors’
htypesare supported for image classification tasks.
Cleanlab Integration
clean_labels()
- Implemented a function to compute
guessed_labelby the classifier after pruning. - Added flag
pretrainedto skip cross-validation if pretrained model is used to compute out-of-sample probabilities faster on a singlefit(). - Instead of returning a tuple of numpy ndarrays
label_issues,label_quality_scoresandpredicted_labels, nowclean_labels()returns a singlelabel_issuesdataframe with columnsis_label_issue,label_quality,predicted_label. - Added keyword arguments that can be passed into modules to fine-tune the parameters for advanced users.
label_issues_kwargscan be be passed to thecleanlab.filter.find_label_issuesfunction.label_quality_kwargscan be passed to thecleanlab.rank.get_label_quality_scoresfunction.
create_tensors()
- Added the ability to select
branchto commit to when creating tensors - Modularized methods.
create_tensors()is now a separate method that takes inlabel_issuesdataframe or looks forlabel_issuesin tensors to get a view where only clean labels are present and the rest are filtered out. This will now returncommit_id.- Added utils functions in a separate file.
- Added error-checking utils to check errors early.
- Check early if a user has write access to the dataset before creating the tensors.
- Check if
label_issuesdataframe columns have correctdtypesand are a subset of a dataset before appending them to tensors.
clean_view()
- Added a method
clean_view(ds)to get a dataset view where only clean labels are present, and the rest are filtered out. This can be useful to pass the clean dataset to downstream ML frameworks for training.
Other
- Created custom config for dependencies
hub[’cleanlab’]. - Created common utils that are reused across modules.
- Renamed some of the function and variable names to be more clear.
- Clarified the docstrings parameters and improved readability.
- Merged
mainbranch and resolved conflicts. - Commented on the aspects I’m not sure about in my PR.
- Run tests on 10+ Activeloop image classification datasets (without creating tensors).
Next Steps
- Finalize PR after the getting reviewers’ feedback.
- Try to create and run unit tests.
- Create a notebook that showcases the workflow (such as https://docs.activeloop.ai/playbooks/evaluating-model-performance)
- Create a blog post with a bit more insight into the problem statement and results of the running workflow on various datasets with varying noise levels (such as https://www.activeloop.ai/resources/)