What’s Done ✅
→ Updated API
from hub.integrations.cleanlab import clean_labels, create_tensors, clean_view
from hub.integrations import skorch
ds = hub.load("hub://ds")
tform = transforms.Compose(
[
transforms.ToTensor(),
transforms.Normalize((0.5,), (0.5,)),
]
)
transform = {"images": tform, "labels": None}
# Get scikit-learn compatible PyTorch module to pass into clean_labels as a classifier
model = skorch(dataset=ds, epochs=5, batch_size=16, transform=transform, tensors=[], valid_transform, skorch_kwargs)
# Obtain a DataFrame with columns is_label_issue, label_quality and predicted_label
label_issues = find_label_issues(
dataset=ds,
model=model,
folds=3,
)
# Create label_issues tensor group on "labels" branch
create_label_isssues_tensors(
dataset=ds,
label_issues=label_issues,
branch="labels"
)
# Get dataset view where only clean labels are present, and the rest are filtered out.
ds_clean = clean_view(ds)
→ Link to PR: https://github.com/activeloopai/Hub/pull/1821
Skorch Integration
**skorch**()
-
Added support for providing the validation set for training
skorch(dataset=ds, valid_dataset=valid_ds)
-
Added keyword arguments that can be passed into modules to fine-tune the parameters for advanced users.
skorch_kwargs
arguments to be passed to the skorchNeuralNet
constructor. Additionally,iterator_train__transform
anditerator_valid__transform
can be used to set params for the training and validation iterators.
-
Made passing in the images and labels tensors more explicit.
-
Modularized methods.
- Separated
skorch
module fromcleanlab
to make it easier to instantiate skorch even if you’re not usingcleanlab
in the downstream. - Further modularized
skorch
module into separate functions and modules. - Added utils functions in a separate file.
- Separated
-
Added error-checking utils to check errors early.
→ Check if a
dataset
andvalid_dataset
that’s passed in is a Hub Dataset.→ Check if the tensors’
htypes
are supported for image classification tasks.
Cleanlab Integration
clean_labels()
- Implemented a function to compute
guessed_label
by the classifier after pruning. - Added flag
pretrained
to skip cross-validation if pretrained model is used to compute out-of-sample probabilities faster on a singlefit()
. - Instead of returning a tuple of numpy ndarrays
label_issues
,label_quality_scores
andpredicted_labels
, nowclean_labels()
returns a singlelabel_issues
dataframe with columnsis_label_issue
,label_quality
,predicted_label
. - Added keyword arguments that can be passed into modules to fine-tune the parameters for advanced users.
label_issues_kwargs
can be be passed to thecleanlab.filter.find_label_issues
function.label_quality_kwargs
can be passed to thecleanlab.rank.get_label_quality_scores
function.
create_tensors()
- Added the ability to select
branch
to commit to when creating tensors - Modularized methods.
create_tensors()
is now a separate method that takes inlabel_issues
dataframe or looks forlabel_issues
in tensors to get a view where only clean labels are present and the rest are filtered out. This will now returncommit_id
.- Added utils functions in a separate file.
- Added error-checking utils to check errors early.
- Check early if a user has write access to the dataset before creating the tensors.
- Check if
label_issues
dataframe columns have correctdtypes
and are a subset of a dataset before appending them to tensors.
clean_view()
- Added a method
clean_view(ds)
to get a dataset view where only clean labels are present, and the rest are filtered out. This can be useful to pass the clean dataset to downstream ML frameworks for training.
Other
- Created custom config for dependencies
hub[’cleanlab’]
. - Created common utils that are reused across modules.
- Renamed some of the function and variable names to be more clear.
- Clarified the docstrings parameters and improved readability.
- Merged
main
branch and resolved conflicts. - Commented on the aspects I’m not sure about in my PR.
- Run tests on 10+ Activeloop image classification datasets (without creating tensors).
Next Steps
- Finalize PR after the getting reviewers’ feedback.
- Try to create and run unit tests.
- Create a notebook that showcases the workflow (such as https://docs.activeloop.ai/playbooks/evaluating-model-performance)
- Create a blog post with a bit more insight into the problem statement and results of the running workflow on various datasets with varying noise levels (such as https://www.activeloop.ai/resources/)