GSoC Weekly Check-In #8 (July 26)

Published: 07/27/2021

What did I do this week?
  • Merged my local forked repo with the main branch used by Hub. I had to make some fixes as the architecture had changed.
  • Added “ (dataset-1, dataset-2)”. This allows comparison of hashlists of two datasets using Jaccard Similarity.
  • Fixed the way samples were hashed, to account for compression. Previously, samples were hashed after compression, resulting in two different hashes for the same sample’s compressed and uncompressed version.

What will I do next week?
  • Use transforms to generate hashes in a distributed manner.
  • Complete task from last week to compare hashes of datasets being uploaded to datasets already on Hub.
  • Write test cases and document code
  • Optimize size of hashlists. Unnecessary quotation marks are being used right now to store hashes e.g. hashes = [“ d2asdsdf“, “asd223gk”]. This should be removed to save space.

Did I get stuck anywhere?
As mentioned above, merging with the main branch meant I had to make some fixes to my code which took time.