GSoC Weekly Check-In #2 (June 14)

rahulbshrestha
Published: 06/15/2021

What did I do this week?
I started off by writing some Python code which compares two datasets and prints out a similarity score (in %). The steps are: generate an md5 hash for each file, append the hash and directory location into a list, compare this list with other hash lists. If both datasets are the same, the similarity score is 100%. I tested it out with several cases, for example renaming, deleting and duplicating files. The similarity score successfully changed every time and the differing files are pointed out. My implementation: https://github.com/rahulbshrestha/hash-dataset

What will I do next week?
I’ll be experimenting with larger datasets and monitoring its execution time. I would like to implement splitting a hash list into smaller subsets and check if that subset exists in another hashlist. This could save time as when a user uploads a new dataset, every file doesn'>
Did I get stuck anywhere?
I learned that some Python libraries are rather inefficient in terms of speed, so I had to rewrite some parts of my code. The execution time adds up when working with large datasets.