GSoC Weekly Check-In #4 (June 28)

Published: 06/29/2021

What did I do this week?
This week, I set up an environment for experimenting with large datasets using an Amazon S3 bucket and EC2 instance. Previously, I was testing my code on my local machine. My hard drive has 10 GB free only so it wasn’t possible to work with TB/PB sized datasets.

What will I do next week?
  • Benchmark performance of the implemented hashing technique with datasets of different sizes.
  • Monitor if separating hashes based on their first three digits is evenly spread among buckets.
  • Check probability of hash collisions for several hashing algorithms (md5, sha-256, murmur hash, etc.). We want to avoid false positives that could arise from collisions in large datasets. Also, check if summing up hashes is a viable quick approach to find if two datasets are identical

Did I get stuck anywhere?
I was having difficulty working with Boto3, the AWS software development kit for Python. I had to go through the documentation extensively to resolve my problems.