GSoC Weekly Check-In #4 (June 28)

rahulbshrestha
Published: 06/29/2021

What did I do this week?
This week, I set up an environment for experimenting with large datasets using an Amazon S3 bucket and EC2 instance. Previously, I was testing my code on my local machine. My hard drive has 10 GB free only so it wasn’t possible to work with TB/PB sized datasets.

What will I do next week?

Benchmark performance of the implemented hashing technique with datasets of different sizes.
Monitor if separating hashes based on their first three digits is evenly spread among buckets.
Check probability of hash collisions for several hashing algorithms (md5, sha-256, murmur hash, etc.). We want to avoid false positives that could arise from collisions in large datasets. Also, check if summing up hashes is a viable quick approach to find if two datasets are identical

Did I get stuck anywhere?
I was having difficulty working with Boto3, the AWS software development kit for Python. I had to go through the documentation extensively to resolve my problems.

GSoC Weekly Check-In #4 (June 28)

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (11 rendered)

Cache calls from 1 backend

Signals

Log messages