What did I do this week?
This week, I set up an environment for experimenting with large datasets using an Amazon S3 bucket and EC2 instance. Previously, I was testing my code on my local machine. My hard drive has 10 GB free only so it wasn’t possible to work with TB/PB sized datasets.
What will I do next week?
Did I get stuck anywhere?
I was having difficulty working with Boto3, the AWS software development kit for Python. I had to go through the documentation extensively to resolve my problems.
This week, I set up an environment for experimenting with large datasets using an Amazon S3 bucket and EC2 instance. Previously, I was testing my code on my local machine. My hard drive has 10 GB free only so it wasn’t possible to work with TB/PB sized datasets.
What will I do next week?
- Benchmark performance of the implemented hashing technique with datasets of different sizes.
- Monitor if separating hashes based on their first three digits is evenly spread among buckets.
- Check probability of hash collisions for several hashing algorithms (md5, sha-256, murmur hash, etc.). We want to avoid false positives that could arise from collisions in large datasets. Also, check if summing up hashes is a viable quick approach to find if two datasets are identical
Did I get stuck anywhere?
I was having difficulty working with Boto3, the AWS software development kit for Python. I had to go through the documentation extensively to resolve my problems.