Weekly Check-in #10 : ( 26 July - 1 Aug )
anubhavp
Published: 07/30/2019
What did you do this week?
- Improved performance of Protego by implementing lazy regex compilation.
- Benchmark Results :
- Time took to parse 570 `robots.txt` files :
Protego :
25th percentile : 0.000134
50th percentile : 0.000340
75th percentile : 0.000911
100th percentile : 0.345727
Total Time : 0.999360
Rerp :
25th percentile : 0.000066
50th percentile : 0.000123
75th percentile : 0.000279
100th percentile : 0.101409
Total Time : 0.317715
Reppy :
25th percentile : 0.000028
50th percentile : 0.000038
75th percentile : 0.000063
100th percentile : 0.015579
Total Time : 0.055850
- Time took to parse 570 `robots.txt` and answer 1000 queries per `robots.txt` :
Protego :
25th percentile : 0.009057
50th percentile : 0.012806
75th percentile : 0.023660
100th percentile : 9.033481
Total Time : 21.999680
Rerp :
25th percentile : 0.006096
50th percentile : 0.011864
75th percentile : 0.041876
100th percentile : 35.027233
Total Time : 68.811635
Reppy :
25th percentile : 0.000858
50th percentile : 0.001018
75th percentile : 0.001472
100th percentile : 0.236081
Total Time : 1.132098
What is coming up next?
- Will depend on the review from the mentors. If everything looks good to them, I would shift my focus back on Scrapy.
Did you get stuck anywhere?
View Blog Post
[Blog #4] Need For Speed
anubhavp
Published: 07/24/2019
<meta charset="utf-8">Hey! This is my fifth blog post for GSoC 2019, covering week 7 and 8.
The most of week 7 was spent making Protego compatible with Google's parser. I also worked on the documentation, since Protego codebase is small enough, proper comments and a good readme was sufficient. I uploaded Protego to PyPI - `pip install Protego` that's all it takes to install Protego.
Week 8 was quite interesting. For Protego to become default in Scrapy, it is necessary that it doesn’t throw any kind of error while parsing `robots.txt` files. To make sure that, I decided to download `robots.txt` from top 10,000 websites. I added tests to see if Protego throws any exceptions while parsing the downloaded `robots.txt`. I benchmarked Protego, and the results were quite disappointing. You can see the result here.
We decided to spend the next week improving performance of Protego. I am going to try profiling and heuristics, and see if the performance can be improved.
View Blog Post
Weekly Check-in #9 : ( 19 July - 26 July )
anubhavp
Published: 07/23/2019
What did you do this week?
- I added tests to make sure Protego doesn't throw exceptions on `robots.txt` of top 10,000 most popular websites.
- Utilised Scrapy to create a tool to download `robots.txt` of top 10,000 most popular websites.
- Benchmarked Protego : I ran Protego(written in Python), Robotexclusionrulesparser(written in Python), Reppy(written in C++ but has Python interface) on 570 `robots.txt` downloaded from top 1000 websites, and here are the results.
- Time spend parsing the `robots.txt`
- Protego : 79.00128873897484 seconds

- Robotexclusionrulesparser : 0.30100024401326664 seconds
- Reppy : 0.05821833698428236 seconds
- Time spend answering queries (1000 queries (all were identical) per `robots.txt`)
- Protego : 14.736387926022871 seconds
- Robotexclusionrulesparser : 67.33521732398367 seconds
- Reppy : 1.0866852040198864 seconds
- Added logging to Protego.
What is coming up next?
- Will depend on the review from the mentors. If everything looks good to them, I would shift my focus back on Scrapy.
- Make `SitemapSpider` use the new interface for `robots.txt` parsers.
- Implement Crawl-delay & Host directive support in Scrapy.
Did you get stuck anywhere?
View Blog Post
[Blog #3] Google open-sourced its robots.txt parser
anubhavp
Published: 07/22/2019
<meta charset="utf-8">
Hey! This is my fourth blog post for GSoC 2019, covering week 5 and 6.
Few interesting things have happened, Google has open-sourced its robots.txt parser, and have also taken the lead role in pushing the community and enterprises to create an official specification for `robots.txt`. I spend a good amount of time making Protego compatible with Google’s parser. This required rewriting a good chunk of Protego to support Google’s parser specific things such as merging record group, supporting misspellings, etc.
I am scared of reading or writing C++ code that uses STL or pointers heavily. So really going through the source code of Google’s parser was kind of uncomforting, but I was able to understand a good chunk of it, after a few days of struggle.
Next up, I will work on making Protego 100% compatible with Google’s parser. I will have to document Protego. I will collect robots.txt from top 1000 websites to understand usage patterns.
View Blog Post
[Blog #2] Protego parse!
anubhavp
Published: 07/21/2019
Hey! This is my third blog post for GSoC 2019, covering week 3 and 4.
The first part of my project concerning interface for `robots.txt` parsers is almost complete.
I have started working on a pure-Python `robots.txt` parser which I have named Protego. The name is borrowed from Harry Potter universe, where Protego is a charm that creates a shield to protect the caster. The end goal for Protego is to support all of the popular directives, wildcard matching, and a good number of less popular directives. Also, we aim to make Protego 100% compatible with Google's robots.txt parser. We intend Protego to become the deafult `robots.txt` parser in Scrapy.
I have implemented support for all major directives in Protego. I have also implemented support for wildcard matching. I utilised pytest and tox to automate testing Protego on every version of Python. Furthur used Travis to run tests automatically on code push and pull requests. I borrowed tests from other parsers to check Protego on. Protego currently passes all tests borrowed from `reppy`, `rep-cpp` and `robotexlusionrulesparser`.
View Blog Post