anubhavp's Blog

Weekly Check-in #10 : ( 26 July - 1 Aug )

Published: 07/30/2019

What did you do this week?

  • Improved performance of Protego by implementing lazy regex compilation.
  • Benchmark Results :
    • Time took to parse 570 `robots.txt` files :

Protego : 
25th percentile : 0.000134
50th percentile : 0.000340
75th percentile : 0.000911
100th percentile : 0.345727
Total Time : 0.999360

Rerp : 
25th percentile : 0.000066
50th percentile : 0.000123
75th percentile : 0.000279
100th percentile : 0.101409
Total Time : 0.317715

Reppy : 
25th percentile : 0.000028
50th percentile : 0.000038
75th percentile : 0.000063
100th percentile : 0.015579
Total Time : 0.055850

  • Time took to parse 570 `robots.txt` and answer 1000 queries per `robots.txt` :

Protego : 
25th percentile : 0.009057
50th percentile : 0.012806
75th percentile : 0.023660
100th percentile : 9.033481
Total Time : 21.999680

Rerp : 
25th percentile : 0.006096
50th percentile : 0.011864
75th percentile : 0.041876
100th percentile : 35.027233
Total Time : 68.811635

Reppy : 
25th percentile : 0.000858
50th percentile : 0.001018
75th percentile : 0.001472
100th percentile : 0.236081
Total Time : 1.132098

What is coming up next?

  • Will depend on the review from the mentors. If everything looks good to them, I would shift my focus back on Scrapy.

Did you get stuck anywhere?

  • Nothing major.
View Blog Post

[Blog #4] Need For Speed

Published: 07/24/2019

<meta charset="utf-8">Hey! This is my fifth blog post for GSoC 2019, covering week 7 and 8.

The most of week 7 was spent making Protego compatible with Google's parser. I also worked on the documentation, since Protego codebase is small enough, proper comments and a good readme was sufficient. I uploaded Protego to PyPI - `pip install Protego` that's all it takes to install Protego. 

Week 8 was quite interesting. For Protego to become default in Scrapy, it is necessary that it doesn’t throw any kind of error while parsing `robots.txt` files. To make sure that, I decided to download `robots.txt` from top 10,000 websites. I added tests to see if Protego throws any exceptions while parsing the downloaded `robots.txt`. I benchmarked Protego, and the results were quite disappointing. You can see the result here. 

We decided to spend the next week improving performance of Protego. I am going to try profiling and heuristics, and see if the performance can be improved.

View Blog Post

Weekly Check-in #9 : ( 19 July - 26 July )

Published: 07/23/2019

What did you do this week?

  • I added tests to make sure Protego doesn't throw exceptions on `robots.txt` of top 10,000 most popular websites.
  • Utilised Scrapy to create a tool to download `robots.txt` of top 10,000 most popular websites.
  • Benchmarked Protego : I ran Protego(written in Python), Robotexclusionrulesparser(written in Python), Reppy(written in C++ but has Python interface) on 570 `robots.txt` downloaded from top 1000 websites, and here are the results.
    • Time spend parsing the `robots.txt`
      • Protego : 79.00128873897484 seconds  broken heart
      • Robotexclusionrulesparser : 0.30100024401326664 seconds
      • Reppy : 0.05821833698428236 seconds
    • Time spend answering queries (1000 queries (all were identical) per `robots.txt`)
      • Protego : 14.736387926022871 seconds
      • Robotexclusionrulesparser : 67.33521732398367 seconds
      • Reppy : 1.0866852040198864 seconds
  • Added logging to Protego.

What is coming up next?

  • Will depend on the review from the mentors. If everything looks good to them, I would shift my focus back on Scrapy.
  • Make `SitemapSpider` use the new interface for `robots.txt` parsers.
  • Implement Crawl-delay & Host directive support in Scrapy.

Did you get stuck anywhere?

  • Nothing major.
View Blog Post

[Blog #3] Google open-sourced its robots.txt parser

Published: 07/22/2019

<meta charset="utf-8">

Hey! This is my fourth blog post for GSoC 2019, covering week 5 and 6.

Few interesting things have happened, Google has open-sourced its robots.txt parser, and have also taken the lead role in pushing the community and enterprises to create an official specification for `robots.txt`. I spend a good amount of time making Protego compatible with Google’s parser. This required rewriting a good chunk of Protego to support Google’s parser specific things such as merging record group, supporting misspellings, etc.

I am scared of reading or writing C++ code that uses STL or pointers heavily. So really going through the source code of Google’s parser was kind of uncomforting, but I was able to understand a good chunk of it, after a few days of struggle.  

Next up, I will work on making Protego 100% compatible with Google’s parser. I will have to document Protego. I will collect robots.txt from top 1000 websites to understand usage patterns.



View Blog Post

[Blog #2] Protego parse!

Published: 07/21/2019

Hey! This is my third blog post for GSoC 2019, covering week 3 and 4.

The first part of my project concerning interface for `robots.txt` parsers is almost complete.

I have started working on a pure-Python `robots.txt` parser which I have named Protego. The name is borrowed from Harry Potter universe, where Protego is a charm that creates a shield to protect the caster. The end goal for Protego is to support all of the popular directives, wildcard matching, and a good number of less popular directives. Also, we aim to make Protego 100% compatible with Google's robots.txt parser. We intend Protego to become the deafult `robots.txt` parser in Scrapy.

I have implemented support for all major directives in Protego. I have also implemented support for wildcard matching. I utilised pytest and tox to automate testing Protego on every version of Python. Furthur used Travis to run tests automatically on code push and pull requests. I borrowed tests from other parsers to check Protego on. Protego currently passes all tests borrowed from `reppy`, `rep-cpp` and `robotexlusionrulesparser`. 

View Blog Post