Weekly Check-in #12: (9 Aug - 15 Aug)

anubhavp
Published: 08/13/2019

What did you do this week?

  • Benchmarking Protego (again). This time we crawled multiple domains (~1,100 domains) and downloaded links to pages as the crawler encounter them. We downloaded 111, 824 links in total.
    • Next we made each robots.txt parser - parse and answer query (we made parsers answer each query 5 times) in an order similar to how they would need to in a broad crawl. Here are the results :

Protego :

25th percentile : 0.002419 seconds
50th percentile : 0.006798 seconds
75th percentile : 0.014307 seconds
100th percentile : 2.546978 seconds
Total Time : 19.545984 seconds

RobotFileParser (default in Scrapy) :

25th percentile : 0.002188 seconds
50th percentile : 0.005350 secondsstyle
75th percentile : 0.010492 seconds
100th percentile : 1.805923 seconds
Total Time : 13.799954 seconds

Rerp Parser :
25th percentile : 0.001288 seconds
50th percentile : 0.005222 seconds
75th percentile : 0.014640 seconds
100th percentile : 52.706880 seconds
Total Time : 76.460496 seconds

Reppy Parser :
25th percentile : 0.000210 seconds
50th percentile : 0.000492 seconds
75th percentile : 0.000997 seconds
100th percentile : 0.129440 seconds
Total Time: 1.405558  seconds

 

  • Removing an hack used in Protego due to lack of an option to ignore characters in `urllib.parse.unquote`. Added few new features to Protego as well. 
  • Protego has been moved to Scrapy organisation.

What is coming up next?

  • Configuring Travis to push to PyPI automatically.
  • Introducing a new `ROBOTSTXT_USER_AGENT` setting in Scrapy.
  • Making `SitemapCrawler` use the new interface.

Did you get stuck anywhere?

I got blocked by StackExchange for few hours. laugh I think they don't like crawlers on their websites. "It is a temporary automatic ban that is put up by our HAProxy instance when you hit our rate limiter." they answered to one of the questions on their website.

View Blog Post

Weekly Check-in #11: ( 2 Aug - 8 Aug )

anubhavp
Published: 08/06/2019

What did you do this week?

  • Added API description and more usage examples to readme.
  • Added PyPy test environment.
  • Opened a pull request to add Protego integration to Scrapy.
  • Modified Protego to treat non-terminal dollar signs as ordinary characters.
  • Minor aesthetic changes. 

What is coming up next?

  • Transferring the Protego repository to Scrapy organisation on GitHub. It seems that write permissions are necessary for initiating the transfer process.
  • Would modify Protego to treat wildcards such as `*` and `$` as ordinary characters as well.  
  • Would modify `SitemapCrawler` to use the new interface. 
  • Implementing support for `host` & `crawl-delay` directives in Scrapy. 
  • Some performance improvements might be possible by using a custom pattern matching logic (in place of regex), but I am not sure. I will need to test it.

Did you get stuck anywhere?

  • Faced problems setting up PyPy test environment. With help from mentors, I was able to solve the issue. 
View Blog Post

Weekly Check-in #10 : ( 26 July - 1 Aug )

anubhavp
Published: 07/30/2019

What did you do this week?

  • Improved performance of Protego by implementing lazy regex compilation.
  • Benchmark Results :
    • Time took to parse 570 `robots.txt` files :

Protego : 
25th percentile : 0.000134
50th percentile : 0.000340
75th percentile : 0.000911
100th percentile : 0.345727
Total Time : 0.999360

Rerp : 
25th percentile : 0.000066
50th percentile : 0.000123
75th percentile : 0.000279
100th percentile : 0.101409
Total Time : 0.317715

Reppy : 
25th percentile : 0.000028
50th percentile : 0.000038
75th percentile : 0.000063
100th percentile : 0.015579
Total Time : 0.055850

  • Time took to parse 570 `robots.txt` and answer 1000 queries per `robots.txt` :

Protego : 
25th percentile : 0.009057
50th percentile : 0.012806
75th percentile : 0.023660
100th percentile : 9.033481
Total Time : 21.999680

Rerp : 
25th percentile : 0.006096
50th percentile : 0.011864
75th percentile : 0.041876
100th percentile : 35.027233
Total Time : 68.811635

Reppy : 
25th percentile : 0.000858
50th percentile : 0.001018
75th percentile : 0.001472
100th percentile : 0.236081
Total Time : 1.132098

What is coming up next?

  • Will depend on the review from the mentors. If everything looks good to them, I would shift my focus back on Scrapy.

Did you get stuck anywhere?

  • Nothing major.
View Blog Post

[Blog #4] Need For Speed

anubhavp
Published: 07/24/2019

<meta charset="utf-8">Hey! This is my fifth blog post for GSoC 2019, covering week 7 and 8.

The most of week 7 was spent making Protego compatible with Google's parser. I also worked on the documentation, since Protego codebase is small enough, proper comments and a good readme was sufficient. I uploaded Protego to PyPI - `pip install Protego` that's all it takes to install Protego. 

Week 8 was quite interesting. For Protego to become default in Scrapy, it is necessary that it doesn’t throw any kind of error while parsing `robots.txt` files. To make sure that, I decided to download `robots.txt` from top 10,000 websites. I added tests to see if Protego throws any exceptions while parsing the downloaded `robots.txt`. I benchmarked Protego, and the results were quite disappointing. You can see the result here. 

We decided to spend the next week improving performance of Protego. I am going to try profiling and heuristics, and see if the performance can be improved.

View Blog Post

Weekly Check-in #9 : ( 19 July - 26 July )

anubhavp
Published: 07/23/2019

What did you do this week?

  • I added tests to make sure Protego doesn't throw exceptions on `robots.txt` of top 10,000 most popular websites.
  • Utilised Scrapy to create a tool to download `robots.txt` of top 10,000 most popular websites.
  • Benchmarked Protego : I ran Protego(written in Python), Robotexclusionrulesparser(written in Python), Reppy(written in C++ but has Python interface) on 570 `robots.txt` downloaded from top 1000 websites, and here are the results.
    • Time spend parsing the `robots.txt`
      • Protego : 79.00128873897484 seconds  broken heart
      • Robotexclusionrulesparser : 0.30100024401326664 seconds
      • Reppy : 0.05821833698428236 seconds
    • Time spend answering queries (1000 queries (all were identical) per `robots.txt`)
      • Protego : 14.736387926022871 seconds
      • Robotexclusionrulesparser : 67.33521732398367 seconds
      • Reppy : 1.0866852040198864 seconds
  • Added logging to Protego.

What is coming up next?

  • Will depend on the review from the mentors. If everything looks good to them, I would shift my focus back on Scrapy.
  • Make `SitemapSpider` use the new interface for `robots.txt` parsers.
  • Implement Crawl-delay & Host directive support in Scrapy.

Did you get stuck anywhere?

  • Nothing major.
View Blog Post