anubhavp's Blog

Weekly Check-in #9 : ( 19 July - 26 July )

Published: 07/23/2019

What did you do this week?

  • I added tests to make sure Protego doesn't throw exceptions on `robots.txt` of top 10,000 most popular websites.
  • Utilised Scrapy to create a tool to download `robots.txt` of top 10,000 most popular websites.
  • Benchmarked Protego : I ran Protego(written in Python), Robotexclusionrulesparser(written in Python), Reppy(written in C++ but has Python interface) on 570 `robots.txt` downloaded from top 1000 websites, and here are the results.
    • Time spend parsing the `robots.txt`
      • Protego : 79.00128873897484 seconds  broken heart
      • Robotexclusionrulesparser : 0.30100024401326664 seconds
      • Reppy : 0.05821833698428236 seconds
    • Time spend answering queries (1000 queries (all were identical) per `robots.txt`)
      • Protego : 14.736387926022871 seconds
      • Robotexclusionrulesparser : 67.33521732398367 seconds
      • Reppy : 1.0866852040198864 seconds
  • Added logging to Protego.

What is coming up next?

  • Will depend on the review from the mentors. If everything looks good to them, I would shift my focus back on Scrapy.
  • Make `SitemapSpider` use the new interface for `robots.txt` parsers.
  • Implement Crawl-delay & Host directive support in Scrapy.

Did you get stuck anywhere?

  • Nothing major.
View Blog Post

[Blog #3] Google open-sourced its robots.txt parser

Published: 07/22/2019

<meta charset="utf-8">

Hey! This is my fourth blog post for GSoC 2019, covering week 5 and 6.

Few interesting things have happened, Google has open-sourced its robots.txt parser, and have also taken the lead role in pushing the community and enterprises to create an official specification for `robots.txt`. I spend a good amount of time making Protego compatible with Google’s parser. This required rewriting a good chunk of Protego to support Google’s parser specific things such as merging record group, supporting misspellings, etc.

I am scared of reading or writing C++ code that uses STL or pointers heavily. So really going through the source code of Google’s parser was kind of uncomforting, but I was able to understand a good chunk of it, after a few days of struggle.  

Next up, I will work on making Protego 100% compatible with Google’s parser. I will have to document Protego. I will collect robots.txt from top 1000 websites to understand usage patterns.



View Blog Post

[Blog #2] Protego parse!

Published: 07/21/2019

Hey! This is my third blog post for GSoC 2019, covering week 3 and 4.

The first part of my project concerning interface for `robots.txt` parsers is almost complete.

I have started working on a pure-Python `robots.txt` parser which I have named Protego. The name is borrowed from Harry Potter universe, where Protego is a charm that creates a shield to protect the caster. The end goal for Protego is to support all of the popular directives, wildcard matching, and a good number of less popular directives. Also, we aim to make Protego 100% compatible with Google's robots.txt parser. We intend Protego to become the deafult `robots.txt` parser in Scrapy.

I have implemented support for all major directives in Protego. I have also implemented support for wildcard matching. I utilised pytest and tox to automate testing Protego on every version of Python. Furthur used Travis to run tests automatically on code push and pull requests. I borrowed tests from other parsers to check Protego on. Protego currently passes all tests borrowed from `reppy`, `rep-cpp` and `robotexlusionrulesparser`. 

View Blog Post

[Blog #1 - Week 3] We are going faster than I predicted

Published: 07/21/2019

Hey, welcome back! It has been about 3 weeks since the beginning of the coding period, and I am making progress at a pace faster than I thought we could. Here are the details of the work I did in the first three weeks.

My week prior to the beginning of the coding period was spent setting up a development environment on my computer, and learning about the internals of Scrapy. Since, Scrapy is python project, setting up an development environment is fairly straightforward. Clone the Scrapy repository `git clone`, and create a virtual environment using `python3 -m venv venv` (it creates a Python 3 virtual environment on Ubuntu). Now, activate the environment using `source venv/bin/activate`. Install python dependencies by running `pip install -r requirements-py3.txt`. If you are planning to run tests locally, install test dependencies by running `pip install -r tests/requirements-py3.txt`. The setup is complete.

Scrapy uses Tox for testing. `tox` is virtual environment management tool that can be used to run tests using multiple versions of python. `tox` also reduces the work required for integrating CIs. To run tests using Python 2.7 and Python 3.7, use `tox -e py27 -e py37`. I had minor difficulty (I was unable to run `tox` using any version other than 2.7) as I was using `tox` for the first time, but I was easily able to solve it using online documentation and help from mentors. You can learn more about `tox` here.

First and second weeks were spend deciding on a interface specification, and implementing the interface on three existing `robots.txt` parsers. Using feedback from the mentors, I improved upon the specification I proposed in my GSoC proposal. Documenting the interface turned out to be harder than implementing it. Both Sphinx and `reStructuredText` was new for me. I also found it difficult to occasionally describe things clearly (documentation is hard). While implementing the interface on top of `RobotFileParser`, I had to take deep dive into its source code. For some reason, I always had this belief that reading through the implementation of python (or any language) and its inbuilt modules, would be difficult and not really useful, and code would mostly be complex and cryptic. This doesn’t seem to be the case (at least with python). I should do more of this, looking at a module’s implementation for fun :smile:. I modified Scrapy to use the new interface instead of directly calling `RobotFileParser`.

Next, I will work on creating extra test environments in Travis for testing third-party parser integration. I haven't work with Travis or `tox` before. Also, I will document the functionalities provided by these parser in Scrapy documentation. I will also need to the make the code for testing integration more maintanble.

View Blog Post

Weekly Check-in #8 : (12 July - 18 July)

Published: 07/18/2019

What did you do this week?

  • Protego is now compatible with Google's parser. Protego pass all testcase in Google's parser testsuite.
  • Documented Protego - functions now have docstrings, and I have created a readme file.
  • Protego is now on PyPI. To install Protego, simply type `pip install protego`.
  • Made few changes to improve performance as suggested by mentors.

What is coming up next?

  • Additions to CI - making sure Protego doesn't throw exceptions on robots.txt of top 1000 alexa websites.
  • Benchmark Protego.

Did you get stuck anywhere?

Nothing major.

View Blog Post