Weekly Check-in #5 : ( 21 Jun - 27 Jun )

anubhavp
Published: 06/25/2019

Hello! The fourth week of GSoC coding period is coming to an end. Here is an update on what I achieved in the past week and what I am looking forward to.

What did you do this week?

Implemented minor changes suggested by scrapy maintainers.
Started working on a new pure python robots.txt parser (which lives here https://github.com/anubhavp28/protego currently). It will eventually be moved to Scrapy organisation.
Implemented support for standard robots.txt directives in the new parser.
Integrated the code with pytest and tox for testing.
Integrated the repo with Travis CI to trigger tests automatically on pull requests.

What is coming up next?

Implement support for modern conventions like wildcard matching, clear-param etc.
Add a lot of tests (mostly borrowed from existing parsers).
Performance benchmarking of the new parser (against existing parsers).
Collecting statistics related to use of robots.txt. On suggestion of a mentor, I am planing to use robots.txt files of Top 1000 websites in alexa rankings, and collect stats such as how many of them use robots.txt, how many records on average a record group contain, how many times a certain directives is mentioned, etc. This could help use make better choices for improving performance - such as whether to use a trie for prefix matching, etc.

Did you get stuck anywhere?

Oh, actually naming the parser was the hardest part . I am still not satisfied with the name. I just ran out of ideas.

Weekly Check-in #5 : ( 21 Jun - 27 Jun )

What did you do this week?

What is coming up next?

Did you get stuck anywhere?

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (11 rendered)

Cache calls from 1 backend

Signals

Log messages