[Blog #2] Protego parse!

anubhavp
Published: 07/21/2019

Hey! This is my third blog post for GSoC 2019, covering week 3 and 4.

The first part of my project concerning interface for `robots.txt` parsers is almost complete.

I have started working on a pure-Python `robots.txt` parser which I have named Protego. The name is borrowed from Harry Potter universe, where Protego is a charm that creates a shield to protect the caster. The end goal for Protego is to support all of the popular directives, wildcard matching, and a good number of less popular directives. Also, we aim to make Protego 100% compatible with Google's robots.txt parser. We intend Protego to become the deafult `robots.txt` parser in Scrapy.

I have implemented support for all major directives in Protego. I have also implemented support for wildcard matching. I utilised pytest and tox to automate testing Protego on every version of Python. Furthur used Travis to run tests automatically on code push and pull requests. I borrowed tests from other parsers to check Protego on. Protego currently passes all tests borrowed from `reppy`, `rep-cpp` and `robotexlusionrulesparser`.

[Blog #2] Protego parse!

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (11 rendered)

Cache calls from 1 backend

Signals

Log messages