[Blog #3] Google open-sourced its robots.txt parser

anubhavp
Published: 07/22/2019

Hey! This is my fourth blog post for GSoC 2019, covering week 5 and 6.

Few interesting things have happened, Google has open-sourced its robots.txt parser, and have also taken the lead role in pushing the community and enterprises to create an official specification for `robots.txt`. I spend a good amount of time making Protego compatible with Google’s parser. This required rewriting a good chunk of Protego to support Google’s parser specific things such as merging record group, supporting misspellings, etc.

I am scared of reading or writing C++ code that uses STL or pointers heavily. So really going through the source code of Google’s parser was kind of uncomforting, but I was able to understand a good chunk of it, after a few days of struggle.

Next up, I will work on making Protego 100% compatible with Google’s parser. I will have to document Protego. I will collect robots.txt from top 1000 websites to understand usage patterns.

[Blog #3] Google open-sourced its robots.txt parser

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (11 rendered)

Cache calls from 1 backend

Signals

Log messages