Weekly Check-in #6: (28 Jun - 4 July)

anubhavp
Published: 07/02/2019

Hello! It has been more than a month since the beginning of GSoC coding period, and I am completely satisfied with my progress. I am glad that I choose to work remotely this summer, since it has been raining heavily for last 5-6 days in Mumbai (place where I live). smiley

What did you do this week?

  • Implemented support for wildcard matching and `request-rate` directive.
  • Increased the number test cases, some of them were borrowed from `rep-cpp`, another robots.txt parser. The new test cases now ensure that the parser behaviour conforms to the google specification for robots.txt.

PS: I really overestimated the amount of work I would be able to do in a week while writing the last check-in - Hofstadter's law.

What is coming up next?

  • Google open-sourced their robots.txt parser - here. Since Google is the most popular search engine in the world, it is likely that for most websites, the largest percentage of crawling requests they receive originates from Google. Hence, there out be more robots.txt written for Google's crawler than any other crawler. It would make sense to make Protego behave in a similar way to Google's parser.
  • Documenting Protego.
  • Increasing the number of test cases (we can borrow some from Google's parser).
  • Benchmarking Protego's performance against other parsers.
  • Regarding the collecting statistics on robots.txt usage, I am not completely sure if it would be a good idea to invest time into it, now that I have found a blog, which describes the popularity of individual directives, and could help us prioritise which ones to implement. 

Did you get stuck anywhere?

Nothing major. I found it simpler than I had initially expected.

vipulgupta2048
Good progress, were you stuck in any place at all.
July 2, 2019, 12:48 p.m. Reply
anubhavp
Nothing major. I found it simpler than I had initially expected.
July 2, 2019, 3:29 p.m. Reply
1000 characters left
1000 characters left
vipulgupta2048
Would love to know more about your project, and help out if needed.
July 2, 2019, 12:49 p.m. Reply
anubhavp
I might need help in collecting robots.txt files and generating useful statistics. I will ping you, if I get stuck anywhere.
July 2, 2019, 3:28 p.m. Reply
1000 characters left
1000 characters left
vipulgupta2048
I really appreciate it, would love to know about different projects, help out. Grow skills together 🐣
July 2, 2019, 6:58 p.m. Reply
1000 characters left
1000 characters left