anubhavp's Blog

Weekly Check-in #7: (5 July - 11 July)

anubhavp
Published: 07/09/2019

Hey! here is an update on what I have achieved so far.

What did you do this week?

  • Protego now passes all tests borrowed from reppy, rep-cpp and robotexclusionrulesparser.
  • Made few changes to Protego to make it compatible with google's parser.
  • Worked on changes suggest on the interface pull request.
  • Wrote code to fetch robots.txt files from top 1000 websites, and generate statistics we need. ( link )
  • Looked at the code of Google's robots.txt parser for the purpose of creating a python interface on top of it. I might need to modify its code as currently it parses the robots.txt file for answering every query. (Working on anything in C++ that uses pointers or STL heavily makes me feel uncomfortable).

What is coming up next?

  • Modify protego to make it behave similar to Google's parser (will need to add few more features like record group merging), and add more tests.
  • Document Protego.
  • Benchmarking Protego's performance.
  • I would need to read how to call C/C++ code from python, for creating an interface on top Google's parser. I am currently thinking of using Cython.
  • Would work on blog posts (planning to write 3 blog posts within this week).

Did you get stuck anywhere?

No, I got to work with some data science tools like jupyter notebook & pandas.

View Blog Post

Weekly Check-in #6: (28 Jun - 4 July)

anubhavp
Published: 07/02/2019

Hello! It has been more than a month since the beginning of GSoC coding period, and I am completely satisfied with my progress. I am glad that I choose to work remotely this summer, since it has been raining heavily for last 5-6 days in Mumbai (place where I live). smiley

What did you do this week?

  • Implemented support for wildcard matching and `request-rate` directive.
  • Increased the number test cases, some of them were borrowed from `rep-cpp`, another robots.txt parser. The new test cases now ensure that the parser behaviour conforms to the google specification for robots.txt.

PS: I really overestimated the amount of work I would be able to do in a week while writing the last check-in - Hofstadter's law.

What is coming up next?

  • Google open-sourced their robots.txt parser - here. Since Google is the most popular search engine in the world, it is likely that for most websites, the largest percentage of crawling requests they receive originates from Google. Hence, there out be more robots.txt written for Google's crawler than any other crawler. It would make sense to make Protego behave in a similar way to Google's parser.
  • Documenting Protego.
  • Increasing the number of test cases (we can borrow some from Google's parser).
  • Benchmarking Protego's performance against other parsers.
  • Regarding the collecting statistics on robots.txt usage, I am not completely sure if it would be a good idea to invest time into it, now that I have found a blog, which describes the popularity of individual directives, and could help us prioritise which ones to implement. 

Did you get stuck anywhere?

Nothing major. I found it simpler than I had initially expected.

View Blog Post

Weekly Check-in #5 : ( 21 Jun - 27 Jun )

anubhavp
Published: 06/25/2019

Hello! The fourth week of GSoC coding period is coming to an end. Here is an update on what I achieved in the past week and what I am looking forward to.

 

What did you do this week?

  • Implemented minor changes suggested by scrapy maintainers.

  • Started working on a new pure python robots.txt parser (which lives here https://github.com/anubhavp28/protego currently). It will eventually be moved to Scrapy organisation.

  • Implemented support for standard robots.txt directives in the new parser.

  • Integrated the code with pytest and tox for testing.

  • Integrated the repo with Travis CI to trigger tests automatically on pull requests.

 

What is coming up next?

  • Implement support for modern conventions like wildcard matching, clear-param etc.

  • Add a lot of tests (mostly borrowed from existing parsers).

  • Performance benchmarking of the new parser (against existing parsers).

  • Collecting statistics related to use of robots.txt. On suggestion of a mentor, I am planing to use robots.txt files of Top 1000 websites in alexa rankings, and collect stats such as how many of them use robots.txt, how many records on average a record group contain, how many times a certain directives is mentioned, etc. This could help use make better choices for improving performance - such as whether to use a trie for prefix matching, etc.  

 

Did you get stuck anywhere?

Oh, actually naming the parser was the hardest part  laugh . I am still not satisfied with the name. I just ran out of ideas.

 

View Blog Post

Weekly Check-in #4: (14 Jun - 20 Jun )

anubhavp
Published: 06/18/2019

Hello! The third week of GSoC coding period is coming to an end. Here is an update on what I achieved in the past week and what I am looking forward to.

 

What did you do this week?

  • Created separate tox testing environments for testing integration with third-party parsers like Robotexclusionrulesparser and Reppy.

  • Made Travis use the new tox environments.

  • Described these parsers in Scrapy documentation.

  • Got Robotexclusionrulesparser to work with unicode user agents.

 

What is coming up next?

I will be working on creating a python based robots.txt parser which compliant with spec and supports modern conventions.

 

Did you get stuck anywhere?

Nothing major.  

View Blog Post

Weekly Check-in #3 : ( 7 Jun - 13 Jun )

anubhavp
Published: 06/08/2019

Hello, wandering pythonistas! The second week of GSoC coding period is coming to an end. Here is an update on what I achieved in the past week and what I am looking forward to.

 

What did you do this week?

  • I made few changes to the interface according to the feedback received from the mentors.

  • I implemented the interface on top of third party parsers like Robotexclusionrulesparser and Reppy.

  • Wrote tests for testing the implementation of interface on top of the two parsers. The tricky part was reducing duplication of code and keeping the test maintainable.

  • Modified Scrapy to use the new interface (instead of directly calling Python’s inbuilt RobotFileParser).

  • I had the weekly meeting with my mentors, where we discussed new stretch goals for the project.

 

What is coming up next?

It will depend on the feedback of the mentors. If everything seems good to them, I will focus my attention on writing a pure python robots.txt parser.

 

Did you get stuck anywhere?

Nothing major, though I had little difficulty due my lack of knowledge of difference between Python 2 and Python 3. I knew Python 3 uses unicode string by default, what I didn’t know is that in Python 3 `bytes` and `str` type are different. Hence, encoding a string produces an object of type `bytes`. This actually makes sense, having different types for string and arbitrary binary data.      

 

View Blog Post