anubhavp's Blog

Weekly Check-in #4: (14 Jun - 20 Jun )

anubhavp
Published: 06/18/2019

Hello! The third week of GSoC coding period is coming to an end. Here is an update on what I achieved in the past week and what I am looking forward to.

 

What did you do this week?

  • Created separate tox testing environments for testing integration with third-party parsers like Robotexclusionrulesparser and Reppy.

  • Made Travis use the new tox environments.

  • Described these parsers in Scrapy documentation.

  • Got Robotexclusionrulesparser to work with unicode user agents.

 

What is coming up next?

I will be working on creating a python based robots.txt parser which compliant with spec and supports modern conventions.

 

Did you get stuck anywhere?

Nothing major.  

View Blog Post

Weekly Check-in #3 : ( 7 Jun - 13 Jun )

anubhavp
Published: 06/08/2019

Hello, wandering pythonistas! The second week of GSoC coding period is coming to an end. Here is an update on what I achieved in the past week and what I am looking forward to.

 

What did you do this week?

  • I made few changes to the interface according to the feedback received from the mentors.

  • I implemented the interface on top of third party parsers like Robotexclusionrulesparser and Reppy.

  • Wrote tests for testing the implementation of interface on top of the two parsers. The tricky part was reducing duplication of code and keeping the test maintainable.

  • Modified Scrapy to use the new interface (instead of directly calling Python’s inbuilt RobotFileParser).

  • I had the weekly meeting with my mentors, where we discussed new stretch goals for the project.

 

What is coming up next?

It will depend on the feedback of the mentors. If everything seems good to them, I will focus my attention on writing a pure python robots.txt parser.

 

Did you get stuck anywhere?

Nothing major, though I had little difficulty due my lack of knowledge of difference between Python 2 and Python 3. I knew Python 3 uses unicode string by default, what I didn’t know is that in Python 3 `bytes` and `str` type are different. Hence, encoding a string produces an object of type `bytes`. This actually makes sense, having different types for string and arbitrary binary data.      

 

View Blog Post

[Blog #0] Accepted for GSoC 2019 : 3 Months of Open Source Ahead

anubhavp
Published: 06/07/2019

 

---

Hello! I have been selected as a student for Google Summer of Code 2019. For those of you who are unaware, Google Summer of Code is a global program focused on bringing more student developers into open source software development. Students developers get an opportunity to work with an open source organization on a 3 month programming project.

I am working on Scrapy - an open-source scraping and web crawling framework. My task is to implement an interface for `robots.txt` parsers in Scrapy. The stretch goal of the project is to write a fully spec compliant pure python `robots.txt` parser.

This blog is part of a series of blog posts where I will go in depth to describe my work. Since, this is the first blog post, I have dedicated it to explain about my project. My project has a lot to do with `robots.txt` and Robot Exclusion Standard. Let’s take a look at what these things are.

Web crawlers (also called spiders) are bots that systematically browse the internet, generally for the purpose of extracting data (called scraping) or indexing. Search engines (eg. google) use web crawlers to index pages on the internet, this index is then used to serve users with relevant results.

When web crawlers visit a website, they too consume resources of the web servers, similar to a normal user. A crawler can make multiple requests per second. Therefore, unrestricted crawling can lower the performance of a website, and cause annoyance to its users. A solution to this problem is robots exclusion standard, which makes it possible to specify which parts of the website a crawler should not access, and how frequently a crawler should make requests for a resource on the server. The standard allows websites to specify how “polite” a crawler should be.

Under robots exclusion standard, instructions for crawlers are specified in a file named `robots.txt` placed at the root of the website. These instructions have to be specified in a certain format.

`robots.txt` standard is not owned by any standard body, and is not in active development since 1997. So, the standard has not been revised for more than two decades which, I will argue, had led it to become outdated. Meanwhile, search giants like google, bing, yahoo! have collaborated to create an informal extension of the standard which their crawlers adhere to. Since usually the majority of crawling requests originates from search engines, most web administrators write `robots.txt` adhering to the informal extension of the specification.

Scrapy uses RobotFileParser (Python’s inbuilt `robots.txt` parser), which strictly follows the old specification, and has not been updated. Hence, Scrapy is likely to misinterpret the crawling rules for majority of websites on the internet. There was an effort to switch to another parser, but since there isn’t a fully compliant pure python parser available, and it is difficult to package non-python code with Scrapy because its wide user base consisting of people using a wide variety of platforms and implementations of python. Including non-python code would need dropping the support of pypy, and would require users to install compilers for other languages (which may not be easy on every platform).

The short term solution to overcome this could be to allow users to switch to a different parser if they wish too, and keep `RobotFileParser` as the default in Scrapy. It has additional benefit of giving more power to the users. For this, we are planning to create an interface for `robots.txt` parsers in Scrapy, and implementing this interface on top few popular parsers. This the first goal of my project. In the end, we would like to have a fully spec compliant pure python parser which Scrapy could use by default. This is the stretch goal of my project.

Hope, to have an incredible 3 months ahead :smile:. If you need any help regarding Google Summer of Code or you just want to learn more about my work, feel free email me at anubhavp28@gmail.com.

View Blog Post

Weekly Check-in #2 [31 May - 6 Jun]

anubhavp
Published: 06/02/2019

Hello everyone. The first week of GSoC coding period is coming to an end. Here is an update on what I achieved in the past week and what I am looking forward to.

 

What did you do this week?

I submitted my first pull request related to GSoC. This week mostly involved discussion on interface specification. I learned that designing an interface involves considering several small but quite important details, and a good practice is to question every choice you make. Also, I had a meeting with my mentors where we discussed about weekly milestones and decided to have weekly meetings, every Tuesdays. I implemented the interface on top of Python’s in-built robots.txt parser and worked on documentation related to the interface.

I got an opportunity to deep dive into the source code of Python’s in-built robots.txt parser.  For some reason, I always had this belief that reading through the implementation of python (or any language) and its inbuilt modules, would be difficult and not really useful, and code would mostly be complex and cryptic (to a beginner like me). This doesn’t seem to be the case (at least with python). I should do more of this, looking at a module’s implementation for fun •ᴗ• .   

What is coming up next?

In the next week, I am looking to finalize the interface, and modify Scrapy to use the interface to communicate with the parsers. I would also work on documenting the interface, and if time permits will implement the interface on top of few other parsers.

Did you get stuck anywhere?

Nope. I learned a lot from constant feedback from my mentors. It was an awesome week •ᴗ• 

View Blog Post

Weekly Check-in #1 [24 May - 30 May]

anubhavp
Published: 05/23/2019

Hey everyone. I am Anubhav, and this summer I am working to implement an interface for robots.txt parsers in scrapy. This is the first of many upcoming weekly blog posts where I will describe in brief the work I have done in the previous week and my plans for the upcoming week. So, let's get started. 

What did you do this week?

Most of time was spent on configuring a local development environment, and learning to use tox and how to run tests locally. For the patches I have submitted before, I didn't run tests locally beforehand, and relied solely on CI to do it. Running tests locally could have saved a lot of time. Also, I went through scrapy contribution guide, learned about twisted (scrapy uses it heavily) and PEP8, and worked on a pull request I had opened before.   

What is comping up next?

  • I will have my first meeting with mentors of the project.
  • I will work on few pull requests I had opened before.
  • Maybe, since this is the last week of community bonding period, looking to have discussion with the mentors regarding interface specification.

Did you get stuck anywhere?

I had minor difficulties understanding how to run tests using tox. When I followed the instructions given in scrapy documentation to run tests, I could only run tests using a Python 2.7 environment. Thankfully, tox has an incredible documentation that allowed to me understand settings inside of tox.ini config file. In the end, I just had to make few edits to my tox.ini file, and I was able to run tests using a Python 3 environment as well.

View Blog Post