[Blog #1 - Week 3] We are going faster than I predicted
anubhavp
Published: 07/21/2019
Hey, welcome back! It has been about 3 weeks since the beginning of the coding period, and I am making progress at a pace faster than I thought we could. Here are the details of the work I did in the first three weeks.
My week prior to the beginning of the coding period was spent setting up a development environment on my computer, and learning about the internals of Scrapy. Since, Scrapy is python project, setting up an development environment is fairly straightforward. Clone the Scrapy repository `git clone https://github.com/scrapy/scrapy`, and create a virtual environment using `python3 -m venv venv` (it creates a Python 3 virtual environment on Ubuntu). Now, activate the environment using `source venv/bin/activate`. Install python dependencies by running `pip install -r requirements-py3.txt`. If you are planning to run tests locally, install test dependencies by running `pip install -r tests/requirements-py3.txt`. The setup is complete.
Scrapy uses Tox for testing. `tox` is virtual environment management tool that can be used to run tests using multiple versions of python. `tox` also reduces the work required for integrating CIs. To run tests using Python 2.7 and Python 3.7, use `tox -e py27 -e py37`. I had minor difficulty (I was unable to run `tox` using any version other than 2.7) as I was using `tox` for the first time, but I was easily able to solve it using online documentation and help from mentors. You can learn more about `tox` here.
First and second weeks were spend deciding on a interface specification, and implementing the interface on three existing `robots.txt` parsers. Using feedback from the mentors, I improved upon the specification I proposed in my GSoC proposal. Documenting the interface turned out to be harder than implementing it. Both Sphinx and `reStructuredText` was new for me. I also found it difficult to occasionally describe things clearly (documentation is hard). While implementing the interface on top of `RobotFileParser`, I had to take deep dive into its source code. For some reason, I always had this belief that reading through the implementation of python (or any language) and its inbuilt modules, would be difficult and not really useful, and code would mostly be complex and cryptic. This doesn’t seem to be the case (at least with python). I should do more of this, looking at a module’s implementation for fun :smile:. I modified Scrapy to use the new interface instead of directly calling `RobotFileParser`.
Next, I will work on creating extra test environments in Travis for testing third-party parser integration. I haven't work with Travis or `tox` before. Also, I will document the functionalities provided by these parser in Scrapy documentation. I will also need to the make the code for testing integration more maintanble.
View Blog Post
Weekly Check-in #8 : (12 July - 18 July)
anubhavp
Published: 07/18/2019
What did you do this week?
- Protego is now compatible with Google's parser. Protego pass all testcase in Google's parser testsuite.
- Documented Protego - functions now have docstrings, and I have created a readme file.
- Protego is now on PyPI. To install Protego, simply type `pip install protego`.
- Made few changes to improve performance as suggested by mentors.
What is coming up next?
- Additions to CI - making sure Protego doesn't throw exceptions on robots.txt of top 1000 alexa websites.
- Benchmark Protego.
Did you get stuck anywhere?
Nothing major.
View Blog Post
Weekly Check-in #7: (5 July - 11 July)
anubhavp
Published: 07/09/2019
Hey! here is an update on what I have achieved so far.
What did you do this week?
- Protego now passes all tests borrowed from reppy, rep-cpp and robotexclusionrulesparser.
- Made few changes to Protego to make it compatible with google's parser.
- Worked on changes suggest on the interface pull request.
- Wrote code to fetch robots.txt files from top 1000 websites, and generate statistics we need. ( link )
- Looked at the code of Google's robots.txt parser for the purpose of creating a python interface on top of it. I might need to modify its code as currently it parses the robots.txt file for answering every query. (Working on anything in C++ that uses pointers or STL heavily makes me feel uncomfortable).
What is coming up next?
- Modify protego to make it behave similar to Google's parser (will need to add few more features like record group merging), and add more tests.
- Document Protego.
- Benchmarking Protego's performance.
- I would need to read how to call C/C++ code from python, for creating an interface on top Google's parser. I am currently thinking of using Cython.
- Would work on blog posts (planning to write 3 blog posts within this week).
Did you get stuck anywhere?
No, I got to work with some data science tools like jupyter notebook & pandas.
View Blog Post
Weekly Check-in #6: (28 Jun - 4 July)
anubhavp
Published: 07/02/2019
Hello! It has been more than a month since the beginning of GSoC coding period, and I am completely satisfied with my progress. I am glad that I choose to work remotely this summer, since it has been raining heavily for last 5-6 days in Mumbai (place where I live). 
What did you do this week?
- Implemented support for wildcard matching and `request-rate` directive.
- Increased the number test cases, some of them were borrowed from `rep-cpp`, another robots.txt parser. The new test cases now ensure that the parser behaviour conforms to the google specification for robots.txt.
PS: I really overestimated the amount of work I would be able to do in a week while writing the last check-in - Hofstadter's law.
What is coming up next?
- Google open-sourced their robots.txt parser - here. Since Google is the most popular search engine in the world, it is likely that for most websites, the largest percentage of crawling requests they receive originates from Google. Hence, there out be more robots.txt written for Google's crawler than any other crawler. It would make sense to make Protego behave in a similar way to Google's parser.
- Documenting Protego.
- Increasing the number of test cases (we can borrow some from Google's parser).
- Benchmarking Protego's performance against other parsers.
- Regarding the collecting statistics on robots.txt usage, I am not completely sure if it would be a good idea to invest time into it, now that I have found a blog, which describes the popularity of individual directives, and could help us prioritise which ones to implement.
Did you get stuck anywhere?
Nothing major. I found it simpler than I had initially expected.
View Blog Post
Weekly Check-in #5 : ( 21 Jun - 27 Jun )
anubhavp
Published: 06/25/2019
Hello! The fourth week of GSoC coding period is coming to an end. Here is an update on what I achieved in the past week and what I am looking forward to.
What did you do this week?
-
Implemented minor changes suggested by scrapy maintainers.
-
Started working on a new pure python robots.txt parser (which lives here https://github.com/anubhavp28/protego currently). It will eventually be moved to Scrapy organisation.
-
Implemented support for standard robots.txt directives in the new parser.
-
Integrated the code with pytest and tox for testing.
-
Integrated the repo with Travis CI to trigger tests automatically on pull requests.
What is coming up next?
-
Implement support for modern conventions like wildcard matching, clear-param etc.
-
Add a lot of tests (mostly borrowed from existing parsers).
-
Performance benchmarking of the new parser (against existing parsers).
-
Collecting statistics related to use of robots.txt. On suggestion of a mentor, I am planing to use robots.txt files of Top 1000 websites in alexa rankings, and collect stats such as how many of them use robots.txt, how many records on average a record group contain, how many times a certain directives is mentioned, etc. This could help use make better choices for improving performance - such as whether to use a trie for prefix matching, etc.
Did you get stuck anywhere?
Oh, actually naming the parser was the hardest part
. I am still not satisfied with the name. I just ran out of ideas.
View Blog Post