Articles on anubhavp's Bloghttps://blogs.python-gsoc.orgUpdates on different articles published on anubhavp's BlogenSun, 25 Aug 2019 20:07:38 +0000[Blog #6] Part of the journey is the end.https://blogs.python-gsoc.org/en/anubhavps-blog/blog-6-part-of-the-journey-is-the-end/<p>&lt;meta charset="utf-8"&gt;</p> <p><strong>Part of the journey is the end. It is time for me to work on my final work report for final evaluation of Google Summer of Code 2019. This week, I will devote my time mainly to write my final report and final blog post. If time permits, I will work on my PRs from last week.</strong></p> <p><strong><b>Last week, I worked on getting Travis to push automatically to PyPI and I redid benchmarking.</b></strong></p> <p> </p>anubhavp28@gmail.com (anubhavp)Sun, 25 Aug 2019 20:07:38 +0000https://blogs.python-gsoc.org/en/anubhavps-blog/blog-6-part-of-the-journey-is-the-end/[Blog #5] Time just seems to fly.https://blogs.python-gsoc.org/en/anubhavps-blog/blog-5-time-just-seems-to-fly/<p> </p> <p>&lt;meta charset="utf-8"&gt;</p> <p><b>Hello! This is my second last blog post for GSoC 2019 - time has gone by so quickly. I spend this week documenting Protego’s API in detail. I opened a pull request to add Protego integration in Scrapy. I added PyPy test environment and modified Protego to treat non-terminal dollar sign as ordinary character.</b></p> <p><b>Up next, I will start the process to transfer Protego to Scrapy organisation on GitHub. I would modify `SitemapCrawler` in Scrapy to use the new interface, and implement a `ROBOTSTXT_USER_AGENT` setting in Scrapy.</b></p> <p><b>I faced minor problems trying to setup PyPy environment in Travis. With the help from mentors, I was able to resolve the issue.</b></p>anubhavp28@gmail.com (anubhavp)Sun, 25 Aug 2019 19:52:41 +0000https://blogs.python-gsoc.org/en/anubhavps-blog/blog-5-time-just-seems-to-fly/Weekly Check-in #13 : ( 16 Aug - 22 Aug )https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-13-16-aug-22-aug/<h2>What did you do this week?</h2> <ul> <li>I worked on getting Travis to push releases automatically to PyPI, adding a new `ROBOTSTXT_USER_AGENT` setting in Scrapy, and improvements to SitemapSpider.</li> </ul> <h2>What is coming up next?</h2> <ul> <li>I am going to work on the final PSF blog post in which I will focus on my experience of GSoC 2019 working with my awesome mentors, and Scrapy.</li> <li>Next, I will write a final report for third evaluation of Google Summer of Code.</li> <li>Next up, I will work on the changes suggested on my this week's work.</li> </ul> <h2>Did you get stuck anywhere?</h2> <ul> <li>Nothing Major.</li> </ul>anubhavp28@gmail.com (anubhavp)Tue, 20 Aug 2019 13:10:39 +0000https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-13-16-aug-22-aug/Weekly Check-in #12: (9 Aug - 15 Aug)https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-12-9-aug-15-aug/<h2>What did you do this week?</h2> <ul> <li>Benchmarking Protego (again). This time we crawled multiple domains (~1,100 domains) and downloaded links to pages as the crawler encounter them. We downloaded 111, 824 links in total. <ul> <li>Next we made each robots.txt parser - parse and answer query (we made parsers answer each query 5 times) in an order similar to how they would need to in a broad crawl. Here are the results :</li> </ul> </li> </ul> <p style="margin-left: 120px;"><strong>Protego :</strong></p> <p style="margin-left: 120px;">25th percentile : 0.002419 seconds<br> 50th percentile : 0.006798 seconds<br> 75th percentile : 0.014307 seconds<br> 100th percentile : 2.546978 seconds<br> Total Time : 19.545984 seconds</p> <p style="margin-left: 120px;"><strong>RobotFileParser (default in Scrapy) :</strong></p> <p style="margin-left: 120px;">25th percentile : 0.002188 seconds<br> 50th percentile : 0.005350 secondsstyle<br> 75th percentile : 0.010492 seconds<br> 100th percentile : 1.805923 seconds<br> Total Time : 13.799954 seconds</p> <p style="margin-left: 120px;"><strong>Rerp Parser :</strong><br> 25th percentile : 0.001288 seconds<br> 50th percentile : 0.005222 seconds<br> 75th percentile : 0.014640 seconds<br> 100th percentile : 52.706880 seconds<br> Total Time : 76.460496 seconds</p> <p style="margin-left: 120px;"><strong>Reppy Parser :</strong><br> 25th percentile : 0.000210 seconds<br> 50th percentile : 0.000492 seconds<br> 75th percentile : 0.000997 seconds<br> 100th percentile : 0.129440 seconds<br> Total Time: 1.405558  seconds</p> <p style="margin-left: 120px;"> </p> <ul> <li>Removing an hack used in Protego due to lack of an option to ignore characters in `urllib.parse.unquote`. Added few new features to Protego as well. </li> <li>Protego has been moved to Scrapy organisation.</li> </ul> <h2>What is coming up next?</h2> <ul> <li>Configuring Travis to push to PyPI automatically.</li> <li>Introducing a new `ROBOTSTXT_USER_AGENT` setting in Scrapy.</li> <li>Making `SitemapCrawler` use the new interface.</li> </ul> <h2>Did you get stuck anywhere?</h2> <p>I got blocked by StackExchange for few hours. <img alt="laugh" src="https://blogs.python-gsoc.org/static/djangocms_text_ckeditor/ckeditor/plugins/smiley/images/teeth_smile.png"> I think they don't like crawlers on their websites. "It is a temporary automatic ban that is put up by our HAProxy instance when you hit our rate limiter." they answered to one of the questions on their website.</p>anubhavp28@gmail.com (anubhavp)Tue, 13 Aug 2019 12:59:47 +0000https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-12-9-aug-15-aug/Weekly Check-in #11: ( 2 Aug - 8 Aug )https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-11-2-aug-8-aug/<h2>What did you do this week?</h2> <ul> <li>Added API description and more usage examples to readme.</li> <li>Added PyPy test environment.</li> <li>Opened a pull request to add Protego integration to Scrapy.</li> <li>Modified Protego to treat non-terminal dollar signs as ordinary characters.</li> <li>Minor aesthetic changes. </li> </ul> <h2>What is coming up next?</h2> <ul> <li>Transferring the Protego repository to Scrapy organisation on GitHub. It seems that write permissions are necessary for initiating the transfer process.</li> <li>Would modify Protego to treat wildcards such as `*` and `$` as ordinary characters as well.  </li> <li>Would modify `SitemapCrawler` to use the new interface. </li> <li>Implementing support for `host` &amp; `crawl-delay` directives in Scrapy. </li> <li>Some performance improvements might be possible by using a custom pattern matching logic (in place of regex), but I am not sure. I will need to test it.</li> </ul> <h2>Did you get stuck anywhere?</h2> <ul> <li>Faced problems setting up PyPy test environment. With help from mentors, I was able to solve the issue. </li> </ul>anubhavp28@gmail.com (anubhavp)Tue, 06 Aug 2019 13:08:03 +0000https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-11-2-aug-8-aug/Weekly Check-in #10 : ( 26 July - 1 Aug )https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-10-26-july-1-aug/<h2>What did you do this week?</h2> <ul> <li>Improved performance of Protego by implementing lazy regex compilation.</li> <li>Benchmark Results : <ul> <li>Time took to parse 570 `robots.txt` files :</li> </ul> </li> </ul> <p style="margin-left: 80px;">Protego : <br> 25th percentile : 0.000134<br> 50th percentile : 0.000340<br> 75th percentile : 0.000911<br> 100th percentile : 0.345727<br> Total Time : 0.999360</p> <p style="margin-left: 80px;">Rerp : <br> 25th percentile : 0.000066<br> 50th percentile : 0.000123<br> 75th percentile : 0.000279<br> 100th percentile : 0.101409<br> Total Time : 0.317715</p> <p style="margin-left: 80px;">Reppy : <br> 25th percentile : 0.000028<br> 50th percentile : 0.000038<br> 75th percentile : 0.000063<br> 100th percentile : 0.015579<br> Total Time : 0.055850</p> <ul style="margin-left: 40px;"> <li>Time took to parse 570 `robots.txt` and answer 1000 queries per `robots.txt` :</li> </ul> <p style="margin-left: 80px;">Protego : <br> 25th percentile : 0.009057<br> 50th percentile : 0.012806<br> 75th percentile : 0.023660<br> 100th percentile : 9.033481<br> Total Time : 21.999680</p> <p style="margin-left: 80px;">Rerp : <br> 25th percentile : 0.006096<br> 50th percentile : 0.011864<br> 75th percentile : 0.041876<br> 100th percentile : 35.027233<br> Total Time : 68.811635</p> <p style="margin-left: 80px;">Reppy : <br> 25th percentile : 0.000858<br> 50th percentile : 0.001018<br> 75th percentile : 0.001472<br> 100th percentile : 0.236081<br> Total Time : 1.132098</p> <h2>What is coming up next?</h2> <ul> <li>Will depend on the review from the mentors. If everything looks good to them, I would shift my focus back on Scrapy.</li> </ul> <h2>Did you get stuck anywhere?</h2> <ul> <li>Nothing major.</li> </ul>anubhavp28@gmail.com (anubhavp)Tue, 30 Jul 2019 13:10:59 +0000https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-10-26-july-1-aug/[Blog #4] Need For Speedhttps://blogs.python-gsoc.org/en/anubhavps-blog/blog-4-need-for-speed/<p>&lt;meta charset="utf-8"&gt;<b id="docs-internal-guid-f8a95d8f-7fff-c33e-d983-58d043e78f73">Hey! This is my fifth blog post for GSoC 2019, covering week 7 and 8.</b></p> <p dir="ltr"><b id="docs-internal-guid-f8a95d8f-7fff-c33e-d983-58d043e78f73">The most of week 7 was spent making Protego compatible with Google's parser. I also worked on the documentation, since Protego codebase is small enough, proper comments and a good readme was sufficient. I uploaded Protego to PyPI - `pip install Protego` that's all it takes to install Protego. </b></p> <p dir="ltr"><b id="docs-internal-guid-f8a95d8f-7fff-c33e-d983-58d043e78f73">Week 8 was quite interesting. For Protego to become default in Scrapy, it is necessary that it doesn’t throw any kind of error while parsing `robots.txt` files. To make sure that, I decided to download `robots.txt` from top 10,000 websites. I added tests to see if Protego throws any exceptions while parsing the downloaded `robots.txt`. I benchmarked Protego, and the results were quite disappointing. You can see the result here. </b></p> <p dir="ltr"><b id="docs-internal-guid-f8a95d8f-7fff-c33e-d983-58d043e78f73">We decided to spend the next week improving performance of Protego. I am going to try profiling and heuristics, and see if the performance can be improved.</b></p>anubhavp28@gmail.com (anubhavp)Wed, 24 Jul 2019 18:59:44 +0000https://blogs.python-gsoc.org/en/anubhavps-blog/blog-4-need-for-speed/Weekly Check-in #9 : ( 19 July - 26 July )https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-9-19-july-26-july/<h2>What did you do this week?</h2> <ul> <li>I added tests to make sure Protego doesn't throw exceptions on `robots.txt` of top 10,000 most popular websites.</li> <li>Utilised Scrapy to create a tool to download `robots.txt` of top 10,000 most popular websites.</li> <li>Benchmarked Protego : I ran Protego(written in Python), Robotexclusionrulesparser(written in Python), Reppy(written in C++ but has Python interface) on 570 `robots.txt` downloaded from top 1000 websites, and here are the results. <ul> <li>Time spend parsing the `robots.txt` <ul> <li>Protego : 79.00128873897484 seconds  <img alt="broken heart" height="23" src="https://blogs.python-gsoc.org/static/djangocms_text_ckeditor/ckeditor/plugins/smiley/images/broken_heart.png" title="broken heart" width="23"></li> <li>Robotexclusionrulesparser : 0.30100024401326664 seconds</li> <li>Reppy : 0.05821833698428236 seconds</li> </ul> </li> <li>Time spend answering queries (1000 queries (all were identical) per `robots.txt`) <ul> <li>Protego : 14.736387926022871 seconds</li> <li>Robotexclusionrulesparser : 67.33521732398367 seconds</li> <li>Reppy : 1.0866852040198864 seconds</li> </ul> </li> </ul> </li> <li>Added logging to Protego.</li> </ul> <h2>What is coming up next?</h2> <ul> <li>Will depend on the review from the mentors. If everything looks good to them, I would shift my focus back on Scrapy.</li> <li>Make `SitemapSpider` use the new interface for `robots.txt` parsers.</li> <li>Implement Crawl-delay &amp; Host directive support in Scrapy.</li> </ul> <h2>Did you get stuck anywhere?</h2> <ul> <li>Nothing major.</li> </ul>anubhavp28@gmail.com (anubhavp)Tue, 23 Jul 2019 13:10:46 +0000https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-9-19-july-26-july/[Blog #3] Google open-sourced its robots.txt parserhttps://blogs.python-gsoc.org/en/anubhavps-blog/blog-3-google-open-sourced-its-robots-txt-parser/<p>&lt;meta charset="utf-8"&gt;</p> <p dir="ltr"><b id="docs-internal-guid-854a5f8d-7fff-b8f8-3115-93df060054c3">Hey! This is my fourth blog post for GSoC 2019, covering week 5 and 6.</b></p> <p dir="ltr"><b id="docs-internal-guid-854a5f8d-7fff-b8f8-3115-93df060054c3">Few interesting things have happened, Google has open-sourced its robots.txt parser, and have also taken the lead role in pushing the community and enterprises to create an official specification for `robots.txt`. I spend a good amount of time making Protego compatible with Google’s parser. This required rewriting a good chunk of Protego to support Google’s parser specific things such as merging record group, supporting misspellings, etc.</b></p> <p dir="ltr"><b id="docs-internal-guid-854a5f8d-7fff-b8f8-3115-93df060054c3">I am scared of reading or writing C++ code that uses STL or pointers heavily. So really going through the source code of Google’s parser was kind of uncomforting, but I was able to understand a good chunk of it, after a few days of struggle.  </b></p> <p dir="ltr"><b id="docs-internal-guid-854a5f8d-7fff-b8f8-3115-93df060054c3">Next up, I will work on making Protego 100% compatible with Google’s parser. I will have to document Protego. I will collect robots.txt from top 1000 websites to understand usage patterns.</b></p> <p dir="ltr"><b id="docs-internal-guid-854a5f8d-7fff-b8f8-3115-93df060054c3">     </b></p> <p> </p>anubhavp28@gmail.com (anubhavp)Mon, 22 Jul 2019 11:46:30 +0000https://blogs.python-gsoc.org/en/anubhavps-blog/blog-3-google-open-sourced-its-robots-txt-parser/[Blog #2] Protego parse!https://blogs.python-gsoc.org/en/anubhavps-blog/blog-2-protego-parse/<p><span style="font-family: Arial,Helvetica,sans-serif;">Hey! This is my third blog post for GSoC 2019, covering week 3 and 4.</span></p> <p><span style="font-family: Arial,Helvetica,sans-serif;">The first part of my project concerning interface for `robots.txt` parsers is almost complete.</span></p> <p><span style="font-family: Arial,Helvetica,sans-serif;">I have started working on a pure-Python `robots.txt` parser which I have named Protego. The name is borrowed from <a href="https://en.wikipedia.org/wiki/Harry_Potter_(film_series)">Harry Potter</a> universe, where Protego is a charm that creates a shield to protect the caster. The end goal for Protego is to support all of the popular directives, wildcard matching, and a good number of less popular directives. Also, we aim to make Protego 100% compatible with <a href="https://github.com/google/robotstxt">Google's robots.txt parser</a>. We intend Protego to become the deafult `robots.txt` parser in Scrapy.</span></p> <p><span style="font-family: Arial,Helvetica,sans-serif;">I have implemented support for all major directives in Protego. I have also implemented support for wildcard matching. I utilised pytest and tox to automate testing Protego on every version of Python. Furthur used Travis to run tests automatically on code push and pull requests. I borrowed tests from other parsers to check Protego on. Protego currently passes all tests borrowed from `reppy`, `rep-cpp` and `robotexlusionrulesparser`. </span></p>anubhavp28@gmail.com (anubhavp)Sun, 21 Jul 2019 17:57:00 +0000https://blogs.python-gsoc.org/en/anubhavps-blog/blog-2-protego-parse/[Blog #1 - Week 3] We are going faster than I predictedhttps://blogs.python-gsoc.org/en/anubhavps-blog/blog-1-week-3-we-are-going-faster-than-i-predicted/<p><span style="font-family: Arial,Helvetica,sans-serif;">Hey, welcome back! It has been about 3 weeks since the beginning of the coding period, and I am making progress at a pace faster than I thought we could. Here are the details of the work I did in the first three weeks.</span></p> <p><span style="font-family: Arial,Helvetica,sans-serif;">My week prior to the beginning of the coding period was spent setting up a development environment on my computer, and learning about the internals of Scrapy. Since, Scrapy is python project, setting up an development environment is fairly straightforward. Clone the Scrapy repository `git clone https://github.com/scrapy/scrapy`, and create a <a href="https://realpython.com/python-virtual-environments-a-primer/">virtual environment</a> using `python3 -m venv venv` (it creates a Python 3 virtual environment on Ubuntu). Now, activate the environment using `source venv/bin/activate`. Install python dependencies by running `pip install -r requirements-py3.txt`. If you are planning to run tests locally, install test dependencies by running `pip install -r tests/requirements-py3.txt`. The setup is complete.</span></p> <p><span style="font-family: Arial,Helvetica,sans-serif;">Scrapy uses <a href="https://tox.readthedocs.io/en/latest/">Tox</a> for testing. `tox` is virtual environment management tool that can be used to run tests using multiple versions of python. `tox` also reduces the work required for integrating CIs. To run tests using Python 2.7 and Python 3.7, use `tox -e py27 -e py37`. I had minor difficulty (I was unable to run `tox` using any version other than 2.7) as I was using `tox` for the first time, but I was easily able to solve it using online documentation and help from mentors. You can learn more about `tox` <a href="https://medium.com/@alejandrodnm/testing-against-multiple-python-versions-with-tox-9c68799c7880">here</a>.</span></p> <p><span style="font-family: Arial,Helvetica,sans-serif;">First and second weeks were spend deciding on a interface specification, and implementing the interface on three existing `robots.txt` parsers. Using feedback from the mentors, I improved upon the specification I proposed in my GSoC proposal. Documenting the interface turned out to be harder than implementing it. Both <a href="http://www.sphinx-doc.org/en/master/">Sphinx</a> and `reStructuredText` was new for me. I also found it difficult to occasionally describe things clearly (documentation is hard). While implementing the interface on top of `RobotFileParser`, I had to take deep dive into its source code. For some reason, I always had this belief that reading through the implementation of python (or any language) and its inbuilt modules, would be difficult and not really useful, and code would mostly be complex and cryptic. This doesn’t seem to be the case (at least with python). I should do more of this, looking at a module’s implementation for fun :smile:. I modified Scrapy to use the new interface instead of directly calling `RobotFileParser`.</span></p> <p><span style="font-family: Arial,Helvetica,sans-serif;">Next, I will work on creating extra test environments in <a href="https://travis-ci.org/">Travis</a> for testing third-party parser integration. I haven't work with <a href="https://travis-ci.org/">Travis</a> or `tox` before. Also, I will document the functionalities provided by these parser in Scrapy documentation. I will also need to the make the code for testing integration more maintanble.</span></p>anubhavp28@gmail.com (anubhavp)Sun, 21 Jul 2019 09:38:36 +0000https://blogs.python-gsoc.org/en/anubhavps-blog/blog-1-week-3-we-are-going-faster-than-i-predicted/Weekly Check-in #8 : (12 July - 18 July)https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-8-12-july-18-july/<h2>What did you do this week?</h2> <ul> <li>Protego is now compatible with Google's parser. Protego pass all testcase in Google's parser testsuite.</li> <li>Documented Protego - functions now have docstrings, and I have created a readme file.</li> <li>Protego is now on PyPI. To install Protego, simply type `pip install protego`.</li> <li>Made few changes to improve performance as suggested by mentors.</li> </ul> <h2>What is coming up next?</h2> <ul> <li>Additions to CI - making sure Protego doesn't throw exceptions on robots.txt of top 1000 alexa websites.</li> <li>Benchmark Protego.</li> </ul> <h2>Did you get stuck anywhere?</h2> <p>Nothing major.</p>anubhavp28@gmail.com (anubhavp)Thu, 18 Jul 2019 17:09:00 +0000https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-8-12-july-18-july/Weekly Check-in #7: (5 July - 11 July)https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-7-5-july-11-july/<p>Hey! here is an update on what I have achieved so far.</p> <h2>What did you do this week?</h2> <ul> <li>Protego now passes all tests borrowed from reppy, rep-cpp and robotexclusionrulesparser.</li> <li>Made few changes to Protego to make it compatible with google's parser.</li> <li>Worked on changes suggest on the interface pull request.</li> <li>Wrote code to fetch robots.txt files from top 1000 websites, and generate statistics we need. ( <a href="https://nbviewer.jupyter.org/github/anubhavp28/robotstxt_usage_stats/blob/master/Robotstxt_Stats.ipynb">link</a> )</li> <li>Looked at the code of Google's robots.txt parser for the purpose of creating a python interface on top of it. I might need to modify its code as currently it parses the robots.txt file for answering every query. (Working on anything in C++ that uses pointers or STL heavily makes me feel uncomfortable).</li> </ul> <h2>What is coming up next?</h2> <ul> <li>Modify protego to make it behave similar to Google's parser (will need to add few more features like record group merging), and add more tests.</li> <li>Document Protego.</li> <li>Benchmarking Protego's performance.</li> <li>I would need to read how to call C/C++ code from python, for creating an interface on top Google's parser. I am currently thinking of using Cython.</li> <li>Would work on blog posts (planning to write 3 blog posts within this week).</li> </ul> <h2>Did you get stuck anywhere?</h2> <p>No, I got to work with some data science tools like jupyter notebook &amp; pandas.</p>anubhavp28@gmail.com (anubhavp)Tue, 09 Jul 2019 11:57:10 +0000https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-7-5-july-11-july/Weekly Check-in #6: (28 Jun - 4 July)https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-6-28-jun-4-july/<p>Hello! It has been more than a month since the beginning of GSoC coding period, and I am completely satisfied with my progress. I am glad that I choose to work remotely this summer, since it has been raining heavily for last 5-6 days in Mumbai (place where I live). <img alt="smiley" height="23" src="https://blogs.python-gsoc.org/static/djangocms_text_ckeditor/ckeditor/plugins/smiley/images/regular_smile.png" title="smiley" width="23"></p> <h2>What did you do this week?</h2> <ul> <li>Implemented support for wildcard matching and `request-rate` directive.</li> <li>Increased the number test cases, some of them were borrowed from `rep-cpp`, another robots.txt parser. The new test cases now ensure that the parser behaviour conforms to the google specification for robots.txt.</li> </ul> <p>PS: I really overestimated the amount of work I would be able to do in a week while writing the last check-in - <a href="https://en.wikipedia.org/wiki/Hofstadter%27s_law">Hofstadter's law</a>.</p> <h2>What is coming up next?</h2> <ul> <li>Google open-sourced their robots.txt parser - <a href="https://github.com/google/robotstxt">here</a>. Since Google is the most popular search engine in the world, it is likely that for most websites, the largest percentage of crawling requests they receive originates from Google. Hence, there out be more robots.txt written for Google's crawler than any other crawler. It would make sense to make Protego behave in a similar way to Google's parser.</li> <li>Documenting Protego.</li> <li>Increasing the number of test cases (we can borrow some from Google's parser).</li> <li>Benchmarking Protego's performance against other parsers.</li> <li>Regarding the collecting statistics on robots.txt usage, I am not completely sure if it would be a good idea to invest time into it, now that I have found a <a href="https://www.ctrl.blog/entry/arcane-robotstxt-directives.html">blog</a>, which describes the popularity of individual directives, and could help us prioritise which ones to implement. </li> </ul> <h2>Did you get stuck anywhere?</h2> <p>Nothing major. I found it simpler than I had initially expected.</p>anubhavp28@gmail.com (anubhavp)Tue, 02 Jul 2019 11:40:42 +0000https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-6-28-jun-4-july/Weekly Check-in #5 : ( 21 Jun - 27 Jun )https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-5-21-jun-27-jun/<p><b id="docs-internal-guid-ac11638d-7fff-a4dc-6ca2-7c895d40dc0d">Hello! The fourth week of GSoC coding period is coming to an end. Here is an update on what I achieved in the past week and what I am looking forward to.</b></p> <p> </p> <h2 dir="ltr"><b id="docs-internal-guid-ac11638d-7fff-a4dc-6ca2-7c895d40dc0d">What did you do this week?</b></h2> <ul> <li dir="ltr"> <p dir="ltr"><b>Implemented minor changes suggested by scrapy maintainers.</b></p> </li> <li dir="ltr"> <p dir="ltr"><b>Started working on a new pure python robots.txt parser (which </b><strong>lives here <a href="https://github.com/anubhavp28/protego">https://github.com/anubhavp28/protego</a> currently). It will eventually be moved to Scrapy organisation.</strong></p> </li> <li dir="ltr"> <p dir="ltr"><b>Implemented support for standard robots.txt directives in the new parser.</b></p> </li> <li dir="ltr"> <p dir="ltr"><b>Integrated the code with pytest and tox for testing.</b></p> </li> <li dir="ltr"> <p dir="ltr"><b>Integrated the repo with Travis CI to trigger tests automatically on pull requests.</b></p> </li> </ul> <p> </p> <h2 dir="ltr"><b id="docs-internal-guid-ac11638d-7fff-a4dc-6ca2-7c895d40dc0d">What is coming up next?</b></h2> <ul> <li dir="ltr"> <p dir="ltr"><b>Implement support for modern conventions like wildcard matching, clear-param etc.</b></p> </li> <li dir="ltr"> <p dir="ltr"><b>Add a lot of tests (mostly borrowed from existing parsers).</b></p> </li> <li dir="ltr"> <p dir="ltr"><b>Performance benchmarking of the new parser (against existing parsers).</b></p> </li> <li dir="ltr"> <p dir="ltr"><b>Collecting statistics related to use of robots.txt. On suggestion of a mentor, I am planing to use robots.txt files of Top 1000 websites in alexa rankings, and collect stats such as how many of them use robots.txt, how many records on average a record group contain, how many times a certain directives is mentioned, etc. This could help use make better choices for improving performance - such as whether to use a trie for prefix matching, etc.  </b></p> </li> </ul> <p> </p> <h2 dir="ltr"><b id="docs-internal-guid-ac11638d-7fff-a4dc-6ca2-7c895d40dc0d">Did you get stuck anywhere?</b></h2> <p dir="ltr"><b>Oh, actually naming the parser was the hardest part  </b><span style="font-size: 8px;"><strong><img alt="laugh" height="23" src="https://blogs.python-gsoc.org/static/djangocms_text_ckeditor/ckeditor/plugins/smiley/images/teeth_smile.png" title="laugh" width="23"></strong></span><b> . I am still not satisfied with the name. I just ran out of ideas.</b></p> <p dir="ltr"> </p>anubhavp28@gmail.com (anubhavp)Tue, 25 Jun 2019 09:51:27 +0000https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-5-21-jun-27-jun/Weekly Check-in #4: (14 Jun - 20 Jun )https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-4-14-jun-20-jun/<p><b id="docs-internal-guid-ac11638d-7fff-a4dc-6ca2-7c895d40dc0d">Hello! The third week of GSoC coding period is coming to an end. Here is an update on what I achieved in the past week and what I am looking forward to.</b></p> <p> </p> <h2 dir="ltr"><b id="docs-internal-guid-ac11638d-7fff-a4dc-6ca2-7c895d40dc0d">What did you do this week?</b></h2> <ul> <li dir="ltr"> <p dir="ltr"><b>Created separate tox testing environments for testing integration with third-party parsers like </b><b id="docs-internal-guid-ac11638d-7fff-a4dc-6ca2-7c895d40dc0d"><a href="http://nikitathespider.com/python/rerp/">Robotexclusionrulesparser</a> and <a href="https://github.com/seomoz/reppy">Reppy</a>.</b></p> </li> <li dir="ltr"> <p dir="ltr"><b>Made Travis use the new tox environments.</b></p> </li> <li dir="ltr"> <p dir="ltr"><b>Described these parsers in Scrapy documentation.</b></p> </li> <li dir="ltr"> <p dir="ltr"><b>Got </b><b id="docs-internal-guid-ac11638d-7fff-a4dc-6ca2-7c895d40dc0d"><a href="http://nikitathespider.com/python/rerp/">Robotexclusionrulesparser</a> to work with unicode user agents.</b></p> </li> </ul> <p> </p> <h2 dir="ltr"><b id="docs-internal-guid-ac11638d-7fff-a4dc-6ca2-7c895d40dc0d">What is coming up next?</b></h2> <p dir="ltr"><b id="docs-internal-guid-ac11638d-7fff-a4dc-6ca2-7c895d40dc0d">I will be working on creating a python based robots.txt parser which compliant with spec and supports modern conventions.</b></p> <h2 dir="ltr"> </h2> <h2 dir="ltr"><b id="docs-internal-guid-ac11638d-7fff-a4dc-6ca2-7c895d40dc0d">Did you get stuck anywhere?</b></h2> <p dir="ltr"><b id="docs-internal-guid-ac11638d-7fff-a4dc-6ca2-7c895d40dc0d">Nothing major.  </b></p>anubhavp28@gmail.com (anubhavp)Tue, 18 Jun 2019 11:33:04 +0000https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-4-14-jun-20-jun/Weekly Check-in #3 : ( 7 Jun - 13 Jun )https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-3-7-jun-13-jun/<p><meta charset="utf-8"><b id="docs-internal-guid-ac11638d-7fff-a4dc-6ca2-7c895d40dc0d">Hello, wandering pythonistas! The second week of GSoC coding period is coming to an end. Here is an update on what I achieved in the past week and what I am looking forward to.</b></p> <p> </p> <h2 dir="ltr"><b id="docs-internal-guid-ac11638d-7fff-a4dc-6ca2-7c895d40dc0d">What did you do this week?</b></h2> <ul> <li dir="ltr"> <p dir="ltr"><b id="docs-internal-guid-ac11638d-7fff-a4dc-6ca2-7c895d40dc0d">I made few changes to the interface according to the feedback received from the mentors.</b></p> </li> <li dir="ltr"> <p dir="ltr"><b id="docs-internal-guid-ac11638d-7fff-a4dc-6ca2-7c895d40dc0d">I implemented the interface on top of third party parsers like <a href="http://nikitathespider.com/python/rerp/">Robotexclusionrulesparser</a> and <a href="https://github.com/seomoz/reppy">Reppy</a>.</b></p> </li> <li dir="ltr"> <p dir="ltr"><b id="docs-internal-guid-ac11638d-7fff-a4dc-6ca2-7c895d40dc0d">Wrote tests for testing the implementation of interface on top of the two parsers. The tricky part was reducing duplication of code and keeping the test </b><b>maintainable.</b></p> </li> <li dir="ltr"> <p dir="ltr"><b id="docs-internal-guid-ac11638d-7fff-a4dc-6ca2-7c895d40dc0d">Modified Scrapy to use the new interface (instead of directly calling Python’s inbuilt <a href="https://docs.python.org/3/library/urllib.robotparser.html">RobotFileParser</a>). </b></p> </li> <li dir="ltr"> <p dir="ltr"><b id="docs-internal-guid-ac11638d-7fff-a4dc-6ca2-7c895d40dc0d">I had the weekly meeting with my mentors, where we discussed new stretch goals for the project. </b></p> </li> </ul> <p> </p> <h2 dir="ltr"><b id="docs-internal-guid-ac11638d-7fff-a4dc-6ca2-7c895d40dc0d">What is coming up next?</b></h2> <p dir="ltr"><b id="docs-internal-guid-ac11638d-7fff-a4dc-6ca2-7c895d40dc0d">It will depend on the feedback of the mentors. If everything seems good to them, I will focus my attention on writing a pure python robots.txt parser. </b></p> <h2 dir="ltr"> </h2> <h2 dir="ltr"><b id="docs-internal-guid-ac11638d-7fff-a4dc-6ca2-7c895d40dc0d">Did you get stuck anywhere?</b></h2> <p dir="ltr"><b id="docs-internal-guid-ac11638d-7fff-a4dc-6ca2-7c895d40dc0d">Nothing major, though I had little difficulty due my lack of knowledge of difference between Python 2 and Python 3. I knew Python 3 uses unicode string by default, what I didn’t know is that in Python 3 `bytes` and `str` type are different. Hence, encoding a string produces an object of type `bytes`. This actually makes sense, having different types for string and arbitrary binary data.      </b></p> <p> </p>anubhavp28@gmail.com (anubhavp)Sat, 08 Jun 2019 13:21:16 +0000https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-3-7-jun-13-jun/[Blog #0] Accepted for GSoC 2019 : 3 Months of Open Source Aheadhttps://blogs.python-gsoc.org/en/anubhavps-blog/blog-0-accepted-for-gsoc-2019-3-months-of-open-source-ahead/<p> </p> <p><span style="font-family: Arial,Helvetica,sans-serif;"><img alt="" height="416" src="https://1.bp.blogspot.com/-T-jEcKc3EIc/XJVXUldWEJI/AAAAAAAAB4U/Mqk1-XPQ0LEuemA16SXUQ4gbeXwjiDFDwCLcBGAs/s1600/GSoC%2B-%2BVertical%2BWide%2B-%2BGray%2BText%2B-%2BWhite%2BBG.png" width="700"></span></p> <p><span style="font-family: Arial,Helvetica,sans-serif;">---</span></p> <p><span style="font-family: Arial,Helvetica,sans-serif;">Hello! I have been selected as a student for Google Summer of Code 2019. For those of you who are unaware, <a href="https://summerofcode.withgoogle.com/">Google Summer of Code</a> is a global program focused on bringing more student developers into open source software development. Students developers get an opportunity to work with an open source organization on a 3 month programming project.</span></p> <p><span style="font-family: Arial,Helvetica,sans-serif;">I am working on <a href="https://scrapy.org/">Scrapy</a> - an open-source scraping and web crawling framework. My task is to implement an interface for `robots.txt` parsers in Scrapy. The stretch goal of the project is to write a fully spec compliant pure python `robots.txt` parser.</span></p> <p><span style="font-family: Arial,Helvetica,sans-serif;">This blog is part of a series of blog posts where I will go in depth to describe my work. Since, this is the first blog post, I have dedicated it to explain about my project. My project has a lot to do with `robots.txt` and <a href="https://en.wikipedia.org/wiki/Robots_exclusion_standard">Robot Exclusion Standard</a>. Let’s take a look at what these things are.</span></p> <p><span style="font-family: Arial,Helvetica,sans-serif;"><a href="https://en.wikipedia.org/wiki/Web_crawler">Web crawlers</a> (also called spiders) are bots that systematically browse the internet, generally for the purpose of extracting data (called scraping) or indexing. Search engines (eg. google) use web crawlers to index pages on the internet, this index is then used to serve users with relevant results.</span></p> <p><span style="font-family: Arial,Helvetica,sans-serif;">When web crawlers visit a website, they too consume resources of the web servers, similar to a normal user. A crawler can make multiple requests per second. Therefore, unrestricted crawling can lower the performance of a website, and cause annoyance to its users. A solution to this problem is robots exclusion standard, which makes it possible to specify which parts of the website a crawler should not access, and how frequently a crawler should make requests for a resource on the server. The standard allows websites to specify how “polite” a crawler should be.</span></p> <p><span style="font-family: Arial,Helvetica,sans-serif;">Under robots exclusion standard, instructions for crawlers are specified in a file named `robots.txt` placed at the root of the website. These instructions have to be specified in a certain format.</span></p> <p><span style="font-family: Arial,Helvetica,sans-serif;">`robots.txt` standard is not owned by any standard body, and is not in active development since 1997. So, the standard has not been revised for more than two decades which, I will argue, had led it to become outdated. Meanwhile, search giants like google, bing, <s>yahoo!</s> have collaborated to create an informal extension of the standard which their crawlers adhere to. Since usually the majority of crawling requests originates from search engines, most web administrators write `robots.txt` adhering to the informal extension of the specification.</span></p> <p><span style="font-family: Arial,Helvetica,sans-serif;">Scrapy uses <a href="https://docs.python.org/3/library/urllib.robotparser.html">RobotFileParser</a> (Python’s inbuilt `robots.txt` parser), which strictly follows the old specification, and has not been updated. Hence, Scrapy is likely to misinterpret the crawling rules for majority of websites on the internet. There was an effort to switch to another parser, but since there isn’t a fully compliant pure python parser available, and it is difficult to package non-python code with Scrapy because its wide user base consisting of people using a wide variety of platforms and implementations of python. Including non-python code would need dropping the support of <a href="https://pypy.org/">pypy</a>, and would require users to install compilers for other languages (which may not be easy on every platform).</span></p> <p><span style="font-family: Arial,Helvetica,sans-serif;">The short term solution to overcome this could be to allow users to switch to a different parser if they wish too, and keep `RobotFileParser` as the default in Scrapy. It has additional benefit of giving more power to the users. For this, we are planning to create an interface for `robots.txt` parsers in Scrapy, and implementing this interface on top few popular parsers. This the first goal of my project. In the end, we would like to have a fully spec compliant pure python parser which Scrapy could use by default. This is the stretch goal of my project.</span></p> <p><span style="font-family: Arial,Helvetica,sans-serif;">Hope, to have an incredible 3 months ahead :smile:. If you need any help regarding Google Summer of Code or you just want to learn more about my work, feel free email me at <a href="mailto:anubhavp28@gmail.cpingom">anubhavp28@gmail.com</a>.</span></p>anubhavp28@gmail.com (anubhavp)Fri, 07 Jun 2019 14:20:15 +0000https://blogs.python-gsoc.org/en/anubhavps-blog/blog-0-accepted-for-gsoc-2019-3-months-of-open-source-ahead/Weekly Check-in #2 [31 May - 6 Jun]https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-2-31-may-6-jun/<p><meta charset="utf-8"><b id="docs-internal-guid-5ca6b8ae-7fff-e839-1770-1cdee7e63d2e">Hello everyone. The first week of GSoC coding period is coming to an end. Here is an update on what I achieved in the past week and what I am looking forward to.</b></p> <p dir="ltr"> </p> <h2 dir="ltr"><b id="docs-internal-guid-5ca6b8ae-7fff-e839-1770-1cdee7e63d2e">What did you do this week?</b></h2> <p dir="ltr"><b id="docs-internal-guid-5ca6b8ae-7fff-e839-1770-1cdee7e63d2e">I submitted my first pull request related to GSoC. This week mostly involved discussion on interface specification. I learned that designing an interface involves considering several small but quite important details, and a good practice is to question every choice you make. Also, I had a meeting with my mentors where we discussed about weekly milestones and decided to have weekly meetings, every Tuesdays. I implemented the interface on top of Python’s in-built robots.txt parser and worked on documentation related to the interface. </b></p> <p dir="ltr"><b id="docs-internal-guid-5ca6b8ae-7fff-e839-1770-1cdee7e63d2e">I got an opportunity to deep dive into the source code of Python’s in-built robots.txt parser.  For some reason, I always had this belief that reading through the implementation of python (or any language) and its inbuilt modules, would be difficult and not really useful, and code would mostly be complex and cryptic (to a beginner like me). This doesn’t seem to be the case (at least with python). I should do more of this, looking at a module’s implementation for fun •ᴗ• .   </b></p> <h2 dir="ltr"><b id="docs-internal-guid-5ca6b8ae-7fff-e839-1770-1cdee7e63d2e">What is coming up next?</b></h2> <p dir="ltr"><b id="docs-internal-guid-5ca6b8ae-7fff-e839-1770-1cdee7e63d2e">In the next week, I am looking to finalize the interface, and modify Scrapy to use the interface to communicate with the parsers. I would also work on documenting the interface, and if time permits will implement the interface on top of few other parsers.</b></p> <h2 dir="ltr"><b id="docs-internal-guid-5ca6b8ae-7fff-e839-1770-1cdee7e63d2e">Did you get stuck anywhere?</b></h2> <p dir="ltr"><b id="docs-internal-guid-5ca6b8ae-7fff-e839-1770-1cdee7e63d2e">Nope. I learned a lot from constant feedback from my mentors. It was an awesome week •ᴗ• </b></p>anubhavp28@gmail.com (anubhavp)Sun, 02 Jun 2019 02:29:44 +0000https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-2-31-may-6-jun/Weekly Check-in #1 [24 May - 30 May]https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-1-24-may-30-may/<p>Hey everyone. I am Anubhav, and this summer I am working to implement an interface for robots.txt parsers in scrapy. This is the first of many upcoming weekly blog posts where I will describe in brief the work I have done in the previous week and my plans for the upcoming week. So, let's get started. </p> <h2><big>What did you do this week?</big></h2> <p>Most of time was spent on configuring a local development environment, and learning to use tox and how to run tests locally. For the patches I have submitted before, I didn't run tests locally beforehand, and relied solely on CI to do it. Running tests locally could have saved a lot of time. Also, I went through scrapy contribution guide, learned about twisted (scrapy uses it heavily) and PEP8, and worked on a pull request I had opened before.   </p> <h2><big>What is comping up next?</big></h2> <ul> <li>I will have my first meeting with mentors of the project.</li> <li>I will work on few pull requests I had opened before.</li> <li>Maybe, since this is the last week of community bonding period, looking to have discussion with the mentors regarding interface specification.</li> </ul> <h2>Did you get stuck anywhere?</h2> <p>I had minor difficulties understanding how to run tests using tox. When I followed the instructions given in scrapy documentation to run tests, I could only run tests using a Python 2.7 environment. Thankfully, tox has an incredible documentation that allowed to me understand settings inside of tox.ini config file. In the end, I just had to make few edits to my tox.ini file, and I was able to run tests using a Python 3 environment as well.</p>anubhavp28@gmail.com (anubhavp)Thu, 23 May 2019 09:02:57 +0000https://blogs.python-gsoc.org/en/anubhavps-blog/weekly-check-in-1-24-may-30-may/