anubhavp's Blog

[Blog #6] Part of the journey is the end.

anubhavp
Published: 08/25/2019

<meta charset="utf-8">

Part of the journey is the end. It is time for me to work on my final work report for final evaluation of Google Summer of Code 2019. This week, I will devote my time mainly to write my final report and final blog post. If time permits, I will work on my PRs from last week.

Last week, I worked on getting Travis to push automatically to PyPI and I redid benchmarking.

 

View Blog Post

[Blog #5] Time just seems to fly.

anubhavp
Published: 08/25/2019

 

<meta charset="utf-8">

Hello! This is my second last blog post for GSoC 2019 - time has gone by so quickly. I spend this week documenting Protego’s API in detail. I opened a pull request to add Protego integration in Scrapy. I added PyPy test environment and modified Protego to treat non-terminal dollar sign as ordinary character.

Up next, I will start the process to transfer Protego to Scrapy organisation on GitHub. I would modify `SitemapCrawler` in Scrapy to use the new interface, and implement a `ROBOTSTXT_USER_AGENT` setting in Scrapy.

I faced minor problems trying to setup PyPy environment in Travis. With the help from mentors, I was able to resolve the issue.

View Blog Post

Weekly Check-in #13 : ( 16 Aug - 22 Aug )

anubhavp
Published: 08/20/2019

What did you do this week?

  • I worked on getting Travis to push releases automatically to PyPI, adding a new `ROBOTSTXT_USER_AGENT` setting in Scrapy, and improvements to SitemapSpider.

What is coming up next?

  • I am going to work on the final PSF blog post in which I will focus on my experience of GSoC 2019 working with my awesome mentors, and Scrapy.
  • Next, I will write a final report for third evaluation of Google Summer of Code.
  • Next up, I will work on the changes suggested on my this week's work.

Did you get stuck anywhere?

  • Nothing Major.
View Blog Post

Weekly Check-in #12: (9 Aug - 15 Aug)

anubhavp
Published: 08/13/2019

What did you do this week?

  • Benchmarking Protego (again). This time we crawled multiple domains (~1,100 domains) and downloaded links to pages as the crawler encounter them. We downloaded 111, 824 links in total.
    • Next we made each robots.txt parser - parse and answer query (we made parsers answer each query 5 times) in an order similar to how they would need to in a broad crawl. Here are the results :

Protego :

25th percentile : 0.002419 seconds
50th percentile : 0.006798 seconds
75th percentile : 0.014307 seconds
100th percentile : 2.546978 seconds
Total Time : 19.545984 seconds

RobotFileParser (default in Scrapy) :

25th percentile : 0.002188 seconds
50th percentile : 0.005350 secondsstyle
75th percentile : 0.010492 seconds
100th percentile : 1.805923 seconds
Total Time : 13.799954 seconds

Rerp Parser :
25th percentile : 0.001288 seconds
50th percentile : 0.005222 seconds
75th percentile : 0.014640 seconds
100th percentile : 52.706880 seconds
Total Time : 76.460496 seconds

Reppy Parser :
25th percentile : 0.000210 seconds
50th percentile : 0.000492 seconds
75th percentile : 0.000997 seconds
100th percentile : 0.129440 seconds
Total Time: 1.405558  seconds

 

  • Removing an hack used in Protego due to lack of an option to ignore characters in `urllib.parse.unquote`. Added few new features to Protego as well. 
  • Protego has been moved to Scrapy organisation.

What is coming up next?

  • Configuring Travis to push to PyPI automatically.
  • Introducing a new `ROBOTSTXT_USER_AGENT` setting in Scrapy.
  • Making `SitemapCrawler` use the new interface.

Did you get stuck anywhere?

I got blocked by StackExchange for few hours. laugh I think they don't like crawlers on their websites. "It is a temporary automatic ban that is put up by our HAProxy instance when you hit our rate limiter." they answered to one of the questions on their website.

View Blog Post

Weekly Check-in #11: ( 2 Aug - 8 Aug )

anubhavp
Published: 08/06/2019

What did you do this week?

  • Added API description and more usage examples to readme.
  • Added PyPy test environment.
  • Opened a pull request to add Protego integration to Scrapy.
  • Modified Protego to treat non-terminal dollar signs as ordinary characters.
  • Minor aesthetic changes. 

What is coming up next?

  • Transferring the Protego repository to Scrapy organisation on GitHub. It seems that write permissions are necessary for initiating the transfer process.
  • Would modify Protego to treat wildcards such as `*` and `$` as ordinary characters as well.  
  • Would modify `SitemapCrawler` to use the new interface. 
  • Implementing support for `host` & `crawl-delay` directives in Scrapy. 
  • Some performance improvements might be possible by using a custom pattern matching logic (in place of regex), but I am not sure. I will need to test it.

Did you get stuck anywhere?

  • Faced problems setting up PyPy test environment. With help from mentors, I was able to solve the issue. 
View Blog Post
DJDT

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (28 rendered)

Cache calls from 1 backend

Signals

Log messages