Weekly Check-in #12: (9 Aug - 15 Aug)

anubhavp
Published: 08/13/2019

What did you do this week?

Benchmarking Protego (again). This time we crawled multiple domains (~1,100 domains) and downloaded links to pages as the crawler encounter them. We downloaded 111, 824 links in total.
- Next we made each robots.txt parser - parse and answer query (we made parsers answer each query 5 times) in an order similar to how they would need to in a broad crawl. Here are the results :

Protego :

25th percentile : 0.002419 seconds
50th percentile : 0.006798 seconds
75th percentile : 0.014307 seconds
100th percentile : 2.546978 seconds
Total Time : 19.545984 seconds

RobotFileParser (default in Scrapy) :

25th percentile : 0.002188 seconds
50th percentile : 0.005350 secondsstyle
75th percentile : 0.010492 seconds
100th percentile : 1.805923 seconds
Total Time : 13.799954 seconds

Rerp Parser :
25th percentile : 0.001288 seconds
50th percentile : 0.005222 seconds
75th percentile : 0.014640 seconds
100th percentile : 52.706880 seconds
Total Time : 76.460496 seconds

Reppy Parser :
25th percentile : 0.000210 seconds
50th percentile : 0.000492 seconds
75th percentile : 0.000997 seconds
100th percentile : 0.129440 seconds
Total Time: 1.405558 seconds

Removing an hack used in Protego due to lack of an option to ignore characters in `urllib.parse.unquote`. Added few new features to Protego as well.
Protego has been moved to Scrapy organisation.

What is coming up next?

Configuring Travis to push to PyPI automatically.
Introducing a new `ROBOTSTXT_USER_AGENT` setting in Scrapy.
Making `SitemapCrawler` use the new interface.

Did you get stuck anywhere?

I got blocked by StackExchange for few hours. I think they don't like crawlers on their websites. "It is a temporary automatic ban that is put up by our HAProxy instance when you hit our rate limiter." they answered to one of the questions on their website.

Weekly Check-in #12: (9 Aug - 15 Aug)

What did you do this week?

What is coming up next?

Did you get stuck anywhere?

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (11 rendered)

Cache calls from 1 backend

Signals

Log messages