Supporting asyncio event loop

This Blogpost deals with supporting asyncio in scrapy.

Supporting Asyncio event loop in scrapy

After the first evaluation, I did not have any talk with my mentor, so I went forward with my stipulated task of writing tasks for the code that I wrote. Ofcourse, as I stated earlier, I discussed with my mentor in writing the proposal,that I would learn writing tests, and would learn and adapt, while covering test suites for my program code. So I did go about my task. Halfway through, I fell sick, so it took 4-5 days to get ready, and covering up for the lost time. I had a discussion with my mentor, regarding the goals of proposal, asking opinions as to how I was going through. My mentor was pretty impressed with the work, so he asked me support other coroutines in scrapy. Till now, we were going well, so my mentor was intended in supporting other coroutines, which can be of good importance.

Preparing for the goal

While I had programmed using asyncio, I still needed some ideas, about going through the task, and to figure out as to what would be the plan, about charting out different libraries that I might need, and using asyncio,in scrapy. Asyncio is supported in Twisted, so we could use it in getting an overhead with the task, and designing the new code part that would be added in scrapy. As it is a new API, so it would be optional, enabling users to use as and when required.

Progress uptil now

While a lot of time has gone since the first evaluations, I would look forward in implementing the ideas, that I planned, and supporting other coroutines, is another stepping stone to completed my proposed goal, for gsoc. The upcomming week would be quite busy, in the sense that I need to cover up a lot, so would be looking forward to another period of intense coding.

Implementing the new syntaxes

This blogpost contains few sample spiders, so one can use them to try out the new syntax.

Sample spiders which uses new support

While the scrapy codebase has been supported with the new syntax, users can make new spiders using the above new syntax. This blog post deals with new spiders, that can be used, for utilising the new syntaxes.

Few Instructions to try out new syntactic sugar

While the new syntax haven’t been supported in the official scrapy codebase, you can clone the project locally, and then install scrapy using python setup.py install to use scrapy locally.

My Codebase

Some Sample Spiders

Here is the link to my spider Spider_Coroutine

Completion of second task

This Blogpost deals with the progress during the couple of weeks, and the near completion of the first task, that was discussed with the mentors.

Supporting response = await scrapy.Request(...)

While in the previous few weeks, I made a lot of progress, I was able to support inline-callbacks in the scrapy spiders. This meant that we could yield the response in the same line, without needing the use of a callback, so it was a good start.

Starting with the idea for the same using native coroutines

I had discussed with my mentors, about the possible loopholes, and problems, that I might face, so it was good that I discussed the probable solutions there itself.

Supporting response = await scrapy.Request(..) required the scrapy Request object to be scheduled, which it does not have in the architecture, so scheduling it would require either – 1. Designing out the reference of Request object, which would be linked with the crawler object and, 2. Using context variables, which would provide a link to the Request object in a particular context. Both of them turned out to be quite extravagant, considering the fact that designing out the reference would require another refactoring of scrapy codebase, which would not be backwards compatible, and the other requires context-variable which is actually supported in Python 3.7 So an alternative was clearly required in this case to support the above

Supporting response = yield scrapy.Request

This was another alternative, which was discussed, so I started to implement the above. This required the scrapy codebase to support asynchronous generators, as we would support the same paradigm, using async/await syntax. I had to refactor scrapy.scraperscrapy.engine, in order to extend the support, that had been done earlier.

I had to read a lot of articles, particularly Coroutines and Twisted Cooperator.

While the first article served as a refresher for the native coroutines, the second one was useful, after I was stuck in refactoring a method in scrapy.misc named parallel. This method used synchronous iterator, while I was required to support asynchronous iteration. Thus, it took me a lot of time to get through the task, so it was fairly challenging as well.

Working prototype

It took me few days to understand and then implement the code, so it was fairly challenging as well for me. The code now supports the async def parse, so it is safe to say that our implementation has started. It has been one month, and we have lots of ground to cover, but a progress like above does not harm us in any manner.

I would post another blog post which would implement as to how to use the new syntax (though it can safely be considered in its beta stage), and the utility, so stay tuned for updates.