Wrapping up for the GSOC

As I discussed in the last post, about the trials with the asyncio reactor, concluding the previous week, I made lot of progress. I completed most of the requirements of the project, and in the end, made some review along with my mentor, for the quality of code. I would discuss this now –

  1. Supporting Asyncio frameworks – The previous blog discussed as to how I tested the asyncio reactor. This week, I installed asyncio reactor, and tried testing on it.  I did not write any test case, but I tried using asyncio based frameworks, such as aioredis, and aiohttp. Both were running smoothly, and based on some discussion with my mentor, I tried using specific cases for the use of asyncio framework.
  2. Fixing the spider-idle bug – Last week, I was plagued with the spider idle bug, where on running the spider, the spider used to close quite prematurely. So I thought about various scenarios, as to when the spider should be closed. After designing the implementation, I got a hold on the above problem, and the bug is now solved.

I also discussed with my mentor, for the prospect of merging the PR. While he expressed his thought that the work was good, there were some requirements, before it can be merged into the Scrapy codebase –

Requirements before it is merged into scrapy code base

While I have made most of the API required for using async / await syntax, there remain to be some of tasks left, before it can be merged into Scrapy’s code-base.

  • Writing test suites — While I intended to complete writing the test suites, as it was proposed in the proposal, and I did write some of them, but the main project rather got too cumbersome at the end ; So while the new API achieves all the expectations in the proposal, but a new API cannot be rolled out, until and unless the code is well tested.
  • Writing documentation — While the new API supposedly adds a few hooks on scrapy, and providing the async/await support as an additional feature, but still writing documentation and some test examples, is mandatory for the common user to use them.
  • Supporting Python2.7 — The Scrapy codebase is backwards compatible, and most of the code is backwards compatible ; but there are some hooks and additional features which are only possible in Python ≥3.7. It should check the version of Python, and use the appropriate methods accordingly.
  • Twisted — While this is a reason that is completely out of my hand, but Twisted has a bug, in which it has some variables defined as async. Starting from Python 3.7, we have async/await as a reserved keyword. This bug is resolved in Twisted’s github page, but before the corrected release, one has to wait in order to try them. You can clone them in your local computer from here.

What does it means for the Scrapy users ?

After this new API is merged into Scrapy, users would be able to take advantage of new syntaxes async/await and run libraries requiring asyncio. The new API also provides the users to get the response in the same line, rather than using a method for receiving the response as a callback.

Learning through the project

The project itself has been quite challenging, but that is the real beauty of Google Summer of Code. I have been quite used to with the frameworks that Scrapy has used, namely Twisted but regarding Asyncio, it has been quite new, and the evolving nature of asyncio would mean that the developers would certainly have a hard time catching with it.

Right at the start of applying for the project, I learnt Django and Flask, but going through Scrapy made me look at it with awe, as concurrency along with single threaded nature of Twisted certainly pushed me to learn this framework. Regarding Asyncio, I was playing with it through, and after reading through quite a lot of blogs, I certainly understood that it would take a fair bit of time before asyncio gets used as a mainstream event driven network framework.

There have been moments, where I chalked out the flag posts for the project, but there have been few moments where I certainly went away from the deadline. But I had allotted time considering the complexity of the project, so I did cover up most of the project within the stipulated time.

I also covered a lot of practical knowledge of generators and asynchronous generators in Python, and after a fair bit of use cases of them, I feel quite content in using them in future requirements.

Important Links

Trials with asyncioreactor

This Blogpost deals with using asyncioreactor in Scrapy.

In the earlier post, I discussed in using asyncio in scrapy. While I have installed asyncio in scrapy, it is working well, and I am covering some problems that are occuring. I am working out day in and day out to cover the problems up, so when some trivial problems are covered, we would have time to polish up the project for public version.

Using asyncioreactor in Scrapy

While I have discussed in using asyncioreactor in scrapy, this part would discuss my experiences in using asyncioreactor in scrapy. While Twisted supports in running asyncio, I have used the above support in Scrapy. We are running the asynchronous generators, in which each asend objects are awaited, and wrapped in asyncio.ensure_future(), and scheduled to run. This makes running and using asyncio frameworks, and we can support them as well. Regarding the working, there was an issue that I was facing. The ones who use scrapy they know that the spider closes, if it is found idle. While this is normal, considering the fact that I want to await some awaitable. It is expected that the lag time would be covered up by other task, but if we have only one awaitable, then this issue is created. I am currently working on it, and hope that this issue is resolved in a few days of time.

Some snippets in using the new framework

While I discussed that there is one recurring issue that is plaguing the progress, the new support is quite ready to be used. Below is a spider snippet that can be used –

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"

    async def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]

        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    async def parse(self, response):
        print("--------------------------IN PARSE-------------------------------------------------------")
        
        links = [(response.xpath('//@href').extract()[-1])]
        links.append(response.xpath('//@href').extract()[-2])
        for h1 in response.xpath('//h1').extract():
            yield {"title": h1}
        for link in links:
            res = yield scrapy.Request(url=link)#can use scrapy.Request(.., callback=self.parse2), so supports both the syntax
            await asyncio.sleep(3)# Sleeps for 3 seconds, but point being that we can await any awaitable, also using asyncio loop
            print("___RESPONSE_____________________________________________________________{!r}".format(res))
        print("---------------------------END OF PARSE------------------------------------------------")
    
    async def parse2(self, response):
        page = response.url.split("/")[-2]
        print("------------------------IN PARSE 2----------------------------")
        filename = 'File-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)
        yield
        print("----END OF PARSE2 ------------")

The spider would work, but it needs to be shut down manually, as I have switched off the closing of spider when it remains idle.

Future tasks and goals

Considering the fact that after supporting asyncioreactor, we can support asyncio based frameworks. After this, I would discuss with my mentor, if we are looking forward to small enhancements, or work on polishing the remaining work.

Using Asyncio in Twisted

This Blogpost deals with supporting Asyncio in Twisted.

Update of the work

While I made it clear,that I would be working on wrting tests and supporting asyncio based frameworks, I managed to write tests for the API supporting response = yield scrapy.Request(url). For the tests of native coroutines, I am still working on it, as there are not a lot of libraries available to support writing tests using async/await.

I also worked on learning, how to support asyncio in twisted, and the remaining blogpost deals with that.

Asyncio and Twisted

Right from the start, I was excited in using asyncio,and rendering support to scrapy. Indeed, it is an exciting framework by Python itself, and with the advent of native coroutines, it is a different experience in using asyncio for supporting asynchronous programming. While, there have been other frameworks too to help us for async-programming, asyncio has been quite new, to be processed and rendered as a rugged framework to be used. Twisted, on the other hand has been quite an experienced ordeal, and it is a framework, which has been used a lot, so we find a lot of support with any sorts of problem, that we might face.

On a personal note, I had faced a lot of problems, in using asyncio, not because it is complex, but the support for different problems, that we might commonly face haven’t been discussed yet, so one has to work on the problem, before asking for the community yet.

Using Twisted on asyncio

There are two ways, in which asyncio can be used in Twisted. One is running Twisted on top of Asyncio,and other is running asyncio on top of Twisted. While, theoretically, both are possible, only one has been implemented uptil yet. Twisted is run over asyncio, so the current blog deals with that.

How I supported asyncio code in Twisted

While, I went through a quite a bit of resources for supporting asyncio in Twisted. I even lurked in the irc of Twisted, and asked them various questions, sometimes, going through the answers previously discussed.

Using Python3 and Twisted

Normally, Twisted had support for native coroutines(async/await), and we can write a coroutine, and then wrap them in defer.ensureDeferred to be used in Twisted.

from twisted.internet.task import react
from twisted.internet.defer import ensureDeferred
# our "real" main
async def _main(reactor):
    await some_deferred_returning_function()
# a wrapper that calls ensureDeferred
def main():
    return react(
        lambda reactor: ensureDeferred(
            _main(reactor)
        )
    )
if __name__ == '__main__':
    main()

Using this feature, we can write and await deferred returning functions. The drawback, is that we cannot await any awaitable coroutines ; only deferred can be awaited.

Supporting asyncio frameworks in Twisted

Till now we discussed, that Twisted supports native coroutines, but all are useful only if we are dealing with deferreds. Till now, the native coroutines serve as a syntactic sugar, for writing methods dealing with deferreds.

But what happens, if we want to await other asyncio frameworks, or technically asyncio future ? Twisted has a solution for this, and we obtain by running twisted on top of asyncio. All the semantics remain the same, we just install twisted reactor on top asyncio.

Make sure that you install asyncioreactor as early as possible, so that other reactor does not gets installed by default.

import asyncio
from twisted.internet import asyncioreactor
asyncioreactor.install(asyncio.get_event_loop())

The trick of using asyncio frameworks in Twisted

The trick to use future objects in Twisted, is to wrap them in deferreds, and then use them as deferreds. We can also use the loop implementation of asyncio, and when we get the future result, we can wrap them into deferred. Twisted also allows interconversion of future and deferreds, so one can easily use them as native coroutines, without worrying about the interconversion between the two.

from twisted.inernet.defer.Deferred import asFuture, fromFuture

def as_future(d):
    return d.asFuture(asyncio.get_event_loop())
def as_deferred(f):
    return Deferred.fromFuture(asyncio.ensure_future(f))

What it means for Scrapy

Using asyncio in scrapy remains the big target, as I am working to support the asyncio frameworks in Scrapy. Learning this new API from Twisted was fairly challenging, but more than that, I am happy that Twisted supports in utilising asyncio, otherwise it would have been as easily a tough job in supporting asyncio. Fairly, this is one of the most challenging phase of Gsoc for me uptil now, as this implementation hasn’t been carried out anywhere, as far as I know( though I would love to get ideas about projects facing the same hurdle). I have completed a part of this, running up and supporting asyncioreactor, and phasing out (or rather) decoupling parts where it requires the implementation of asyncio. I have progressed fairly well, but I am quite behind the actual result. I am hopeful that in few weeks, this might be up and running.

Supporting asyncio event loop

This Blogpost deals with supporting asyncio in scrapy.

Supporting Asyncio event loop in scrapy

After the first evaluation, I did not have any talk with my mentor, so I went forward with my stipulated task of writing tasks for the code that I wrote. Ofcourse, as I stated earlier, I discussed with my mentor in writing the proposal,that I would learn writing tests, and would learn and adapt, while covering test suites for my program code. So I did go about my task. Halfway through, I fell sick, so it took 4-5 days to get ready, and covering up for the lost time. I had a discussion with my mentor, regarding the goals of proposal, asking opinions as to how I was going through. My mentor was pretty impressed with the work, so he asked me support other coroutines in scrapy. Till now, we were going well, so my mentor was intended in supporting other coroutines, which can be of good importance.

Preparing for the goal

While I had programmed using asyncio, I still needed some ideas, about going through the task, and to figure out as to what would be the plan, about charting out different libraries that I might need, and using asyncio,in scrapy. Asyncio is supported in Twisted, so we could use it in getting an overhead with the task, and designing the new code part that would be added in scrapy. As it is a new API, so it would be optional, enabling users to use as and when required.

Progress uptil now

While a lot of time has gone since the first evaluations, I would look forward in implementing the ideas, that I planned, and supporting other coroutines, is another stepping stone to completed my proposed goal, for gsoc. The upcomming week would be quite busy, in the sense that I need to cover up a lot, so would be looking forward to another period of intense coding.

Implementing the new syntaxes

This blogpost contains few sample spiders, so one can use them to try out the new syntax.

Sample spiders which uses new support

While the scrapy codebase has been supported with the new syntax, users can make new spiders using the above new syntax. This blog post deals with new spiders, that can be used, for utilising the new syntaxes.

Few Instructions to try out new syntactic sugar

While the new syntax haven’t been supported in the official scrapy codebase, you can clone the project locally, and then install scrapy using python setup.py install to use scrapy locally.

My Codebase

Some Sample Spiders

Here is the link to my spider Spider_Coroutine

Completion of second task

This Blogpost deals with the progress during the couple of weeks, and the near completion of the first task, that was discussed with the mentors.

Supporting response = await scrapy.Request(...)

While in the previous few weeks, I made a lot of progress, I was able to support inline-callbacks in the scrapy spiders. This meant that we could yield the response in the same line, without needing the use of a callback, so it was a good start.

Starting with the idea for the same using native coroutines

I had discussed with my mentors, about the possible loopholes, and problems, that I might face, so it was good that I discussed the probable solutions there itself.

Supporting response = await scrapy.Request(..) required the scrapy Request object to be scheduled, which it does not have in the architecture, so scheduling it would require either – 1. Designing out the reference of Request object, which would be linked with the crawler object and, 2. Using context variables, which would provide a link to the Request object in a particular context. Both of them turned out to be quite extravagant, considering the fact that designing out the reference would require another refactoring of scrapy codebase, which would not be backwards compatible, and the other requires context-variable which is actually supported in Python 3.7 So an alternative was clearly required in this case to support the above

Supporting response = yield scrapy.Request

This was another alternative, which was discussed, so I started to implement the above. This required the scrapy codebase to support asynchronous generators, as we would support the same paradigm, using async/await syntax. I had to refactor scrapy.scraperscrapy.engine, in order to extend the support, that had been done earlier.

I had to read a lot of articles, particularly Coroutines and Twisted Cooperator.

While the first article served as a refresher for the native coroutines, the second one was useful, after I was stuck in refactoring a method in scrapy.misc named parallel. This method used synchronous iterator, while I was required to support asynchronous iteration. Thus, it took me a lot of time to get through the task, so it was fairly challenging as well.

Working prototype

It took me few days to understand and then implement the code, so it was fairly challenging as well for me. The code now supports the async def parse, so it is safe to say that our implementation has started. It has been one month, and we have lots of ground to cover, but a progress like above does not harm us in any manner.

I would post another blog post which would implement as to how to use the new syntax (though it can safely be considered in its beta stage), and the utility, so stay tuned for updates.

Completion of first task

This Blogpost would describe my account for the Community Bonding Period and the completion of first small task for Gsoc.

Community Bonding Period

During the Community Bonding Period, a lot of time went into planning, getting used to the codebase, and optimising the requirements, of the Proposal. This post would describe the details regarding it.

Going through the codebase

I have previously contributed into codebase, so contribution to a open source project, would be rather straightforward : Make changes to your patch, submit a PR, make changes as required by the reviewer, and voila, we have made a contribution! But coding for a serious proposal, requires fair bit of responsibility, from the developer’s side, so getting used to with the codebase is a must. I had earlier went through small portions of codebase, before community bonding, so I made a few notes, regarding the ‘gotchas’ of the project. The notes did came handy though, as I went through the codebase, figuring out the flow of code, and noting down the utility of methods and classes, and the high-level interface associated to it. Going through the codebase becomes highly monotonous, so I undertook a task, that was discussed with my mentor during the week. So I went through the relevant methods, classes which would require refactoring, and it made my objective of understanding the codebase, quite interesting.

Discussing the goals, with the mentor

I had a fairly good discussion with my mentor, about the short term objectives and goals, as well as the long term objective. The best part of Gsoc, is to divide our work into small manageable parts, so completion of each part, actually helps us to gauge the progress of the project. So we discussed the requirements, and decided that for the first short term requirement, I would have to support async/await idioms in Scrapy’s inbuilt parse method. This was a fairly good chalenging work, so I discussed all the points, doubts with the mentor, and finalised the objective of goal.

Going about the writing code

As I spent a lot of time, in planning and discussing the goals of the project, I did not write that much of a code, as I intended. However, I did work upon enabling asynchronous generator support in Scrapy. This is also backwards compatible, so we can use it with Python 2.7( though Python 2.7 does not support Asyncio). This PR, will be used in introducing async/await idioms in parse method of Scrapy. So I think, that this kicks off my community bonding period. I am hopeful, that I can code through much faster, though coding with planning, requires time.

Tasks that I am currently working upon

I am currently working on supporting response = await scrapy.Request(..), as this is the next task I planned with my mentor. I am currently looking through a similar Python package, inline_requests, and would be looking forward to discuss the above with my mentor. However, I am quite hopeful that I might make good progress this week, so looking forward to it.

Details of my project

This blogpost covers the details of my project, Async Await support in Scrapy. With this post, I would explain my project which I would be working on during this summers.

Details of my project : Async Await support in Scrapy

How did I get to know about the organization ?

Scrapy is a web scraper, which I have used to scrape websites, and get a lot of data to use them, in Data Analytics, and Machine Learning. Data is an integral part of these fields, and we need to have data, in order to progress in them. So using scrapymade me realise the philosophy of the software, while open sourced nature would mean that the users can customize them accordingly.

How did I choose the project ?

Choosing the right project is important, because it is neccessary that the goal you would be working, should align with your interests and philosophy. The most interesting part of the project that fascinated me was Asynchronous Programmingparadigm. Now the ones who have programmed in Javascript, they know about that; for the others, I would use an analogy. Suppose, we have to serve 3 different dishes ordered by 3 different people, in some specific order. Synchronous way would be analogous to serve the 3 dishes in the order of sequence. While this wastes a lot of time, if the third order was completed first, but we have to serve that at last, in order to comply with the rule.

For Asynchronous paradigm, the goal is simple : serve the order which finishes first, in spite of the sequence of the request. In programming terms, it is like serving the requests in order of their completion.

How does Scrapy uses the asynchronous paradigm of programming ?

As Scrapy is a web scraper, it makes sense to use async programming, because web pages may not be available at time of requests, and it makes sense to use async paradigm to approach the task. Currently, before Python 3.3, people used asynchronous paradigm, using generatorsScrapy uses a framework, named Twisted for the above. Generators are iterators, which can be used to stop the iterator as and when wished, and to resume them, whenever we want. Using generators, we can stop the code flow, as we can ask Python to wait till we get the response of the request, and proceed accordingly. A relevant blog post for using generators is : Generators.

How async await syntactic sugar relevant ?

For people who have programmed using asynchronous paradigm, they know the drawbacks of it, especially the sphagetti code and callback hell. While new paradigms for asycnhronous programming has been used in Python( i.e generators), the main problem starts, when we want to use generators for both as tool for asynchronous programming, and as an iterator. It becomes difficult to differentiate between the two, and people contributing to these projects, might not be able to make out between them. In order to differentiate between the two, Python has introduced two primitives, for them, async and await. These are generators under the hood, but with difference that they have their own primitive, called ‘coroutines’. Using the above, it becomes easy to write the asynchronous code in a synchronous manner, while mantaining the async nature of code.

What would be my goal in this project ?

My objective in the project would be to support async / await idioms, so that users can write the project with the new syntactic sugar, while maintaining the backwards compatibilty of the project. At the end, we can write scraping code using the new paradigm, and frame our logic of scraping, without needing the overhead of callbacks.

My first Blog

This summer, I have been selected for GSOC 2018 at Python Software Foundation to work on Async Await support for the spiders in Scrapy. This blog will serve as a medium for sharing my experiences while I code through the project.

Link to my project page: Async Await in Spider

Fun fact

My nickname for the summers, is hitman23, is based on two facts. hitman comes from the nickname, given to the Cricket player, Rohit Sharma. He is named hitman, for the particular reason, that he is an aggressive hitter of the cricket ball. 23 comes from my birthdate.

May the combination of my nickname, inspire others!