Completion of first task

This Blogpost would describe my account for the Community Bonding Period and the completion of first small task for Gsoc.

Community Bonding Period

During the Community Bonding Period, a lot of time went into planning, getting used to the codebase, and optimising the requirements, of the Proposal. This post would describe the details regarding it.

Going through the codebase

I have previously contributed into codebase, so contribution to a open source project, would be rather straightforward : Make changes to your patch, submit a PR, make changes as required by the reviewer, and voila, we have made a contribution! But coding for a serious proposal, requires fair bit of responsibility, from the developer’s side, so getting used to with the codebase is a must. I had earlier went through small portions of codebase, before community bonding, so I made a few notes, regarding the ‘gotchas’ of the project. The notes did came handy though, as I went through the codebase, figuring out the flow of code, and noting down the utility of methods and classes, and the high-level interface associated to it. Going through the codebase becomes highly monotonous, so I undertook a task, that was discussed with my mentor during the week. So I went through the relevant methods, classes which would require refactoring, and it made my objective of understanding the codebase, quite interesting.

Discussing the goals, with the mentor

I had a fairly good discussion with my mentor, about the short term objectives and goals, as well as the long term objective. The best part of Gsoc, is to divide our work into small manageable parts, so completion of each part, actually helps us to gauge the progress of the project. So we discussed the requirements, and decided that for the first short term requirement, I would have to support async/await idioms in Scrapy’s inbuilt parse method. This was a fairly good chalenging work, so I discussed all the points, doubts with the mentor, and finalised the objective of goal.

Going about the writing code

As I spent a lot of time, in planning and discussing the goals of the project, I did not write that much of a code, as I intended. However, I did work upon enabling asynchronous generator support in Scrapy. This is also backwards compatible, so we can use it with Python 2.7( though Python 2.7 does not support Asyncio). This PR, will be used in introducing async/await idioms in parse method of Scrapy. So I think, that this kicks off my community bonding period. I am hopeful, that I can code through much faster, though coding with planning, requires time.

Tasks that I am currently working upon

I am currently working on supporting response = await scrapy.Request(..), as this is the next task I planned with my mentor. I am currently looking through a similar Python package, inline_requests, and would be looking forward to discuss the above with my mentor. However, I am quite hopeful that I might make good progress this week, so looking forward to it.

Details of my project

This blogpost covers the details of my project, Async Await support in Scrapy. With this post, I would explain my project which I would be working on during this summers.

Details of my project : Async Await support in Scrapy

How did I get to know about the organization ?

Scrapy is a web scraper, which I have used to scrape websites, and get a lot of data to use them, in Data Analytics, and Machine Learning. Data is an integral part of these fields, and we need to have data, in order to progress in them. So using scrapymade me realise the philosophy of the software, while open sourced nature would mean that the users can customize them accordingly.

How did I choose the project ?

Choosing the right project is important, because it is neccessary that the goal you would be working, should align with your interests and philosophy. The most interesting part of the project that fascinated me was Asynchronous Programmingparadigm. Now the ones who have programmed in Javascript, they know about that; for the others, I would use an analogy. Suppose, we have to serve 3 different dishes ordered by 3 different people, in some specific order. Synchronous way would be analogous to serve the 3 dishes in the order of sequence. While this wastes a lot of time, if the third order was completed first, but we have to serve that at last, in order to comply with the rule.

For Asynchronous paradigm, the goal is simple : serve the order which finishes first, in spite of the sequence of the request. In programming terms, it is like serving the requests in order of their completion.

How does Scrapy uses the asynchronous paradigm of programming ?

As Scrapy is a web scraper, it makes sense to use async programming, because web pages may not be available at time of requests, and it makes sense to use async paradigm to approach the task. Currently, before Python 3.3, people used asynchronous paradigm, using generatorsScrapy uses a framework, named Twisted for the above. Generators are iterators, which can be used to stop the iterator as and when wished, and to resume them, whenever we want. Using generators, we can stop the code flow, as we can ask Python to wait till we get the response of the request, and proceed accordingly. A relevant blog post for using generators is : Generators.

How async await syntactic sugar relevant ?

For people who have programmed using asynchronous paradigm, they know the drawbacks of it, especially the sphagetti code and callback hell. While new paradigms for asycnhronous programming has been used in Python( i.e generators), the main problem starts, when we want to use generators for both as tool for asynchronous programming, and as an iterator. It becomes difficult to differentiate between the two, and people contributing to these projects, might not be able to make out between them. In order to differentiate between the two, Python has introduced two primitives, for them, async and await. These are generators under the hood, but with difference that they have their own primitive, called ‘coroutines’. Using the above, it becomes easy to write the asynchronous code in a synchronous manner, while mantaining the async nature of code.

What would be my goal in this project ?

My objective in the project would be to support async / await idioms, so that users can write the project with the new syntactic sugar, while maintaining the backwards compatibilty of the project. At the end, we can write scraping code using the new paradigm, and frame our logic of scraping, without needing the overhead of callbacks.