siddharthadr's Blog

Blog Post #1

siddharthadr
Published: 06/20/2021

Going into the 3rd week of GSoC, the Item Filter feature is almost complete and ready to go. In this post I'll give a rundown how I implemented it.


What is this feature?

When Scrapy scrapes some data from the interwebs, it saves the scraped item to the declared feed storages. Scrapy provides a lot of flexibility in choosing what this storage can be, ranging from simple csv files to S3 cloud storages. But currently everytime some data is scraped, it is stored in all storages. Some users may require storing certain types of data in certain storage. So let's make it convenient for the users to do so.

How to implement such a feature?

We can introduce a certain type of filter for each storage, so that whenever some data is scraped, the storage can check if the data is acceptable and if it is then we can export it to the storage. Well that's really simple, right? So how to do that?
Let's create a class called ItemFilter that will handle the filtering process and assign this class to a declared storage. As we can have more than one data storage, we can assign different ItemFilter classes to different storages with each filter having their own logic to filter scraped data. So we create our filter, assign it to one of the storages in settings.py file and voila. Another convenient thing we can do is allow users to just declare what sort of data Item, they wish to accept for a particular storage. So they can just put a list of Items for a storage in settings.py file instead of creating a filter class.

My Work

As this was a straightforward implementation, there wasn't much difficulty in implementing this feature apart from some design changes. Though testing phase was a little different for this feature than what I'm familiar with. For testing the Feed Storages, we need to decorate every test method with a @defer.inlineCallbacks. It is part of Twisted framework and is used when you want to write a sequential code while using Deferred objects. This was new to me and I had to read about it and why it was used. I got to learn about Deferred objects and how to write code for it. Also this was the first time I was writing documentation in reStructuredText format.

View Blog Post

Weekly Check-in #2

siddharthadr
Published: 06/14/2021

Weather's just great here now, with the Monsoon arriving here. I can now enjoy the rains while sipping coffee and coding for GSoC with some cozy tunes to go along with.

1. What did you do this week?
A. So the ItemChecker class that I proposed has been included in the codebase with the appropriate modifications in FeedExport, so that it can use those filter classes. I pushed the code and created a draft PR. It can be accessed here: #5178. My mentor left some feedback which I agreed with.

2. What is coming up next?
A. I will look into mentor's feedback and improve the code. After that I'll continue with adding tests and documentations.

3. Did you get stuck anywhere?
A. I did once got stuck when I was manually testing the new implementation. I had actually added a class instance attribute called item_classes to ItemChecker class which I intended to be changed for every Feed slot. The problem I was facing was whenever I changed any one of the item_classes attribute it reflected on other instance's attribute as well. The problem was quite trivial actually and I should have known why it was happening. The reason why it was happening was simply because class attributes are shared amongst all the classes. And not only that I also learned about how default parameter values worked in Python which explains why class attributes work the way they do. For the curious, follow this Stackoverflow thread to understand more about default parameter values in Python.

View Blog Post

Weekly Check-in #1

siddharthadr
Published: 06/07/2021

Alright then as this is the first ever post for GSoC let me introduce myself and what I'm doing. My name is D R Siddhartha, friends call me DR(yeah those two letters). I like running and hiking(hopefully I get to hike Iceland one day). I'll be working on Scrapy, a web-crawling framework. Specifically I'll be working on its Feed features, adding enhancements.

Now about the 3 mandatory questions:
1. What did you do this week?
A. I started the week by getting a vaccine shot. The side effects manifested the next day and was horrendous. So I had to delay the scheduled meeting. But me and my mentor have already been discussing on the API designs for the enhancements. We made the discussions public in Github so other people can discuss as well. The discussions are available here: #5168, #5161 and #5169.

2. What is coming up next?
A. The coding phase starts now. I'll be starting my work on the first of the 3 features that I'll be implementing - Item Filters. Some API discussions have already been done. So now the implementation work starts.

3. Did you get stuck anywhere?
A. Not so far.

View Blog Post