Blog Post #1

siddharthadr
Published: 06/20/2021

Going into the 3rd week of GSoC, the Item Filter feature is almost complete and ready to go. In this post I'll give a rundown how I implemented it.


What is this feature?

When Scrapy scrapes some data from the interwebs, it saves the scraped item to the declared feed storages. Scrapy provides a lot of flexibility in choosing what this storage can be, ranging from simple csv files to S3 cloud storages. But currently everytime some data is scraped, it is stored in all storages. Some users may require storing certain types of data in certain storage. So let's make it convenient for the users to do so.

How to implement such a feature?

We can introduce a certain type of filter for each storage, so that whenever some data is scraped, the storage can check if the data is acceptable and if it is then we can export it to the storage. Well that's really simple, right? So how to do that?
Let's create a class called ItemFilter that will handle the filtering process and assign this class to a declared storage. As we can have more than one data storage, we can assign different ItemFilter classes to different storages with each filter having their own logic to filter scraped data. So we create our filter, assign it to one of the storages in settings.py file and voila. Another convenient thing we can do is allow users to just declare what sort of data Item, they wish to accept for a particular storage. So they can just put a list of Items for a storage in settings.py file instead of creating a filter class.

My Work

As this was a straightforward implementation, there wasn't much difficulty in implementing this feature apart from some design changes. Though testing phase was a little different for this feature than what I'm familiar with. For testing the Feed Storages, we need to decorate every test method with a @defer.inlineCallbacks. It is part of Twisted framework and is used when you want to write a sequential code while using Deferred objects. This was new to me and I had to read about it and why it was used. I got to learn about Deferred objects and how to write code for it. Also this was the first time I was writing documentation in reStructuredText format.