siddharthadr's Blog

Blog Post #3

siddharthadr
Published: 07/18/2021

After finalizing the Post-Processing feature, I have moved onto implementing my third proposed feature - Batch Delivery Triggers. This feature will be the toughest of the three. Further I will talk about what it really is and how am tackling this.

What is this feature?

In Scrapy 2.3.0, a batch creating feature was introduced which generates multiple output files based on a specified item count constraint. Such that whenever a file's item count exceeds a limit, a new fresh file is created for storing further items. What I intend to do, is add more delivery triggers, namely - time duration limit and file size limit while also providing users with a way to create their own custom trigger.

The Plan

To have the ability to use custom batch delivery triggers, I had planned to use a base BatchHandler class to use as a default handler for batch creation for triggers such as file size, item count and time duration. This will allow to easily replace this base class with a custom class, if the user desires to use one.

My Work So Far

I have created a base class with the planned methods for batch handling. Next was modifying the FeedExporter to so that it can use the BatchHandler classes and use them to determine when to start a new batch. So far it works for item count limits and file size limit but there's been some problems implementing a trigger for time duration.

Initially I came with a very naive approach to update the time elapsed for a batch whenever an item is scraped and create a new batch if the limit is passed. Obvious problems can be deduced from that sentence itself. There are no guarantees that an item is scraped instantaneously, so a batch duration may very well cross the limit by a lot by the time an item is scraped.

One possible solution I have been thinking is to "schedule" or call the function to create a new batch for the feed after the specified duration has elapsed. But if some other declared constraint has breached, a new batch will be created then and timer will be reset. Using Twisted framework's task.deferLater or reactor.callLater could be the answer to my problem. But I think these methods are not thread-safe and will need some sort of mutexes to guarantee safety to the code. I will need to come up with some control flow plan to ensure the batches are created timely without threatening the safety of the code.

View Blog Post

Weekly Check-in #4

siddharthadr
Published: 07/12/2021

1. What did you do this week?
A. I extended the tests for Post-Processing and completed the documentation for it. The feature was completed and approved by my mentor.

2. What is coming up next?
A. Now the last proposed feature Batch Trigger remains. I will polish its user API and start the implementation process.

3. Did you get stuck anywhere?
A. Nothing major this week.

View Blog Post

Blog Post #2

siddharthadr
Published: 07/04/2021

Last week I completed most of my proposed #2 feature which is "Post-Processing". Let me run you through how it happened.

The Idea

So this feature is actually an extension of a compression support idea. Initially it was supposed to be just an addition of compression feature to compress the feeds with the help of different compression algorithms. While discussing this idea with my mentor, he suggested a post-processing sort of thing rather than just a compression thing which I quite liked and expanded it a little. A few back and forth discussion with my mentor and this idea grew somewhat concrete.

The Plan

The finalized plan was that, the processing of feed would be done by "plugins", which would work in a pipeline-ish way by transferring the processed data from one plugin to another until it was finally written to the target file. These plugins will be managed by a single plugin manager which would load all the user declared plugins and connect their outputs to next-in-line plugin's input. The plugin manager will act as a wrapper to the feed's storage, so whenever a write event is invoked on the storage, it first goes through the manager which will then send the data sequentially to all the plugins until finally it reaches the destination file, processed as intended.

The Implementation

The plugin's interface was first decided and subsequently the builtin plugins were created: GzipPlugin, LZMAPlugin and Bz2Plugin. These plugins are, as you can see, compression based. Parameter passing was achieved through feed_options dictionary, the feed-specific options, where the users can declare the parameters. Next came the testing phase, where I initially made an all-in-one test method for the all the post-processing. My mentor suggested to create another elaborate test class, instead of one mega-test-method. Documentation is new for me, so my newbie skills were reflected on my poor documentation attempts. Fortunately, I have a cool mentor with a keen eye for detail so there were valuable inputs in my code reviews from which I got to learn a lot.

View Blog Post

Weekly Check-in #3

siddharthadr
Published: 06/27/2021

1. What did you do this week?
A. This week I created plugins' manager PostProcessingManager, Plugin interface and builtin compression plugins (gzip, bzip2, lzma). PR availabe here: #5190.

2. What is coming up next?
A. This week I will focus on the tests for the created plugins and manager.

3. Did you get stuck anywhere?
A. While implementing the manager and builtin plugins I had assumed that only two methods will be used ever: write and close. Well it turned out one of the exporters (used for exporting a scraped item to a file), CsvItemExporter also wraps the storage file around io.TextIOWrapper and that actually expects an object with at least the methods and attributes of that of io.IOBase. The problem was fixed by making PostProcessingManager subclass io.IOBase and also providing a method writable to indicate that the object is open for writing.

View Blog Post

Blog Post #1

siddharthadr
Published: 06/20/2021

Going into the 3rd week of GSoC, the Item Filter feature is almost complete and ready to go. In this post I'll give a rundown how I implemented it.


What is this feature?

When Scrapy scrapes some data from the interwebs, it saves the scraped item to the declared feed storages. Scrapy provides a lot of flexibility in choosing what this storage can be, ranging from simple csv files to S3 cloud storages. But currently everytime some data is scraped, it is stored in all storages. Some users may require storing certain types of data in certain storage. So let's make it convenient for the users to do so.

How to implement such a feature?

We can introduce a certain type of filter for each storage, so that whenever some data is scraped, the storage can check if the data is acceptable and if it is then we can export it to the storage. Well that's really simple, right? So how to do that?
Let's create a class called ItemFilter that will handle the filtering process and assign this class to a declared storage. As we can have more than one data storage, we can assign different ItemFilter classes to different storages with each filter having their own logic to filter scraped data. So we create our filter, assign it to one of the storages in settings.py file and voila. Another convenient thing we can do is allow users to just declare what sort of data Item, they wish to accept for a particular storage. So they can just put a list of Items for a storage in settings.py file instead of creating a filter class.

My Work

As this was a straightforward implementation, there wasn't much difficulty in implementing this feature apart from some design changes. Though testing phase was a little different for this feature than what I'm familiar with. For testing the Feed Storages, we need to decorate every test method with a @defer.inlineCallbacks. It is part of Twisted framework and is used when you want to write a sequential code while using Deferred objects. This was new to me and I had to read about it and why it was used. I got to learn about Deferred objects and how to write code for it. Also this was the first time I was writing documentation in reStructuredText format.

View Blog Post