Blog Post #3

siddharthadr
Published: 07/18/2021

After finalizing the Post-Processing feature, I have moved onto implementing my third proposed feature - Batch Delivery Triggers. This feature will be the toughest of the three. Further I will talk about what it really is and how am tackling this.

What is this feature?

In Scrapy 2.3.0, a batch creating feature was introduced which generates multiple output files based on a specified item count constraint. Such that whenever a file's item count exceeds a limit, a new fresh file is created for storing further items. What I intend to do, is add more delivery triggers, namely - time duration limit and file size limit while also providing users with a way to create their own custom trigger.

The Plan

To have the ability to use custom batch delivery triggers, I had planned to use a base BatchHandler class to use as a default handler for batch creation for triggers such as file size, item count and time duration. This will allow to easily replace this base class with a custom class, if the user desires to use one.

My Work So Far

I have created a base class with the planned methods for batch handling. Next was modifying the FeedExporter to so that it can use the BatchHandler classes and use them to determine when to start a new batch. So far it works for item count limits and file size limit but there's been some problems implementing a trigger for time duration.

Initially I came with a very naive approach to update the time elapsed for a batch whenever an item is scraped and create a new batch if the limit is passed. Obvious problems can be deduced from that sentence itself. There are no guarantees that an item is scraped instantaneously, so a batch duration may very well cross the limit by a lot by the time an item is scraped.

One possible solution I have been thinking is to "schedule" or call the function to create a new batch for the feed after the specified duration has elapsed. But if some other declared constraint has breached, a new batch will be created then and timer will be reset. Using Twisted framework's task.deferLater or reactor.callLater could be the answer to my problem. But I think these methods are not thread-safe and will need some sort of mutexes to guarantee safety to the code. I will need to come up with some control flow plan to ensure the batches are created timely without threatening the safety of the code.

Blog Post #3

What is this feature?

The Plan

My Work So Far

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (11 rendered)

Cache calls from 1 backend

Signals

Log messages