Articles on siddharthadr's Blog

GSoC Final Report

siddharthadr11@gmail.com (siddharthadr) — Wed, 18 Aug 2021 07:23:14 +0000

Summary

GSoC comes to an end this week. Here I have presented my contributions to the Scrapy project that I made this summer. All the features that I have proposed were implemented. The last feature had to be divided into 2 parts: soft limits and hard limits. The soft limits part has been implemented but hard limit is yet to be completed.

Contributions

1. Issues

2. Pull Requests

3. Blog Posts: All my blog posts can be found here.

Future Work

Though the features have been implemented, there are scope of improvements:

Make hard limits for batch delivery triggers
Switching batches based on item content
batch scheduling based on cron expression

Blog Post #5

siddharthadr11@gmail.com (siddharthadr) — Sun, 15 Aug 2021 19:19:00 +0000

The Past Week

Last week was spent improving documentation about the final feature - batch triggers. Hopefully the feature is now in a merge-able state. This feature can be improved a lot though so I'll continue to work on it.

Phasing Out..

This week brings an end to GSoC. It has been a wonderful experience for me! I'd like to thank my mentors Adrián Chaves and Aditya Kumar for their support and help. Scrapy has been a cool project to work on and I hope I continue to contribute to it.

Weekly Check-in #6

siddharthadr11@gmail.com (siddharthadr) — Sun, 08 Aug 2021 18:36:19 +0000

What did you do this week?
A. I am almost done with soft limits for batch delivery triggers.

What is coming up next?
A. Up next is converting the soft limits to hard limits.

Did you get stuck anywhere?
A. Nothing major.

Blog Post #4

siddharthadr11@gmail.com (siddharthadr) — Sun, 01 Aug 2021 18:17:55 +0000

The Past Week

My University classes have finally began and with that, placement season as well so I am kinda getting behind my schedule. Though I managed to write tests for file-size and duration batch delivery triggers. As these are soft limits, creating tests for them was tricky.

Work Done So Far

The added triggers are called soft limits because the limits are only checked when an item is scraped. So the triggers will not always be triggered accurately. I was a little unsure how to proceed with the tests but I affirmed with my mentor's suggestion and proceeded with that. So tests for soft limits can have 3 different general cases(presuming we have 2 total items): i) when the limit is zero(so no limits are imposed), ii) the smallest unit is the limit(so only 1 item is accepted), iii) when the limit causes at most 1 item per batch iv) when the limit causes at most 2 items per batch.

Work Ahead

I have yet to add tests for custom batch triggers. With addition of documentations I will finalize my soft limit triggers. After that I will have to convert the soft limits to hard ones.

Weekly Check-in #5

siddharthadr11@gmail.com (siddharthadr) — Sun, 25 Jul 2021 18:39:47 +0000

What did you do this week?
A. I added soft limits for file size and time duration as batch triggers. I tried figuring out ways to find hard limits for them as well.

What is coming up next?
A. Finalize plans for hard limits and start its implementation.

Did you get stuck anywhere?
A. I am still in the middle of figuring out hard limits for time duration and file size.

Blog Post #3

siddharthadr11@gmail.com (siddharthadr) — Sun, 18 Jul 2021 19:10:40 +0000

After finalizing the Post-Processing feature, I have moved onto implementing my third proposed feature - Batch Delivery Triggers. This feature will be the toughest of the three. Further I will talk about what it really is and how am tackling this.

What is this feature?

In Scrapy 2.3.0, a batch creating feature was introduced which generates multiple output files based on a specified item count constraint. Such that whenever a file's item count exceeds a limit, a new fresh file is created for storing further items. What I intend to do, is add more delivery triggers, namely - time duration limit and file size limit while also providing users with a way to create their own custom trigger.

The Plan

To have the ability to use custom batch delivery triggers, I had planned to use a base BatchHandler class to use as a default handler for batch creation for triggers such as file size, item count and time duration. This will allow to easily replace this base class with a custom class, if the user desires to use one.

My Work So Far

I have created a base class with the planned methods for batch handling. Next was modifying the FeedExporter to so that it can use the BatchHandler classes and use them to determine when to start a new batch. So far it works for item count limits and file size limit but there's been some problems implementing a trigger for time duration.

Initially I came with a very naive approach to update the time elapsed for a batch whenever an item is scraped and create a new batch if the limit is passed. Obvious problems can be deduced from that sentence itself. There are no guarantees that an item is scraped instantaneously, so a batch duration may very well cross the limit by a lot by the time an item is scraped.

One possible solution I have been thinking is to "schedule" or call the function to create a new batch for the feed after the specified duration has elapsed. But if some other declared constraint has breached, a new batch will be created then and timer will be reset. Using Twisted framework's task.deferLater or reactor.callLater could be the answer to my problem. But I think these methods are not thread-safe and will need some sort of mutexes to guarantee safety to the code. I will need to come up with some control flow plan to ensure the batches are created timely without threatening the safety of the code.

Weekly Check-in #4

siddharthadr11@gmail.com (siddharthadr) — Mon, 12 Jul 2021 07:56:26 +0000

1. What did you do this week?
A. I extended the tests for Post-Processing and completed the documentation for it. The feature was completed and approved by my mentor.

2. What is coming up next?
A. Now the last proposed feature Batch Trigger remains. I will polish its user API and start the implementation process.

3. Did you get stuck anywhere?
A. Nothing major this week.

Blog Post #2

siddharthadr11@gmail.com (siddharthadr) — Sun, 04 Jul 2021 20:26:53 +0000

Last week I completed most of my proposed #2 feature which is "Post-Processing". Let me run you through how it happened.

The Idea

So this feature is actually an extension of a compression support idea. Initially it was supposed to be just an addition of compression feature to compress the feeds with the help of different compression algorithms. While discussing this idea with my mentor, he suggested a post-processing sort of thing rather than just a compression thing which I quite liked and expanded it a little. A few back and forth discussion with my mentor and this idea grew somewhat concrete.

The Plan

The finalized plan was that, the processing of feed would be done by "plugins", which would work in a pipeline-ish way by transferring the processed data from one plugin to another until it was finally written to the target file. These plugins will be managed by a single plugin manager which would load all the user declared plugins and connect their outputs to next-in-line plugin's input. The plugin manager will act as a wrapper to the feed's storage, so whenever a write event is invoked on the storage, it first goes through the manager which will then send the data sequentially to all the plugins until finally it reaches the destination file, processed as intended.

The Implementation

The plugin's interface was first decided and subsequently the builtin plugins were created: GzipPlugin, LZMAPlugin and Bz2Plugin. These plugins are, as you can see, compression based. Parameter passing was achieved through feed_options dictionary, the feed-specific options, where the users can declare the parameters. Next came the testing phase, where I initially made an all-in-one test method for the all the post-processing. My mentor suggested to create another elaborate test class, instead of one mega-test-method. Documentation is new for me, so my newbie skills were reflected on my poor documentation attempts. Fortunately, I have a cool mentor with a keen eye for detail so there were valuable inputs in my code reviews from which I got to learn a lot.

Weekly Check-in #3

siddharthadr11@gmail.com (siddharthadr) — Sun, 27 Jun 2021 19:39:19 +0000

1. What did you do this week?
A. This week I created plugins' manager PostProcessingManager, Plugin interface and builtin compression plugins (gzip, bzip2, lzma). PR availabe here: #5190.

2. What is coming up next?
A. This week I will focus on the tests for the created plugins and manager.

3. Did you get stuck anywhere?
A. While implementing the manager and builtin plugins I had assumed that only two methods will be used ever: write and close. Well it turned out one of the exporters (used for exporting a scraped item to a file), CsvItemExporter also wraps the storage file around io.TextIOWrapper and that actually expects an object with at least the methods and attributes of that of io.IOBase. The problem was fixed by making PostProcessingManager subclass io.IOBase and also providing a method writable to indicate that the object is open for writing.

Blog Post #1

siddharthadr11@gmail.com (siddharthadr) — Sun, 20 Jun 2021 16:07:27 +0000

Going into the 3rd week of GSoC, the Item Filter feature is almost complete and ready to go. In this post I'll give a rundown how I implemented it.

What is this feature?

When Scrapy scrapes some data from the interwebs, it saves the scraped item to the declared feed storages. Scrapy provides a lot of flexibility in choosing what this storage can be, ranging from simple csv files to S3 cloud storages. But currently everytime some data is scraped, it is stored in all storages. Some users may require storing certain types of data in certain storage. So let's make it convenient for the users to do so.

How to implement such a feature?

We can introduce a certain type of filter for each storage, so that whenever some data is scraped, the storage can check if the data is acceptable and if it is then we can export it to the storage. Well that's really simple, right? So how to do that?
Let's create a class called ItemFilter that will handle the filtering process and assign this class to a declared storage. As we can have more than one data storage, we can assign different ItemFilter classes to different storages with each filter having their own logic to filter scraped data. So we create our filter, assign it to one of the storages in settings.py file and voila. Another convenient thing we can do is allow users to just declare what sort of data Item, they wish to accept for a particular storage. So they can just put a list of Items for a storage in settings.py file instead of creating a filter class.

My Work

As this was a straightforward implementation, there wasn't much difficulty in implementing this feature apart from some design changes. Though testing phase was a little different for this feature than what I'm familiar with. For testing the Feed Storages, we need to decorate every test method with a @defer.inlineCallbacks. It is part of Twisted framework and is used when you want to write a sequential code while using Deferred objects. This was new to me and I had to read about it and why it was used. I got to learn about Deferred objects and how to write code for it. Also this was the first time I was writing documentation in reStructuredText format.

Weekly Check-in #2

siddharthadr11@gmail.com (siddharthadr) — Mon, 14 Jun 2021 18:40:15 +0000

Weather's just great here now, with the Monsoon arriving here. I can now enjoy the rains while sipping coffee and coding for GSoC with some cozy tunes to go along with.

1. What did you do this week?
A. So the ItemChecker class that I proposed has been included in the codebase with the appropriate modifications in FeedExport, so that it can use those filter classes. I pushed the code and created a draft PR. It can be accessed here: #5178. My mentor left some feedback which I agreed with.

2. What is coming up next?
A. I will look into mentor's feedback and improve the code. After that I'll continue with adding tests and documentations.

3. Did you get stuck anywhere?
A. I did once got stuck when I was manually testing the new implementation. I had actually added a class instance attribute called item_classes to ItemChecker class which I intended to be changed for every Feed slot. The problem I was facing was whenever I changed any one of the item_classes attribute it reflected on other instance's attribute as well. The problem was quite trivial actually and I should have known why it was happening. The reason why it was happening was simply because class attributes are shared amongst all the classes. And not only that I also learned about how default parameter values worked in Python which explains why class attributes work the way they do. For the curious, follow this Stackoverflow thread to understand more about default parameter values in Python.

Weekly Check-in #1

siddharthadr11@gmail.com (siddharthadr) — Mon, 07 Jun 2021 19:06:01 +0000

Alright then as this is the first ever post for GSoC let me introduce myself and what I'm doing. My name is D R Siddhartha, friends call me DR(yeah those two letters). I like running and hiking(hopefully I get to hike Iceland one day). I'll be working on Scrapy, a web-crawling framework. Specifically I'll be working on its Feed features, adding enhancements.

Now about the 3 mandatory questions:
1. What did you do this week?
A. I started the week by getting a vaccine shot. The side effects manifested the next day and was horrendous. So I had to delay the scheduled meeting. But me and my mentor have already been discussing on the API designs for the enhancements. We made the discussions public in Github so other people can discuss as well. The discussions are available here: #5168, #5161 and #5169.

2. What is coming up next?
A. The coding phase starts now. I'll be starting my work on the first of the 3 features that I'll be implementing - Item Filters. Some API discussions have already been done. So now the implementation work starts.

3. Did you get stuck anywhere?
A. Not so far.

Test

siddharthadr11@gmail.com (siddharthadr) — Fri, 21 May 2021 14:36:18 +0000

Test [link](www.site.com)