Lukas0907's Blog

Blog post for week 3: Passing settings around the right way

Lukas0907
Published: 06/22/2020

Last week I did not implement new features but rather incorporated feedback from my mentors. In this blog post I want to write about one specific problem that I encountered last week.

To connect to Redis it is necessary for Scrapy to know the hostname and port which are project specific and part of the settings object. The settings object is not global; it must be passed through whenever it is needed. Classes in Scrapy often come with the factory methods from_crawler() and from_settings(). These class methods allow to create an object from a common interface and also inject the crawler or settings object.

This is also the case for ScrapyRequestQueue classes (of which _RedisQueue is one of them):

class ScrapyRequestQueue(queue_class):

    def __init__(self, crawler, key):
        self.spider = crawler.spider
        super(ScrapyRequestQueue, self).__init__(key)

    @classmethod
    def from_crawler(cls, crawler, key, *args, **kwargs):
        return cls(crawler, key)

The queue object is created by the from_crawler() factory method which has access to the crawler object. The constructor sets the spider instance variable and calls the parent constructor.

The problem is that instances of the class PickleFifoRedisQueueNonRequest do not have the spider variable set, only instances of PickleFifoRedisQueue (the NonRequest version of the class is used for testing while the full class is used for storing the Request objects).

My solution to accommodate both use case was now as follows:

class _RedisQueue(ABC):

    @classmethod
    def from_settings(cls, settings, path):
        return cls(path, settings)

    def __init__(self, path, settings=None):
        # (...)

        # If called from from_crawler() method, self.spider is set.
        # If called from from_settings() method, settings is given.
        if not settings:
            settings = self.crawler.settings

Depending on how the object was constructed, I was relying on the fact that spider was set or the settings given as an argument. The problem with this approach, as pointed out by mentors, is that different code paths are executed when the code is tested and when it is actually used. Preserving backwards compatibility for other queues (i.e. not adding a new required parameter) and also coming up with a clean way to implement this requirement was not straight forward, partially also due to the complex inheritance hierarchy.

I managed to come up with the following solution:

class ScrapyRequestQueue(queue_class):

    def __init__(self, crawler, key):
        self.spider = crawler.spider
        super(ScrapyRequestQueue, self).__init__(key, crawler.settings)

# ...

class SerializableQueue(queue_class):

    def __init__(self, path, settings=None, *args, **kwargs):
        self.settings = settings
        super(SerializableQueue, self).__init__(path, *args, **kwargs)

# ...

class _RedisQueue(ABC):

    def __init__(self, path):
        # Accessing settings via self.settings regardless of how the object is created

I pass the settings in the call to the super constructor from the ScrapyRequestQueue class (i.e. PickleFifoRedisQueue) and accept it as an optional parameter in the constructor of the parent class SerializableQueue (i.e. PickleFifoRedisQueueNonRequest). This way the settings object can be accessed in both cases while preserving backwards compatibility. This might seem like an obvious solution but still took some tinkering to come up.

This week I'm working on documentation so that users know how to use Redis as an external queue. I will benchmark the implementation to see how it fares against the disk-based queues and I will also add tests.

View Blog Post

Check-in for week 2

Lukas0907
Published: 06/15/2020

1. What did you do this week?

I created a new PR that builds on the previous work in the redis branch and introduces the concept of a Persister which abstracts away the current JOBDIR storage mechanism. The basic idea is that instead of checking if the JOBDIR setting is configured and using it as a directory where arbitrary data can be read from and written to, a Persister should be used instead. A Persister offers a common interface to its users so they don‘t have to worry how the data is persisted.

2. What is coming up next?

I plan on stabilizing the PR, incorporating feedback as well as manual and automated testing in different scenarios. I will also take a look at benchmarking so that the PRs do not introduce a performance regression. Also on my list is the Redis connection pooling.

3. Did you get stuck anywhere?

 

So far everything is working as expected.

 

View Blog Post

Blog post for week 1: Introducing support for Redis

Lukas0907
Published: 06/08/2020

Scrapy uses queues for handling requests. The scheduler pushes requests to the queue and pops them from the queue when the next request is ready to be made. At the moment, there is no support for external message queues (e.g. Redis, Kafka, etc.) implemented in Scrapy, however, there are external libraries (https://github.com/rmax/scrapy-redis and others) that bridge Scrapy with external message queues.

The goal of the first week was to implement a new disk-based queue with Redis as a message queue backend. In the first iteration, which happened last week, Redis is used for storing and retrieving requests. Meta data (request fingerprints, etc.) is still saved on disk in a directory set by the JOBDIR folder. If the setting SCHEDULER_DISK_QUEUE is set to a class name, e.g. scrapy.squeues.PickleFifoRedisQueue, the Redis-based implementation is used as a queue backend.

Implementation

The classes PickleFifoRedisQueue and PickleLifoRedisQueueNonRequest are wrappers around the actual Redis queue classes _FifoRedisQueue, _LifoRedisQueue and _RedisQueue that handle connecting to Redis and issuing commands. The only difference between a FIFO and a LIFO queue is the position from which an element is popped after it has been pushed to the queue (left side for a LIFO or right side for a FIFO). Therefore the implementation for both queues is based on the common abstract base class _RedisQueue where most of the code is implemented (except for the pop() method which is abstract and implemented in _FifoRedisQueue and _LifoRedisQueue). The implementation uses the redis-py library (https://pypi.org/project/redis/) under the hood. Redis-py is the recommended library for Python by the Redis project.

Testing

Although I was planning to write tests a bit later this month, I had some time and already experimented with testing the Redis integration. Scrapy already comes with tests for generic memory and disk-based queues. An additional requirement in case of a Redis queue is that the tests require redis-server to be running. In case of a CI like Travis CI, this can be achieved by enabling redis-server in the CI‘s configuration file. However, tests should also be able to run outside of the CI, of course, with little manual intervention. Therefore the usual approach in the Scrapy testing code base is to start and stop a process if it is needed for the code under test. Further, tests should not be executed if redis-server is not available. Fortunately pytest supports skipping tests based on a condition. I added a function that checks for the condition and decorators to the appropriate tests so that they are skipped if redis-server is not available.

Outlook

This week I am working on saving even more data in Redis and getting rid of storing meta information about the crawl job on the file system. The idea is to use Redis not only as a queue for requests but also as a store for meta information that is needed to be persistent between crawls.

View Blog Post

Check-in for week 0

Lukas0907
Published: 06/01/2020

1. What did you do this week?

During the community bonding my mentors and I set up a communication channel, i.e. I joined the GSoC channel in ScrapingHub's Slack instance. I also made sure that I am able to work on the Scrapy code base (have all the requirements installed, etc.). Today is the first day where I am working on my project. I am working on a basic integration of Redis into Scrapy.

2. What is coming up next?

This week (week 1) I plan on implementing a new disk-based queue with Redis as a message queue backend. On Tuesday I will have the first weekly meeting with my mentors Adrian, Julio and Nikita.

3. Did you get stuck anywhere?

So far everything is working as expected.

View Blog Post