Blog post for week 3: Passing settings around the right way

Lukas0907
Published: 06/22/2020

Last week I did not implement new features but rather incorporated feedback from my mentors. In this blog post I want to write about one specific problem that I encountered last week.

To connect to Redis it is necessary for Scrapy to know the hostname and port which are project specific and part of the settings object. The settings object is not global; it must be passed through whenever it is needed. Classes in Scrapy often come with the factory methods from_crawler() and from_settings(). These class methods allow to create an object from a common interface and also inject the crawler or settings object.

This is also the case for ScrapyRequestQueue classes (of which _RedisQueue is one of them):

class ScrapyRequestQueue(queue_class):

    def __init__(self, crawler, key):
        self.spider = crawler.spider
        super(ScrapyRequestQueue, self).__init__(key)

    @classmethod
    def from_crawler(cls, crawler, key, *args, **kwargs):
        return cls(crawler, key)

The queue object is created by the from_crawler() factory method which has access to the crawler object. The constructor sets the spider instance variable and calls the parent constructor.

The problem is that instances of the class PickleFifoRedisQueueNonRequest do not have the spider variable set, only instances of PickleFifoRedisQueue (the NonRequest version of the class is used for testing while the full class is used for storing the Request objects).

My solution to accommodate both use case was now as follows:

class _RedisQueue(ABC):

    @classmethod
    def from_settings(cls, settings, path):
        return cls(path, settings)

    def __init__(self, path, settings=None):
        # (...)

        # If called from from_crawler() method, self.spider is set.
        # If called from from_settings() method, settings is given.
        if not settings:
            settings = self.crawler.settings

Depending on how the object was constructed, I was relying on the fact that spider was set or the settings given as an argument. The problem with this approach, as pointed out by mentors, is that different code paths are executed when the code is tested and when it is actually used. Preserving backwards compatibility for other queues (i.e. not adding a new required parameter) and also coming up with a clean way to implement this requirement was not straight forward, partially also due to the complex inheritance hierarchy.

I managed to come up with the following solution:

class ScrapyRequestQueue(queue_class):

    def __init__(self, crawler, key):
        self.spider = crawler.spider
        super(ScrapyRequestQueue, self).__init__(key, crawler.settings)

# ...

class SerializableQueue(queue_class):

    def __init__(self, path, settings=None, *args, **kwargs):
        self.settings = settings
        super(SerializableQueue, self).__init__(path, *args, **kwargs)

# ...

class _RedisQueue(ABC):

    def __init__(self, path):
        # Accessing settings via self.settings regardless of how the object is created

I pass the settings in the call to the super constructor from the ScrapyRequestQueue class (i.e. PickleFifoRedisQueue) and accept it as an optional parameter in the constructor of the parent class SerializableQueue (i.e. PickleFifoRedisQueueNonRequest). This way the settings object can be accessed in both cases while preserving backwards compatibility. This might seem like an obvious solution but still took some tinkering to come up.

This week I'm working on documentation so that users know how to use Redis as an external queue. I will benchmark the implementation to see how it fares against the disk-based queues and I will also add tests.

1000 characters left