Last week I did not implement new features but rather incorporated feedback from my mentors. In this blog post I want to write about one specific problem that I encountered last week.
To connect to Redis it is necessary for Scrapy to know the hostname and port which are project specific and part of the
settings object. The
settings object is not global; it must be passed through whenever it is needed. Classes in Scrapy often come with the factory methods
from_settings(). These class methods allow to create an object from a common interface and also inject the crawler or settings object.
This is also the case for
ScrapyRequestQueue classes (of which
_RedisQueue is one of them):
class ScrapyRequestQueue(queue_class): def __init__(self, crawler, key): self.spider = crawler.spider super(ScrapyRequestQueue, self).__init__(key) @classmethod def from_crawler(cls, crawler, key, *args, **kwargs): return cls(crawler, key)
The queue object is created by the
from_crawler() factory method which has access to the crawler object. The constructor sets the
spider instance variable and calls the parent constructor.
The problem is that instances of the class
PickleFifoRedisQueueNonRequest do not have the
spider variable set, only instances of
PickleFifoRedisQueue (the NonRequest version of the class is used for testing while the full class is used for storing the
My solution to accommodate both use case was now as follows:
class _RedisQueue(ABC): @classmethod def from_settings(cls, settings, path): return cls(path, settings) def __init__(self, path, settings=None): # (...) # If called from from_crawler() method, self.spider is set. # If called from from_settings() method, settings is given. if not settings: settings = self.crawler.settings
Depending on how the object was constructed, I was relying on the fact that
spider was set or the
settings given as an argument. The problem with this approach, as pointed out by mentors, is that different code paths are executed when the code is tested and when it is actually used. Preserving backwards compatibility for other queues (i.e. not adding a new required parameter) and also coming up with a clean way to implement this requirement was not straight forward, partially also due to the complex inheritance hierarchy.
I managed to come up with the following solution:
class ScrapyRequestQueue(queue_class): def __init__(self, crawler, key): self.spider = crawler.spider super(ScrapyRequestQueue, self).__init__(key, crawler.settings) # ... class SerializableQueue(queue_class): def __init__(self, path, settings=None, *args, **kwargs): self.settings = settings super(SerializableQueue, self).__init__(path, *args, **kwargs) # ... class _RedisQueue(ABC): def __init__(self, path): # Accessing settings via self.settings regardless of how the object is created
I pass the settings in the call to the super constructor from the
ScrapyRequestQueue class (i.e.
PickleFifoRedisQueue) and accept it as an optional parameter in the constructor of the parent class
PickleFifoRedisQueueNonRequest). This way the
settings object can be accessed in both cases while preserving backwards compatibility. This might seem like an obvious solution but still took some tinkering to come up.
This week I'm working on documentation so that users know how to use Redis as an external queue. I will benchmark the implementation to see how it fares against the disk-based queues and I will also add tests.