Last week I did not implement new features but rather incorporated feedback from my mentors. In this blog post I want to write about one specific problem that I encountered last week.
To connect to Redis it is necessary for Scrapy to know the hostname and port which are project specific and part of the settings
object. The settings
object is not global; it must be passed through whenever it is needed. Classes in Scrapy often come with the factory methods from_crawler()
and from_settings()
. These class methods allow to create an object from a common interface and also inject the crawler or settings object.
This is also the case for ScrapyRequestQueue
classes (of which _RedisQueue
is one of them):
class ScrapyRequestQueue(queue_class): def __init__(self, crawler, key): self.spider = crawler.spider super(ScrapyRequestQueue, self).__init__(key) @classmethod def from_crawler(cls, crawler, key, *args, **kwargs): return cls(crawler, key)
The queue object is created by the from_crawler()
factory method which has access to the crawler object. The constructor sets the spider
instance variable and calls the parent constructor.
The problem is that instances of the class PickleFifoRedisQueueNonRequest
do not have the spider
variable set, only instances of PickleFifoRedisQueue
(the NonRequest version of the class is used for testing while the full class is used for storing the Request
objects).
My solution to accommodate both use case was now as follows:
class _RedisQueue(ABC): @classmethod def from_settings(cls, settings, path): return cls(path, settings) def __init__(self, path, settings=None): # (...) # If called from from_crawler() method, self.spider is set. # If called from from_settings() method, settings is given. if not settings: settings = self.crawler.settings
Depending on how the object was constructed, I was relying on the fact that spider
was set or the settings
given as an argument. The problem with this approach, as pointed out by mentors, is that different code paths are executed when the code is tested and when it is actually used. Preserving backwards compatibility for other queues (i.e. not adding a new required parameter) and also coming up with a clean way to implement this requirement was not straight forward, partially also due to the complex inheritance hierarchy.
I managed to come up with the following solution:
class ScrapyRequestQueue(queue_class): def __init__(self, crawler, key): self.spider = crawler.spider super(ScrapyRequestQueue, self).__init__(key, crawler.settings) # ... class SerializableQueue(queue_class): def __init__(self, path, settings=None, *args, **kwargs): self.settings = settings super(SerializableQueue, self).__init__(path, *args, **kwargs) # ... class _RedisQueue(ABC): def __init__(self, path): # Accessing settings via self.settings regardless of how the object is created
I pass the settings in the call to the super constructor from the ScrapyRequestQueue
class (i.e. PickleFifoRedisQueue
) and accept it as an optional parameter in the constructor of the parent class SerializableQueue
(i.e. PickleFifoRedisQueueNonRequest
). This way the settings
object can be accessed in both cases while preserving backwards compatibility. This might seem like an obvious solution but still took some tinkering to come up.
This week I'm working on documentation so that users know how to use Redis as an external queue. I will benchmark the implementation to see how it fares against the disk-based queues and I will also add tests.