Blog post for week 1: Introducing support for Redis

Lukas0907
Published: 06/08/2020

Scrapy uses queues for handling requests. The scheduler pushes requests to the queue and pops them from the queue when the next request is ready to be made. At the moment, there is no support for external message queues (e.g. Redis, Kafka, etc.) implemented in Scrapy, however, there are external libraries (https://github.com/rmax/scrapy-redis and others) that bridge Scrapy with external message queues.

The goal of the first week was to implement a new disk-based queue with Redis as a message queue backend. In the first iteration, which happened last week, Redis is used for storing and retrieving requests. Meta data (request fingerprints, etc.) is still saved on disk in a directory set by the JOBDIR folder. If the setting SCHEDULER_DISK_QUEUE is set to a class name, e.g. scrapy.squeues.PickleFifoRedisQueue, the Redis-based implementation is used as a queue backend.

Implementation

The classes PickleFifoRedisQueue and PickleLifoRedisQueueNonRequest are wrappers around the actual Redis queue classes _FifoRedisQueue, _LifoRedisQueue and _RedisQueue that handle connecting to Redis and issuing commands. The only difference between a FIFO and a LIFO queue is the position from which an element is popped after it has been pushed to the queue (left side for a LIFO or right side for a FIFO). Therefore the implementation for both queues is based on the common abstract base class _RedisQueue where most of the code is implemented (except for the pop() method which is abstract and implemented in _FifoRedisQueue and _LifoRedisQueue). The implementation uses the redis-py library (https://pypi.org/project/redis/) under the hood. Redis-py is the recommended library for Python by the Redis project.

Testing

Although I was planning to write tests a bit later this month, I had some time and already experimented with testing the Redis integration. Scrapy already comes with tests for generic memory and disk-based queues. An additional requirement in case of a Redis queue is that the tests require redis-server to be running. In case of a CI like Travis CI, this can be achieved by enabling redis-server in the CI‘s configuration file. However, tests should also be able to run outside of the CI, of course, with little manual intervention. Therefore the usual approach in the Scrapy testing code base is to start and stop a process if it is needed for the code under test. Further, tests should not be executed if redis-server is not available. Fortunately pytest supports skipping tests based on a condition. I added a function that checks for the condition and decorators to the appropriate tests so that they are skipped if redis-server is not available.

Outlook

This week I am working on saving even more data in Redis and getting rid of storing meta information about the crawl job on the file system. The idea is to use Redis not only as a queue for requests but also as a store for meta information that is needed to be persistent between crawls.

1000 characters left