Last week I was working on the specification of the queue interface. There are some important properties of how a queue is expected to behave so we thought it was worthwhile to document them. The documentation should make it easier for users to figure out how queues work (especially in the case of an error) but also make it easier in case a custom queue (facade) should be developed by a user of Scrapy.
A queue consists of five methods:
1. from_crawler(): A commonly used pattern to create an object from a crawler object.
2. push(): Adds a request to the queue. Requests are usually added to the beginning („left side“) of the queue but this is left to the implementation.
3. pop(): Takes a request from the queue. Requests can be taken from the beginning („left side“) or the end („right side“). Depending on how push() is implemented, the result is either a FIFO or a LIFO queue. This is, again, left to the implementation.
4. close(): Releases internal resources. After this, no push() or pop() is expected to happen.
5. __len__(): This dunder method is expected to return the number of elements in the queue. The reason for using a dunder method is that this way len(queue) can be used and „if queue“ evaluates to False in case the queue is empty and True otherwise.
The tricky part is getting the error handling right. It‘s generally a good idea to fail early so that the user has the chance to notice a problem immediately. Therefore the from_crawler() method (or the constructor that it calls) is expected to verify the passed arguments, open a socket/connection in case the queue is reached via network and send a test command (“ping”). This way, invalid configurations are immediately noticed and not only after the first (failed) push.
The push() method differentiates between temporary problems and permanent problems. Admittedly, the line between temporary and permanent problems is a bit blurry and it’s often not always clear on which side a problem falls. A dropped connection and failed reconnect can usually be considered as a temporary problem: If the connection worked before but does not anymore, then it’s likely that the problem will be resolved on its own (e.g. server or router is temporarily offline due to a power outage, etc.). In such a case a TransientError exception is raised which will result in the caller (i.e. the Scrapy scheduler) to fallback to a memory queue. Another kind of temporary error is a serialization problem. In such a case a ValueError should be raised (which also results in a fallback to the memory queue). All other cases are not handled by the scheduler; the crawling process is halted.
The pop() method also differentiates between temporary and permanent problems but handles them a bit differently than the push() method. In case of a temporary problem, None should be returned. This will cause the caller (i.e. the Scrapy scheduler) to retry again. Exceptions are not handled by the scheduler; the crawling process is halted.
The __len__() method also needs to take problems into account. In case of a temporary error, the method is expected to still return the (last known) number of elements in the queue. This can be implemented by tracking the number of push and pop calls. It’s important that the method does not return 0 because this would case the queue to be closed.
The next week will bring a separation of Scrapy core and external queues: The idea is to create a project “scrapy-external-queues” which can be installed separately and which provides bridges to other messages queues like Redis.