Lukas0907's Blog

Check-in for week 8

Lukas0907
Published: 07/27/2020

1. What did you do this week?

Last week I was splitting the pull request into two parts: One part that goes into the Scrapy core, and one part that goes into a new library scrapy-external-queues. I also refactored the payload generation code in scrapy-bench to not duplicate code and use random payloads instead of fixed ones.

2. What is coming up next?

The new project scrapy-external-queues needs a bit of cleanup. It also lacks tooling like flake8, tox or Travis. Additionally, I want to add support for RabbitMQ as a second message queue backend.

3. Did you get stuck anywhere?

No.Create

View Blog Post

Blog post for week 7: Queue interface documentation

Lukas0907
Published: 07/20/2020

Last week I was working on the specification of the queue interface. There are some important properties of how a queue is expected to behave so we thought it was worthwhile to document them. The documentation should make it easier for users to figure out how queues work (especially in the case of an error) but also make it easier in case a custom queue (facade) should be developed by a user of Scrapy.

A queue consists of five methods:

1. from_crawler(): A commonly used pattern to create an object from a crawler object.

2. push(): Adds a request to the queue. Requests are usually added to the beginning („left side“) of the queue but this is left to the implementation.

3. pop(): Takes a request from the queue. Requests can be taken from the beginning („left side“) or the end („right side“). Depending on how push() is implemented, the result is either a FIFO or a LIFO queue. This is, again, left to the implementation.

4. close(): Releases internal resources. After this, no push() or pop() is expected to happen.

5. __len__(): This dunder method is expected to return the number of elements in the queue. The reason for using a dunder method is that this way len(queue) can be used and „if queue“ evaluates to False in case the queue is empty and True otherwise.

The tricky part is getting the error handling right. It‘s generally a good idea to fail early so that the user has the chance to notice a problem immediately. Therefore the from_crawler() method (or the constructor that it calls) is expected to verify the passed arguments, open a socket/connection in case the queue is reached via network and send a test command (“ping”). This way, invalid configurations are immediately noticed and not only after the first (failed) push.

The push() method differentiates between temporary problems and permanent problems. Admittedly, the line between temporary and permanent problems is a bit blurry and it’s often not always clear on which side a problem falls. A dropped connection and failed reconnect can usually be considered as a temporary problem: If the connection worked before but does not anymore, then it’s likely that the problem will be resolved on its own (e.g. server or router is temporarily offline due to a power outage, etc.). In such a case a TransientError exception is raised which will result in the caller (i.e. the Scrapy scheduler) to fallback to a memory queue. Another kind of temporary error is a serialization problem. In such a case a ValueError should be raised (which also results in a fallback to the memory queue). All other cases are not handled by the scheduler; the crawling process is halted.

The pop() method also differentiates between temporary and permanent problems but handles them a bit differently than the push() method. In case of a temporary problem, None should be returned. This will cause the caller (i.e. the Scrapy scheduler) to retry again. Exceptions are not handled by the scheduler; the crawling process is halted.

The __len__() method also needs to take problems into account. In case of a temporary error, the method is expected to still return the (last known) number of elements in the queue. This can be implemented by tracking the number of push and pop calls. It’s important that the method does not return 0 because this would case the queue to be closed.

The next week will bring a separation of Scrapy core and external queues: The idea is to create a project “scrapy-external-queues” which can be installed separately and which provides bridges to other messages queues like Redis.

View Blog Post

Check-in for week 6

Lukas0907
Published: 07/14/2020

1. What did you do this week?

Last week I was reworking the error handling once again. I moved the self-check from the general queue class to the Redis class and changed it to a simple Redis ping command. I also checked the code coverage w.r.t. tests and made requests with big payloads (e.g. fat cookies).

2. What is coming up next?

The next step is to extract Redis into its own scrapy library. This library should be an official addon to scrapy and eventually not only contain support for Redis but also for other message queues.

3. Did you get stuck anywhere?

No.

 

View Blog Post

Blog post for week 5: Polishing

Lukas0907
Published: 07/06/2020

Last week was another week of code and documentation polishing. Originally I planned to implement duplicate filtering with external data sources, however, I already did that in week 2 when I evaluated the possibility of disk-less external queues (see pull request #2).

One of the easier changes was to change the hostname/port/database settings triple to one Redis URL. redis-py supports initializing a client instance from an URL, e. g. redis://[[username]:[password]]@localhost:6379/0. The biggest advantage of the URL is its flexibility. It allows the user to optionally specify username and password, hostname, port, database name and even certain settings. The Redis URL scheme is also specified at https://www.iana.org/assignments/uri-schemes/prov/redis.

While working on the URL refactoring, we also noticed a subtle bug: A spider can have its own settings and hence it's possible for different spiders to use different Redis instances. My implementation was reusing an existing Redis connection but didn't account for spiders having different settings. The fix was easy: The client object is now cached in a dict and indexed by the Redis URL. This way, the object is only reused if the URL matches.

Another thing that kept me busy last week was detecting and handling various errors. If a user configures an unreachable hostname or wrong credentials, Scrapy should fail early and not in the middle of the crawl. The difficulty was that depending on if a new crawl is done or a previous crawl is picked up, queues would be created lazily (new crawl) or eagerly (continued crawl). To unify the behavior, I introduced a queue self-check which not only works for Redis but for all queues (i. e. also plain old disk-based queues). The idea is that upon initialization of a priority queue, it pushes a fake Request object, pops it again from the queue and compares the fingerprints. If the fingerprints match and if no exception was raised at this point, the self-check succeeded.

The code for this self-check looks as follows:

def selfcheck(self):
    # Find an empty/unused queue.
    while True:
        # Use random priority to not interfere with existing queues.
        priority = random.randrange(2**64)
        q = self.qfactory(priority)
        if not q:  # Queue is empty
            break
        q.close()

    self.queues[priority] = q
    self.curprio = priority
    req1 = Request('http://hostname.invalid', priority=priority)
    self.push(req1)
    req2 = self.pop()
    if request_fingerprint(req1) != request_fingerprint(req2):
        raise ValueError(
            "Pushed request %s with priority %d but popped different request %s."
            % (req1, priority, req2)
        )

    if q:
        raise ValueError(
            "Queue with priority %d should be empty after selfcheck!" % priority
        )


The code seems a bit complicated with the random call which might need additional explanation. The difficulty for the self-check is that a queue is basically identified by its priority and if queue with a given priority already exists it will be reused. This means that if we used a static priority we could pick up an existing queue and push and pop to it. This is actually not really a problem for FIFO queues where the last element that is pushed is also popped. But for LIFO queues that are not empty this means that an arbitrary element is popped (and not the request that we pushed). The solution for this problem is to generate a random priority, get a queue for that priority and only use it if it is empty. Otherwise, generate a new priority randomly. Due to the large range for the priority (0..2**64-1) it is extremely unlikely that a queue with that priority already exists but even if it does, the loop makes sure that another priority is generated.

For this week, I will do another feedback iteration with my mentors and prepare for the topic of next week: Implementing distributed crawling using common message queues.

View Blog Post

Check-in for week 4

Lukas0907
Published: 06/29/2020

1. What did you do this week?

Last week I was mainly polishing my work so far. The unit tests would sometimes fail due to a race condition when the server was not yet ready to accept connections. I now wait until the server is ready and only then start with the test. I also added error handling for cases when the configuration is missing or invalid (and tests for these scenarios). Further, I wrote documentation on how to actually use the Redis integration. Last but not least I benchmarked the code to see how the Redis queue compares to the normal disk queue and to be sure that I did not introduce a performance regression.

2. What is coming up next?

Implementing duplicate filtering with external data sources, for which I already did some work in week #2 when I evaluated disk-less external queues.

3. Did you get stuck anywhere?

No.

View Blog Post