Blog post for week 5: Polishing

Lukas0907
Published: 07/06/2020

Last week was another week of code and documentation polishing. Originally I planned to implement duplicate filtering with external data sources, however, I already did that in week 2 when I evaluated the possibility of disk-less external queues (see pull request #2).

One of the easier changes was to change the hostname/port/database settings triple to one Redis URL. redis-py supports initializing a client instance from an URL, e. g. redis://[[username]:[password]]@localhost:6379/0. The biggest advantage of the URL is its flexibility. It allows the user to optionally specify username and password, hostname, port, database name and even certain settings. The Redis URL scheme is also specified at https://www.iana.org/assignments/uri-schemes/prov/redis.

While working on the URL refactoring, we also noticed a subtle bug: A spider can have its own settings and hence it's possible for different spiders to use different Redis instances. My implementation was reusing an existing Redis connection but didn't account for spiders having different settings. The fix was easy: The client object is now cached in a dict and indexed by the Redis URL. This way, the object is only reused if the URL matches.

Another thing that kept me busy last week was detecting and handling various errors. If a user configures an unreachable hostname or wrong credentials, Scrapy should fail early and not in the middle of the crawl. The difficulty was that depending on if a new crawl is done or a previous crawl is picked up, queues would be created lazily (new crawl) or eagerly (continued crawl). To unify the behavior, I introduced a queue self-check which not only works for Redis but for all queues (i. e. also plain old disk-based queues). The idea is that upon initialization of a priority queue, it pushes a fake Request object, pops it again from the queue and compares the fingerprints. If the fingerprints match and if no exception was raised at this point, the self-check succeeded.

The code for this self-check looks as follows:

def selfcheck(self):
    # Find an empty/unused queue.
    while True:
        # Use random priority to not interfere with existing queues.
        priority = random.randrange(2**64)
        q = self.qfactory(priority)
        if not q:  # Queue is empty
            break
        q.close()

    self.queues[priority] = q
    self.curprio = priority
    req1 = Request('http://hostname.invalid', priority=priority)
    self.push(req1)
    req2 = self.pop()
    if request_fingerprint(req1) != request_fingerprint(req2):
        raise ValueError(
            "Pushed request %s with priority %d but popped different request %s."
            % (req1, priority, req2)
        )

    if q:
        raise ValueError(
            "Queue with priority %d should be empty after selfcheck!" % priority
        )


The code seems a bit complicated with the random call which might need additional explanation. The difficulty for the self-check is that a queue is basically identified by its priority and if queue with a given priority already exists it will be reused. This means that if we used a static priority we could pick up an existing queue and push and pop to it. This is actually not really a problem for FIFO queues where the last element that is pushed is also popped. But for LIFO queues that are not empty this means that an arbitrary element is popped (and not the request that we pushed). The solution for this problem is to generate a random priority, get a queue for that priority and only use it if it is empty. Otherwise, generate a new priority randomly. Due to the large range for the priority (0..2**64-1) it is extremely unlikely that a queue with that priority already exists but even if it does, the loop makes sure that another priority is generated.

For this week, I will do another feedback iteration with my mentors and prepare for the topic of next week: Implementing distributed crawling using common message queues.