Building the Spider Failure Detector and making Scrapy Plugin

In the last two weeks, I worked on:
1. Building the Spider Failure Detector.
2. Learning to build a plugin for the Scrapy Spider Auto-repair code.

To build the Spider Failure Detector, we are assuming that the callback/parse method of a spider returns three types of objects:
1. Items
2. Dictionaries
3. Requests.

So, to detect the failure of a spider, we compare the data extracted by the working spider on the old page and the data extracted by the same spider on the new page. If the two data are equal, then, we say that spider is good, otherwise, we say that it has failed.
The data extracted by the spider can be of the following types:
1. Items
2. Dictionaries
3. Requests.
The values corresponding to the keys in a dictionary and the values of the attributes of the item objects can also be objects. However, if we keep going further, i.e. the values of values of values of … values of the first item attribute/dictionary key, ultimately, the values will be privitive datatypes.
So we recursively compare the extracted objects to see if the extracted data is equal.
Here is the link to the pull request for this feature.

Apart from spider failure detector, I also worked on making a plugin for scrapy and figure a way to install the autorepair code in that plugin.

Next week onwards, I will work on implementing the plugin and writing some supporting code.

Creating a Dataset by Scraping data from WayBack Machine

As described in my previous blog post, I was supposed to scrape data from the Wayback Machine and create a dataset out of it so that my auto repair code can be tested on this dataset. In the last week, I did exactly this.

Wayback Machine APIs

The Internet Archive Wayback Machine supports a number of different APIs to make it easier for developers to retrieve information about Wayback capture data.

Wayback Availability JSON API

This simple API for Wayback is a test to see if a given url is archived and currenlty accessible in the Wayback Machine. This API is useful for providing a 404 or other error handler which checks Wayback to see if it has an archived copy ready to display. The API can be used as follows:

http://archive.org/wayback/available?url=example.com

which might return:

{
    "archived_snapshots": {
        "closest": {
            "available": true,
            "url": "http://web.archive.org/web/20130919044612/http://example.com/",
            "timestamp": "20130919044612",
            "status": "200"
        }
    }
}

if the url is available. When available, the url is the link to the archived snapshot in the Wayback Machine At this time, archived_snapshots just returns a single closest snapshot, but additional snapshots may be added in the future.

If the url is not available (not archived or currently not accessible), the response will be:

{"archived_snapshots":{}}

Other Options

Additional options which may be specified are timestamp and callback

    • timestamp is the timestamp to look up in Wayback. If not specified, the most recenty available capture in Wayback is returned. The format of the timestamp is 1-14 digits (YYYYMMDDhhmmss) ex:

http://archive.org/wayback/available?url=example.com&timestamp=20060101

may result in the following response (note that the snapshot timestamp is now close to 20060101):

{
    "archived_snapshots": {
        "closest": {
            "available": true,
            "url": "http://web.archive.org/web/20060101064348/http://www.example.com:80/",
            "timestamp": "20060101064348",
            "status": "200"
        }
    }
}

callback is an optional callback which may be specified to produce a JSONP response.

How the spider works?

The first step is to check if a given URL exists. To check this, we will use wayback machine’s JSON API with the date option. Also, as I had explained in one of my previous blogs, the goal is to scrape the snapshotted versions of the websites listed in a CSV file containing the list of top 500 most visited websites. You can find the .csv file here.

To do this, for each website URL in the .csv file and for each year from 1996 to 2018, we check if the URL exists. Once we do that, we get a result something like for the API call corresponding to the query : http://archive.org/wayback/available?url=example.com&timestamp=20060101

{
    "archived_snapshots": {
        "closest": {
            "available": true,
            "url": "http://web.archive.org/web/20060101064348/http://www.example.com:80/",
            "timestamp": "20060101064348",
            "status": "200"
        }
    }
}

From the above JSON response, we pick the URL attribute in archived_snapshots>closest>url.

Once we get  this URL, we go to that URL, scrape the data at that URL and organize the scraped data in the form of files and folders.

Link to the scraped dataset –  https://github.com/virmht/Scrapy-Spider-Autorepair/blob/master/Dataset.rar

Link to the srapy code used for scraping –  https://github.com/virmht/Scrapy-Spider-Autorepair/blob/master/data_extractor_scrapy.py.