Creating a Dataset by Scraping data from WayBack Machine

As described in my previous blog post, I was supposed to scrape data from the Wayback Machine and create a dataset out of it so that my auto repair code can be tested on this dataset. In the last week, I did exactly this.

Wayback Machine APIs

The Internet Archive Wayback Machine supports a number of different APIs to make it easier for developers to retrieve information about Wayback capture data.

Wayback Availability JSON API

This simple API for Wayback is a test to see if a given url is archived and currenlty accessible in the Wayback Machine. This API is useful for providing a 404 or other error handler which checks Wayback to see if it has an archived copy ready to display. The API can be used as follows:

http://archive.org/wayback/available?url=example.com

which might return:

{
    "archived_snapshots": {
        "closest": {
            "available": true,
            "url": "http://web.archive.org/web/20130919044612/http://example.com/",
            "timestamp": "20130919044612",
            "status": "200"
        }
    }
}

if the url is available. When available, the url is the link to the archived snapshot in the Wayback Machine At this time, archived_snapshots just returns a single closest snapshot, but additional snapshots may be added in the future.

If the url is not available (not archived or currently not accessible), the response will be:

{"archived_snapshots":{}}

Other Options

Additional options which may be specified are timestamp and callback

    • timestamp is the timestamp to look up in Wayback. If not specified, the most recenty available capture in Wayback is returned. The format of the timestamp is 1-14 digits (YYYYMMDDhhmmss) ex:

http://archive.org/wayback/available?url=example.com&timestamp=20060101

may result in the following response (note that the snapshot timestamp is now close to 20060101):

{
    "archived_snapshots": {
        "closest": {
            "available": true,
            "url": "http://web.archive.org/web/20060101064348/http://www.example.com:80/",
            "timestamp": "20060101064348",
            "status": "200"
        }
    }
}

callback is an optional callback which may be specified to produce a JSONP response.

How the spider works?

The first step is to check if a given URL exists. To check this, we will use wayback machine’s JSON API with the date option. Also, as I had explained in one of my previous blogs, the goal is to scrape the snapshotted versions of the websites listed in a CSV file containing the list of top 500 most visited websites. You can find the .csv file here.

To do this, for each website URL in the .csv file and for each year from 1996 to 2018, we check if the URL exists. Once we do that, we get a result something like for the API call corresponding to the query : http://archive.org/wayback/available?url=example.com&timestamp=20060101

{
    "archived_snapshots": {
        "closest": {
            "available": true,
            "url": "http://web.archive.org/web/20060101064348/http://www.example.com:80/",
            "timestamp": "20060101064348",
            "status": "200"
        }
    }
}

From the above JSON response, we pick the URL attribute in archived_snapshots>closest>url.

Once we get  this URL, we go to that URL, scrape the data at that URL and organize the scraped data in the form of files and folders.

Link to the scraped dataset –  https://github.com/virmht/Scrapy-Spider-Autorepair/blob/master/Dataset.rar

Link to the srapy code used for scraping –  https://github.com/virmht/Scrapy-Spider-Autorepair/blob/master/data_extractor_scrapy.py.

Leave a Reply

Your email address will not be published. Required fields are marked *