Adding Documentation, Tests and Refactoring Code

This week and the previous week I worked on the following:

1. I have written an API which the user can use to automatically repair his spider.

This function, auto_repair_lst takes 4 parameters:

  • old_page_path(type = string)
  • new_page_path(type = string)
  • lst_extracted_old_subtree(type = list of lxml.etree._Element objects)
  • rules(type = list)(OPTIONAL)

old_page_path is the path to a file containing the old HTML page on which the spider worked and correctly extracted the required data.

new_page_path is the path to a file containing the new HTML page on which the spider fails to extract the correct data and it is this file from which you would like the repaired spider to extract the data.

lst_extracted_old_subtree is a list of objects of type lxml.etree._Element. Each object in this list is a subtree of the old_page HTML tree which was extracted from the old page when the spider was not broken.

If rules(the fourth parameter) is given, the function will use these rules to correctly extract the relevant information from the new page.

This function takes the above arguments and returns two things:

  • Rules for data extraction
  • List of repaired subtrees. Each subtree in this list is an object of type lxml.etree._Element.

Let’s take a simple example.

Suppose the old page contains the following HTML code:

<html>
    <body>
        <div>
            <p>Username</p>
            <p>Password</p>
            <div>Submit</div>
        </div>
        <div>
           <div>
                <div>
                    <div>
                        <p>Username</p>
                        <p>email</p>
                        <p>Captcha1</p>
                        <p>Captcha2</p>
                    </div>
                </div>
            </div>
        </div> 
        </p>This should not be extracted</p>
    </body>
</html>

And the new page contains the following HTML code:

<html>
    <body>
        <div>
            <p>Username</p>
            <p>email</p>
        </div> 
        </p>This should not be extracted</p>
        <div>
            <p>Hello World</p>
            <div>
                <p>Username</p>
                <p>Password</p>
            </div>
            <div>Submit</div>
        </div>
    </body>
</html>

Now you can run the following code to correctly extract data.

>>> old_page_path = 'Examples/Autorepair_Old_Page.html'
>>> new_page_path = 'Examples/Autorepair_New_Page.html'
>>> old_page = Page(old_page_path, 'html')
>>> new_page = Page(new_page_path, 'html')
>>> lst_extracted_old_subtrees = [old_page.tree.getroot()[0][1][0][0]]
>>> lst_rules, lst_repaired_subtrees = auto_repair_lst(old_page_path,new_page_path, lst_extracted_old_subtrees)
>>> lst_rules
[[([0, 0], [0, 0, 0]), ([0, 1], [0, 0, 1])]]
>>> len(lst_repaired_subtrees)
1
>>> tostring(lst_repaired_subtrees[0])
b'<div>\n                    <div>\n                        <p>Username</p>\n            <p>email</p>\n        <p>Captcha1</p>\n                        <p>Captcha2</p>\n                    </div>\n                </div>\n            '
>>>

From the above example, you can see that since captcha1 and captcha2 could not be found in the new_page, it keeps it untouched.

Now, whenever you encounter a webpage having layout similar to the new page layout, you can directly use the rules returned when trying to repair the spider to correctly extract data from new_page.

To do this, simply pass the variable rules to the auto_repair_lst function in the following manner:

Extending the above example, suppose a webpage having layout similar to the new_page layout is the following:

<html>
    <body>
        <div>
            <p>Google</p>
            <p>Microsoft</p>
        </div> 
        </p>This should not be extracted</p>
        <div>
            <p>Hello World</p>
            <div>
                <p>foop>
                <p>bar</p>
            </div>
            <div>blah...</div>
        </div>
    </body>
</html>

You can write the following extra code:

>>> new_page_similar_path = 'C:/Users/Viral Mehta/Desktop/Scrapy-Spider-Autorepair/Examples/Autorepair_New_page_similar.html'
>>> rules, lst_repaired_subtrees = auto_repair_lst(old_page_path, new_page_similar_path, lst_extracted_old_subtrees, rules)
>>> tostring(lst_repaired_subtrees[0])
b'<div>\n                    <div>\n                        <p>Google</p>\n            <p>Microsoft</p>\n        <p>Captcha1</p>\n                        <p>Captcha2</p>\n                    </div>\n                </div>\n            '
>>>

2. Adding Documentation: For each function, I have added a short description of what the function does, the values returned by the function and some examples to show how to use the function.
3. Adding Tests: For each function I have added at least one test. Testing framework used – pytests
4. Refactoring Code: I have divided the code into functions in such a way that each function is a part of an appropriate class.
5. I have also tried to package the code and uploaded it to PyPI. The PyPI URL for my project is:
https://pypi.org/project/scrapy-spider-auto-repair/
6. Apart from that, I have also written the README and contribution guidelines.

You can find all the above in my recent PR – https://github.com/virmht/Scrapy-Spider-Autorepair/pull/3

For the next week, I will wrap up all the pending tasks. I would like to setup Continuous Integration in my repository for which I will be using Travis CI.

Building the Spider Failure Detector and making Scrapy Plugin

In the last two weeks, I worked on:
1. Building the Spider Failure Detector.
2. Learning to build a plugin for the Scrapy Spider Auto-repair code.

To build the Spider Failure Detector, we are assuming that the callback/parse method of a spider returns three types of objects:
1. Items
2. Dictionaries
3. Requests.

So, to detect the failure of a spider, we compare the data extracted by the working spider on the old page and the data extracted by the same spider on the new page. If the two data are equal, then, we say that spider is good, otherwise, we say that it has failed.
The data extracted by the spider can be of the following types:
1. Items
2. Dictionaries
3. Requests.
The values corresponding to the keys in a dictionary and the values of the attributes of the item objects can also be objects. However, if we keep going further, i.e. the values of values of values of … values of the first item attribute/dictionary key, ultimately, the values will be privitive datatypes.
So we recursively compare the extracted objects to see if the extracted data is equal.
Here is the link to the pull request for this feature.

Apart from spider failure detector, I also worked on making a plugin for scrapy and figure a way to install the autorepair code in that plugin.

Next week onwards, I will work on implementing the plugin and writing some supporting code.

Creating a Dataset by Scraping data from WayBack Machine

As described in my previous blog post, I was supposed to scrape data from the Wayback Machine and create a dataset out of it so that my auto repair code can be tested on this dataset. In the last week, I did exactly this.

Wayback Machine APIs

The Internet Archive Wayback Machine supports a number of different APIs to make it easier for developers to retrieve information about Wayback capture data.

Wayback Availability JSON API

This simple API for Wayback is a test to see if a given url is archived and currenlty accessible in the Wayback Machine. This API is useful for providing a 404 or other error handler which checks Wayback to see if it has an archived copy ready to display. The API can be used as follows:

http://archive.org/wayback/available?url=example.com

which might return:

{
    "archived_snapshots": {
        "closest": {
            "available": true,
            "url": "http://web.archive.org/web/20130919044612/http://example.com/",
            "timestamp": "20130919044612",
            "status": "200"
        }
    }
}

if the url is available. When available, the url is the link to the archived snapshot in the Wayback Machine At this time, archived_snapshots just returns a single closest snapshot, but additional snapshots may be added in the future.

If the url is not available (not archived or currently not accessible), the response will be:

{"archived_snapshots":{}}

Other Options

Additional options which may be specified are timestamp and callback

    • timestamp is the timestamp to look up in Wayback. If not specified, the most recenty available capture in Wayback is returned. The format of the timestamp is 1-14 digits (YYYYMMDDhhmmss) ex:

http://archive.org/wayback/available?url=example.com&timestamp=20060101

may result in the following response (note that the snapshot timestamp is now close to 20060101):

{
    "archived_snapshots": {
        "closest": {
            "available": true,
            "url": "http://web.archive.org/web/20060101064348/http://www.example.com:80/",
            "timestamp": "20060101064348",
            "status": "200"
        }
    }
}

callback is an optional callback which may be specified to produce a JSONP response.

How the spider works?

The first step is to check if a given URL exists. To check this, we will use wayback machine’s JSON API with the date option. Also, as I had explained in one of my previous blogs, the goal is to scrape the snapshotted versions of the websites listed in a CSV file containing the list of top 500 most visited websites. You can find the .csv file here.

To do this, for each website URL in the .csv file and for each year from 1996 to 2018, we check if the URL exists. Once we do that, we get a result something like for the API call corresponding to the query : http://archive.org/wayback/available?url=example.com&timestamp=20060101

{
    "archived_snapshots": {
        "closest": {
            "available": true,
            "url": "http://web.archive.org/web/20060101064348/http://www.example.com:80/",
            "timestamp": "20060101064348",
            "status": "200"
        }
    }
}

From the above JSON response, we pick the URL attribute in archived_snapshots>closest>url.

Once we get  this URL, we go to that URL, scrape the data at that URL and organize the scraped data in the form of files and folders.

Link to the scraped dataset –  https://github.com/virmht/Scrapy-Spider-Autorepair/blob/master/Dataset.rar

Link to the srapy code used for scraping –  https://github.com/virmht/Scrapy-Spider-Autorepair/blob/master/data_extractor_scrapy.py.

Implementation of Ideas

In my previous blog I described the ideas that were supposed to be implemented. I spent the last two weeks doing exactly this.

I initially spent a long time implementing the most efficient version of hungarian algorithm, however, in my quest to search for the most efficient implementation, I came across Scipy’s implementation of it. Here is the link to it.

Another idea to speed up the algorithm is to do some pre-processing before computing the edit distance. Once, this preprocessing is done, the edit distance between any a string A and any substring of string B can be found in O(1) time. The preprocessing itself takes O(|A||B| + |A| + |B|) time. This will bring down the time complexity from O(|A||B||B|) to O(|A||B| + |A| + |B|) time which is a good improvement. More details about this algorithm are given in this paper.

Future Plan:

The goal is to write a spider in scrapy to gather the data from top 500 websites. I will be spending my next week doing exactly this.

Link to the implementation code – https://github.com/virmht/Scrapy-Spider-Autorepair/blob/master/auto_repair_code.py

Maximizing Joint Probability of Alignment and Data Extraction

In my previous blog post, I had explained about maximizing joint probability of alignment, i.e., choosing the alignment whose probability is maximum. To do this, we will use Hungarian algorithm  and its variants. Let’s begin by describing the problem a little more formally.

Assume that we have N workers and N jobs that should be done. For each pair (worker, job) we know salary that should be paid to worker for him to perform the job. Our goal is to complete all jobs minimizing total inputs, while assigning each worker to exactly one job and vice versa.

Converting this problem to a formal mathematical definition we can form the following equations:
 – cost matrix, where cij – cost of worker i to perform job j.
 – resulting binary matrix, where xij = 1 if and only if ith worker is assigned to jth job.
 – one worker to one job assignment.
 – one job to one worker assignment.
 – total cost function.

 

Figure – 1

I didn’t know how to insert latex in this blog post, hence, putting a picture below –

Figure – 2

Data Extraction:

The above ideas look good in theory, however, we need to test them. In order to perform this test, we want to gather snapshots of different websites at different dates. Once we have this data, we will generate random spiders to extract certain sub-trees from two different snapshots of a web-page. Because the layout of a web-page can change with time, there is a small probability that a spider written to extract data from one snapshot of a web-page fails to extract the same sub-tree from another snapshot of the web-page. Once it fails, we will use the auto-repair code to correct the spider. To text the improvement in performance, we will incrementally add features to the code and test the performance. As we make our algorithm better, we expect to see improvements in the performance as well. For this purpose, the extracted data is useful.

Here is the link to the data extraction code.

Future Plan of Work:

The plan is to implement the ideas in the existing code. Particularly, I wish t make the following improvements –

  1. Improve the tree edit distance run-time complexity
  2. Implement the idea of using nearby landmarks as feature vectors which serve as the weights in the alignment problem.
  3. Implement the idea of maximizing joint probability of alignment.
  4. Testing the implementation for accuracy.

The plan is to complete the above by the start of second evaluation.

GSoC 2018: Building the Essential Components of Spider Auto Repair

Problem Description:

Web data extraction using XPaths is brittle and vulnerable to site layout changes. The goal of my project is to make web extraction robust by automatically correcting XPaths and rearranging data in containers to ensure that the data extracted before and after changes in web site layout is the same. We assume that we have access to the concerned web page before and after layout changes.

Approach Taken:

Any web page can be thought to be a tree where the tags form the internal nodes of the tree and the content inside the tags form the child elements.

Detecting Failure in Data Extraction:

The first step is to detect if data extraction has failed. We assume that the location of extracted data is specified using an XPath. Such XPath expressions can be used to extract a sub-tree of the HTML tree of a web page. We compare the extracted sub-tree before and after changes in layout,  if they are the same, we are done, however, if they are different, then we try to repair the XPaths to ensure that the extracted data is the same before and after page layout changes.

Repairing the Spider:

We will call the web page before layout changes as P1 and the web page after layout changes as P2.

Stage – 1:

Case I – Sub-tree extracted from P1 can be found in P2:

This is the simplest case. We simply search for the sub-tree extracted from P1 in P2. We use Levenshtein distance to find the distance between two sub-trees represented as an HTML strings. If this distance is smaller than a threshold, we pass this sub-tree for further processing for stage – 2 and we remember its XPath.

Case II – Sub-tree extracted from P1 cannot be found in P2 but the leaf nodes of the extracted sub-tree can be found in P2:

Let’s call the sub-tree extracted from P1 as ST1. We try to find the largest sub-tree of ST1 which is present in P2(using Levenshtein Distance as stated in Case I above). We remember the Path of this sub-tree of ST1 and write this path in place of this sub-tree in ST1. Let me illustrate this with a simple example:

Fig. 1

 

The above figure shows the HTML trees of web pages P1 and P2. One can also see the sub-tree ST1 of P1 inside the blue boundary. Let’s say, we wish to extract this sub-tree. Now, we will try to look for this ST1 in P2. Clearly, ST1 cannot be found in P2. You can see that if the node 8(in red boundary) would have been absent, then, we would be ale to find ST1 in P2. Now, let’s take a look at the numbers written on the edges of the two trees. These numbers indicate the path taken from the root to reach a given node. So, as node 2 is the first child of node 1 in P1, we say that one must follow the path [0] to reach node 2. Similarly, node 3 is the second child(ordered left to right) of node 1, hence, one must follow the path [1] to reach node 3. Similarly, to reach node 5 in P1, one must follow the path [0, 1] from the root to reach node 5.

Since, ST1 is not present in P2, we will try to search for sub-trees of ST1 in P2. Now, one can see that the sub-tree containing just the node 4 can be found in P2. Similarly, the subtree consisting of nodes 5 and 6 can also be found in P2, so, we will replace these two sub-trees in  ST1 with their paths in P2. So now, ST1 becomes the tree shown in the figure Fig. 2 below.

Fig. 2

You can find the code which implement the above logic here and the README for the code here.

Stage – 2:

The above logic should work well except for one case when the sub-tree to be extracted can be found at multiple places in P2, so now, the question is – the path of which sub-tree in P2 should be used to replace sub-tree of ST1 as shown in Fig. 2?

To deal with this issue, we will use the fact that the surroundings of a sub-tree should be somewhat similar in both P1 and P2. For example, if you want to extract the Username element in an HTML form, you would expect that the Password element is some where near it. The idea is to use this. The plan is to create a binary feature vector for the K nearest landmarks with respect to ST1. If ST1 is the Username element, then its logical to expect that landmarks –  Password and Submit button are near that element. So the feature vector for the 2 nearest neighbours = [1, 1, 0]([for Password element, for Submit button, for Sign up button]). So, for a particular sub-tree in P2 for which the Levenshtien distance  is less than the threshold, we look for its surrounding landmarks. Suppose, it only has the submit button(nearest element) and the sign-up button (second-nearest element) but no password element, then, the feature vector for this sub-tree will be [0, 1, 1]. Now, to find the similarity between these two vectors, we use Cosine Similarity. So for these two vectors, the cosine similarity is 0.5. For each occurrence of ST1 in P2, we find the similarity of that sub-tree with ST1 using neighborhood information.

Can we do better?

It turns out that we can do much better. There are two important ideas:

  1. Using weights: Some landmarks are more important than other landmarks. So, instead of having binary feature vectors, let’s assign weights to the features. We will talk more about this in future blogs.
  2. Using the idea of Maximizing Joint Probability of a Matching: One can see that for each copy of ST1 in P2, we will find the similarity using neighborhood information. So, we can say that one sub-tree ST1 can be matched to N different sub-trees in P2 each with a certain weight given by the similarity. Suppose, N = 2 and there are 2 locations of ST1 in P2. Because the content remains the same, we would expect that there should be 2 exactly locations of ST1 in P1. So, we have a 2 by 2 matching problem here.  In general, there can be a N by M matching problem because it is not always necessary that the content of a web page remains same always. Note that this matching is bijective. Suppose, the probability of each matching for the 2 by 2 case is written on the edges as shown in Fig. 3 below –

    Fig. 3

In the above figure Fig. 3, nodes A and B are the locations of two different instances of ST1 in P1 and nodes 1 and 2 are the locations of two different instances of ST1 in P2.

We have the following alignments:

  • [(A – 1), (B – 2)]
  • [(A – 2), (B – 1)]

As the similarity algorithm we are using assigns weights of A independently from weights of B, we can say that,

Probability([(A – 1), (B – 2)]) = Probability([(A – 1)])*Probability([B – 2]) = w1*w4

Probability([(A – 2), (B – 1)]) = Probability([(A – 2)])*Probability([B – 1]) = w2*w3.

Now, we pick the alignment having the most joint probability. For an N by N mapping, no. of different alignments = N! which is a huge number. We can actually solve this problem using the Hungarian Algorithm. We will talk more about this in future blogs.

 

My First Blog

Hello world!

I am glad to be selected for GSoC 2018 at Python Software Foundation to work on Scrapy Spider Auto-repair. The idea of this project is as follows: Spiders can become broken due to changes on the target site, which lead to different page layouts (therefore, broken XPath and CSS extractors). Often however, the information content of a page remains the same, just in a different form or layout. This project would concern the use of snapshotted versions of a target page, combined with extracted data from that page, to infer rules for scraping the new layout automatically. “Scrapely” is an example of a pre-existing tool that might be instrumental in this project and I am expected to build a tool that can, in some fortunate cases, automatically infer extraction rules to keep a spider up-to-date with site changes. Preferably, these rules can be emitted as new XPath or CSS queries in log output to the user, so that they can incorporate these new rules in the spider for a more maintainable long-term fix.

This blog will serve as a medium for sharing my experiences while I code through the project.

If you have no idea what Google Summer of Code is, it is a program by Google where college students spend the whole summer contributing to the selected open source projects. You can find out more about that here.