In my previous blog post, I had explained about maximizing joint probability of alignment, i.e., choosing the alignment whose probability is maximum. To do this, we will use Hungarian algorithm and its variants. Let’s begin by describing the problem a little more formally.
Assume that we have N workers and N jobs that should be done. For each pair (worker, job) we know salary that should be paid to worker for him to perform the job. Our goal is to complete all jobs minimizing total inputs, while assigning each worker to exactly one job and vice versa.
Converting this problem to a formal mathematical definition we can form the following equations:
– cost matrix, where cij – cost of worker i to perform job j.
– resulting binary matrix, where xij = 1 if and only if ith worker is assigned to jth job.
– one worker to one job assignment.
– one job to one worker assignment.
– total cost function.
I didn’t know how to insert latex in this blog post, hence, putting a picture below –
The above ideas look good in theory, however, we need to test them. In order to perform this test, we want to gather snapshots of different websites at different dates. Once we have this data, we will generate random spiders to extract certain sub-trees from two different snapshots of a web-page. Because the layout of a web-page can change with time, there is a small probability that a spider written to extract data from one snapshot of a web-page fails to extract the same sub-tree from another snapshot of the web-page. Once it fails, we will use the auto-repair code to correct the spider. To text the improvement in performance, we will incrementally add features to the code and test the performance. As we make our algorithm better, we expect to see improvements in the performance as well. For this purpose, the extracted data is useful.
Web data extraction using XPaths is brittle and vulnerable to site layout changes. The goal of my project is to make web extraction robust by automatically correcting XPaths and rearranging data in containers to ensure that the data extracted before and after changes in web site layout is the same. We assume that we have access to the concerned web page before and after layout changes.
Any web page can be thought to be a tree where the tags form the internal nodes of the tree and the content inside the tags form the child elements.
Detecting Failure in Data Extraction:
The first step is to detect if data extraction has failed. We assume that the location of extracted data is specified using an XPath. Such XPath expressions can be used to extract a sub-tree of the HTML tree of a web page. We compare the extracted sub-tree before and after changes in layout, if they are the same, we are done, however, if they are different, then we try to repair the XPaths to ensure that the extracted data is the same before and after page layout changes.
Repairing the Spider:
We will call the web page before layout changes as P1 and the web page after layout changes as P2.
Stage – 1:
Case I – Sub-tree extracted from P1 can be found in P2:
This is the simplest case. We simply search for the sub-tree extracted from P1 in P2. We use Levenshtein distance to find the distance between two sub-trees represented as an HTML strings. If this distance is smaller than a threshold, we pass this sub-tree for further processing for stage – 2 and we remember its XPath.
Case II – Sub-tree extracted from P1 cannot be found in P2 but the leaf nodes of the extracted sub-tree can be found in P2:
Let’s call the sub-tree extracted from P1 as ST1. We try to find the largest sub-tree of ST1 which is present in P2(using Levenshtein Distance as stated in Case I above). We remember the Path of this sub-tree of ST1 and write this path in place of this sub-tree in ST1. Let me illustrate this with a simple example:
The above figure shows the HTML trees of web pages P1 and P2. One can also see the sub-tree ST1 of P1 inside the blue boundary. Let’s say, we wish to extract this sub-tree. Now, we will try to look for this ST1 in P2. Clearly, ST1 cannot be found in P2. You can see that if the node 8(in red boundary) would have been absent, then, we would be ale to find ST1 in P2. Now, let’s take a look at the numbers written on the edges of the two trees. These numbers indicate the path taken from the root to reach a given node. So, as node 2 is the first child of node 1 in P1, we say that one must follow the path  to reach node 2. Similarly, node 3 is the second child(ordered left to right) of node 1, hence, one must follow the path  to reach node 3. Similarly, to reach node 5 in P1, one must follow the path [0, 1] from the root to reach node 5.
Since, ST1 is not present in P2, we will try to search for sub-trees of ST1 in P2. Now, one can see that the sub-tree containing just the node 4 can be found in P2. Similarly, the subtree consisting of nodes 5 and 6 can also be found in P2, so, we will replace these two sub-trees in ST1 with their paths in P2. So now, ST1 becomes the tree shown in the figure Fig. 2 below.
You can find the code which implement the above logic here and the README for the code here.
Stage – 2:
The above logic should work well except for one case when the sub-tree to be extracted can be found at multiple places in P2, so now, the question is – the path of which sub-tree in P2 should be used to replace sub-tree of ST1 as shown in Fig. 2?
To deal with this issue, we will use the fact that the surroundings of a sub-tree should be somewhat similar in both P1 and P2. For example, if you want to extract the Username element in an HTML form, you would expect that the Password element is some where near it. The idea is to use this. The plan is to create a binary feature vector for the K nearest landmarks with respect to ST1. If ST1 is the Username element, then its logical to expect that landmarks – Password and Submit button are near that element. So the feature vector for the 2 nearest neighbours = [1, 1, 0]([for Password element, for Submit button, for Sign up button]). So, for a particular sub-tree in P2 for which the Levenshtien distance is less than the threshold, we look for its surrounding landmarks. Suppose, it only has the submit button(nearest element) and the sign-up button (second-nearest element) but no password element, then, the feature vector for this sub-tree will be [0, 1, 1]. Now, to find the similarity between these two vectors, we use Cosine Similarity. So for these two vectors, the cosine similarity is 0.5. For each occurrence of ST1 in P2, we find the similarity of that sub-tree with ST1 using neighborhood information.
Can we do better?
It turns out that we can do much better. There are two important ideas:
Using weights: Some landmarks are more important than other landmarks. So, instead of having binary feature vectors, let’s assign weights to the features. We will talk more about this in future blogs.
Using the idea of Maximizing Joint Probability of a Matching: One can see that for each copy of ST1 in P2, we will find the similarity using neighborhood information. So, we can say that one sub-tree ST1 can be matched to N different sub-trees in P2 each with a certain weight given by the similarity. Suppose, N = 2 and there are 2 locations of ST1 in P2. Because the content remains the same, we would expect that there should be 2 exactly locations of ST1 in P1. So, we have a 2 by 2 matching problem here. In general, there can be a N by M matching problem because it is not always necessary that the content of a web page remains same always. Note that this matching is bijective. Suppose, the probability of each matching for the 2 by 2 case is written on the edges as shown in Fig. 3 below –
In the above figure Fig. 3, nodes A and B are the locations of two different instances of ST1 in P1 and nodes 1 and 2 are the locations of two different instances of ST1 in P2.
We have the following alignments:
[(A – 1), (B – 2)]
[(A – 2), (B – 1)]
As the similarity algorithm we are using assigns weights of A independently from weights of B, we can say that,
Now, we pick the alignment having the most joint probability. For an N by N mapping, no. of different alignments = N! which is a huge number. We can actually solve this problem using the Hungarian Algorithm. We will talk more about this in future blogs.
I am glad to be selected for GSoC 2018 at Python Software Foundation to work on Scrapy Spider Auto-repair. The idea of this project is as follows: Spiders can become broken due to changes on the target site, which lead to different page layouts (therefore, broken XPath and CSS extractors). Often however, the information content of a page remains the same, just in a different form or layout. This project would concern the use of snapshotted versions of a target page, combined with extracted data from that page, to infer rules for scraping the new layout automatically. “Scrapely” is an example of a pre-existing tool that might be instrumental in this project and I am expected to build a tool that can, in some fortunate cases, automatically infer extraction rules to keep a spider up-to-date with site changes. Preferably, these rules can be emitted as new XPath or CSS queries in log output to the user, so that they can incorporate these new rules in the spider for a more maintainable long-term fix.
This blog will serve as a medium for sharing my experiences while I code through the project.
If you have no idea what Google Summer of Code is, it is a program by Google where college students spend the whole summer contributing to the selected open source projects. You can find out more about that here.