Web data extraction using XPaths is brittle and vulnerable to site layout changes. The goal of my project is to make web extraction robust by automatically correcting XPaths and rearranging data in containers to ensure that the data extracted before and after changes in web site layout is the same. We assume that we have access to the concerned web page before and after layout changes.
Any web page can be thought to be a tree where the tags form the internal nodes of the tree and the content inside the tags form the child elements.
Detecting Failure in Data Extraction:
The first step is to detect if data extraction has failed. We assume that the location of extracted data is specified using an XPath. Such XPath expressions can be used to extract a sub-tree of the HTML tree of a web page. We compare the extracted sub-tree before and after changes in layout, if they are the same, we are done, however, if they are different, then we try to repair the XPaths to ensure that the extracted data is the same before and after page layout changes.
Repairing the Spider:
We will call the web page before layout changes as P1 and the web page after layout changes as P2.
Stage – 1:
Case I – Sub-tree extracted from P1 can be found in P2:
This is the simplest case. We simply search for the sub-tree extracted from P1 in P2. We use Levenshtein distance to find the distance between two sub-trees represented as an HTML strings. If this distance is smaller than a threshold, we pass this sub-tree for further processing for stage – 2 and we remember its XPath.
Case II – Sub-tree extracted from P1 cannot be found in P2 but the leaf nodes of the extracted sub-tree can be found in P2:
Let’s call the sub-tree extracted from P1 as ST1. We try to find the largest sub-tree of ST1 which is present in P2(using Levenshtein Distance as stated in Case I above). We remember the Path of this sub-tree of ST1 and write this path in place of this sub-tree in ST1. Let me illustrate this with a simple example:
The above figure shows the HTML trees of web pages P1 and P2. One can also see the sub-tree ST1 of P1 inside the blue boundary. Let’s say, we wish to extract this sub-tree. Now, we will try to look for this ST1 in P2. Clearly, ST1 cannot be found in P2. You can see that if the node 8(in red boundary) would have been absent, then, we would be ale to find ST1 in P2. Now, let’s take a look at the numbers written on the edges of the two trees. These numbers indicate the path taken from the root to reach a given node. So, as node 2 is the first child of node 1 in P1, we say that one must follow the path  to reach node 2. Similarly, node 3 is the second child(ordered left to right) of node 1, hence, one must follow the path  to reach node 3. Similarly, to reach node 5 in P1, one must follow the path [0, 1] from the root to reach node 5.
Since, ST1 is not present in P2, we will try to search for sub-trees of ST1 in P2. Now, one can see that the sub-tree containing just the node 4 can be found in P2. Similarly, the subtree consisting of nodes 5 and 6 can also be found in P2, so, we will replace these two sub-trees in ST1 with their paths in P2. So now, ST1 becomes the tree shown in the figure Fig. 2 below.
Stage – 2:
The above logic should work well except for one case when the sub-tree to be extracted can be found at multiple places in P2, so now, the question is – the path of which sub-tree in P2 should be used to replace sub-tree of ST1 as shown in Fig. 2?
To deal with this issue, we will use the fact that the surroundings of a sub-tree should be somewhat similar in both P1 and P2. For example, if you want to extract the Username element in an HTML form, you would expect that the Password element is some where near it. The idea is to use this. The plan is to create a binary feature vector for the K nearest landmarks with respect to ST1. If ST1 is the Username element, then its logical to expect that landmarks – Password and Submit button are near that element. So the feature vector for the 2 nearest neighbours = [1, 1, 0]([for Password element, for Submit button, for Sign up button]). So, for a particular sub-tree in P2 for which the Levenshtien distance is less than the threshold, we look for its surrounding landmarks. Suppose, it only has the submit button(nearest element) and the sign-up button (second-nearest element) but no password element, then, the feature vector for this sub-tree will be [0, 1, 1]. Now, to find the similarity between these two vectors, we use Cosine Similarity. So for these two vectors, the cosine similarity is 0.5. For each occurrence of ST1 in P2, we find the similarity of that sub-tree with ST1 using neighborhood information.
Can we do better?
It turns out that we can do much better. There are two important ideas:
- Using weights: Some landmarks are more important than other landmarks. So, instead of having binary feature vectors, let’s assign weights to the features. We will talk more about this in future blogs.
- Using the idea of Maximizing Joint Probability of a Matching: One can see that for each copy of ST1 in P2, we will find the similarity using neighborhood information. So, we can say that one sub-tree ST1 can be matched to N different sub-trees in P2 each with a certain weight given by the similarity. Suppose, N = 2 and there are 2 locations of ST1 in P2. Because the content remains the same, we would expect that there should be 2 exactly locations of ST1 in P1. So, we have a 2 by 2 matching problem here. In general, there can be a N by M matching problem because it is not always necessary that the content of a web page remains same always. Note that this matching is bijective. Suppose, the probability of each matching for the 2 by 2 case is written on the edges as shown in Fig. 3 below –
In the above figure Fig. 3, nodes A and B are the locations of two different instances of ST1 in P1 and nodes 1 and 2 are the locations of two different instances of ST1 in P2.
We have the following alignments:
- [(A – 1), (B – 2)]
- [(A – 2), (B – 1)]
As the similarity algorithm we are using assigns weights of A independently from weights of B, we can say that,
Probability([(A – 1), (B – 2)]) = Probability([(A – 1)])*Probability([B – 2]) = w1*w4
Probability([(A – 2), (B – 1)]) = Probability([(A – 2)])*Probability([B – 1]) = w2*w3.
Now, we pick the alignment having the most joint probability. For an N by N mapping, no. of different alignments = N! which is a huge number. We can actually solve this problem using the Hungarian Algorithm. We will talk more about this in future blogs.