Maximizing Joint Probability of Alignment and Data Extraction

In my previous blog post, I had explained about maximizing joint probability of alignment, i.e., choosing the alignment whose probability is maximum. To do this, we will use Hungarian algorithm  and its variants. Let’s begin by describing the problem a little more formally.

Assume that we have N workers and N jobs that should be done. For each pair (worker, job) we know salary that should be paid to worker for him to perform the job. Our goal is to complete all jobs minimizing total inputs, while assigning each worker to exactly one job and vice versa.

Converting this problem to a formal mathematical definition we can form the following equations:
 – cost matrix, where cij – cost of worker i to perform job j.
 – resulting binary matrix, where xij = 1 if and only if ith worker is assigned to jth job.
 – one worker to one job assignment.
 – one job to one worker assignment.
 – total cost function.

 

Figure – 1

I didn’t know how to insert latex in this blog post, hence, putting a picture below –

Figure – 2

Data Extraction:

The above ideas look good in theory, however, we need to test them. In order to perform this test, we want to gather snapshots of different websites at different dates. Once we have this data, we will generate random spiders to extract certain sub-trees from two different snapshots of a web-page. Because the layout of a web-page can change with time, there is a small probability that a spider written to extract data from one snapshot of a web-page fails to extract the same sub-tree from another snapshot of the web-page. Once it fails, we will use the auto-repair code to correct the spider. To text the improvement in performance, we will incrementally add features to the code and test the performance. As we make our algorithm better, we expect to see improvements in the performance as well. For this purpose, the extracted data is useful.

Here is the link to the data extraction code.

Future Plan of Work:

The plan is to implement the ideas in the existing code. Particularly, I wish t make the following improvements –

  1. Improve the tree edit distance run-time complexity
  2. Implement the idea of using nearby landmarks as feature vectors which serve as the weights in the alignment problem.
  3. Implement the idea of maximizing joint probability of alignment.
  4. Testing the implementation for accuracy.

The plan is to complete the above by the start of second evaluation.

Leave a Reply

Your email address will not be published. Required fields are marked *