In my previous blog I described the ideas that were supposed to be implemented. I spent the last two weeks doing exactly this.
I initially spent a long time implementing the most efficient version of hungarian algorithm, however, in my quest to search for the most efficient implementation, I came across Scipy’s implementation of it. Here is the link to it.
Another idea to speed up the algorithm is to do some pre-processing before computing the edit distance. Once, this preprocessing is done, the edit distance between any a string A and any substring of string B can be found in O(1) time. The preprocessing itself takes O(|A||B| + |A| + |B|) time. This will bring down the time complexity from O(|A||B||B|) to O(|A||B| + |A| + |B|) time which is a good improvement. More details about this algorithm are given in this paper.
The goal is to write a spider in scrapy to gather the data from top 500 websites. I will be spending my next week doing exactly this.
Link to the implementation code – https://github.com/virmht/Scrapy-Spider-Autorepair/blob/master/auto_repair_code.py