Here is the link to the source code.
I am glad to be selected for GSoC 2018 at Python Software Foundation to work on Scrapy Spider Auto-repair. The idea of this project is as follows: Spiders can become broken due to changes on the target site, which lead to different page layouts (therefore, broken XPath and CSS extractors). Often however, the information content of a page remains the same, just in a different form or layout. This project would concern the use of snapshotted versions of a target page, combined with extracted data from that page, to infer rules for scraping the new layout automatically. “Scrapely” is an example of a pre-existing tool that might be instrumental in this project and I am expected to build a tool that can, in some fortunate cases, automatically infer extraction rules to keep a spider up-to-date with site changes. Preferably, these rules can be emitted as new XPath or CSS queries in log output to the user, so that they can incorporate these new rules in the spider for a more maintainable long-term fix.
This blog will serve as a medium for sharing my experiences while I code through the project.
If you have no idea what Google Summer of Code is, it is a program by Google where college students spend the whole summer contributing to the selected open source projects. You can find out more about that here.