In the last two weeks, I worked on:
1. Building the Spider Failure Detector.
2. Learning to build a plugin for the Scrapy Spider Auto-repair code.
To build the Spider Failure Detector, we are assuming that the callback/parse method of a spider returns three types of objects:
So, to detect the failure of a spider, we compare the data extracted by the working spider on the old page and the data extracted by the same spider on the new page. If the two data are equal, then, we say that spider is good, otherwise, we say that it has failed.
The data extracted by the spider can be of the following types:
The values corresponding to the keys in a dictionary and the values of the attributes of the item objects can also be objects. However, if we keep going further, i.e. the values of values of values of … values of the first item attribute/dictionary key, ultimately, the values will be privitive datatypes.
So we recursively compare the extracted objects to see if the extracted data is equal.
Here is the link to the pull request for this feature.
Apart from spider failure detector, I also worked on making a plugin for scrapy and figure a way to install the autorepair code in that plugin.
Next week onwards, I will work on implementing the plugin and writing some supporting code.