Adding Documentation, Tests and Refactoring Code

This week and the previous week I worked on the following:

1. I have written an API which the user can use to automatically repair his spider.

This function, auto_repair_lst takes 4 parameters:

  • old_page_path(type = string)
  • new_page_path(type = string)
  • lst_extracted_old_subtree(type = list of lxml.etree._Element objects)
  • rules(type = list)(OPTIONAL)

old_page_path is the path to a file containing the old HTML page on which the spider worked and correctly extracted the required data.

new_page_path is the path to a file containing the new HTML page on which the spider fails to extract the correct data and it is this file from which you would like the repaired spider to extract the data.

lst_extracted_old_subtree is a list of objects of type lxml.etree._Element. Each object in this list is a subtree of the old_page HTML tree which was extracted from the old page when the spider was not broken.

If rules(the fourth parameter) is given, the function will use these rules to correctly extract the relevant information from the new page.

This function takes the above arguments and returns two things:

  • Rules for data extraction
  • List of repaired subtrees. Each subtree in this list is an object of type lxml.etree._Element.

Let’s take a simple example.

Suppose the old page contains the following HTML code:

<html>
    <body>
        <div>
            <p>Username</p>
            <p>Password</p>
            <div>Submit</div>
        </div>
        <div>
           <div>
                <div>
                    <div>
                        <p>Username</p>
                        <p>email</p>
                        <p>Captcha1</p>
                        <p>Captcha2</p>
                    </div>
                </div>
            </div>
        </div> 
        </p>This should not be extracted</p>
    </body>
</html>

And the new page contains the following HTML code:

<html>
    <body>
        <div>
            <p>Username</p>
            <p>email</p>
        </div> 
        </p>This should not be extracted</p>
        <div>
            <p>Hello World</p>
            <div>
                <p>Username</p>
                <p>Password</p>
            </div>
            <div>Submit</div>
        </div>
    </body>
</html>

Now you can run the following code to correctly extract data.

>>> old_page_path = 'Examples/Autorepair_Old_Page.html'
>>> new_page_path = 'Examples/Autorepair_New_Page.html'
>>> old_page = Page(old_page_path, 'html')
>>> new_page = Page(new_page_path, 'html')
>>> lst_extracted_old_subtrees = [old_page.tree.getroot()[0][1][0][0]]
>>> lst_rules, lst_repaired_subtrees = auto_repair_lst(old_page_path,new_page_path, lst_extracted_old_subtrees)
>>> lst_rules
[[([0, 0], [0, 0, 0]), ([0, 1], [0, 0, 1])]]
>>> len(lst_repaired_subtrees)
1
>>> tostring(lst_repaired_subtrees[0])
b'<div>\n                    <div>\n                        <p>Username</p>\n            <p>email</p>\n        <p>Captcha1</p>\n                        <p>Captcha2</p>\n                    </div>\n                </div>\n            '
>>>

From the above example, you can see that since captcha1 and captcha2 could not be found in the new_page, it keeps it untouched.

Now, whenever you encounter a webpage having layout similar to the new page layout, you can directly use the rules returned when trying to repair the spider to correctly extract data from new_page.

To do this, simply pass the variable rules to the auto_repair_lst function in the following manner:

Extending the above example, suppose a webpage having layout similar to the new_page layout is the following:

<html>
    <body>
        <div>
            <p>Google</p>
            <p>Microsoft</p>
        </div> 
        </p>This should not be extracted</p>
        <div>
            <p>Hello World</p>
            <div>
                <p>foop>
                <p>bar</p>
            </div>
            <div>blah...</div>
        </div>
    </body>
</html>

You can write the following extra code:

>>> new_page_similar_path = 'C:/Users/Viral Mehta/Desktop/Scrapy-Spider-Autorepair/Examples/Autorepair_New_page_similar.html'
>>> rules, lst_repaired_subtrees = auto_repair_lst(old_page_path, new_page_similar_path, lst_extracted_old_subtrees, rules)
>>> tostring(lst_repaired_subtrees[0])
b'<div>\n                    <div>\n                        <p>Google</p>\n            <p>Microsoft</p>\n        <p>Captcha1</p>\n                        <p>Captcha2</p>\n                    </div>\n                </div>\n            '
>>>

2. Adding Documentation: For each function, I have added a short description of what the function does, the values returned by the function and some examples to show how to use the function.
3. Adding Tests: For each function I have added at least one test. Testing framework used – pytests
4. Refactoring Code: I have divided the code into functions in such a way that each function is a part of an appropriate class.
5. I have also tried to package the code and uploaded it to PyPI. The PyPI URL for my project is:
https://pypi.org/project/scrapy-spider-auto-repair/
6. Apart from that, I have also written the README and contribution guidelines.

You can find all the above in my recent PR – https://github.com/virmht/Scrapy-Spider-Autorepair/pull/3

For the next week, I will wrap up all the pending tasks. I would like to setup Continuous Integration in my repository for which I will be using Travis CI.