0

GSoC 2018: Final week

For the final week, pretty much everything was done and the scurl library is under control. However, there are still a lot of things to do, especially finishing the PRs that integrate Scurl into Scrapy and w3lib, which is a sub project in Scrapy organization that has caninocalize_url function.

First thing first, everything in the Scurl project was a mess since I did not take care of the code pieces’ position at all. The canonicalize function component is mixed up with the urlparse function. Therefore, I needed to work on that first before I can officially release the library. Currently, I have moved all the code that is related to canonicalize_url to the canonocalize.pyx and all the code related to urlparse, urljoin and urlsplit to the cgurl.pyx file. In addition, I also moved all the class property methods to several separate classes and let the ParseResult and SplitResult class inherit from them. The properties of the SplitResult and ParseResult classes were created inside the __new__ method before, which does not make much sense.

Now that the code is more organized, I also need to work on improving the performance of all the functions from Scurl. First of all, there was a double parsing issue in urljoin (I created a GURL container to check if it can parse the input url successfully, which is because I want to make the function fallback to the original urljoin if the input url is considered invalid, and another one for parsing the urls). The fix actually improved the performance of urljoin significantly!

Another task was that there were still many calls to the “quote” function from urllib.parse. Therefore, I removed all of them and make calls to the canonicalize_component from GURL instead. This also brought a significant improve in term of performance of canonicalize_url function!

Other than that, I will work on the final report after this!