0

GSoC 2018: Testing Rust-url, Chromium-gurl, uriparser

Having seen the result of the Python libraries, which is not different from the original library urllib, I decided to take a look at other libraries. They are rust-url, chromium-gurl and uriparser

Speed test:

For the speed test, I used the chromium url test file, which contains about 83k unique urls. Here is the result for each library’s urlparse function:

 

Parsing speed (in seconds)
urllib 0.66s
rust 0.4s
uriparser 0.15s
gurl-cython 0.13s

As we can see from the result, the 3 libraries have better performance than the original urllib library. If we use one of these 3 libraries as the replacement for urllib, I expect the performance to be 100% improved since the canonicalize_url function in scrapy, which is the function that takes a lot of time while running the Scrapy spider, uses a lot of parsing functions.

 

Correctness test

The way each library parse the urls is really important since we don’t want Scrapy to parse urls incorrectly. Therefore, I made a correctness test for each library to check their parsing standards. The test file can be found here, and the repo to the correctness tests is here. The result is noted below:

 

Number of wrong parsing result: (total testing cases: 409)

 

Wrong scheme Wrong netloc Wrong path
urllib 1 50 1
rust 0 0 0
uriparser 0 11 18
gurl-cython 0 34 56

 

Uriparser:

netloc

  1. unmatched netloc: http://[2001::1]/, the result is: 2001::1, while it should be: [2001::1]
  2. unmatched netloc: http://[::7f00:1]/, the result is: ::7f00:1, while it should be: [::7f00:1]
  3. unmatched netloc: http://[::d01:4403]/, the result is: ::d01:4403, while it should be: [::d01:4403]
  4. unmatched netloc: http://[2001::1]/, the result is: 2001::1, while it should be: [2001::1]
  5. unmatched netloc: sc://%1F!”$&'()*+,-.;<=>^_`{|}~/, the result is: %1F!, while it should be: %1F!”$&'()*+,-.;<=>^_`{|}~
  6. unmatched netloc: http://[1::]/, the result is: 1::, while it should be: [1::]
  7. unmatched netloc: non-special://[1:2:0:0:5::]/, the result is: 1:2:0:0:5::, while it should be: [1:2:0:0:5::]
  8. unmatched netloc: non-special://[1:2::3]/, the result is: 1:2::3, while it should be: [1:2::3]
  9. unmatched netloc: non-special://[1:2::3]:80/, the result is: 1:2::3, while it should be: [1:2::3]
  10. unmatched netloc: http://[0:1:0:1:0:1:0:1]/, the result is: 0:1:0:1:0:1:0:1, while it should be: [0:1:0:1:0:1:0:1]
  11. unmatched netloc: http://[1:0:1:0:1:0:1:0]/, the result is: 1:0:1:0:1:0:1:0, while it should be: [1:0:1:0:1:0:1:0]

 

Path:

  1. (‘unmatched path at’, ‘a: foo.com’, ‘the result is’, ”, ‘expected’, ‘ foo.com’)
  2. (‘unmatched path at’, ‘lolscheme:x x#x%20x’, ‘the result is’, ”, ‘expected’, ‘x x’)
  3. (‘unmatched path at’, ‘http://example.org/foo/bar#\\’, ‘the result is’, ”, ‘expected’, ‘/foo/bar’)
  4. (‘unmatched path at’, ‘http://foo/path;a??e#f#g’, ‘the result is’, ”, ‘expected’, ‘/path;a’)
  5. (‘unmatched path at’, ‘http://example.org/foo/[61:24:74]:98’, ‘the result is’, ”, ‘expected’, ‘/foo/[61:24:74]:98’)
  6. (‘unmatched path at’, ‘http://example.org/foo/[61:27]/:foo’, ‘the result is’, ”, ‘expected’, ‘/foo/[61:27]/:foo’)
  7. (‘unmatched path at’, ‘http://example.com/foo/%2e%2’, ‘the result is’, ”, ‘expected’, ‘/foo/%2e%2’)
  8. (‘unmatched path at’, ‘http://example.com/foo%’, ‘the result is’, ”, ‘expected’, ‘/foo%’)
  9. (‘unmatched path at’, ‘http://example.com/foo%2’, ‘the result is’, ”, ‘expected’, ‘/foo%2’)
  10. (‘unmatched path at’, ‘http://example.com/foo%2zbar’, ‘the result is’, ”, ‘expected’, ‘/foo%2zbar’)
  11. (‘unmatched path at’, ‘http://example.com/foo%2%C3%82%C2%A9zbar’, ‘the result is’, ”, ‘expected’, ‘/foo%2%C3%82%C2%A9zbar’)
  12. (‘unmatched path at’, ‘http://%60%7B%7D:%60%7B%7D@h/%60%7B%7D?`{}’, ‘the result is’, ”, ‘expected’, ‘/%60%7B%7D’)
  13. (‘unmatched path at’, ‘sc:\\../’, ‘the result is’, ”, ‘expected’, ‘\\../’)
  14. (‘unmatched path at’, ‘wow:%NBD’, ‘the result is’, ”, ‘expected’, ‘%NBD’)
  15. (‘unmatched path at’, ‘wow:%1G’, ‘the result is’, ”, ‘expected’, ‘%1G’)
  16. (‘unmatched path at’, ‘file://host/dir/C|a’, ‘the result is’, ”, ‘expected’, ‘/dir/C|a’)
  17. (‘unmatched path at’, ‘http://example.org/test?%GH’, ‘the result is’, ”, ‘expected’, ‘/test’)
  18. (‘unmatched path at’, ‘http://example.org/test?a#%GH’, ‘the result is’, ”, ‘expected’, ‘/test’)

 

Chromium GURL:

netloc

  1. unmatched netloc at non-special://test@test/x the result is  while it should be test
  2. unmatched netloc at non-special://test/x the result is  while it should be test
  3. unmatched netloc at httpa://foo:80/ the result is  while it should be foo
  4. unmatched netloc at sc://fa%C3%9F.ExAmPlE/ the result is  while it should be fa%C3%9F.ExAmPlE
  5. unmatched netloc at sc://ho/i the result is  while it should be ho
  6. unmatched netloc at sc://ho/i the result is  while it should be ho
  7. unmatched netloc at sc://ho/i the result is  while it should be ho
  8. unmatched netloc at sc://ho/pa?i the result is  while it should be ho
  9. unmatched netloc at sc://ho/pa#i the result is  while it should be ho
  10. unmatched netloc at sc://%C3%B1.test/ the result is  while it should be %C3%B1.test
  11. unmatched netloc at sc://%1F!”$&'()*+,-.;<=>^_`{|}~/ the result is  while it should be %1F!”$&'()*+,-.;<=>^_`{|}~
  12. unmatched netloc at sc://%/ the result is  while it should be %
  13. unmatched netloc at sc://%C3%B1/x the result is  while it should be %C3%B1
  14. unmatched netloc at sc://%C3%B1 the result is  while it should be %C3%B1
  15. unmatched netloc at sc://%C3%B1?x the result is  while it should be %C3%B1
  16. unmatched netloc at sc://%C3%B1#x the result is  while it should be %C3%B1
  17. unmatched netloc at sc://%C3%B1#x the result is  while it should be %C3%B1
  18. unmatched netloc at sc://%C3%B1?x the result is  while it should be %C3%B1
  19. unmatched netloc at tftp://foobar.com/someconfig;mode=netascii the result is  while it should be foobar.com
  20. unmatched netloc at telnet://user:pass@foobar.com:23/ the result is  while it should be foobar.com
  21. unmatched netloc at ut2004://10.10.10.10:7777/Index.ut2 the result is  while it should be 10.10.10.10
  22. unmatched netloc at redis://foo:bar@somehost:6379/0?baz=bam&qux=baz the result is  while it should be somehost
  23. unmatched netloc at rsync://foo@host:911/sup the result is  while it should be host
  24. unmatched netloc at git://github.com/foo/bar.git the result is  while it should be github.com
  25. unmatched netloc at irc://myserver.com:6999/channel?passwd the result is  while it should be myserver.com
  26. unmatched netloc at dns://fw.example.org:9999/foo.bar.org?type=TXT the result is  while it should be fw.example.org
  27. unmatched netloc at ldap://localhost:389/ou=People,o=JNDITutorial the result is  while it should be localhost
  28. unmatched netloc at git+https://github.com/foo/bar the result is  while it should be github.com
  29. unmatched netloc at non-special://%E2%80%A0/ the result is  while it should be %E2%80%A0
  30. unmatched netloc at non-special://H%4fSt/path the result is  while it should be H%4fSt
  31. unmatched netloc at non-special://[1:2:0:0:5::]/ the result is  while it should be [1:2:0:0:5::]
  32. unmatched netloc at non-special://[1:2::3]/ the result is  while it should be [1:2::3]
  33. unmatched netloc at non-special://[1:2::3]:80/ the result is  while it should be [1:2::3]
  34. unmatched netloc at a://b/test-a-colon-slash-slash-b.html the result is  while it should be b

 

path

  1. unmatched path at non-special://test@test/x the result is //test@test/x while it should be /x
  2. unmatched path at non-special://test/x the result is //test/x while it should be /x
  3. unmatched path at foo:// the result is // while it should be
  4. unmatched path at foo:///////// the result is ///////// while it should be ///////
  5. unmatched path at foo://///////bar.com/ the result is /////////bar.com/ while it should be ///////bar.com/
  6. unmatched path at foo:////:///// the result is ////:///// while it should be //://///
  7. unmatched path at http://example.com/foo/%2e%2 the result is /foo/.%2 while it should be /foo/%2e%2
  8. unmatched path at http://example.com/%2e.bar the result is /..bar while it should be /%2e.bar
  9. unmatched path at http://example.com/foo%41%7a the result is /fooAz while it should be /foo%41%7a
  10. unmatched path at http://example.com/foo%00%51 the result is /foo%00Q while it should be /foo%00%51
  11. unmatched path at http://www/foo%2Ehtml the result is /foo.html while it should be /foo%2Ehtml
  12. unmatched path at httpa://foo:80/ the result is //foo:80/ while it should be /
  13. unmatched path at sc://fa%C3%9F.ExAmPlE/ the result is //fa%C3%9F.ExAmPlE/ while it should be /
  14. unmatched path at mailto:x@x.com#x the result is x@x.com#x while it should be x@x.com
  15. unmatched path at sc://ho/i the result is //ho/i while it should be /i
  16. unmatched path at sc:///pa/i the result is ///pa/i while it should be /pa/i
  17. unmatched path at sc://ho/i the result is //ho/i while it should be /i
  18. unmatched path at sc:///i the result is ///i while it should be /i
  19. unmatched path at sc://ho/i the result is //ho/i while it should be /i
  20. unmatched path at sc:///i the result is ///i while it should be /i
  21. unmatched path at sc://ho/pa?i the result is //ho/pa while it should be /pa
  22. unmatched path at sc:///pa/pa?i the result is ///pa/pa while it should be /pa/pa
  23. unmatched path at sc://ho/pa#i the result is //ho/pa while it should be /pa
  24. unmatched path at sc:///pa/pa#i the result is ///pa/pa while it should be /pa/pa
  25. unmatched path at sc://%C3%B1.test/ the result is //%C3%B1.test/ while it should be /
  26. unmatched path at sc://%1F!”$&'()*+,-.;<=>^_`{|}~/ the result is //%1F!”$&'()*+,-.;<=>^_`{|}~/ while it should be /
  27. unmatched path at sc://%/ the result is //%/ while it should be /
  28. unmatched path at sc://%C3%B1/x the result is //%C3%B1/x while it should be /x
  29. unmatched path at file://host/dir/C|a the result is /dir/C%7Ca while it should be /dir/C|a
  30. unmatched path at sc://%C3%B1 the result is //%C3%B1 while it should be
  31. unmatched path at sc://%C3%B1?x the result is //%C3%B1 while it should be
  32. unmatched path at sc://%C3%B1#x the result is //%C3%B1 while it should be
  33. unmatched path at sc://%C3%B1#x the result is //%C3%B1 while it should be
  34. unmatched path at sc://%C3%B1?x the result is //%C3%B1 while it should be
  35. unmatched path at sc://? the result is // while it should be
  36. unmatched path at sc://# the result is // while it should be
  37. unmatched path at sc:/// the result is /// while it should be /
  38. unmatched path at sc://// the result is //// while it should be //
  39. unmatched path at sc:////x/ the result is ////x/ while it should be //x/
  40. unmatched path at tftp://foobar.com/someconfig;mode=netascii the result is //foobar.com/someconfig;mode=netascii while it should be /someconfig;mode=netascii
  41. unmatched path at telnet://user:pass@foobar.com:23/ the result is //user:pass@foobar.com:23/ while it should be /
  42. unmatched path at ut2004://10.10.10.10:7777/Index.ut2 the result is //10.10.10.10:7777/Index.ut2 while it should be /Index.ut2
  43. unmatched path at redis://foo:bar@somehost:6379/0?baz=bam&qux=baz the result is //foo:bar@somehost:6379/0 while it should be /0
  44. unmatched path at rsync://foo@host:911/sup the result is //foo@host:911/sup while it should be /sup
  45. unmatched path at git://github.com/foo/bar.git the result is //github.com/foo/bar.git while it should be /foo/bar.git
  46. unmatched path at irc://myserver.com:6999/channel?passwd the result is //myserver.com:6999/channel while it should be /channel
  47. unmatched path at dns://fw.example.org:9999/foo.bar.org?type=TXT the result is //fw.example.org:9999/foo.bar.org while it should be /foo.bar.org
  48. unmatched path at ldap://localhost:389/ou=People,o=JNDITutorial the result is //localhost:389/ou=People,o=JNDITutorial while it should be /ou=People,o=JNDITutorial
  49. unmatched path at git+https://github.com/foo/bar the result is //github.com/foo/bar while it should be /foo/bar
  50. unmatched path at non-special://%E2%80%A0/ the result is //%E2%80%A0/ while it should be /
  51. unmatched path at non-special://H%4fSt/path the result is //H%4fSt/path while it should be /path
  52. unmatched path at non-special://[1:2:0:0:5::]/ the result is //[1:2:0:0:5::]/ while it should be /
  53. unmatched path at non-special://[1:2::3]/ the result is //[1:2::3]/ while it should be /
  54. unmatched path at non-special://[1:2::3]:80/ the result is //[1:2::3]:80/ while it should be /
  55. unmatched path at a:///test-a-colon-slash-slash.html the result is ///test-a-colon-slash-slash.html while it should be /test-a-colon-slash-slash.html
  56. unmatched path at a://b/test-a-colon-slash-slash-b.html the result is //b/test-a-colon-slash-slash-b.html while it should be /test-a-colon-slash-slash-b.html

 

In addition to that, the issue mentioned in #1304 is not a problem for these 3 libraries as they handle the relative urls correctly.

As can be seen, rust-url passed all the test cases but its performance is not as good as either chromium-gurl or uriparser.

However, uriparser does not support parsing international urls. Therefore, we would not use uriparser for this project because of that. In addition, chromium-gurl does have some mistakes while handling the test cases. But after discussing with my mentor, Konstantin, we have decided to move forward with chromium-gurl since it has better performance than Rust and the failed test cases are harmless. Therefore, I will be working on building a wrapper for the chromium-gurl for the next few weeks!