GSOC 2020 - Final Report

arnav_k
Published: 08/27/2020

Hi, first of all the link to the newly created number-parser library - https://github.com/scrapinghub/number-parser. The entire library was created from scratch
as part of GSoC 2020. Going over the github stats :-

  • 58 commits
  • 46000+ lines of code added
  • 22000+ lines of code deleted.

Phew, that's a lot. Before going any further about the details of the challenges and work done , I would like to thank all my mentors - this journey was possible only due to their support and inputs. Special shout-out to Marc @noviluni , without your constant code-reviews and inputs this library would not have been half as good.
 

Work Done
The README gives a more detailed explanation of what the library is capable of and how to use the library. Basically the goal was to have a way to convert numbers written in the natural language to their numeric form and I am proud to say that we were able to succeed in doing so to a large extent.
Additionally it supports multiple languages :-

For cardinal numbers -> English , Russian , Hindi , Spanish
For ordinal numbers -> English

Challenges / Work Remaining
The toughest challenge was actually starting the process from scratch with little reference. However once this phase was done it was a fun and relatively smooth working.
One major aspect was setting up good tests to ensure the library works well , once again thanks to all my mentors for helping with adding tests in different languages.
Apart from that the library is also planned to be a dependency of date-parser - https://github.com/scrapinghub/dateparser/pull/711 this in turn meant keeping a high level of code quality. Also it was important to structure the code and add features (like auto-language detection) so that the incorporation was smooth.

The library is still in the early stages and there is an endless scope for improvements. So request one and all to contribute and make it better. From the list of pending issues
https://github.com/scrapinghub/number-parser/issues, the major ones to revamp the library are :-

All in all it has been an amazing summer and I would like to thank everyone who was part of it.

Signing off
Arnav

DJDT

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (11 rendered)

Cache calls from 1 backend

Signals

Log messages