Hi, first of all the link to the newly created number-parser library - https://github.com/scrapinghub/number-parser. The entire library was created from scratch
as part of GSoC 2020. Going over the github stats :-
- 58 commits
- 46000+ lines of code added
- 22000+ lines of code deleted.
Phew, that's a lot. Before going any further about the details of the challenges and work done , I would like to thank all my mentors - this journey was possible only due to their support and inputs. Special shout-out to Marc @noviluni , without your constant code-reviews and inputs this library would not have been half as good.
Work Done
The README gives a more detailed explanation of what the library is capable of and how to use the library. Basically the goal was to have a way to convert numbers written in the natural language to their numeric form and I am proud to say that we were able to succeed in doing so to a large extent.
Additionally it supports multiple languages :-
For cardinal numbers -> English , Russian , Hindi , Spanish
For ordinal numbers -> English
Challenges / Work Remaining
The toughest challenge was actually starting the process from scratch with little reference. However once this phase was done it was a fun and relatively smooth working.
One major aspect was setting up good tests to ensure the library works well , once again thanks to all my mentors for helping with adding tests in different languages.
Apart from that the library is also planned to be a dependency of date-parser - https://github.com/scrapinghub/dateparser/pull/711 this in turn meant keeping a high level of code quality. Also it was important to structure the code and add features (like auto-language detection) so that the incorporation was smooth.
The library is still in the early stages and there is an endless scope for improvements. So request one and all to contribute and make it better. From the list of pending issues
https://github.com/scrapinghub/number-parser/issues, the major ones to revamp the library are :-
- https://github.com/scrapinghub/number-parser/issues/40 - Support for more languages -> While the code is modular enough to work on more languages there are some languages and specific edge-cases that need to be worked upon. Additionally ordinal numbers currently support English only which needs to be expanded
- https://github.com/scrapinghub/number-parser/issues/11 - Supporting decimal and negative numbers -> Currently we are only limited to integers.
All in all it has been an amazing summer and I would like to thank everyone who was part of it.
Signing off
Arnav