GSOC 2020 - Final Report

arnav_k
Published: 08/27/2020

Hi, first of all the link to the newly created number-parser library - https://github.com/scrapinghub/number-parser. The entire library was created from scratch
as part of GSoC 2020. Going over the github stats :-

58 commits
46000+ lines of code added
22000+ lines of code deleted.

Phew, that's a lot. Before going any further about the details of the challenges and work done , I would like to thank all my mentors - this journey was possible only due to their support and inputs. Special shout-out to Marc @noviluni , without your constant code-reviews and inputs this library would not have been half as good.

Work Done
The README gives a more detailed explanation of what the library is capable of and how to use the library. Basically the goal was to have a way to convert numbers written in the natural language to their numeric form and I am proud to say that we were able to succeed in doing so to a large extent.
Additionally it supports multiple languages :-

For cardinal numbers -> English , Russian , Hindi , Spanish
For ordinal numbers -> English

Challenges / Work Remaining
The toughest challenge was actually starting the process from scratch with little reference. However once this phase was done it was a fun and relatively smooth working.
One major aspect was setting up good tests to ensure the library works well , once again thanks to all my mentors for helping with adding tests in different languages.
Apart from that the library is also planned to be a dependency of date-parser - https://github.com/scrapinghub/dateparser/pull/711 this in turn meant keeping a high level of code quality. Also it was important to structure the code and add features (like auto-language detection) so that the incorporation was smooth.

The library is still in the early stages and there is an endless scope for improvements. So request one and all to contribute and make it better. From the list of pending issues
https://github.com/scrapinghub/number-parser/issues, the major ones to revamp the library are :-

https://github.com/scrapinghub/number-parser/issues/40 - Support for more languages -> While the code is modular enough to work on more languages there are some languages and specific edge-cases that need to be worked upon. Additionally ordinal numbers currently support English only which needs to be expanded
https://github.com/scrapinghub/number-parser/issues/11 - Supporting decimal and negative numbers -> Currently we are only limited to integers.

All in all it has been an amazing summer and I would like to thank everyone who was part of it.

Signing off
Arnav

View Blog Post

Weekly Check-In #7 (16th Aug - 23rd Aug)

arnav_k
Published: 08/25/2020

Hi, so we have almost reached the end of the program. It's time to wrap up all the work and polish it.

What did you do this week ?
I created the PR for date-parser incorporation and nearly all the test-cases seem to work , so that's good. On testing I did come across a bug for hindi language in number-parser , because of the different tokenization method for hindi, which I plan to fix this week.

Did you get stuck anywhere ?
Nothing major as such.

What is coming up next ?
Most of the coding part is pretty much wrapped up , just need to finalize the code , documentation etc for the final submission.

View Blog Post

Weekly Blog #6 (9th Aug - 16th Aug)

arnav_k
Published: 08/17/2020

Hi, all we have nearly reached the end. This week was mostly involved in creating and finalizing the auto-language detect feature.

Earlier we needed to put a language parameter with the input string to parse it correctly. However to have an efficient incorporation with date-parser we need to find the best language even if it is not passed. This uses a very simplistic count based approach to get the best language. All the words of the string are compared to the supported languages and the one with the maximum count is selected. It does require to check against all the languages , which might be a bottleneck in the future and thus will need to be monitored.

I didn't get stuck anywhere in particular. Since the primary goal was to create a working number-parser library was completed , we and my mentor had created a mini-plan for the last month. The goal for this week would be to create and (hopefully merge) the date-parser PR , and then we can create a similar draft PR for price-parser.

View Blog Post

Weekly Check-In #6 (2nd Aug - 9th Aug)

arnav_k
Published: 08/11/2020

So we have almost reached the end of the program and it was a fun learning experience.

What did you do this week ?
We were looking to restructure some core chunks of the code to allow easier testing and future contribution. Some developments were made in this direction , identifying possible problems and all. Apart from that a basic support for ordinal numbers (for English language only) was added which will make it more useful when it is integrated with date-parser.

Did you get stuck anywhere ?
The restructuring part was quite tough because with the current logic - it's hard to restructure the code in the necessary logical flow without completely revamping everything. Hence for the time-being we will let it be in the current form.

What is coming up next ?
The plan for the next week is to incorporate number-parser with date-parser. To achieve this we need to have auto-language detection in the number-parser code. Currently you need to supply a mandatory language parameter which we will do away with.

View Blog Post

Weekly Blog #5 (26th Jul - 2nd Aug)

arnav_k
Published: 08/03/2020

Hi all , we have released number-parser v0.1.0 out.

link

This was the primary goal of this project so really happy to get it done. This week was mostly spent in finalizing the version release which primarily involved creating a detailed README and minor bug fixes. Me and my mentor also discussed the plans for the last month and we have laid down an ambitious plan and hopefully we can achieve it. The two major goals for the last month is as follows :-

Support for Ordinal Numbers - Currently only cardinal numbers are supported however wrt to dates we generally see ordinal numbers being very common. (First , Second , Tenth etc ). The implementation for the same will be very similarly structured , in the data files we will have the dictionary words for ORDINAL_NUMBERS and a similar logic should work out.
Integration with date-parser - Once a basic version of cardinal numbers is supported the goal would be to incorporate it with date-parser. This would lead to date-parser having number-parser as a dependency. Since date-parser is a very highly used library already , it's important to ensure that nothing is broken by this merge.

I hope to get these two goals done in the upcoming weeks , I am really thankful to my mentors for the continuous support that has allowed me to grow as a programmer. They have also ensured that the past two months were really smooth and a lot of fun !!

View Blog Post

arnav_k's Blog

GSOC 2020 - Final Report

Weekly Check-In #7 (16th Aug - 23rd Aug)

Weekly Blog #6 (9th Aug - 16th Aug)

Weekly Check-In #6 (2nd Aug - 9th Aug)

Weekly Blog #5 (26th Jul - 2nd Aug)

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (28 rendered)

Cache calls from 1 backend

Signals

Log messages