Articles on arnav_k's Blog

GSOC 2020 - Final Report

arnav.kapoor@research.iiit.ac.in (arnav_k) — Thu, 27 Aug 2020 11:33:43 +0000

Hi, first of all the link to the newly created number-parser library - https://github.com/scrapinghub/number-parser. The entire library was created from scratch
as part of GSoC 2020. Going over the github stats :-

58 commits
46000+ lines of code added
22000+ lines of code deleted.

Phew, that's a lot. Before going any further about the details of the challenges and work done , I would like to thank all my mentors - this journey was possible only due to their support and inputs. Special shout-out to Marc @noviluni , without your constant code-reviews and inputs this library would not have been half as good.

Work Done
The README gives a more detailed explanation of what the library is capable of and how to use the library. Basically the goal was to have a way to convert numbers written in the natural language to their numeric form and I am proud to say that we were able to succeed in doing so to a large extent.
Additionally it supports multiple languages :-

For cardinal numbers -> English , Russian , Hindi , Spanish
For ordinal numbers -> English

Challenges / Work Remaining
The toughest challenge was actually starting the process from scratch with little reference. However once this phase was done it was a fun and relatively smooth working.
One major aspect was setting up good tests to ensure the library works well , once again thanks to all my mentors for helping with adding tests in different languages.
Apart from that the library is also planned to be a dependency of date-parser - https://github.com/scrapinghub/dateparser/pull/711 this in turn meant keeping a high level of code quality. Also it was important to structure the code and add features (like auto-language detection) so that the incorporation was smooth.

The library is still in the early stages and there is an endless scope for improvements. So request one and all to contribute and make it better. From the list of pending issues
https://github.com/scrapinghub/number-parser/issues, the major ones to revamp the library are :-

https://github.com/scrapinghub/number-parser/issues/40 - Support for more languages -> While the code is modular enough to work on more languages there are some languages and specific edge-cases that need to be worked upon. Additionally ordinal numbers currently support English only which needs to be expanded
https://github.com/scrapinghub/number-parser/issues/11 - Supporting decimal and negative numbers -> Currently we are only limited to integers.

All in all it has been an amazing summer and I would like to thank everyone who was part of it.

Signing off
Arnav

Weekly Check-In #7 (16th Aug - 23rd Aug)

arnav.kapoor@research.iiit.ac.in (arnav_k) — Tue, 25 Aug 2020 07:46:27 +0000

Hi, so we have almost reached the end of the program. It's time to wrap up all the work and polish it.

What did you do this week ?
I created the PR for date-parser incorporation and nearly all the test-cases seem to work , so that's good. On testing I did come across a bug for hindi language in number-parser , because of the different tokenization method for hindi, which I plan to fix this week.

Did you get stuck anywhere ?
Nothing major as such.

What is coming up next ?
Most of the coding part is pretty much wrapped up , just need to finalize the code , documentation etc for the final submission.

Weekly Blog #6 (9th Aug - 16th Aug)

arnav.kapoor@research.iiit.ac.in (arnav_k) — Mon, 17 Aug 2020 10:53:22 +0000

Hi, all we have nearly reached the end. This week was mostly involved in creating and finalizing the auto-language detect feature.

Earlier we needed to put a language parameter with the input string to parse it correctly. However to have an efficient incorporation with date-parser we need to find the best language even if it is not passed. This uses a very simplistic count based approach to get the best language. All the words of the string are compared to the supported languages and the one with the maximum count is selected. It does require to check against all the languages , which might be a bottleneck in the future and thus will need to be monitored.

I didn't get stuck anywhere in particular. Since the primary goal was to create a working number-parser library was completed , we and my mentor had created a mini-plan for the last month. The goal for this week would be to create and (hopefully merge) the date-parser PR , and then we can create a similar draft PR for price-parser.

Weekly Check-In #6 (2nd Aug - 9th Aug)

arnav.kapoor@research.iiit.ac.in (arnav_k) — Tue, 11 Aug 2020 07:34:11 +0000

So we have almost reached the end of the program and it was a fun learning experience.

What did you do this week ?
We were looking to restructure some core chunks of the code to allow easier testing and future contribution. Some developments were made in this direction , identifying possible problems and all. Apart from that a basic support for ordinal numbers (for English language only) was added which will make it more useful when it is integrated with date-parser.

Did you get stuck anywhere ?
The restructuring part was quite tough because with the current logic - it's hard to restructure the code in the necessary logical flow without completely revamping everything. Hence for the time-being we will let it be in the current form.

What is coming up next ?
The plan for the next week is to incorporate number-parser with date-parser. To achieve this we need to have auto-language detection in the number-parser code. Currently you need to supply a mandatory language parameter which we will do away with.

Weekly Blog #5 (26th Jul - 2nd Aug)

arnav.kapoor@research.iiit.ac.in (arnav_k) — Mon, 03 Aug 2020 18:28:13 +0000

Hi all , we have released number-parser v0.1.0 out. link

This was the primary goal of this project so really happy to get it done. This week was mostly spent in finalizing the version release which primarily involved creating a detailed README and minor bug fixes. Me and my mentor also discussed the plans for the last month and we have laid down an ambitious plan and hopefully we can achieve it. The two major goals for the last month is as follows :-

Support for Ordinal Numbers - Currently only cardinal numbers are supported however wrt to dates we generally see ordinal numbers being very common. (First , Second , Tenth etc ). The implementation for the same will be very similarly structured , in the data files we will have the dictionary words for ORDINAL_NUMBERS and a similar logic should work out.
Integration with date-parser - Once a basic version of cardinal numbers is supported the goal would be to incorporate it with date-parser. This would lead to date-parser having number-parser as a dependency. Since date-parser is a very highly used library already , it's important to ensure that nothing is broken by this merge.

I hope to get these two goals done in the upcoming weeks , I am really thankful to my mentors for the continuous support that has allowed me to grow as a programmer. They have also ensured that the past two months were really smooth and a lot of fun !!

Weekly Check-In #5 (19th Jul - 26th Jul)

arnav.kapoor@research.iiit.ac.in (arnav_k) — Mon, 27 Jul 2020 08:42:14 +0000

So we have almost reached the second stage of evaluation , and the past month did go surprisingly fast. 😅

What did you do this week ?
So fixes were made to handle the long and short scale used by different locales. Additionally normalization of the input text was needed to specifically handle accents in numbers. All these changes were approved and soon would be merged.
I also worked a bit on the PR of incorporation of this code with the date-parser library.

Did you get stuck anywhere ?
Nothing major as such , I was experimenting a bit with different functions to normalize the text , as they needed to work on a large variety of languages.

What is coming up next ?
The plan for the next week is to wrap this up and get a first version out. Thereafter, I will discuss with the mentors to either add additional support to this library ( Ordinal Numbers and Decimals) or move to incorporation.

Weekly Blog #4 (12th Jul - 19th Jul)

arnav.kapoor@research.iiit.ac.in (arnav_k) — Mon, 20 Jul 2020 10:56:30 +0000

Hi all , so we were hoping to get the first version out the previous week however some bugs did creep up and it was important to resolve them before moving ahead. Hence we hope to release the first version to PyPI by this week's end. The two major troublesome cases were as follows :-

Adding Large constants - So we have supplementary json data files that are used to add language specific information, i.e those missing from the CLDR repository data source. This allows users to easily add language specific information. Now we can add large constants Gúgol , centillón in json supplementary file. using terms like 1e600 (thatjson supports) however while reading them using python script and merging into the final python language file, It was creating an error this happens because the range of python float is about 1.79e+308 thus larger numbers than this get translated as infinite. We still don't have a fix for this but since it is a minor issue impacting only very large numbers we can let it be for the time being.
Now one of the things not considered till now was long and short scales , The long scale is based on powers of one million (1,000,000), whereas the short scale is based on powers of one thousand (1,000). Now again this impacts the logic only for larger numbers greater than 10**9. However it hasn't been incorporated and we plan to fix this. More details here about the issues here https://github.com/arnavkapoor/number-parser/pull/24

Despite these issues the week in general was productive , I have incorporated most of the feedback from the mentor into the PR. Also my mentor found an effective way to automate some part of testing which will help us to be more sure of the code.
The plan for next-week is to ensure these minor issues are fixed and we release the first version.

Weekly Check-In #4 (5th Jul - 12th Jul)

arnav.kapoor@research.iiit.ac.in (arnav_k) — Mon, 13 Jul 2020 08:18:34 +0000

So we are about halfway through the project and the number-parser library is going strong. We are very close to publish version 1.0 to PyPI.

What did you do this week ?
The parser library was refactored to create a language class and use the language data python files. Tests were added for the supported language , which helped to identify a number of small bugs across the board, which were gradually fixed.

Did you get stuck anywhere ?
One of the bugs was how to elegantly handle multiple consecutive multipliers for example 'thousand millions'. In the end I did come up with a reasonable solution that should be working across all languages but more tests would need to be added to ascertain this.

What is coming up next ?
This week the first version would be released , to do that would require a round of code-cleanup, adding documentation , and more tests for all supported languages.

Weekly Blog #3 (29th Jun - 5th Jul)

arnav.kapoor@research.iiit.ac.in (arnav_k) — Mon, 06 Jul 2020 05:03:41 +0000

Hey everyone we are done with the first third of the program and I will use this blog to both give the weekly update as well as summarize the current state of progress. In the past 4 weeks , we have created a new number-parser library from scratch and build an MVP that is being continuously improved.

Last week was spent fine-tuning the parser to retrieve the relevant data from the CLDR RBNF repo. This 'rule based number parser' (RBNF) repo is basically a Java library that converts a number (23) to the corresponding word. (twenty-three) It has a lot of hard-coded values and data that are very useful to our library and thus we plan to extract all this information accurately and efficiently.

In addition to this there are multiple nuances in each of the language that was being taken care , accents in languages. For eg) the french '0' is written as zéro with (accent aigu over the e ) However we don't expect the users to enter these accents each time hence we need to normalise (i.e remove) these accents.

The most challenging aspect was definitely understanding (which I am still not completely clear) the CLDR RBNF structure , there is only a little documentation explaining some of the basic rules however it's tough to identify which are the relevant rules and which aren't.

Originally I was hoping to add more tests as well in this week however all this took longer than expected so the testing aspect is going to be pushed to the current week.

Weekly Check-In #3 (22nd Jun - 29th Jun)

arnav.kapoor@research.iiit.ac.in (arnav_k) — Sun, 28 Jun 2020 19:10:16 +0000

Hi, so we are done with week 4 of the program - a third of the way in and it has been a fun ride and the progress is on par with the expected timeline.

What did you do this week ?
A major upheaval of the code was done to incorporate the multiple-locale features. https://github.com/arnavkapoor/number-parser/pull/12. Now the base structure is in place to incorporate multiple language support ,
currently we have the data for 4 supported languages (English, Russian, Spanish, Hindi). This involved parsing raw data for each of these languages . -

The main approach rests on creating 6 sets of dictionaries for each of the languages:-

UNIT_NUMBERS -> Numbers from 1 to 9.
BASE_NUMBERS -> These contain uniquely defined numbers (i.e don't use any prefix). The maximum range is from [10,99]. For different languages, this range changes.
- English -> This range is from [10,19] (ten,eleven , twelve ... , nineteen)
- Hindi -> This range is from [10,99] Unique words exists all the way upto 100.
- Spanish -> This range is from [10,29]
MTENS -> These are multiples of tens from 20 to 90 that are used along with unit numbers to form the complete number, This might be empty for certain languages like Hindi. For English this list is twenty,thirty, forty ... ninety
MHUNDREDS -> These are multiples of hundreds from 200 to 900. This is a new set added as it wasn't needed for English or Hindi. However it is widely used for Russian and Spannish and probably other languages too,
- This includes words like doscientos (200), quinientos (500), пятьсот (500) , двести (200)
  Now one alternate approach was to parse substrings instead as in doscientos - 'dos' as two and cientos as hundred '100'. However the lack of delimiters would mean major upheaval in the logic. Also, words like quinientos don't have any root word (5 is cinco). Similarly the suffix in russian is different based on numbers. eg) сти for 200 , сот for 500.
  Thus decided to create this dictionary as opposed to parsing it.
MULTIPLIERS -> These are simply powers of 10.eg for English -> Hundred , Thousand ....... and so on.
VALID_TOKENS -> Presence of certain words are ignored between the numbers. 'and' for English, 'y' for Spanish, and so on.

Did you get stuck anywhere ?
Getting the lanuage-data and parsing it's content , some of the languages have multiple forms of the same number. Thus getting this raw-data took the most-time , in the end I stuck with CLDR-RBNF data for populating the above dictionaries.

What is coming up next ?
The mentors were really helpful with a detailed review of the PR, The most important thing to do is to add more tests , the project is highly test-driven and thus creating a robust set of tests for each language is essential.

Weekly Blog #2 ( 15th June - 22nd June)

arnav.kapoor@research.iiit.ac.in (arnav_k) — Sun, 21 Jun 2020 19:23:19 +0000

Hi all , so we have had 3 weeks of coding till now and overall am pleased with the progress of the project. It has been going smoothly without too many major issues.
The three major milestones achieved this week were :-
1. Created a draft PR for incorporation with date-parser. The number-parser library is constantly improving but one of the major goals was also to incorporate it with other Scrapy libraries (primarily date-parser and price-parser). The incorporation needed to be seamless without too much modification to the date-parser base code. Once number-parser library improves we can add it as a dependency.
2. Issue creation and resolution on the number-parser library. Since we are done with the MVP , we can now move to a more organized structure where we are now discussing bugs/features and I am able to create PRs to target specific issues.
3. Implemented a parse_number feature that allows to parse single numbers written in natural language. eg) 'fifty seven' -> 57 , 'cats' -> None

The plan for next-week is to tackle the multi-language feature (starting with spanish,hindi,russian) and hopefully by the end of the week will have the pipeline to incorporate multiple languages in place.

Weekly Check-In #2 ( 7th Jun - 14th Jun )

arnav.kapoor@research.iiit.ac.in (arnav_k) — Sun, 14 Jun 2020 19:10:57 +0000

Hey back with the second check in blog covering week 2.

What did you do this week ?
Fixes and features - still tweaking the number parser library now it handles multiple numbers (not separated by a delimiter and returns a set of words as opposed to a single word). I also experimented with the date-parser library and looked into the integration.

Did you get stuck anywhere ?
Nothing major as such . I was modifying, updating the overall structure of the library , which needed some research into best python practices.

What is coming up next ?
The next week involves completing the integration with date-parser and price-parser. Additionally hoping to handle date-specific years.

Weekly Post #1 ( 1st June - 7th June)

arnav.kapoor@research.iiit.ac.in (arnav_k) — Sun, 07 Jun 2020 18:40:27 +0000

Weekly Update

Hey everyone number-parser is up and running ( number parser ). Do check it out and raise issues / feature request etc.

It was a really fun and productive first week and I have got the basic version done and will keep on refining it in the upcoming weeks.
So the procedure for the parser is as follows :-

Identify all the words which are numbers / part of number in natural language ( hundred , twelve , seven , million )
This list of token is passed to a number builder.
The number is built by putting appropriate signs b/w the tokens (current value is multiplied on encountering a multiplier like hundred ,thousand , million etc)
- [ nine , hundred , and , seven , thousand] - parser would treat it as ( 9 * 100 + 7 ) * 1000 = 907000

The parser takes a string as input and only changes the words which are numbers. Thus non number words are ignored.

Most of the learning was in the setup needed to create the library. This involved configuring the setup.py and setting up a robust testing environment (tox). The mentors were really helpful and reviewed the code mid-week and gave important insights. Overall it was a smooth first week of coding with no major issues.

Next Week

The plan for next week is to do a more robust testing of the parser (adding more test-cases ) and then move to integration of the current work of number-parser with the date-parser library.

Weekly Check-In #1 - Community Bonding ( 4th May - 31st May )

arnav.kapoor@research.iiit.ac.in (arnav_k) — Sat, 30 May 2020 20:16:23 +0000

Hi, I am Arnav Kapoor a 3rd year Undergraduate student from IIIT-Hyderabad and I will be working with the Scrapinghub sub-org this summer. The project goal is to create a number-parser library to parse numbers in natural language and incorporate the same with existing libraries.

What did you do this week ?
The community bonding phase mostly involved researching more into the existing solutions, understanding the pros and cons of each. I also got to know the mentors and we have set up weekly meetings for the duration of the program.

Did you get stuck anywhere ?
No there weren't any hurdles as such.

What is coming up next ?
The next week involves creating a basic english only version which will gradually be built upon . It's time to begin coding and face the challenges as and when they come.