arnav_k's Blog

Weekly Check-In #5 (19th Jul - 26th Jul)

arnav_k
Published: 07/27/2020

So we have almost reached the second stage of evaluation , and the past month did go surprisingly fast. 😅
 

What did you do this week ?
So fixes were made to handle the long and short scale used by different locales. Additionally normalization of the input text was needed to specifically handle accents in numbers. All these changes were approved and soon would be merged.
I also worked a bit on the PR of incorporation of this code with the date-parser library.
 

Did you get stuck anywhere ?
Nothing major as such , I was experimenting a bit with different functions to normalize the text , as they needed to work on a large variety of languages.
 

What is coming up next ?
The plan for the next week is to wrap this up and get a first version out. Thereafter, I will discuss with the mentors to either add additional support to this library ( Ordinal Numbers and Decimals) or move to incorporation.

View Blog Post

Weekly Blog #4 (12th Jul - 19th Jul)

arnav_k
Published: 07/20/2020

Hi all , so we were hoping to get the first version out the previous week however some bugs did creep up and it was important to resolve them before moving ahead. Hence we hope to release the first version to PyPI by this week's end. The two major troublesome cases were as follows :-
 
  1. Adding Large constants - So we have supplementary json data files that are used to add language specific information, i.e those missing from the CLDR repository data source. This allows users to easily add language specific information. Now we can add large constants Gúgol , centillón in json supplementary file. using terms like 1e600 (thatjson supports)  however while reading them using python script and merging into the final python language file, It was creating an error this happens because the range of python float is about 1.79e+308 thus larger numbers than this get translated as infinite. We still don't have a fix for this but since it is a minor issue impacting only very large numbers we can let it be for the time being.
  2. Now one of the things not considered till now was long and short scales , The long scale is based on powers of one million (1,000,000), whereas the short scale is based on powers of one thousand (1,000). Now again this impacts the logic only for larger numbers greater than 10**9. However it hasn't been incorporated and we plan to fix this. More details here about the issues here https://github.com/arnavkapoor/number-parser/pull/24

    Despite these issues the week in general was productive , I have incorporated most of the feedback from the mentor into the PR. Also my mentor found an effective way to automate some part of testing which will help us to be more sure of the code.
    The plan for next-week is to ensure these minor issues are fixed and we release the first version.
View Blog Post

Weekly Check-In #4 (5th Jul - 12th Jul)

arnav_k
Published: 07/13/2020

So we are about halfway through the project and the number-parser library is going strong. We are very close to publish version 1.0 to PyPI.

What did you do this week ?
The parser library was refactored to create a language class and use the language data python files. Tests were added for the supported language , which helped to identify a number of small bugs across the board, which were gradually fixed. 

Did you get stuck anywhere ?
One of the bugs was how to elegantly handle multiple consecutive multipliers for example 'thousand millions'. In the end I did come up with a reasonable solution that should be working across all languages but more tests would need to be added to ascertain this.

What is coming up next ?
This week the first version would be released , to do that would require a round of code-cleanup, adding documentation , and more tests for all supported languages.

View Blog Post

Weekly Blog #3 (29th Jun - 5th Jul)

arnav_k
Published: 07/06/2020

Hey everyone we are done with the first third of the program and I will use this blog to both give the weekly update as well as summarize the current state of progress. In the past 4 weeks , we have created a new number-parser library from scratch and build an MVP that is being continuously improved.

Last week was spent fine-tuning the parser to retrieve the relevant data from the CLDR RBNF repo. This 'rule based number parser' (RBNF) repo is basically a Java library that converts a number (23) to the corresponding word. (twenty-three) It has a lot of hard-coded values and data that are very useful to our library and thus we plan to extract all this information accurately and efficiently.

In addition to this there are multiple nuances in each of the language that was being taken care , accents in languages. For eg) the french '0' is written as zéro with (accent aigu over the e ) However we don't expect the users to enter these accents each time hence we need to normalise (i.e remove) these accents.

The most challenging aspect was definitely understanding (which I am still not completely clear) the CLDR RBNF structure , there is only a little documentation explaining some of the basic rules however it's tough to identify which are the relevant rules and which aren't.

Originally I was hoping to add more tests as well in this week however all this took longer than expected so the testing aspect is going to be pushed to the current week.

View Blog Post

Weekly Check-In #3 (22nd Jun - 29th Jun)

arnav_k
Published: 06/28/2020

Hi, so we are done with week 4 of the program - a third of the way in and it has been a fun ride and the progress is on par with the expected timeline.

What did you do this week ?
A major upheaval of the code was done to incorporate the multiple-locale features. https://github.com/arnavkapoor/number-parser/pull/12. Now the base structure is in place to incorporate multiple language support ,
currently we have the data for 4 supported languages (English, Russian, Spanish, Hindi). This involved parsing raw data for each of these languages . -

The main approach rests on creating 6 sets of dictionaries for each of the languages:-

  • UNIT_NUMBERS -> Numbers from 1 to 9.

  • BASE_NUMBERS -> These contain uniquely defined numbers (i.e don't use any prefix). The maximum range is from [10,99]. For different languages, this range changes.

    • English -> This range is from [10,19] (ten,eleven , twelve ... , nineteen)
    • Hindi -> This range is from [10,99] Unique words exists all the way upto 100.
    • Spanish -> This range is from [10,29]
  • MTENS -> These are multiples of tens from 20 to 90 that are used along with unit numbers to form the complete number, This might be empty for certain languages like Hindi. For English this list is twenty,thirty, forty ... ninety

  • MHUNDREDS -> These are multiples of hundreds from 200 to 900. This is a new set added as it wasn't needed for English or Hindi. However it is widely used for Russian and Spannish and probably other languages too,

    • This includes words like doscientos (200), quinientos (500), пятьсот (500) , двести (200)
      Now one alternate approach was to parse substrings instead as in doscientos - 'dos' as two and cientos as hundred '100'. However the lack of delimiters would mean major upheaval in the logic. Also, words like quinientos don't have any root word (5 is cinco). Similarly the suffix in russian is different based on numbers. eg) сти for 200 , сот for 500.
      Thus decided to create this dictionary as opposed to parsing it.
  • MULTIPLIERS -> These are simply powers of 10.eg for English -> Hundred , Thousand ....... and so on.

  • VALID_TOKENS -> Presence of certain words are ignored between the numbers. 'and' for English, 'y' for Spanish, and so on.

 
Did you get stuck anywhere ?
Getting the lanuage-data and parsing it's content , some of the languages have multiple forms of the same number. Thus getting this raw-data took the most-time , in the end I stuck with CLDR-RBNF data for populating the above dictionaries.


What is coming up next ?
The mentors were really helpful with a detailed review of the PR, The most important thing to do is to add more tests , the project is highly test-driven and thus creating a robust set of tests for each language is essential.

View Blog Post