Weekly Check-In #3 (22nd Jun - 29th Jun)

arnav_k
Published: 06/28/2020

Hi, so we are done with week 4 of the program - a third of the way in and it has been a fun ride and the progress is on par with the expected timeline.

What did you do this week ?
A major upheaval of the code was done to incorporate the multiple-locale features. https://github.com/arnavkapoor/number-parser/pull/12. Now the base structure is in place to incorporate multiple language support ,
currently we have the data for 4 supported languages (English, Russian, Spanish, Hindi). This involved parsing raw data for each of these languages . -

The main approach rests on creating 6 sets of dictionaries for each of the languages:-

UNIT_NUMBERS -> Numbers from 1 to 9.
BASE_NUMBERS -> These contain uniquely defined numbers (i.e don't use any prefix). The maximum range is from [10,99]. For different languages, this range changes.
- English -> This range is from [10,19] (ten,eleven , twelve ... , nineteen)
- Hindi -> This range is from [10,99] Unique words exists all the way upto 100.
- Spanish -> This range is from [10,29]
MTENS -> These are multiples of tens from 20 to 90 that are used along with unit numbers to form the complete number, This might be empty for certain languages like Hindi. For English this list is twenty,thirty, forty ... ninety
MHUNDREDS -> These are multiples of hundreds from 200 to 900. This is a new set added as it wasn't needed for English or Hindi. However it is widely used for Russian and Spannish and probably other languages too,
- This includes words like doscientos (200), quinientos (500), пятьсот (500) , двести (200)
  Now one alternate approach was to parse substrings instead as in doscientos - 'dos' as two and cientos as hundred '100'. However the lack of delimiters would mean major upheaval in the logic. Also, words like quinientos don't have any root word (5 is cinco). Similarly the suffix in russian is different based on numbers. eg) сти for 200 , сот for 500.
  Thus decided to create this dictionary as opposed to parsing it.
MULTIPLIERS -> These are simply powers of 10.eg for English -> Hundred , Thousand ....... and so on.
VALID_TOKENS -> Presence of certain words are ignored between the numbers. 'and' for English, 'y' for Spanish, and so on.

Did you get stuck anywhere ?
Getting the lanuage-data and parsing it's content , some of the languages have multiple forms of the same number. Thus getting this raw-data took the most-time , in the end I stuck with CLDR-RBNF data for populating the above dictionaries.

What is coming up next ?
The mentors were really helpful with a detailed review of the PR, The most important thing to do is to add more tests , the project is highly test-driven and thus creating a robust set of tests for each language is essential.

Weekly Check-In #3 (22nd Jun - 29th Jun)

Versions

Time

Settings from gsoc.settings

Headers

Request

SQL queries from 1 connection

Static files (2312 found, 3 used)

Templates (11 rendered)

Cache calls from 1 backend

Signals

Log messages