arnav_k's Blog

Weekly Blog #3 (29th Jun - 5th Jul)

Published: 07/06/2020

Hey everyone we are done with the first third of the program and I will use this blog to both give the weekly update as well as summarize the current state of progress. In the past 4 weeks , we have created a new number-parser library from scratch and build an MVP that is being continuously improved.

Last week was spent fine-tuning the parser to retrieve the relevant data from the CLDR RBNF repo. This 'rule based number parser' (RBNF) repo is basically a Java library that converts a number (23) to the corresponding word. (twenty-three) It has a lot of hard-coded values and data that are very useful to our library and thus we plan to extract all this information accurately and efficiently.

In addition to this there are multiple nuances in each of the language that was being taken care , accents in languages. For eg) the french '0' is written as zéro with (accent aigu over the e ) However we don't expect the users to enter these accents each time hence we need to normalise (i.e remove) these accents.

The most challenging aspect was definitely understanding (which I am still not completely clear) the CLDR RBNF structure , there is only a little documentation explaining some of the basic rules however it's tough to identify which are the relevant rules and which aren't.

Originally I was hoping to add more tests as well in this week however all this took longer than expected so the testing aspect is going to be pushed to the current week.

View Blog Post

Weekly Check-In #3 (22nd Jun - 29th Jun)

Published: 06/28/2020

Hi, so we are done with week 4 of the program - a third of the way in and it has been a fun ride and the progress is on par with the expected timeline.

What did you do this week ?
A major upheaval of the code was done to incorporate the multiple-locale features. Now the base structure is in place to incorporate multiple language support ,
currently we have the data for 4 supported languages (English, Russian, Spanish, Hindi). This involved parsing raw data for each of these languages . -

The main approach rests on creating 6 sets of dictionaries for each of the languages:-

  • UNIT_NUMBERS -> Numbers from 1 to 9.

  • BASE_NUMBERS -> These contain uniquely defined numbers (i.e don't use any prefix). The maximum range is from [10,99]. For different languages, this range changes.

    • English -> This range is from [10,19] (ten,eleven , twelve ... , nineteen)
    • Hindi -> This range is from [10,99] Unique words exists all the way upto 100.
    • Spanish -> This range is from [10,29]
  • MTENS -> These are multiples of tens from 20 to 90 that are used along with unit numbers to form the complete number, This might be empty for certain languages like Hindi. For English this list is twenty,thirty, forty ... ninety

  • MHUNDREDS -> These are multiples of hundreds from 200 to 900. This is a new set added as it wasn't needed for English or Hindi. However it is widely used for Russian and Spannish and probably other languages too,

    • This includes words like doscientos (200), quinientos (500), пятьсот (500) , двести (200)
      Now one alternate approach was to parse substrings instead as in doscientos - 'dos' as two and cientos as hundred '100'. However the lack of delimiters would mean major upheaval in the logic. Also, words like quinientos don't have any root word (5 is cinco). Similarly the suffix in russian is different based on numbers. eg) сти for 200 , сот for 500.
      Thus decided to create this dictionary as opposed to parsing it.
  • MULTIPLIERS -> These are simply powers of for English -> Hundred , Thousand ....... and so on.

  • VALID_TOKENS -> Presence of certain words are ignored between the numbers. 'and' for English, 'y' for Spanish, and so on.

Did you get stuck anywhere ?
Getting the lanuage-data and parsing it's content , some of the languages have multiple forms of the same number. Thus getting this raw-data took the most-time , in the end I stuck with CLDR-RBNF data for populating the above dictionaries.

What is coming up next ?
The mentors were really helpful with a detailed review of the PR, The most important thing to do is to add more tests , the project is highly test-driven and thus creating a robust set of tests for each language is essential.

View Blog Post

Weekly Blog #2 ( 15th June - 22nd June)

Published: 06/21/2020

Hi all , so we have had 3 weeks of coding till now and overall am pleased with the progress of the project. It has been going smoothly without too many major issues.
The three major milestones achieved this week were :-
1. Created a draft PR for incorporation with date-parser. The number-parser library is constantly improving but one of the major goals was also to incorporate it with other Scrapy libraries (primarily date-parser and price-parser). The incorporation needed to be seamless without too much modification to the date-parser base code. Once number-parser library improves we can add it as a dependency.
2. Issue creation and resolution on the number-parser library. Since we are done with the MVP , we can now move to a more organized structure where we are now discussing bugs/features and I am able to create PRs to target specific issues.
3. Implemented a parse_number feature that allows to parse single numbers written in natural language. eg) 'fifty seven' -> 57 , 'cats' -> None

The plan for  next-week is to tackle the multi-language feature (starting with spanish,hindi,russian) and hopefully by the end of the week will have the pipeline to incorporate multiple languages in place.

View Blog Post

Weekly Check-In #2 ( 7th Jun - 14th Jun )

Published: 06/14/2020

Hey back with the second check in blog covering week 2.

What did you do this week ?
Fixes and features - still tweaking the number parser library now it handles multiple numbers (not separated by a delimiter and returns a set of words as opposed to a single word). I also experimented with the date-parser library and looked into the integration.

Did you get stuck anywhere ?
Nothing major as such . I was modifying, updating the overall structure of the library , which needed some research into best python practices.

What is coming up next ?
The next week involves completing the integration with date-parser and price-parser. Additionally hoping to handle date-specific years.

View Blog Post

Weekly Post #1 ( 1st June - 7th June)

Published: 06/07/2020

Weekly Update

Hey everyone number-parser is up and running  ( number parser ). Do check it out and raise issues / feature request etc.

It was a really fun and productive first week and I have got the basic version done and will keep on refining it in the upcoming weeks.
So the procedure for the parser is as follows :-

  • Identify all the words which are numbers / part of number in natural language ( hundred , twelve , seven , million )
  • This list of token is passed to a number builder.
  • The number is built by putting appropriate signs b/w the tokens (current value is multiplied on encountering a multiplier like hundred ,thousand , million etc)
    • [ nine , hundred , and , seven , thousand]  - parser would treat it as  ( 9 * 100 + 7 ) * 1000 = 907000

The parser takes a string as input and only changes the words which are numbers. Thus non number words are ignored.

Most of the learning was in the setup needed to create the library. This involved configuring the and setting up a robust testing environment (tox). The mentors were really helpful  and reviewed the code mid-week and gave important insights. Overall it was a smooth first week of coding with no major issues.


Next Week

The plan for next week is to do a more robust testing of the parser (adding more test-cases ) and then move to integration of the current work of number-parser with the date-parser library.


View Blog Post