Niraj-Kamdar's Blog

GSoC: Week 8: InputEngine.extend(functionalities)

Niraj-Kamdar
Published: 07/19/2020

What did I do this week?

I didn't know about usage of other triage data like custom severity so I asked my mentor about it she gave me various use-case scenarios where it can be useful. After understanding requirements, I have added support for three new fields to our input_engine: 1) comments, 2) cve_number and 3) severity. Now user can specify these triage data and it will get reflected in the all machine readable output format. I have also added support for wheel and egg archive format. I have modernize error handling in outputengine and extractor. I have also fixed a bug which was causing progress bar to be displayed on quite mode. 

What am I doing this week? 

I am going to work on configuration file this week. I most likely going to choose toml as our config file format as recommended by PEP. 

Have I got stuck anywhere?

No I didn't stuck anywhere this week.

View Blog Post

GSoC: Week 7: with ErrorHandler()

Niraj-Kamdar
Published: 07/12/2020

What did I do this week?

This week my mentor has pointed out several issues in my InputEngine PR and I fixed those this week.I have fixed Issue: Use patterns in VERSION_PATTERNS as valid CONTAINS_PATTERNS by default and for that I have changed checker metaclass to include VERSION_PATTERNS by default as valid CONTAINS_PATTERNS. I also changed mapping test data of all checkers and removed redundant CONTAINS_PATTERNS. I have also fixed Escape sequence issue. I have also created an error_handler module which provides ErrorHandler context manager. It will display colorful traceback and set custom exit code. Currently, It supports four different modes for error handling:

  1. TruncTrace - displays truncated traceback (default)
    • trucated traceback output
  2. FullTrace - displays full traceback (when logging level is debug can be set via -l debug option)
    • Full traceback output
  3. NoTrace - displays no traceback (when logging level is critical can be set via -q(--quiet) flag)
    • no traceback output
  4. Ignore - Ignore any raised Exception (Only used internally.)

I have moved all custom exception in error_handler module so that it would be easy to assign error code. I have also changed excepthook to display colorized output traceback. I have also changed unittest for cli and input_engine to incorporate changes in exception handling.  If one raise error without context manager he will get full traceback regardless of mode he set. So, always use ErrorHandler context manager to raise exception or around the code that can raise exception.

What am I doing this week? 

I am going to improve InputEngine and Extractor modules this week.

Have I got stuck anywhere?

I wanted to improve InputEngine this week but Ideas discussed in issue related to the other functionalities of InputEngine aren't clear so I wanted to discuss future plans for InputEngine in this week's meeting but unfortunately mentors were busy this week so meeting got canceled but terriko has opened issues regarding exceptions and I got an idea to colorize traceback and extend functionality of custom error codes for every modules so I have done that instead and as you can see it looks awesome now.

 

 

View Blog Post

GSoC: Week 6: class InputEngine

Niraj-Kamdar
Published: 07/06/2020

What did I do this week?

I have started working on input engine this week. Currently, we only have csv2cve which accepts csv file of vendor, product and version as input and produces list of CVEs as output. Currently, csv2cve is separate module with separate command line entry point. I have created a module called input_engine that can process data from any input format (currently csv and json).User can now add remarks field in csv or json which can have any value from following values ( Here, values in parenthesis are aliases for that specific type. )

  1. NewFound (1, n, N)
  2. Unexplored (2, u, U)
  3. Mitigated, (3, m, M)
  4. Confirmed (4, c, C)
  5. Ignored (5, i, I)

I have added --input-file(-i) option in the cli.py to specify input file which input_engine parses and create intermediate data structure that will be used by output_engine to display data according to remarks. Output will be displayed in the same order as priority given to the remarks. I have also created a dummy csv2cve which just calls cli.py with -i option as argument specified in csv2cve. Here, is example usage of -i as input file to produce CVE:  cve-bin-tool -i=test.csv  and User can also use -i to supplement remarks data while scanning directory so that output will be sorted according to remarks. Here is example usage for that: cve-bin-tool -i=test.csv /path/to/scan.

I have also added test cases for input_engine and removed old test cases of the csv2cve.

What am I doing this week? 

I have exams this week from today to 9th July. So, I won't be able to do much during this week but I will spend my weekend improving input_engine like giving more fine-grained control to provide remarks and custom severity.

Have I got stuck anywhere?

No, I didn't get stuck anywhere this week :)

View Blog Post

GSoC: Week 5: improve CVEDB

Niraj-Kamdar
Published: 06/29/2020

What did I do this week?

I have finished my work on improving cvedb this week. I am using aiohttp to download NVD dataset instead of requesting with  multiprocessing pool. This has improved our downloading speed since now every tasks are downloading concurrently in same thread instead of 4 tasks at a time with process pool. I have also measured performance of aiosqlite but it was significantly slower while writing to database so, I decided to keep writing process synchronous. I have also added a beautiful progressbar with the help of rich module. So, now user can get feedback about progress of the downloading and updating database.  Here is the demo of how does it look now. 

 

It was taking 2 minutes to download and update database with multiprocessing. Now, it is only taking 1 minute for that. So, we have gained 200% speed just by converting IO bound tasks into asynchronous coroutines. I have also fixed an event loop bug that we are getting sometimes due to parallel execution of pytest. I have also fixed a small bug in extractor in PR #767. I have also created some utils function to reduce code repetition.

What am I doing this week? 

I have started working on input engine and I am hoping to provide basic triage support by the end of this week.

Have I got stuck anywhere?

No, I didn't get stuck anywhere this week :)

View Blog Post

GSoC: Week 4: Status - 300 Multiple Choice

Niraj-Kamdar
Published: 06/22/2020

Hello everyone,

What did I do this week?

I have fixed several bugs of my PR: Asynchronous File module. I have also started working on making cvedb run asynchronously. Currently, we have cache_update function which downloads  JSON dataset from NVD site and store it in user's local cache directory. Module cvedb contains a CVEDB class which has a method named nist_scrape for scraping NVD site to find appropriate links that can be used by cache_update function. It also has following methods:

  • init_database - Creates tables if database is empty
  • get_cvelist_if_stale - Update if the local db is more than one day old. This avoids the full slow update with every execution.
  • populate_db - Function that populates the database from the JSON.
  • refresh - Refresh the cve database and update it if it is stale
  • clear_cached_data - removes all data from cache directory.

It also has other methods some aren't related to updating database and some are just helper methods. We are currently using multiprocessing to download data which isn't necessary since downloading is an IO bound task and for IO bound task asyncio is a good solution. I am currently not sure how I am going to implement it since we can do use multiple ways to achieve same result and I am experimenting and benchmarking result I am getting from each method. I think storing json dataset is unnecessary since we already have populated sqlite database from it. After populating database, we are only using to check if dataset is stale and for that, we are finding SHA sum of the cached dataset and comparing it to the latest SHA sum listed in metadata from NVD site. We can save a significant amount of space by just storing SHA sum of each dataset from NVD site and compare it instead. I am also thinking about spliting CVEDB class into three classes 1) NVDDownloader - it will handle downloading and pre-processing of data, 2) CVEDB - it will create database if there isn't one and populate it with data it get from NVDDownloader and 3) CVEScanner - which scans database and find CVEs for given vendor, product and version. 

In the new architecture I am thinking of we aren't storing dataset on disk. So, how does populate_db function get data it needs to populate sqlite database. Well, we can use very popular technique we learnt from OS classes that is producer-consumer problem. In our case, NVDDownloader will act as producer while CVEDB will act as consumer and a Queue will act as a pipeline connecting producer with consumer. There are several benefits of this architecture 1) we only need to wait if queue is either full or empty. (Queue without size limit isn't practical because in our case producer is too fast.), 2) We will get performance improvement since we are using RAM as an intermediate storage instead of disk.

First I wrote code for NVDDownloader and benchmark it, it was taking 33 seconds with asyncio queues and dummy consumer to complete whole task. So, I thought about improving performance of it by using ProcessPoolExecutor with 4 workers to pre-process data and in this case it got completed in 22 seconds but all this thing I have done to optimize performance of producer is in vain because our code is as fast as its slowest part and in my case it's consumer. Database transactions are very slow and it doesn't matter how fast is my producer, I need to improve performance of writing to the database. sqlite can handle 50000 insertions/second but it can only handle few transactions/second. We are currently commiting code with every execute statement. We can instead make transactions of thousands of insert statement and commit it. We can also improve performance of database write operation by sacrificing durability of database. I guess it won't be a problem for us since we just need database for cve lookup and we can do integrity check when application start and refresh database if its corrupted (which will be rare) I have to do several experiments before I finalize best solution.

What am I doing this week? 

I am going to discuss with my mentors about what should I do for implementing above problem? Should I keep whole dataset file or just keep metadata?  I will be doing several experiments and benchmark it and choose the best solution for the problem. I am also thinking about improving UX by displaying progress bar.

Have I got stuck anywhere?

Yes, I need to confirm with my mentor if we want to keep cached JSON files or not. I have informed them about this on gitter and I am also going to discuss about it in this week's meeting.

View Blog Post