Hello everyone,
What did I do this week?
I have fixed several bugs of my PR: Asynchronous File module. I have also started working on making cvedb run asynchronously. Currently, we have cache_update function which downloads JSON dataset from NVD site and store it in user's local cache directory. Module cvedb contains a CVEDB class which has a method named nist_scrape for scraping NVD site to find appropriate links that can be used by cache_update function. It also has following methods:
- init_database - Creates tables if database is empty
- get_cvelist_if_stale - Update if the local db is more than one day old. This avoids the full slow update with every execution.
- populate_db - Function that populates the database from the JSON.
- refresh - Refresh the cve database and update it if it is stale
- clear_cached_data - removes all data from cache directory.
It also has other methods some aren't related to updating database and some are just helper methods. We are currently using multiprocessing to download data which isn't necessary since downloading is an IO bound task and for IO bound task asyncio is a good solution. I am currently not sure how I am going to implement it since we can do use multiple ways to achieve same result and I am experimenting and benchmarking result I am getting from each method. I think storing json dataset is unnecessary since we already have populated sqlite database from it. After populating database, we are only using to check if dataset is stale and for that, we are finding SHA sum of the cached dataset and comparing it to the latest SHA sum listed in metadata from NVD site. We can save a significant amount of space by just storing SHA sum of each dataset from NVD site and compare it instead. I am also thinking about spliting CVEDB class into three classes 1) NVDDownloader - it will handle downloading and pre-processing of data, 2) CVEDB - it will create database if there isn't one and populate it with data it get from NVDDownloader and 3) CVEScanner - which scans database and find CVEs for given vendor, product and version.
In the new architecture I am thinking of we aren't storing dataset on disk. So, how does populate_db function get data it needs to populate sqlite database. Well, we can use very popular technique we learnt from OS classes that is producer-consumer problem. In our case, NVDDownloader will act as producer while CVEDB will act as consumer and a Queue will act as a pipeline connecting producer with consumer. There are several benefits of this architecture 1) we only need to wait if queue is either full or empty. (Queue without size limit isn't practical because in our case producer is too fast.), 2) We will get performance improvement since we are using RAM as an intermediate storage instead of disk.
First I wrote code for NVDDownloader and benchmark it, it was taking 33 seconds with asyncio queues and dummy consumer to complete whole task. So, I thought about improving performance of it by using ProcessPoolExecutor with 4 workers to pre-process data and in this case it got completed in 22 seconds but all this thing I have done to optimize performance of producer is in vain because our code is as fast as its slowest part and in my case it's consumer. Database transactions are very slow and it doesn't matter how fast is my producer, I need to improve performance of writing to the database. sqlite can handle 50000 insertions/second but it can only handle few transactions/second. We are currently commiting code with every execute statement. We can instead make transactions of thousands of insert statement and commit it. We can also improve performance of database write operation by sacrificing durability of database. I guess it won't be a problem for us since we just need database for cve lookup and we can do integrity check when application start and refresh database if its corrupted (which will be rare) I have to do several experiments before I finalize best solution.
What am I doing this week?
I am going to discuss with my mentors about what should I do for implementing above problem? Should I keep whole dataset file or just keep metadata? I will be doing several experiments and benchmark it and choose the best solution for the problem. I am also thinking about improving UX by displaying progress bar.
Have I got stuck anywhere?
Yes, I need to confirm with my mentor if we want to keep cached JSON files or not. I have informed them about this on gitter and I am also going to discuss about it in this week's meeting.