These two weeks, I've been working on fixing bug #1. I think my algorithm will work in general, but right now the get_cves function doesn't work as I expected, which would cause some problem with data format in futher. Previously, the function will run the query that selects all the version of a product_vendor pair, and then save the result in memory, and using Python code to check whether the version we want is in it. This is actually inefficient and memory-consuming. Therefore, I refactored the code for conducting the query. Right now, the query will only select the version we want and return it to memory. The time for running the whole test cases reduced from 83 seconds to 61 second. So right now, I will try to implement the dynamic programming method and see if the time complexity is not too high.
Last time I spent time working on testing on Windows and refining documents. While I was trying to get it running on windows, I found the environment setting is really a problem. One solution is to write a script to automatically configure all the stuffs. In the following week I will try to fix #1 issue that is about NVD database.
Last week, I implemented multithread scanning with John's help. At first, I thought the logic was to create a function that instantiates a database everytime and close it finally, but that would also be too inefficient since it might take a long time to connect and disconnect the database if we let each thread call the function for each file. Instead, we could just use a queue to save all the files that will be scanned, and each thread just opens the database first and closes it only if there is no jobs to be done. We also don't need to take care of thread safety since the queue in Python is alread thread safe. Besides, I also added a flag to enable/disable the updating database so that users could save time to test or run the tool.
Compared with C, I think it is easier for Python to implement multithread/processing. For example, the communication between processes/threads are more various, in C we could only use signals, shared memory, pipe and message queue. In addition, in Python each thread we could call `join()`, which is like wait() in C. But in C the parent process is the only one who needs to call wait(), so in terms of coding, we have to implement parent and child processes individually.
The other thing that I learnt is about code coverage. Multithread is hard to debug because it is difficult for us to track every thread at the same time. With code coverage's help, we could see the report about which part of the code is not covered during the test, so we could find why it is not entered.
After doing some traceback of the program, I found that when running multithread scanning, each thread instantiates a nvd object
NVDSQLite, while in
nvd.get_cvelist_if_stale() the object tries to call
init_database() regardless of the database situation, and in line 131 of
NVDAutoUpdate.py, it will execute the
CREATE TABLE IF NOT EXISTS. Since this is not thread-safe, the database might be locked.
Therefore, there might be a logical incorrection in previous code, which is posted in issue #177. I think in getting cves we don't need to check database status since we have already checked it before, if not we could manually checked the status first and then try to get cve, therefore, we only need to make a cursor and connect to the database. So I will fix that first and then test mutithread mode.
Actually I don't think sqlite3 is a good fit with multithread since in python's official document it says "Older SQLite versions had issues with sharing connections between threads. That’s why the Python module disallows sharing connections and cursors between threads. If you still try to do so, you will get an exception at runtime."
This week I'm working on implementing mutithread extracting and scaning. So far the extracting works. From this week I acquired how does multithread pool work in Python. The pool has a method called `map`, which is basically same as `map` in Python: taking a function and an iterable as input, apply the function to the iterable in a multithread way. For extracting, this is easy to implement because we don't need to worry about thread-safe issue. However, for scanning, this could be a problem. We know that multiprocessing in operating system is really hard because we need to keep all processes/threads synchronized by using locks/mutex/semaphore. At first I thought this is okay because in scanning function we are supposed to just run mutiple queries at the same time, so we don't need lock.
But acutally this becomes a problem. I will explain later.
Another problem is the difference of pool.map() between python2 and python3. In python3, pool is supported as a context manager, thus we could use something like "*pool.map()", while this is invalid in python2. So we had to go back to use try/finally or the pool will not terminate correctly.