Articles on wzao1515's Blog

Week 10 & 11

zw2498@columbia.edu (wzao1515) — Tue, 20 Aug 2019 02:56:27 +0000

These two weeks, I've been working on fixing bug #1. I think my algorithm will work in general, but right now the get_cves function doesn't work as I expected, which would cause some problem with data format in futher. Previously, the function will run the query that selects all the version of a product_vendor pair, and then save the result in memory, and using Python code to check whether the version we want is in it. This is actually inefficient and memory-consuming. Therefore, I refactored the code for conducting the query. Right now, the query will only select the version we want and return it to memory. The time for running the whole test cases reduced from 83 seconds to 61 second. So right now, I will try to implement the dynamic programming method and see if the time complexity is not too high.

Week 9

zw2498@columbia.edu (wzao1515) — Fri, 02 Aug 2019 03:51:36 +0000

Last time I spent time working on testing on Windows and refining documents. While I was trying to get it running on windows, I found the environment setting is really a problem. One solution is to write a script to automatically configure all the stuffs. In the following week I will try to fix #1 issue that is about NVD database.

Week 8

zw2498@columbia.edu (wzao1515) — Fri, 26 Jul 2019 19:37:07 +0000

Last week, I implemented multithread scanning with John's help. At first, I thought the logic was to create a function that instantiates a database everytime and close it finally, but that would also be too inefficient since it might take a long time to connect and disconnect the database if we let each thread call the function for each file. Instead, we could just use a queue to save all the files that will be scanned, and each thread just opens the database first and closes it only if there is no jobs to be done. We also don't need to take care of thread safety since the queue in Python is alread thread safe. Besides, I also added a flag to enable/disable the updating database so that users could save time to test or run the tool.

Compared with C, I think it is easier for Python to implement multithread/processing. For example, the communication between processes/threads are more various, in C we could only use signals, shared memory, pipe and message queue. In addition, in Python each thread we could call `join()`, which is like wait() in C. But in C the parent process is the only one who needs to call wait(), so in terms of coding, we have to implement parent and child processes individually.

The other thing that I learnt is about code coverage. Multithread is hard to debug because it is difficult for us to track every thread at the same time. With code coverage's help, we could see the report about which part of the code is not covered during the test, so we could find why it is not entered.

Week 7 P2

zw2498@columbia.edu (wzao1515) — Fri, 19 Jul 2019 02:16:11 +0000

After doing some traceback of the program, I found that when running multithread scanning, each thread instantiates a nvd object NVDSQLite, while in nvd.get_cvelist_if_stale() the object tries to call init_database() regardless of the database situation, and in line 131 of NVDAutoUpdate.py, it will execute the CREATE TABLE IF NOT EXISTS. Since this is not thread-safe, the database might be locked.

Therefore, there might be a logical incorrection in previous code, which is posted in issue #177. I think in getting cves we don't need to check database status since we have already checked it before, if not we could manually checked the status first and then try to get cve, therefore, we only need to make a cursor and connect to the database. So I will fix that first and then test mutithread mode.

Actually I don't think sqlite3 is a good fit with multithread since in python's official document it says "Older SQLite versions had issues with sharing connections between threads. That’s why the Python module disallows sharing connections and cursors between threads. If you still try to do so, you will get an exception at runtime."

Week 7

zw2498@columbia.edu (wzao1515) — Fri, 19 Jul 2019 02:08:48 +0000

This week I'm working on implementing mutithread extracting and scaning. So far the extracting works. From this week I acquired how does multithread pool work in Python. The pool has a method called `map`, which is basically same as `map` in Python: taking a function and an iterable as input, apply the function to the iterable in a multithread way. For extracting, this is easy to implement because we don't need to worry about thread-safe issue. However, for scanning, this could be a problem. We know that multiprocessing in operating system is really hard because we need to keep all processes/threads synchronized by using locks/mutex/semaphore. At first I thought this is okay because in scanning function we are supposed to just run mutiple queries at the same time, so we don't need lock.

But acutally this becomes a problem. I will explain later.

Another problem is the difference of pool.map() between python2 and python3. In python3, pool is supported as a context manager, thus we could use something like "*pool.map()", while this is invalid in python2. So we had to go back to use try/finally or the pool will not terminate correctly.

Week 6

zw2498@columbia.edu (wzao1515) — Wed, 10 Jul 2019 04:01:46 +0000

This week, I implemented several new checkers for the tool. Since I have implemented some checkers before, this is not difficult. The best way to implement checkers is to look up the NVD database first, from their you could find the vendor name, library name, versions... Therefore, the rest of the process is to look into the archive (released packages) and come up with regex to match and guess the version.

When I was looking the source code of scan_file, I got a question: what would be the difference between "is" and "contains". Then I found that if the name of the file is "xxx package", then the value is "is", if the name is in some lines of the file, then the value would be "contains". But I'm still wondering why we need to distinguish between them.

Another question is that the scan process is inefficient, that's why we need multi-thread stuff.

Week 5

zw2498@columbia.edu (wzao1515) — Fri, 28 Jun 2019 14:42:36 +0000

This week, I have basically finished python implementation of extractors, Python has some packages for me to easily implement. The thing needs to be noticed is that we should try to avoid run from shell / command line given the consideration of efficiency. Besides, if we have to use it (like using subprocess), we should avoid the flag "shell=True". If this flag turns to be true, it means Python will open a new shell and take whatever string input from the parameter. By doing that our environment might suffer from injections.

Therefore, I have two more weeks that are not in the plan. So I think I will help Terri look into some pull requests for bug fixing after some refinements of my code.

Week 4

zw2498@columbia.edu (wzao1515) — Tue, 18 Jun 2019 00:33:07 +0000

After implementing strings, I worked on implementing extractors. I found that Python actually provides several modules for extracting files. For example, with a instantiating an 'tarfile' object, we could extract files that end with .gz, .bz... Therefore, I modified extractor file, that if the system doesn't have processes like 'tar', 'unzip', it will use the python implemented method to extract target. This works for tar and zip, exe... Besides, for rpm and deb, I found that there is a package called libarchive, so I guess maybe we could use it. And in the following days of this week I will start implementing other extractors that python doesn't have a module to solve.

Right now I'm working on extracting cab files. I saw on github people have implemented such extractor in Python, so I will understand their code first and see what I could do (use it, rewrite it in C, etc...)

Week 3 (P2)

zw2498@columbia.edu (wzao1515) — Wed, 12 Jun 2019 00:39:28 +0000

Continued from the last time, I decided to implement the strings and files in C on Linux first. And I learnt a lot from this.

One thing is that how to extend C. So the entrance of the C program should be PyMODINIT_FUNC PyInit_modulename(void), python setuptools will recognize this, and then in the function, you should return an object called PyModule_Create(&modulename). the object is defined with a data structure called static struct PyModuleDef. In this, you will need to define the wrapper of python, the function table (similar as the call table in Linux kernel). This is really a briliant design since it considers all the object-oriented idea: encapsulation, inheritance and polymorphism. But the thing needs to notice is that there is a significant difference between python2 and python3 in terms of the implementation of this, so I decide to implement python3 version first since python2 is almost deprecated.

Another thing is the real implementation of strings and file. There are two difficulties that I have addressed. The first one is reading file. There is a slight difference between python and C. In C, firstly you need to get the file size and allocate memory for the array. And this could be dangerous since the memory might leak (maybe you forgot to free it after use). The second one is how do you iterate it after reading. This is trival but important. At first, I simply used strlen(buffer), but this is not correct. it should be the file length you got before. Another one important knowledge I learnt is the difference between char * and char array. We know that you could modify an array but not a char pointer. The reason is that char [] will allocate a string on stack while char * will only create a pointer on stack and the string stores in data segment.

Week 3

zw2498@columbia.edu (wzao1515) — Mon, 10 Jun 2019 15:05:40 +0000

Currently, I'm still working on implementing the extension of C.

The problem I met is that there is a significant difference between the extension on Windows and Linux while on PSF the instruction is mainly about Linux. For example, in the new .c file, on Linux platform I just need to declare the function and implemented, but on Windows, I need to create a new module called PyInit_, which is used for Windows compilers to identify python extension.

Besides, since I'm using visual studio to develop, there are some issues with the path. But finally I found an instruction writen by official visual studio very helpful: https://docs.microsoft.com/en-us/visualstudio/python/working-with-c-cpp-python-in-visual-studio?view=vs-2019

I haven't talked to Terri and John yet, so I will leave this first and change it after the meeting.

Week 2

zw2498@columbia.edu (wzao1515) — Thu, 06 Jun 2019 14:48:01 +0000

As described last week, this week I mainly focused on solving the environment setup and enhance the test files to support Windows commands. I have made a pull request at https://github.com/intel/cve-bin-tool/pull/146.

The thing that I did is that under Windows, the header (magic byte) of the compiled binary file is not same as it's in Linux, it is 'MZ\x90\x00'. Besides, the "rm" command should be replaced with "erase" under Windows.

In this week, I will try to rewrite "string" and "file" command in C and test them in both Linux and Windows. In Linux, I hope it could run the whole process and scan for different type of files like rpm, tar... And in Windows, since extractor is not implemented, I will just test its parsing functionality. After that, I will start implementing the extractor under Windows, the first goal would be zip. Some test cases are need, ideally would be curl. The problem that I might encounter is how to extend Python with C. There are two ways but I'm not sure which one would be ideal. I asked John for his help, so I might wait for him first.

Week 1

zw2498@columbia.edu (wzao1515) — Wed, 29 May 2019 20:54:43 +0000

It's a great pleasure to work on this project! I talked to Terri and we fixed down this week's plan.

Earlier in this week, I set up the Windows environment for the binary tool, like installed Mingw for gcc, make, and some python modules. In the following days, I will work on comping up with Windows test cases. In addition, when I was testing the program, I found that previously it is using many subprocess calls like "rm", which is not compatible in Windows. Therefore, it would be ideal to modify those test files. My idea is to make an abstract test class first, and then when user runs the test, it will check the operating system first, and then decide to extend the subclass to test, which is exactly like the factory mode in design pattern.

Besides, we talked about future plans. Both we decide to keep the plan in proposal first, and we could adjust the plan according to needs. And the priority of the task might be changed since some tasks are not important like extracting an rpm file on Windows.

In the nutshell, I'm really excited! Definitely going to do a good job on this!