wzao1515's Blog

Week 6

Published: 07/10/2019

This week, I implemented several new checkers for the tool. Since I have implemented some checkers before, this is not difficult. The best way to implement checkers is to look up the NVD database first, from their you could find the vendor name, library name, versions... Therefore, the rest of the process is to look into the archive (released packages) and come up with regex to match and guess the version.

When I was looking the source code of scan_file, I got a question: what would be the difference between "is" and "contains". Then I found that if the name of the file is "xxx package", then the value is "is", if the name is in some lines of the file, then the value would be "contains". But I'm still wondering why we need to distinguish between them.

Another question is that the scan process is inefficient, that's why we need multi-thread stuff.

View Blog Post

Week 5

Published: 06/28/2019

This week, I have basically finished python implementation of extractors, Python has some packages for me to easily implement. The thing needs to be noticed is that we should try to avoid run from shell / command line given the consideration of efficiency. Besides, if we have to use it (like using subprocess), we should avoid the flag "shell=True". If this flag turns to be true, it means Python will open a new shell and take whatever string input from the parameter. By doing that our environment might suffer from injections.

Therefore, I have two more weeks that are not in the plan. So I think I will help Terri look into some pull requests for bug fixing after some refinements of my code.

View Blog Post

Week 4

Published: 06/18/2019

After implementing strings, I worked on implementing extractors. I found that Python actually provides several modules for extracting files. For example, with a instantiating an 'tarfile' object, we could extract files that end with .gz, .bz... Therefore, I modified extractor file, that if the system doesn't have processes like 'tar', 'unzip', it will use the python implemented method to extract target. This works for tar and zip, exe... Besides, for rpm and deb, I found that there is a package called libarchive, so I guess maybe we could use it. And in the following days of this week I will start implementing other extractors that python doesn't have a module to solve.

Right now I'm working on extracting cab files. I saw on github people have implemented such extractor in Python, so I will understand their code first and see what I could do (use it, rewrite it in C, etc...)

View Blog Post

Week 3 (P2)

Published: 06/12/2019

Continued from the last time, I decided to implement the strings and files in C on Linux first. And I learnt a lot from this.

One thing is that how to extend C. So the entrance of the C program should be PyMODINIT_FUNC PyInit_modulename(void), python setuptools will recognize this, and then in the function, you should return an object called PyModule_Create(&modulename). the object is defined with a data structure called static struct PyModuleDef. In this, you will need to define the wrapper of python, the function table (similar as the call table in Linux kernel). This is really a briliant design since it considers all the object-oriented idea: encapsulation, inheritance and polymorphism. But the thing needs to notice is that there is a significant difference between python2 and python3 in terms of the implementation of this, so I decide to implement python3 version first since python2 is almost deprecated.

Another thing is the real implementation of strings and file. There are two difficulties that I have addressed. The first one is reading file. There is a slight difference between python and C. In C, firstly you need to get the file size and allocate memory for the array. And this could be dangerous since the memory might leak (maybe you forgot to free it after use). The second one is how do you iterate it after reading. This is trival but important. At first, I simply used strlen(buffer), but this is not correct. it should be the file length you got before. Another one important knowledge I learnt is the difference between char * and char array. We know that you could modify an array but not a char pointer. The reason is that char [] will allocate a string on stack while char * will only create a pointer on stack and the string stores in data segment.


View Blog Post

Week 3

Published: 06/10/2019

Currently, I'm still working on implementing the extension of C.

The problem I met is that there is a significant difference between the extension on Windows and Linux while on PSF the instruction is mainly about Linux. For example, in the new .c file, on Linux platform I just need to declare the function and implemented, but on Windows, I need to create a new module called PyInit_, which is used for Windows compilers to identify python extension.

Besides, since I'm using visual studio to develop, there are some issues with the path. But finally I found an instruction writen by official visual studio very helpful: https://docs.microsoft.com/en-us/visualstudio/python/working-with-c-cpp-python-in-visual-studio?view=vs-2019

I haven't talked to Terri and John yet, so I will leave this first and change it after the meeting.

View Blog Post