Niraj-Kamdar's Blog

GSoC: Week 3: Awaiting the Future

Niraj-Kamdar
Published: 06/15/2020

Hello everyone,

What did I do this week?

I have started working on optimizing concurrency of CVE Binary Tool. I am going to use asyncio for IO bound tasks and process pool for long CPU bound tasks. I have converted IO bound synchronous functions  of extractor (PR#741), strings (PR#746) and file(PR#750) modules into asynchronous coroutines. I have also created async_utils module which provides necessary  asynchronous utility functions and classes for every modules. Since asyncio's eventloop doesn't support File IO directly. I have searched external library that may provide functionalities I need and I have found one: aiofiles but it was lacking many functionalities like asynchronous tempfile, shutil etc and It also has many issues and PR opened for more than a year. So, I decided to make one myself. After 2-3 days of research and coding I have finally created an asynchronous FileIO class with all the method that synchronous file object provides and also implemented tempfile's TemporaryFile, NamedTemporaryFile and SpooledTemporaryFile classes on top of it. I have also created asynchronous run_command coroutine which runs command in non-blocking manner since we are using subprocess in many places. I have also converted synchronous unittest to asynchronous by using pytest's pytest-asyncio extension plugin. 

What am I doing this week? 

I am going to refactor scanner into two separate modules: 1) version_scanner and 2) cve_scanner - I am thinking about calling it cve_fetcher to avoid misunderstanding but since I have mentioned cve_scanner in my proposal and issues, let's keep it that for now. I will be merging get_cves methods of cvedb and scanner into one module called cve_scanner which uses cvedb. This will make code more maintainable and readable once I convert it into asynchronous.

Have I got stuck anywhere?

I wasn't able to figure out that Should I use aiofiles and use it to implement functions it lacks or implement one on my own. I was confused because I don't want to reinvent wheels and code-base of  aiofiles was scary at first glance. but then I figured out code of aiofiles is unnecessarily complicated. So, I have borrowed some of their logic and written all the functionality it provides + tempfile functionalities that I need in a compact form.

I am also thinking about making my own library as an alternative to aiofiles which also implements other file IO functionality like shutil and os and deploy it on PyPI. 

View Blog Post

GSoC 20: Week 2: del legacy.c

Niraj-Kamdar
Published: 06/08/2020

Hello everyone!

It's Niraj again. Today, I will be sharing my code contribution of this week.

What did I do this week?

I have completed my work on removing compiler dependency for testing this week and opened a PR. We have been using c files to create binary files which contains same version string as can be found in the product for which we have made checker so that we can assert that our checker and scanner modules are working correctly and we are calling this test mapping_test. Because Most of the strings generated by compiling c file is just the compiler dump which we are ignoring anyway. So, why don't we use struct(as mentioned by @pdxjohnny) or plain binary strings which will save time and space. I was experimenting on struct and I found out binary file produced by using struct is same as we generate from just writing binary strings on a file. 

To make the basic test suite run quickly, we create "faked" binary files to test the CVE mappings. However, we want to be able to test real files to test that the signatures work on real-world data. We have _file_test function that takes a url, and package name and a version, and downloads the file, runs the scanner against it and we call this test package test.

Initially, I have proposed a file named mapping_test_data.py for mapping_test of test_scanner which contains list of dictionary of version, checker_name (module_name) and version_strings and a package_test_data.py file for package_test of test_scanner which contains list of tuple of url, package_name, module_name and version. For example:

mapping_test_data = [
    {
        "module": "bash",
        "version": "1.14.0",
        "version_strings": ["Bash version 1.14.0"],
    },
    {
        "module": "binutils",
        "version": "2.31.1",
        "version_strings": [
            "Using the --size-sort and --undefined-only options together",
            "libbfd-2.31.1-system.so",
            "Auxiliary filter for shared object symbol table",
        ],
    },
]
package_test_data = itertools.chain(
    [
        # Filetests for bash checker
        (
            "https://kojipkgs.fedoraproject.org/packages/bash/4.0/1.fc11/x86_64/",
            "bash-4.0-1.fc11.x86_64.rpm",
            "bash",
            "4.0.0",
        ),
        (
            "http://rpmfind.net/linux/mageia/distrib/4/x86_64/media/core/updates/",
            "bash-4.2-53.1.mga4.x86_64.rpm",
            "bash",
            "4.2.53",
        ),
    ],
    [
        # Filetests for binutils checker
        (
            "http://security.ubuntu.com/ubuntu/pool/main/b/binutils/",
            "binutils_2.26.1-1ubuntu1~16.04.8_amd64.deb",
            "binutils",
            "2.26.1",
        ),
        (
            "http://mirror.centos.org/centos/7/os/x86_64/Packages/",
            "binutils-2.27-43.base.el7.x86_64.rpm",
            "binutils",
            "2.27",
        ),
    ],

Although, this format is better than creating c file and also adding test_data in test_scanner file, In this week's virtual conference, my mentors has pointed out that if we keep test data for all checkers in one file it will be hard to navigate it since number of checkers is going to increase as time goes. So, they told me to create separate test_data file for each checkers which contains two attributes 1) mapping_test_data - which contains test data for our mapping test and 2) package_test_data - which contains test data for our package test. So, I created separate test_data file for each checker. For example, test_data file for bash checker looks like this:

mapping_test_data = [
    {"module": "bash", "version": "1.14.0", "version_strings": ["Bash version 1.14.0"]}
]
package_test_data = [
    {
        "url": "https://kojipkgs.fedoraproject.org/packages/bash/4.0/1.fc11/x86_64/",
        "package_name": "bash-4.0-1.fc11.x86_64.rpm",
        "module": "bash",
        "version": "4.0.0",
    },
    {
        "url": "http://rpmfind.net/linux/mageia/distrib/4/x86_64/media/core/updates/",
        "package_name": "bash-4.2-53.1.mga4.x86_64.rpm",
        "module": "bash",
        "version": "4.2.53",
    },
]

We also have to add new entry in the __all__ list of __init__.py file of test_data module for the checker we are writing test for, if it doesn't exist because I am using this list to load these test_data file at runtime. 

After this PR will get merged, checker developer only need to create two files 1) checker class file under checkers directory and 2) test_data file under test_data directory. This will spare him some time of navigating whole test_scanner file (around 2500 lines) to just add test_data for the checker he has written.

What am I doing this week?

I am going to make extractor module asynchronous this week. I have started working on it and created some functions for it. At the end of the week I want to have asynchronous extractor module and asynchronous test_extractor.

Have I got stuck anywhere?

As I mentioned in my previous blog, file utility of unix wasn't flagging binaries generated by me as executable binary file. After some research, I got to know about a magic signature that file utility uses to identify binary file and I have added it to the binary file I was creating. Here is this magic hex signature that can be found in the beginning of most executable file: 

b"\x7f\x45\x4c\x46\x02\x01\x01\x03"

 

View Blog Post

GSoC: Week 1: __init__.py

Niraj-Kamdar
Published: 06/01/2020

Hello everyone!

I am Niraj Kamdar, a third year undergraduate at DA-IICT - India. I will be working with CVE Binary Tool under the umbrella of Python Software Foundation. 

What is the CVE Binary Tool?

The CVE Binary Tool scans for a number of common, vulnerable open source components like openssl, libpng, libxml2, expat etc. to let you know if a given directory or binary file includes common libraries with known vulnerabilities. 

How it works?

We have checkers for popular open source libraries which contains methods which look at the strings found in a binary file to see if they match certain unique strings found in an open source library and try to guess it's version. We have a scanner module which recursively scans every binary file of the given directory and parse strings from the binary file and forward it to every checkers and checkers determine the vendor, product and version and pass it to the scanner then scanner look into local copy of NVD database and finds all the vulnerabilities associated with the given product and displays it. We supports many output formats like JSON, CSV and a nice console format.  

What did I do in Community Bonding Period?

I have fixed several bugs (like stale egg info, extractor bugs in windows etc.), written faster native python solution to replace c strings extension module and refactored whole checkers module to use object-oriented approach to reduce repetition of code. Previously, we have to write several functions when we were creating a checker, now all we need to do is write 5 class attributes. If you want to learn more about how to write a checker? checkout our contributing checker guidelines.
 

I also had video conference meetings with my mentors scheduled every week on Wednesday where we discussed about the project design and implementation aspects. Since, my project involves adding concurrency to the CVE Binary Tool. I was studying asyncio and concurrent.futures modules during this time. My mentor has also helped me and recommended few articles.

What am I doing this week?

I will be working on removing compiler dependency of test_scanner which is part of my GSoC project. I have started 3-4 days early and I have already finished first task of this week which was splitting cli.py module into cli.py and scanner.py

Have I got stuck anywhere?

While I was working on removing compiler dependency of test_scanner issue, I came to know that I also have to add some binary strings to that compiler normally add because we are using file utility to check if file we are scanning is binary and It isn't currently flagging file generated by me as a binary file due to lack of signatures that normally can be found in a binary file. I have mentioned this problem to my mentors and I guess they will reply me soon. Meanwhile, I will be look into this myself.

View Blog Post