Articles on Niraj-Kamdar's Bloghttps://blogs.python-gsoc.orgUpdates on different articles published on Niraj-Kamdar's BlogenMon, 24 Aug 2020 14:29:39 +0000GSoC: Week 13: Create GitHub Actionhttps://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-13-create-github-action/<h2><b>What did I do this week?</b></h2> <p>I was working on documentation this week. I have added an example GitHub action workflow so that users can easily integrate CVE Binary Tool in their CI/CD pipeline. I am using <code>actions/setup-python</code> to run CVE Binary Tool and <code>actions/cache</code> to cache database and dependencies to decrease CI runtime. In example, I am using latest version of CVE Binary Tool because current stable version lacks many features like config file and html report. I am using <code>actions/artifact</code> to  upload generated report as Github artifact which can be downloaded later.</p> <p>I have also made a pull-request to integrate caching in our CI. It can help reduce CI runtime a little.</p> <h2><strong>What am I doing this week? </strong></h2> <p>I am going to start building final project report this week and I will complete it  before 31st August.</p> <h2><strong>Have I got stuck anywhere?</strong></h2> <p>No, I didn't get stuck this week.</p>201701184@daiict.ac.in (Niraj-Kamdar)Mon, 24 Aug 2020 14:29:39 +0000https://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-13-create-github-action/GSoC: Week 12: Scanning dockerhttps://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-12-scanning-docker/<h2><b>What did I do this week?</b></h2> <p>I was working on documentation this week. I have added how-to guide for scanning a docker image which was requested by our user. I have listed 2 different ways to scan a docker image:</p> <ol> <li>Install <code>cve-bin-tool</code> inside a docker instance and scan the directory just how you would normally and export report to the host.</li> <li>Export directory you want to scan from container to host and scan it on the host</li> </ol> <p>I have also discussed pros and cons of both methods. I have also found out that when multiple file contains same product, CVEScanner perform unnecessary database IO and It can be performance bottleneck. So, I have short-circuited the flow in case product has already been scanned. I have also fixed filename generation bugs mentioned by Harmandeep Singh. I have also reviewed exclude path PR.</p> <h2><strong>What am I doing this week? </strong></h2> <p>I have some documentation part left to do and I am also going to improve tests for module I have created and will also go through entire code base and add appropriate comments and docstrings for new contributors in these last 2 weeks.</p> <h2><strong>Have I got stuck anywhere?</strong></h2> <p>No, I didn't get stuck this week.</p>201701184@daiict.ac.in (Niraj-Kamdar)Mon, 17 Aug 2020 11:10:02 +0000https://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-12-scanning-docker/GSoC: Week 11: InputEngine.add(paths)https://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-11-inputengine-add-paths/<p>Hello guys, </p> <h2><b>What did I do this week?</b></h2> <p>After we added support for file paths in output. I have found out a bug which was breaking cve_scanner whenever we use <code>--input-file</code> flag for scanning CVEs from CSV or JSON file. I have also found out several other issues in the previous structures which is specified below: </p> <ol> <li>Old CVEData was NamedTuple and since newly added path attribute was mutable it can create hard to find bugs. </li> <li>To update path we need to scan all_cve_data to find product for which we want to append paths.<br> <code>Time Complexity: O(n**2)</code> which can be reduced to <code>O(n)</code> using better structure.</li> <li>Throwing vendor, product, version in different function was decreasing readability. So, ProductInfo would be nice to pack this data together since we never need that alone.</li> <li>TriageData structure wasn't syncing with old CVEData. So, csv2cve or input_engine was breaking.</li> </ol> <p>So, I have decided to change current structure to handle all these issues. Previously <code>all_cve_data</code> was <code>Set[CVEData]</code> which was sufficient then because all attributes are immutable in <code>CVEData</code> and we are just using set to remove duplicates from output. But, when we introduce <code>paths</code> attribute we need to change <code>paths</code> everytime we detect same product in different time and set doesn't have any easy way(Set isn't made for storing mutable type) to get value stored in it apart from looping over whole set to find what we are looking for. So, I have refactor structure into two parts: 1) immutable <code>ProductInfo(vendor, product, version)</code> and 2) mutable <code>CVEData(list_of_cves, paths_of_cves)</code>. And I am storing mapping of <code>ProductInfo</code> and <code>CVEData</code> into <code>all_cve_data</code> so now we can access CVEData of a product without having to traverse whole <code>all_cve_data</code>. Also, I have moved all data structures into utils to avoid circular imports. I have also added test for paths.</p> <h2><strong>What am I doing this week? </strong></h2> <p>I am continue to improve documentation of the code I generated like adding docstrings and comments. And I am also going to add requested how-to guides to improve User Experience. </p> <h2><strong>Have I got stuck anywhere?</strong></h2> <p>No, I didn't get stuck this week.</p>201701184@daiict.ac.in (Niraj-Kamdar)Tue, 11 Aug 2020 11:59:08 +0000https://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-11-inputengine-add-paths/GSoC: Week 10: ''' Documentation '''https://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-10-documentation/<p>Hello guys, </p> <p>I hope you all doing great. Today, I am going to talk about what I did in this week.</p> <h2><b>What did I do this week?</b></h2> <p>I am working on documentation of code I have produced during the first two phases. I have changed user manual and readme. I am also going to change other documentation. I have created user manual for new input engine features and config file feature.</p> <h2><strong>What am I doing this week? </strong></h2> <p>I have talked with a user and we come to conclusion that our documettion lacks some important How-to guides which are necessary as mentioned by Daniele Procida in his amazing <a href="https://www.youtube.com/watch?v=azf6yzuJt54">PyCon talk</a>. So, I am going to create a How-to directory inside our doc folder which will contain interesting recipes for different usecases. Ex:</p> <ol> <li>How to change theme of html?</li> <li>How to add custom checker (out of tree checker)?</li> <li>How to scan docker image?</li> <li>How to parallel scan?</li> </ol> <h2><strong>Have I got stuck anywhere?</strong></h2> <p>No, I didn't get stuck anywhere this week.</p> <p> </p>201701184@daiict.ac.in (Niraj-Kamdar)Mon, 03 Aug 2020 03:15:29 +0000https://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-10-documentation/GSoC: Week 9: ConfigParser()https://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-9-configparser/<h2><b>What did I do this week?</b></h2> <p>I have done research on various configuration file formats and compiled outcomes of it in a issue: <a href="https://github.com/intel/cve-bin-tool/issues/832"> Discussion: Configuration file format</a>. Some users recommended INI files because it is very old and still popular among masses but  INI file does not have any built-in type support and It also lacks formal specification. It parses everything as string. So, we have to process data parsed by configparser to convert it into something usable.<br> Our example data can be parsed as following dictionary:</p> <pre><code class="language-json">{ "checker": { "runs": "[curl,binutils]", # This has to be transformed into list "skips": "[python,bzip2]" }, "input": { "directory": "test/assets", "input_file": "test/csv/triage.csv" }, }</code></pre> <p>So, parsing INI file won't be as easy as TOML or YAML which supports complex datatypes by default. It is also not easy to parse other datatypes like integer, float etc.</p> <p>TOML is very similar to INI file and TOML also supports complex data types by default.</p> <pre><code class="language-json">{ 'checker': { 'runs': ['curl', 'binutils'], # this is correctly parsed as list 'skips': ['python', 'bzip2'] }, 'input': { 'directory': 'test/assets', 'input_file': 'test/csv/triage.csv' }, } </code></pre> <p>I concluded that TOML and YAML are both very easy to read and write by both machine and human. So, we should use one of them. We have discussed which format to use in meeting and my mentors had various opinions on it. Summary of our discussion was: "The top contenders among our team seem to be TOML (readable, familar to python folk and close enough to INI for skill transfer for windows folk) and YAML (which might be a better fit for the dev-ops community that we hope will be among the biggest users of cve-bin-tool)."</p> <p>Since Parsers for both formats produce similar python structures, I have created ConfigParser class which can parse both YAML and TOML file format. I have also added basic tests for it. I have also changed architecture of main function of cli.py to add support for config files and I also made sure that option given from terminal get preference over config option. I am also going to add tests for this. I have also fixed quiet mode bugs.</p> <h2><strong>What am I doing this week? </strong></h2> <p>I am going to write tests for config files in test_cli.py and since I have completed almost all work related to InputEngine, I think it's good time to document it. </p> <h2><strong>Have I got stuck anywhere?</strong></h2> <p>Yes, I need my <a href="https://github.com/intel/cve-bin-tool/pull/830">Quiet mode bug fix</a> PR merged since I have changed TestCLI in it and I need latest TestCLI for testing ConfigParser.</p> <p> </p>201701184@daiict.ac.in (Niraj-Kamdar)Sun, 26 Jul 2020 11:22:14 +0000https://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-9-configparser/GSoC: Week 8: InputEngine.extend(functionalities)https://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-8-inputengine-extend-functionalities/<h2><b>What did I do this week?</b></h2> <p>I didn't know about usage of other triage data like custom severity so I asked my mentor about it she gave me various use-case scenarios where it can be useful. After understanding requirements, I have added support for three new fields to our input_engine: 1) comments, 2) cve_number and 3) severity. Now user can specify these triage data and it will get reflected in the all machine readable output format. I have also added support for wheel and egg archive format. I have modernize error handling in outputengine and extractor. I have also fixed a bug which was causing progress bar to be displayed on quite mode. </p> <h2><strong>What am I doing this week? </strong></h2> <p>I am going to work on configuration file this week. I most likely going to choose toml as our config file format as recommended by PEP. </p> <h2><strong>Have I got stuck anywhere?</strong></h2> <p>No I didn't stuck anywhere this week.</p>201701184@daiict.ac.in (Niraj-Kamdar)Sun, 19 Jul 2020 10:38:21 +0000https://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-8-inputengine-extend-functionalities/GSoC: Week 7: with ErrorHandler()https://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-7-with-errorhandler/<h2><b>What did I do this week?</b></h2> <p>This week my mentor has pointed out several issues in my <code>InputEngine</code> PR and I fixed those this week.I have fixed <a href="https://github.com/intel/cve-bin-tool/issues/789">Issue: Use patterns in VERSION_PATTERNS as valid CONTAINS_PATTERNS by default</a> and for that I have changed checker metaclass to include <code>VERSION_PATTERNS</code> by default as valid <code>CONTAINS_PATTERNS</code>. I also changed mapping test data of all checkers and removed redundant <code>CONTAINS_PATTERNS</code>. I have also fixed <a href="https://github.com/intel/cve-bin-tool/issues/792">Escape sequence issue</a>. I have also created an <code>error_handler</code> module which provides <code>ErrorHandler</code> context manager. It will display colorful traceback and set custom exit code. Currently, It supports four different modes for error handling:</p> <ol> <li>TruncTrace - displays truncated traceback (default) <ul> <li><img alt="trucated traceback output" src="https://i.imgur.com/BQZInWX.png"></li> </ul> </li> <li>FullTrace - displays full traceback (when logging level is debug can be set via <code>-l debug</code> option) <ul> <li><img alt="Full traceback output" src="https://i.imgur.com/tUdFpRt.png"></li> </ul> </li> <li>NoTrace - displays no traceback (when logging level is critical can be set via <code>-q(--quiet) flag</code>) <ul> <li><img alt="no traceback output" src="https://i.imgur.com/XGVtOXP.png"></li> </ul> </li> <li>Ignore - Ignore any raised Exception (Only used internally.)</li> </ol> <p>I have moved all custom exception in <code>error_handler</code> module so that it would be easy to assign error code. I have also changed <code>excepthook</code> to display colorized output traceback. I have also changed unittest for <code>cli</code> and <code>input_engine</code> to incorporate changes in exception handling.  If one raise error without context manager he will get full traceback regardless of mode he set. So, always use ErrorHandler context manager to raise exception or around the code that can raise exception.</p> <h2><strong>What am I doing this week? </strong></h2> <p>I am going to improve InputEngine and Extractor modules this week.</p> <h2><strong>Have I got stuck anywhere?</strong></h2> <p>I wanted to improve InputEngine this week but Ideas discussed in issue related to the other functionalities of InputEngine aren't clear so I wanted to discuss future plans for InputEngine in this week's meeting but unfortunately mentors were busy this week so meeting got canceled but terriko has opened issues regarding exceptions and I got an idea to colorize traceback and extend functionality of custom error codes for every modules so I have done that instead and as you can see it looks awesome now.</p> <p> </p> <p> </p>201701184@daiict.ac.in (Niraj-Kamdar)Sun, 12 Jul 2020 13:24:45 +0000https://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-7-with-errorhandler/GSoC: Week 6: class InputEnginehttps://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-6-class-inputengine/<h2><b>What did I do this week?</b></h2> <p><span style="font-family: Verdana,Geneva,sans-serif;"><span style="font-size: 14px;">I have started working on input engine this week. Currently, we only have <em>csv2cve</em> which accepts csv file of vendor, product and version as input and produces list of CVEs as output. Currently, <em>csv2cve</em> is separate module with separate command line entry point. I have created a module called <em>input_engine</em> that can process data from any input format (currently csv and json).User can now add remarks field in csv or json which can have any value from following values ( Here, values in parenthesis are aliases for that specific type. )</span></span></p> <ol> <li><span style="font-size: 14px;"><span style="font-family: Verdana,Geneva,sans-serif;">NewFound (1, n, N)</span></span></li> <li><span style="font-size: 14px;"><span style="font-family: Verdana,Geneva,sans-serif;">Unexplored (2, u, U)</span></span></li> <li><span style="font-size: 14px;"><span style="font-family: Verdana,Geneva,sans-serif;">Mitigated, (3, m, M)</span></span></li> <li><span style="font-size: 14px;"><span style="font-family: Verdana,Geneva,sans-serif;">Confirmed (4, c, C)</span></span></li> <li><span style="font-size: 14px;"><span style="font-family: Verdana,Geneva,sans-serif;">Ignored (5, i, I)</span></span></li> </ol> <p><span style="font-family: Verdana,Geneva,sans-serif;"><span style="font-size: 14px;">I have added --input-file(-i) option in the <em>cli.py</em> to specify input file <em> </em>which <em>input_engine</em> parses and create intermediate data structure that will be used by output_engine to display data according to remarks. Output will be displayed in the same order as priority given to the remarks. I have also created a dummy <em>csv2cve</em> which just calls <em>cli.py</em> with -i option as argument specified in <em>csv2cve</em>. Here, is example usage of -i as input file to produce CVE: </span></span> <code>cve-bin-tool -i=test.csv </code> <span style="font-size: 14px;"><span style="font-family: Verdana,Geneva,sans-serif;">and User can also use -i to supplement remarks data while scanning directory so that output will be sorted according to remarks. Here is example usage for that: </span></span><code>cve-bin-tool -i=test.csv /path/to/scan.</code></p> <p><span style="font-family: Verdana,Geneva,sans-serif;"><span style="font-size: 14px;">I have also added test cases for <em>input_engine</em> and removed old test cases of the <em>csv2cve.</em></span></span></p> <h2><strong>What am I doing this week? </strong></h2> <p><span style="font-family: Verdana,Geneva,sans-serif;"><span style="font-size: 14px;">I have exams this week from today to 9th July. So, I won't be able to do much during this week but I will spend my weekend improving input_engine like giving more fine-grained control to provide remarks and custom severity.</span></span></p> <h2><strong>Have I got stuck anywhere?</strong></h2> <p><span style="font-size: 14px;"><span style="font-family: Verdana,Geneva,sans-serif;">No, I didn't get stuck anywhere this week :)</span></span></p>201701184@daiict.ac.in (Niraj-Kamdar)Mon, 06 Jul 2020 07:12:24 +0000https://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-6-class-inputengine/GSoC: Week 5: improve CVEDBhttps://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-5-improve-cvedb/<h2><b>What did I do this week?</b></h2> <p>I have finished my work on improving cvedb this week. I am using <em>aiohttp</em> to download NVD dataset instead of requesting with  multiprocessing pool. This has improved our downloading speed since now every tasks are downloading concurrently in same thread instead of 4 tasks at a time with process pool. I have also measured performance of <em>aiosqlite</em> but it was significantly slower while writing to database so, I decided to keep writing process synchronous. I have also added a beautiful progressbar with the help of <em>rich</em> module. So, now user can get feedback about progress of the downloading and updating database.  Here is the demo of how does it look now. </p> <p> </p> <p><img alt="" src="/media/uploads/6de513b8-20e8-4cbf-b07a-542c4348301b.gif"></p> <p>It was taking 2 minutes to download and update database with multiprocessing. Now, it is only taking 1 minute for that. So, we have gained 200% speed just by converting IO bound tasks into asynchronous coroutines. I have also fixed an event loop bug that we are getting sometimes due to parallel execution of pytest. I have also fixed a small bug in extractor in <a href="https://github.com/intel/cve-bin-tool/pull/767">PR #767</a>. I have also created some utils function to reduce code repetition.</p> <h2><strong>What am I doing this week? </strong></h2> <p>I have started working on input engine and I am hoping to provide basic triage support by the end of this week.</p> <h2><strong>Have I got stuck anywhere?</strong></h2> <p>No, I didn't get stuck anywhere this week :)</p>201701184@daiict.ac.in (Niraj-Kamdar)Mon, 29 Jun 2020 18:26:38 +0000https://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-5-improve-cvedb/GSoC: Week 4: Status - 300 Multiple Choicehttps://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-4-status-300-multiple-choice/<p>Hello everyone,</p> <h2><b>What did I do this week?</b></h2> <p>I have fixed several bugs of my <a href="https://github.com/intel/cve-bin-tool/pull/751">PR: Asynchronous File module</a>. I have also started working on making <em>cvedb</em> run asynchronously. Currently, we have cache_update function which downloads  JSON dataset from <a href="https://nvd.nist.gov/vuln/data-feeds">NVD site</a> and store it in user's local cache directory. Module <em>cvedb</em> contains a CVEDB class which has a method named nist_scrape for scraping NVD site to find appropriate links that can be used by cache_update function. It also has following methods:</p> <ul> <li>init_database - Creates tables if database is empty</li> <li>get_cvelist_if_stale - Update if the local db is more than one day old. This avoids the full slow update with every execution.</li> <li>populate_db - Function that populates the database from the JSON.</li> <li>refresh - Refresh the cve database and update it if it is stale</li> <li>clear_cached_data - removes all data from cache directory.</li> </ul> <p>It also has other methods some aren't related to updating database and some are just helper methods. We are currently using multiprocessing to download data which isn't necessary since downloading is an IO bound task and for IO bound task asyncio is a good solution. I am currently not sure how I am going to implement it since we can do use multiple ways to achieve same result and I am experimenting and benchmarking result I am getting from each method. I think storing json dataset is unnecessary since we already have populated sqlite database from it. After populating database, we are only using to check if dataset is stale and for that, we are finding SHA sum of the cached dataset and comparing it to the latest SHA sum listed in metadata from NVD site. We can save a significant amount of space by just storing SHA sum of each dataset from NVD site and compare it instead. I am also thinking about spliting CVEDB class into three classes 1) NVDDownloader - it will handle downloading and pre-processing of data, 2) CVEDB - it will create database if there isn't one and populate it with data it get from NVDDownloader and 3) CVEScanner - which scans database and find CVEs for given vendor, product and version. </p> <p>In the new architecture I am thinking of we aren't storing dataset on disk. So, how does populate_db function get data it needs to populate sqlite database. Well, we can use very popular technique we learnt from OS classes that is producer-consumer problem. In our case, NVDDownloader will act as producer while CVEDB will act as consumer and a Queue will act as a pipeline connecting producer with consumer. There are several benefits of this architecture 1) we only need to wait if queue is either full or empty. (Queue without size limit isn't practical because in our case producer is too fast.), 2) We will get performance improvement since we are using RAM as an intermediate storage instead of disk.</p> <p>First I wrote code for NVDDownloader and benchmark it, it was taking 33 seconds with asyncio queues and dummy consumer to complete whole task. So, I thought about improving performance of it by using ProcessPoolExecutor with 4 workers to pre-process data and in this case it got completed in 22 seconds but all this thing I have done to optimize performance of producer is in vain because our code is as fast as its slowest part and in my case it's consumer. Database transactions are very slow and it doesn't matter how fast is my producer, I need to improve performance of writing to the database. sqlite can handle 50000 insertions/second but it can only handle few transactions/second. We are currently commiting code with every execute statement. We can instead make transactions of thousands of insert statement and commit it. We can also improve performance of database write operation by sacrificing durability of database. I guess it won't be a problem for us since we just need database for cve lookup and we can do integrity check when application start and refresh database if its corrupted (which will be rare) I have to do several experiments before I finalize best solution.</p> <h2><strong>What am I doing this week? </strong></h2> <p>I am going to discuss with my mentors about what should I do for implementing above problem? Should I keep whole dataset file or just keep metadata?  I will be doing several experiments and benchmark it and choose the best solution for the problem. I am also thinking about improving UX by displaying progress bar.</p> <h2><strong>Have I got stuck anywhere?</strong></h2> <p>Yes, I need to confirm with my mentor if we want to keep cached JSON files or not. I have informed them about this on gitter and I am also going to discuss about it in this week's meeting.</p>201701184@daiict.ac.in (Niraj-Kamdar)Mon, 22 Jun 2020 15:35:20 +0000https://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-4-status-300-multiple-choice/GSoC: Week 3: Awaiting the Futurehttps://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-3-awaiting-the-future/<p><span style="font-family: Verdana,Geneva,sans-serif;">Hello everyone,</span></p> <h2><span style="font-family: Verdana,Geneva,sans-serif;"><b>What did I do this week?</b></span></h2> <p><span style="font-family: Verdana,Geneva,sans-serif;">I have started working on optimizing concurrency of CVE Binary Tool. I am going to use asyncio for IO bound tasks and process pool for long CPU bound tasks. I have converted IO bound synchronous functions  of extractor (<a href="https://github.com/intel/cve-bin-tool/pull/741">PR#741</a>), strings (<a href="https://github.com/intel/cve-bin-tool/pull/746">PR#746</a>) and file(<a href="https://github.com/intel/cve-bin-tool/pull/750">PR#750</a>) modules into asynchronous coroutines. I have also created async_utils module which provides necessary  asynchronous utility functions and classes for every modules. Since asyncio's eventloop doesn't support File IO directly. I have searched external library that may provide functionalities I need and I have found one: <a href="https://github.com/Tinche/aiofiles">aiofiles</a> but it was lacking many functionalities like asynchronous tempfile, shutil etc and It also has many issues and PR opened for more than a year. So, I decided to make one myself. After 2-3 days of research and coding I have finally created an asynchronous FileIO class with all the method that synchronous file object provides and also implemented tempfile's TemporaryFile, NamedTemporaryFile and SpooledTemporaryFile classes on top of it. I have also created asynchronous run_command coroutine which runs command in non-blocking manner since we are using subprocess in many places. I have also converted synchronous unittest to asynchronous by using pytest's pytest-asyncio extension plugin. </span></p> <h2><strong><span style="font-family: Verdana,Geneva,sans-serif;">What am I doing this week? </span></strong></h2> <p><span style="font-family: Verdana,Geneva,sans-serif;">I am going to refactor scanner into two separate modules: 1) version_scanner and 2) cve_scanner - I am thinking about calling it cve_fetcher to avoid misunderstanding but since I have mentioned cve_scanner in my proposal and issues, let's keep it that for now. I will be merging get_cves methods of cvedb and scanner into one module called cve_scanner which uses cvedb. This will make code more maintainable and readable once I convert it into asynchronous.</span></p> <h2><strong><span style="font-family: Verdana,Geneva,sans-serif;">Have I got stuck anywhere?</span></strong></h2> <p><span style="font-family: Verdana,Geneva,sans-serif;">I wasn't able to figure out that Should I use aiofiles and use it to implement functions it lacks or implement one on my own. I was confused because I don't want to reinvent wheels and code-base of  aiofiles was scary at first glance. but then I figured out code of aiofiles is unnecessarily complicated. So, I have borrowed some of their logic and written all the functionality it provides + tempfile functionalities that I need in a compact form. </span></p> <blockquote> <p><em><span style="font-family: Verdana,Geneva,sans-serif;">I am also thinking about making my own library as an alternative to aiofiles which also implements other file IO functionality like shutil and os and deploy it on PyPI. </span></em></p> </blockquote>201701184@daiict.ac.in (Niraj-Kamdar)Mon, 15 Jun 2020 13:23:05 +0000https://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-3-awaiting-the-future/GSoC 20: Week 2: del legacy.chttps://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-20-week-2-del-legacy-c/<p><span style="font-size: 14px;"><span style="font-family: Verdana,Geneva,sans-serif;">Hello everyone!</span></span></p> <p><span style="font-size: 14px;"><span style="font-family: Verdana,Geneva,sans-serif;">It's Niraj again. Today, I will be sharing my code contribution of this week.</span></span></p> <h2 style="text-align: justify;"><span style="font-family: Verdana,Geneva,sans-serif;">What did I do this week?</span></h2> <p><span style="font-size: 14px;"><span style="font-family: Verdana,Geneva,sans-serif;">I have completed my work on <a href="https://github.com/intel/cve-bin-tool/issues/638">removing compiler dependency</a> for testing this week and opened a <a href="https://github.com/intel/cve-bin-tool/pull/716">PR</a>. We have been using c files to create binary files which contains same version string as can be found in the product for which we have made checker so that we can assert that our checker and scanner modules are working correctly and we are calling this test mapping_test. Because Most of the strings generated by compiling c file is just the compiler dump which we are ignoring anyway. So, why don't we use struct(as mentioned by <a href="https://github.com/pdxjohnny">@pdxjohnny</a>) or plain binary strings which will save time and space. I was experimenting on struct and I found out binary file produced by using struct is same as we generate from just writing binary strings on a file. </span></span></p> <p><span style="font-size: 14px;"><span style="font-family: Verdana,Geneva,sans-serif;">To make the basic test suite run quickly, we create "faked" binary files to test the CVE mappings. However, we want to be able to test real files to test that the signatures work on real-world data.</span></span> <span style="font-family: Verdana,Geneva,sans-serif;"><span style="font-size: 14px;">We have _file_test function that takes a url, and package name and a version, and downloads the file, runs the scanner against it and we call this test package test.</span></span></p> <p><span style="font-size: 14px;"><span style="font-family: Verdana,Geneva,sans-serif;">Initially, I have proposed a file named mapping_test_data.py for mapping_test of test_scanner which contains list of dictionary of version, checker_name (module_name) and version_strings and a package_test_data.py file for package_test of test_scanner which contains list of tuple of url, package_name, module_name and version. For example:</span></span></p> <pre><code class="language-python">mapping_test_data = [ { "module": "bash", "version": "1.14.0", "version_strings": ["Bash version 1.14.0"], }, { "module": "binutils", "version": "2.31.1", "version_strings": [ "Using the --size-sort and --undefined-only options together", "libbfd-2.31.1-system.so", "Auxiliary filter for shared object symbol table", ], }, ]</code></pre> <pre><code class="language-python">package_test_data = itertools.chain( [ # Filetests for bash checker ( "https://kojipkgs.fedoraproject.org/packages/bash/4.0/1.fc11/x86_64/", "bash-4.0-1.fc11.x86_64.rpm", "bash", "4.0.0", ), ( "http://rpmfind.net/linux/mageia/distrib/4/x86_64/media/core/updates/", "bash-4.2-53.1.mga4.x86_64.rpm", "bash", "4.2.53", ), ], [ # Filetests for binutils checker ( "http://security.ubuntu.com/ubuntu/pool/main/b/binutils/", "binutils_2.26.1-1ubuntu1~16.04.8_amd64.deb", "binutils", "2.26.1", ), ( "http://mirror.centos.org/centos/7/os/x86_64/Packages/", "binutils-2.27-43.base.el7.x86_64.rpm", "binutils", "2.27", ), ],</code></pre> <p><span style="font-size: 14px;"><span style="font-family: Verdana,Geneva,sans-serif;">Although, this format is better than creating c file and also adding test_data in test_scanner file, In this week's virtual conference, my mentors has pointed out that if we keep test data for all checkers in one file it will be hard to navigate it since number of checkers is going to increase as time goes. So, they told me to create separate test_data file for each checkers which contains two attributes 1) mapping_test_data - which contains test data for our mapping test and 2) package_test_data - which contains test data for our package test. So, I created separate test_data file for each checker. For example, test_data file for bash checker looks like this:</span></span></p> <pre><code class="language-python">mapping_test_data = [ {"module": "bash", "version": "1.14.0", "version_strings": ["Bash version 1.14.0"]} ] package_test_data = [ { "url": "https://kojipkgs.fedoraproject.org/packages/bash/4.0/1.fc11/x86_64/", "package_name": "bash-4.0-1.fc11.x86_64.rpm", "module": "bash", "version": "4.0.0", }, { "url": "http://rpmfind.net/linux/mageia/distrib/4/x86_64/media/core/updates/", "package_name": "bash-4.2-53.1.mga4.x86_64.rpm", "module": "bash", "version": "4.2.53", }, ]</code></pre> <p><span style="font-size: 14px;"><span style="font-family: Verdana,Geneva,sans-serif;">We also have to add new entry in the __all__ list of __init__.py file of test_data module for the checker we are writing test for, if it doesn't exist because I am using this list to load these test_data file at runtime. </span></span></p> <p><span style="font-size: 14px;"><span style="font-family: Verdana,Geneva,sans-serif;">After this PR will get merged, checker developer only need to create two files 1) checker class file under checkers directory and 2) test_data file under test_data directory. This will spare him some time of navigating whole test_scanner file (around 2500 lines) to just add test_data for the checker he has written.</span></span></p> <h2 style="text-align: justify;"><span style="font-family: Verdana,Geneva,sans-serif;">What am I doing this week?</span></h2> <p><span style="font-size: 14px;"><span style="font-family: Verdana,Geneva,sans-serif;">I am going to make extractor module asynchronous this week. I have started working on it and created some functions for it. At the end of the week I want to have asynchronous extractor module and asynchronous test_extractor.</span></span></p> <h2 style="text-align: justify;"><span style="font-family: Verdana,Geneva,sans-serif;">Have I got stuck anywhere?</span></h2> <p><span style="font-size: 14px;"><span style="font-family: Verdana,Geneva,sans-serif;">As I mentioned in my previous blog, <em>file</em> utility of unix wasn't flagging binaries generated by me as executable binary file. After some research, I got to know about a magic signature that <em>file</em> utility uses to identify binary file and I have added it to the binary file I was creating. Here is this magic hex signature that can be found in the beginning of most executable file: </span></span></p> <pre><code class="language-python">b"\x7f\x45\x4c\x46\x02\x01\x01\x03"</code></pre> <p> </p>201701184@daiict.ac.in (Niraj-Kamdar)Mon, 08 Jun 2020 05:36:41 +0000https://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-20-week-2-del-legacy-c/GSoC: Week 1: __init__.pyhttps://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-1-init-py/<p style="text-align: justify;"><span style="font-family: Verdana,Geneva,sans-serif;"><span style="font-size: 14px;">Hello everyone!</span></span></p> <p style="text-align: justify;"><span style="font-family: Verdana,Geneva,sans-serif;"><span style="font-size: 14px;">I am Niraj Kamdar, a third year undergraduate at DA-IICT - India. I will be working with CVE Binary Tool under the umbrella of Python Software Foundation. </span></span></p> <h2 style="text-align: justify;"><span style="font-family: Verdana,Geneva,sans-serif;">What is the CVE Binary Tool?</span></h2> <p><span style="font-size: 14px;"><span style="font-family: Verdana,Geneva,sans-serif;">The CVE Binary Tool scans for a number of common, vulnerable open source components like openssl, libpng, libxml2, expat etc. to let you know if a given directory or binary file includes common libraries with known vulnerabilities. </span></span></p> <h2><span style="font-family: Verdana,Geneva,sans-serif;">How it works?</span></h2> <p><span style="font-size: 14px;"><span style="font-family: Verdana,Geneva,sans-serif;">We have <em>checkers</em> for popular open source libraries which contains methods which look at the strings found in a binary file to see if they match certain unique strings found in an open source library and try to guess it's version. We have a <em>scanner</em> module which recursively scans every binary file of the given directory and parse strings from the binary file and forward it to every checkers and checkers determine the vendor, product and version and pass it to the <em>scanner</em> then <em>scanner</em> look into local copy of NVD database and finds all the vulnerabilities associated with the given product and displays it. We supports many output formats like JSON, CSV and a nice console format.  </span></span></p> <h2 style="text-align: justify;"><span style="font-family: Verdana,Geneva,sans-serif;">What did I do in Community Bonding Period?</span></h2> <p><span style="font-family: Verdana,Geneva,sans-serif;"><span style="font-size: 14px;">I have fixed several bugs (like stale egg info, extractor bugs in windows etc.), written faster native python solution to replace c strings extension module and refactored whole checkers module to use object-oriented approach to reduce repetition of code. Previously, we have to write several functions when we were creating a checker, now all we need to do is write 5 class attributes. If you want to learn more about how to write a checker? checkout our <a href="https://github.com/intel/cve-bin-tool/blob/master/cve_bin_tool/checkers/README.md">contributing checker guidelines</a>.</span></span><br>  </p> <p><span style="font-size: 14px;"><span style="font-family: Verdana,Geneva,sans-serif;">I also had video conference meetings with my mentors scheduled every week on Wednesday where we discussed about the project design and implementation aspects. Since, my project involves adding concurrency to the CVE Binary Tool. I was studying <em>asyncio</em> and <em>concurrent.futures</em> modules during this time. My mentor has also helped me and recommended few articles.</span></span></p> <h2><span style="font-family: Verdana,Geneva,sans-serif;">What am I doing this week?</span></h2> <p><span style="font-size: 14px;"><span style="font-family: Verdana,Geneva,sans-serif;">I will be working on <a href="https://github.com/intel/cve-bin-tool/issues/638">removing compiler dependency of test_scanner</a> which is part of my GSoC project. I have started 3-4 days early and I have already finished first task of this week which was <a href="https://github.com/intel/cve-bin-tool/pull/696">splitting <em>cli.py</em> module into <em>cli.py</em> and <em>scanner.py</em></a>. </span></span></p> <h2><span style="font-family: Verdana,Geneva,sans-serif;">Have I got stuck anywhere?</span></h2> <p><span style="font-size: 14px;"><span style="font-family: Verdana,Geneva,sans-serif;">While I was working on <a href="https://github.com/intel/cve-bin-tool/issues/638">removing compiler dependency of test_scanner</a> issue, I came to know that I also have to add some binary strings to that compiler normally add because we are using <em>file </em>utility to check if file we are scanning is binary and It isn't currently flagging file generated by me as a binary file due to lack of signatures that normally can be found in a binary file. I have mentioned this problem to my mentors and I guess they will reply me soon. Meanwhile, I will be look into this myself.</span></span></p>201701184@daiict.ac.in (Niraj-Kamdar)Mon, 01 Jun 2020 06:24:11 +0000https://blogs.python-gsoc.org/en/niraj-kamdars-blog/gsoc-week-1-init-py/